Input files#

Overview#

MicroCAT currently supports fastq files from raw sequencing data and bam files that havebeen mapped to the host genome.

  • For fastq files(from different sequencing platform), MicroCAT starts with qualitycontrol and proceeds with demultiplexing and mapping to the user-provided hostgenome and microbial reference genome. It outputs gene expression data andmicrobial taxonomic data for the host sample.

  • For bam files (output files from processes such as CellRanger, STARsolo, HISAT2,etc.), MicroCAT by default maps the bam files to the microbial reference genome andoutputs microbial taxonomic data for the sample.

MicroCAT ultimately outputs microbial taxonomic counts in sparse matrix format and hosttranscriptome counts (if starting from fastq files). It seamlessly integrates with the currentleading single-cell transcriptome analysis toolkits.

Thanks to the utilization of starsolo, in principle, Microcat can support sequencing data fromany single-cell transcriptomics platform, particularly

  • droplet-based (e.g., 10X)

  • well-based (e.g., Seq-Well)

  • plate-based (e.g., Smart-Seq)

  • split-pool-based (e.g., SPLiT-seq)

For convenience, we provide the following pre-configured settings for transcriptomeanalysis. To add additional support for other technologies or for troubleshooting purposes,please submit an issue to our GitHub repository issues.

Initialize sample metadata file#

The simplest usage is to initialize with microcat init -s sample.tsv

The sample.tsv file is a crucial input for the Snakemake workflow that is used forprocessing sequencing data. This file contains information about the samples that are to beprocessed, their associated data files, and their corresponding directories. Here’s how to prepare it

The sample.tsv file should be a tab-separated value (TSV) file with the following three columns

  • id: The “id” column is a key element in organizing sequencing files effectively usingMicrocat. It should provide information in a specific naming format. For droplet-basedtranscriptomic techniques, the naming format is as follows: {Patient}_{Tissue}_{Lane}_{Library}. Each segment of the name should not containa period (“.”).

    • {Patient}: represents the source patient of the sequencing sample. It can be anyfield and may include multiple underscores. For example, OSCC_16_T or sdsadw_dsagew413dxd-231_2134dsccxc.

    • {Tissue}: refers to the sampled tissue from the patient and should be named inthe format “S” followed by several digits, such as S1 or S23.

    • {Lane}: represents the sequencing lane and should be named in the format Lfollowed by a three-digit number. If you are sequencing only one lane, it can benamed L001. However, if you are performing multi-lane sequencing for a singlepatient’s tissue, different lanes should be recorded to distinguish them, such asL003 or L005.

    • {Library}: is a three-digit number used to differentiate sequencing libraryreagents. Typically, we use the same sequencing library preparation scheme and name it as 001. If you have multiple sequencing libraries, you should use differentnumbers to distinguish them, such as 002 or 003.

  • fq1: This is the full path to the first fastq file for the sample (usually the file containingthe forward reads). When using starsolo for host transcriptome alignment, the fastq filenames can have any format. However, if you are using cellranger, the fastq file namesshould follow the format: Sample_S1_{Lane}_R1_{library}.fastq.gz(See Cellranger tutorial). Here, “Sample” represents the {Patient}_{Tissue} information.

  • fq2: This is the full path to the second FastQ file for the sample (usually the filecontaining the reverse reads).When using starsolo for host transcriptome alignment, thefastq file names can have any format. However, if you are using cellranger, the fastqfile names should follow the format: Sample_S1_{Lane}_R2_{library}.fastq.gz(See Cellranger tutorial). Here, “Sample” represents the {Patient}_{Tissue} information.

For plate-based transcriptomic techniques, the naming format is {Patient}_{Tissue}_{Plate}_{Library}. In this case, the “Lane” should be transformedinto a “Plate” naming format, representing the sequencing plate number. The format for theplate name is “P” followed by a three-digit number. If you are sequencing only one lane, it can be named P001.

Warning

Whether you are using STARsolo or CellRanger, you can input either fastq or fastq.gz compressed format. However, it is important to ensure that the file format of each line in both fq1 and fq2 remains consistent.