Input files#
Overview#
MicroCAT currently supports fastq files from raw sequencing data and bam files that havebeen mapped to the host genome.
For
fastq
files(from different sequencing platform), MicroCAT starts with qualitycontrol and proceeds with demultiplexing and mapping to the user-provided hostgenome and microbial reference genome. It outputs gene expression data andmicrobial taxonomic data for the host sample.For
bam
files (output files from processes such as CellRanger, STARsolo, HISAT2,etc.), MicroCAT by default maps the bam files to the microbial reference genome andoutputs microbial taxonomic data for the sample.
MicroCAT ultimately outputs microbial taxonomic counts in sparse matrix format and hosttranscriptome counts (if starting from fastq files). It seamlessly integrates with the currentleading single-cell transcriptome analysis toolkits.
Thanks to the utilization of starsolo, in principle, Microcat can support sequencing data fromany single-cell transcriptomics platform, particularly
droplet-based (e.g., 10X)
well-based (e.g., Seq-Well)
plate-based (e.g., Smart-Seq)
split-pool-based (e.g., SPLiT-seq)
For convenience, we provide the following pre-configured settings for transcriptomeanalysis. To add additional support for other technologies or for troubleshooting purposes,please submit an issue to our GitHub repository issues.
Initialize sample metadata file#
The simplest usage is to initialize with microcat init -s sample.tsv
The sample.tsv
file is a crucial input for the Snakemake workflow that is used forprocessing sequencing data. This file contains information about the samples that are to beprocessed, their associated data files, and their corresponding directories. Here’s how to prepare it
The sample.tsv file should be a tab-separated value (TSV) file with the following three columns
id
: The “id” column is a key element in organizing sequencing files effectively usingMicrocat. It should provide information in a specific naming format. For droplet-basedtranscriptomic techniques, the naming format is as follows:{Patient}_{Tissue}_{Lane}_{Library}
. Each segment of the name should not containa period (“.”).{Patient}
: represents the source patient of the sequencing sample. It can be anyfield and may include multiple underscores. For example,OSCC_16_T
orsdsadw_dsagew413dxd-231_2134dsccxc
.{Tissue}
: refers to the sampled tissue from the patient and should be named inthe format “S” followed by several digits, such asS1
orS23
.{Lane}
: represents the sequencing lane and should be named in the format Lfollowed by a three-digit number. If you are sequencing only one lane, it can benamed L001. However, if you are performing multi-lane sequencing for a singlepatient’s tissue, different lanes should be recorded to distinguish them, such asL003 or L005.{Library}
: is a three-digit number used to differentiate sequencing libraryreagents. Typically, we use the same sequencing library preparation scheme and name it as001
. If you have multiple sequencing libraries, you should use differentnumbers to distinguish them, such as002
or003
.
fq1
: This is the full path to the first fastq file for the sample (usually the file containingthe forward reads). When using starsolo for host transcriptome alignment, the fastq filenames can have any format. However, if you are using cellranger, the fastq file namesshould follow the format:Sample_S1_{Lane}_R1_{library}.fastq.gz
(See Cellranger tutorial). Here, “Sample” represents the{Patient}_{Tissue}
information.fq2
: This is the full path to the second FastQ file for the sample (usually the filecontaining the reverse reads).When using starsolo for host transcriptome alignment, thefastq file names can have any format. However, if you are using cellranger, the fastqfile names should follow the format:Sample_S1_{Lane}_R2_{library}.fastq.gz
(See Cellranger tutorial). Here, “Sample” represents the{Patient}_{Tissue}
information.
For plate-based transcriptomic techniques, the naming format is {Patient}_{Tissue}_{Plate}_{Library}
. In this case, the “Lane” should be transformedinto a “Plate” naming format, representing the sequencing plate number.
The format for theplate name is “P” followed by a three-digit number. If you are sequencing only one lane, it can be named P001
.
Warning
Whether you are using STARsolo or CellRanger, you can input either fastq or fastq.gz compressed format. However, it is important to ensure that the file format of each line in both fq1
and fq2
remains consistent.