A Small Example of MicroCAT#
This guide will walk you through the basic steps to set up and run a MicroCAT analysis pipeline.
1. Prerequisites#
MicroCAT Installed: Ensure MicroCAT is installed and the
microcat
command is available in your terminal.Internet Connection: Required for downloading necessary files
barcode whitelists
Kraken2 database and Host reference genome
profiles for cluster execution
Common Tools:
wget
andtar
for downloading and extracting files.
If you enter microcat --help
in the terminal and the following information is displayed, it means that MicroCAT has been successfully installed:
!microcat --help
Usage: microcat [OPTIONS] COMMAND [ARGS]...
███╗ ███╗██╗ ██████╗██████╗ ██████╗ ██████╗ █████╗ ████████╗
████╗ ████║██║██╔════╝██╔══██╗██╔═══██╗██╔════╝██╔══██╗╚══██╔══╝
██╔████╔██║██║██║ ██████╔╝██║ ██║██║ ███████║ ██║
██║╚██╔╝██║██║██║ ██╔══██╗██║ ██║██║ ██╔══██║ ██║
██║ ╚═╝ ██║██║╚██████╗██║ ██║╚██████╔╝╚██████╗██║ ██║ ██║
╚═╝ ╚═╝╚═╝ ╚═════╝╚═╝ ╚═╝ ╚═════╝ ╚═════╝╚═╝ ╚═╝ ╚═╝
Microbiome Identification upon Cell Resolution from Omics-
Computational Analysis Toolbox
Options:
-v, --version Show the version and exit.
-h, --help Show this message and exit.
Commands:
config Quickly adjust microcat's default configurations
debug Execute the analysis workflow on debug mode.
download Download necessary files for running microcat
init Init microcat style analysis project
path Print out microcat install path
run-local Execute the analysis workflow on local computer mode
run-remote Execute the analysis workflow on remote cluster mode
In this tutorial, we mainly use microcat to analyze 10x single-cell RNA-seq data.
2. Initial Setup (One-time or Infrequent)#
These steps configure MicroCAT with essential data it needs.
a. Download Barcode Whitelists: MicroCAT uses barcode whitelist files for processing single-cell sequencing data with tools like STARsolo. Download the standard set of whitelists by running:
microcat download whitelist
This command downloads and stores the whitelist files in a location managed by the MicroCAT package, making them available for project initialization.
b. Download and Configure Kraken2 Database: You’ll need a Kraken2 database for taxonomic screening. The following commands download the specified database and configure MicroCAT to use it.
Create a directory for your databases (if you don’t have one already). Replace
/path/to/your/databases/
with your preferred location:mkdir -p /path/to/your/databases/kraken2_dbs
Download the Kraken2 database:
wget https://genome-idx.s3.amazonaws.com/kraken/k2_minusb_20250402.tar.gz -P /path/to/your/databases/kraken2_dbs/
Extract the database:
tar -xvzf /path/to/your/databases/kraken2_dbs/k2_minusb_20250402.tar.gz -C /path/to/your/databases/kraken2_dbs/
This will create a directory named
k2_minusb_20250402
(or similar, depending on the archive structure) inside/path/to/your/databases/kraken2_dbs/
. The actual database files (hash.k2d
,opts.k2d
,taxo.k2d
) will be within this extracted folder.Update MicroCAT’s configuration to point to this database: Make sure to use the path to the directory containing the database files (e.g.,
/path/to/your/databases/kraken2_dbs/k2_minusb_20250402
).microcat config --krak2_ref /path/to/your/databases/kraken2_dbs/k2_minusb_20250402
This command updates MicroCAT’s template configuration files. New projects initialized hereafter will use this database path by default. You can similarly configure paths for other reference genomes (e.g.,
--starsolo_ref
,--cellranger_ref
) if needed.
c. (Optional) Download Cluster Profiles: If you plan to run MicroCAT on a cluster (e.g., Slurm, LSF, SGE), download the relevant Snakemake profile:
microcat download profile --cluster your_cluster_type
For example, for Slurm:
microcat download profile --cluster slurm
This downloads profile configurations to ~/.config/snakemake/
.
3. Download the data#
Here we use a single-cell RNA-seq dataset extracted from a Salmonella infection sample to demonstrate the analysis process of MicroCAT. The specific fastq data can be downloaded and obtained (54MB) from zenodo-microcat_10x_example.
We create a project folder and a data/raw
for storing the original fastq data
mkdir -p MySingleCellProject/data
Unzip the downloaded fastq data into the data/raw
folder
tar -xvzf microcat_10x_example.tar.gz -C MySingleCellProject/data/
%ls /data/comics-sucx/microcat_test/data/microcat_10x_example/
GSM3454529_S1_L001_R1_001.fastq.gz GSM3454529_S1_L003_R1_001.fastq.gz
GSM3454529_S1_L001_R2_001.fastq.gz GSM3454529_S1_L003_R2_001.fastq.gz
GSM3454529_S1_L002_R1_001.fastq.gz GSM3454529_S1_L004_R1_001.fastq.gz
GSM3454529_S1_L002_R2_001.fastq.gz GSM3454529_S1_L004_R2_001.fastq.gz
There are a total of 8 samples, which are read 1 and read 2 of lanes 1-4 of GSM3454529_S1.
We construct a tab-separated values (TSV) file named sample.tsv
. This file lists your input samples and their corresponding data files.
For FASTQ input: Required columns:
id
,fq1
,fq2
(columnfq2
is for paired-end reads; omit or leave empty for single-end). Examplesample.tsv
for paired-end FASTQ:id fq1 fq2 PatientA_TumorS1_L001_Lib1 /full/path/to/data/PatientA_L001_R1.fastq.gz /full/path/to/data/PatientA_L001_R2.fastq.gz PatientB_NormalS2_L001_Lib1 /full/path/to/data/PatientB_L001_R1.fastq.gz /full/path/to/data/PatientB_L001_R2.fastq.gz
For BAM input (e.g., from Cell Ranger for single-cell): Required columns:
id
,bam
,mtx
. Examplesample.tsv
for BAM input:id bam mtx PatientC_TumorS3_P001_Lib1 /full/path/to/cellranger_output/possorted_genome_bam.bam /full/path/to/cellranger_output/filtered_feature_bc_matrix/
Important notes for sample.tsv
:
The
id
column should ideally follow a consistent format, often{Patient}_{Tissue}_{LaneOrPlate}_{Library}
.Ensure all file paths (
fq1
,fq2
,bam
,mtx
) are absolute paths or paths relative to where the pipeline will be executed, and that these files are accessible.
import pandas as pd
df = pd.read_csv('/data/comics-sucx/microcat_test/sample.tsv', sep='\t')
df
id | fq1 | fq2 | |
---|---|---|---|
0 | GSM3454529_S1_L001_001 | /data/comics-sucx/microcat_test/data/microcat_... | /data/comics-sucx/microcat_test/data/microcat_... |
1 | GSM3454529_S1_L002_001 | /data/comics-sucx/microcat_test/data/microcat_... | /data/comics-sucx/microcat_test/data/microcat_... |
2 | GSM3454529_S1_L003_001 | /data/comics-sucx/microcat_test/data/microcat_... | /data/comics-sucx/microcat_test/data/microcat_... |
3 | GSM3454529_S1_L004_001 | /data/comics-sucx/microcat_test/data/microcat_... | /data/comics-sucx/microcat_test/data/microcat_... |
4. Create and Configure a New Project#
Now, let’s set up a specific analysis project.
a. Initialize Project:
Navigate to the directory where you want to create your new project. Then, run the microcat init
command.
For a single-cell RNA-seq project (e.g., using 10x Genomics 3’ v3 chemistry):
microcat init single --project /data/comics-sucx/microcat_test -s /data/comics-sucx/microcat_test/sample.tsv --chemistry tenx_3pv3
--project <project_name>
(e.g.,/data/comics-sucx/microcat_test
) defines the name of the new directory that will be created.--chemistry <chemistry_name>
is crucial for single-cell workflows to correctly configure alignment parameters. You can find available chemistries in MicroCAT’schemistry_defs.json
file or documentation.--host <host_aligner>
specifies the host alignment tool (e.g.,starsolo
,cellranger
).
This command creates a new project directory or modify the existing project directory (e.g., /data/comics-sucx/microcat_test
) containing a config.yaml
file (pre-filled based on the global settings you configured in Step 2 and the options you provided to init
) and other necessary subdirectories like results/
, logs/
, envs/
, etc.
b. (Optional) Customize Project Configuration:
Open the config.yaml
file located directly within your project directory (e.g., MySingleCellProject/config.yaml
).
You can review and customize various parameters, such as:
The starting step of the pipeline:
params: begin:
Tool-specific options for alignment, classification, etc.
Resource allocations for specific rules (though often managed by profiles for cluster execution).
4. Run the MicroCAT Pipeline#
Once your project is set up and configured:
Navigate into your project directory:
cd path/project
Execute the pipeline:
For a local run (on your current machine):
microcat run-local
This command uses the
config.yaml
andsample.tsv
from the current directory.For a cluster run (if you downloaded profiles): Specify the workflow and the name of the profile you want to use. For example, to run a single-cell workflow using a Slurm profile (assuming a profile named
generic
exists, typically atpath/project/.profile/generic
):microcat run-remote --cluster-engine generic
5. Output#
Pipeline results will be generated in the results/
subdirectory within your project folder. Log files for each step can be found in the logs/
subdirectory.