A Small Example of MicroCAT#

This guide will walk you through the basic steps to set up and run a MicroCAT analysis pipeline.

1. Prerequisites#

  • MicroCAT Installed: Ensure MicroCAT is installed and the microcat command is available in your terminal.

  • Internet Connection: Required for downloading necessary files

    • barcode whitelists

    • Kraken2 database and Host reference genome

    • profiles for cluster execution

  • Common Tools: wget and tar for downloading and extracting files.

If you enter microcat --help in the terminal and the following information is displayed, it means that MicroCAT has been successfully installed:

!microcat --help
Usage: microcat [OPTIONS] COMMAND [ARGS]...

          ███╗   ███╗██╗ ██████╗██████╗  ██████╗  ██████╗ █████╗ ████████╗
          ████╗ ████║██║██╔════╝██╔══██╗██╔═══██╗██╔════╝██╔══██╗╚══██╔══╝
          ██╔████╔██║██║██║     ██████╔╝██║   ██║██║     ███████║   ██║
          ██║╚██╔╝██║██║██║     ██╔══██╗██║   ██║██║     ██╔══██║   ██║
          ██║ ╚═╝ ██║██║╚██████╗██║  ██║╚██████╔╝╚██████╗██║  ██║   ██║
          ╚═╝     ╚═╝╚═╝ ╚═════╝╚═╝  ╚═╝ ╚═════╝  ╚═════╝╚═╝  ╚═╝   ╚═╝
          Microbiome Identification upon Cell Resolution from Omics-
          Computational Analysis Toolbox

Options:
  -v, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  config      Quickly adjust microcat's default configurations
  debug       Execute the analysis workflow on debug mode.
  download    Download necessary files for running microcat
  init        Init microcat style analysis project
  path        Print out microcat install path
  run-local   Execute the analysis workflow on local computer mode
  run-remote  Execute the analysis workflow on remote cluster mode

In this tutorial, we mainly use microcat to analyze 10x single-cell RNA-seq data.

2. Initial Setup (One-time or Infrequent)#

These steps configure MicroCAT with essential data it needs.

a. Download Barcode Whitelists: MicroCAT uses barcode whitelist files for processing single-cell sequencing data with tools like STARsolo. Download the standard set of whitelists by running:

microcat download whitelist

This command downloads and stores the whitelist files in a location managed by the MicroCAT package, making them available for project initialization.

b. Download and Configure Kraken2 Database: You’ll need a Kraken2 database for taxonomic screening. The following commands download the specified database and configure MicroCAT to use it.

  1. Create a directory for your databases (if you don’t have one already). Replace /path/to/your/databases/ with your preferred location:

    mkdir -p /path/to/your/databases/kraken2_dbs
    
  2. Download the Kraken2 database:

    wget https://genome-idx.s3.amazonaws.com/kraken/k2_minusb_20250402.tar.gz -P /path/to/your/databases/kraken2_dbs/
    
  3. Extract the database:

    tar -xvzf /path/to/your/databases/kraken2_dbs/k2_minusb_20250402.tar.gz -C /path/to/your/databases/kraken2_dbs/
    

    This will create a directory named k2_minusb_20250402 (or similar, depending on the archive structure) inside /path/to/your/databases/kraken2_dbs/. The actual database files (hash.k2d, opts.k2d, taxo.k2d) will be within this extracted folder.

  4. Update MicroCAT’s configuration to point to this database: Make sure to use the path to the directory containing the database files (e.g., /path/to/your/databases/kraken2_dbs/k2_minusb_20250402).

    microcat config --krak2_ref /path/to/your/databases/kraken2_dbs/k2_minusb_20250402
    

    This command updates MicroCAT’s template configuration files. New projects initialized hereafter will use this database path by default. You can similarly configure paths for other reference genomes (e.g., --starsolo_ref, --cellranger_ref) if needed.

c. (Optional) Download Cluster Profiles: If you plan to run MicroCAT on a cluster (e.g., Slurm, LSF, SGE), download the relevant Snakemake profile:

microcat download profile --cluster your_cluster_type

For example, for Slurm:

microcat download profile --cluster slurm

This downloads profile configurations to ~/.config/snakemake/.

3. Download the data#

Here we use a single-cell RNA-seq dataset extracted from a Salmonella infection sample to demonstrate the analysis process of MicroCAT. The specific fastq data can be downloaded and obtained (54MB) from zenodo-microcat_10x_example.

We create a project folder and a data/raw for storing the original fastq data

mkdir -p MySingleCellProject/data

Unzip the downloaded fastq data into the data/raw folder

tar -xvzf microcat_10x_example.tar.gz -C MySingleCellProject/data/
%ls /data/comics-sucx/microcat_test/data/microcat_10x_example/
GSM3454529_S1_L001_R1_001.fastq.gz  GSM3454529_S1_L003_R1_001.fastq.gz
GSM3454529_S1_L001_R2_001.fastq.gz  GSM3454529_S1_L003_R2_001.fastq.gz
GSM3454529_S1_L002_R1_001.fastq.gz  GSM3454529_S1_L004_R1_001.fastq.gz
GSM3454529_S1_L002_R2_001.fastq.gz  GSM3454529_S1_L004_R2_001.fastq.gz

There are a total of 8 samples, which are read 1 and read 2 of lanes 1-4 of GSM3454529_S1. We construct a tab-separated values (TSV) file named sample.tsv. This file lists your input samples and their corresponding data files.

  • For FASTQ input: Required columns: id, fq1, fq2 (column fq2 is for paired-end reads; omit or leave empty for single-end). Example sample.tsv for paired-end FASTQ:

    id	fq1	fq2
    PatientA_TumorS1_L001_Lib1	/full/path/to/data/PatientA_L001_R1.fastq.gz	/full/path/to/data/PatientA_L001_R2.fastq.gz
    PatientB_NormalS2_L001_Lib1	/full/path/to/data/PatientB_L001_R1.fastq.gz	/full/path/to/data/PatientB_L001_R2.fastq.gz
    
  • For BAM input (e.g., from Cell Ranger for single-cell): Required columns: id, bam, mtx. Example sample.tsv for BAM input:

    id	bam	mtx
    PatientC_TumorS3_P001_Lib1	/full/path/to/cellranger_output/possorted_genome_bam.bam	/full/path/to/cellranger_output/filtered_feature_bc_matrix/
    

Important notes for sample.tsv:

  • The id column should ideally follow a consistent format, often {Patient}_{Tissue}_{LaneOrPlate}_{Library}.

  • Ensure all file paths (fq1, fq2, bam, mtx) are absolute paths or paths relative to where the pipeline will be executed, and that these files are accessible.

import pandas as pd
df = pd.read_csv('/data/comics-sucx/microcat_test/sample.tsv', sep='\t')
df
id fq1 fq2
0 GSM3454529_S1_L001_001 /data/comics-sucx/microcat_test/data/microcat_... /data/comics-sucx/microcat_test/data/microcat_...
1 GSM3454529_S1_L002_001 /data/comics-sucx/microcat_test/data/microcat_... /data/comics-sucx/microcat_test/data/microcat_...
2 GSM3454529_S1_L003_001 /data/comics-sucx/microcat_test/data/microcat_... /data/comics-sucx/microcat_test/data/microcat_...
3 GSM3454529_S1_L004_001 /data/comics-sucx/microcat_test/data/microcat_... /data/comics-sucx/microcat_test/data/microcat_...

4. Create and Configure a New Project#

Now, let’s set up a specific analysis project.

a. Initialize Project: Navigate to the directory where you want to create your new project. Then, run the microcat init command.

  • For a single-cell RNA-seq project (e.g., using 10x Genomics 3’ v3 chemistry):

    microcat init single --project /data/comics-sucx/microcat_test  -s /data/comics-sucx/microcat_test/sample.tsv --chemistry tenx_3pv3
    
    • --project <project_name> (e.g., /data/comics-sucx/microcat_test) defines the name of the new directory that will be created.

    • --chemistry <chemistry_name> is crucial for single-cell workflows to correctly configure alignment parameters. You can find available chemistries in MicroCAT’s chemistry_defs.json file or documentation.

    • --host <host_aligner> specifies the host alignment tool (e.g., starsolo, cellranger).

This command creates a new project directory or modify the existing project directory (e.g., /data/comics-sucx/microcat_test) containing a config.yaml file (pre-filled based on the global settings you configured in Step 2 and the options you provided to init) and other necessary subdirectories like results/, logs/, envs/, etc.

b. (Optional) Customize Project Configuration: Open the config.yaml file located directly within your project directory (e.g., MySingleCellProject/config.yaml). You can review and customize various parameters, such as:

  • The starting step of the pipeline: params: begin:

  • Tool-specific options for alignment, classification, etc.

  • Resource allocations for specific rules (though often managed by profiles for cluster execution).

4. Run the MicroCAT Pipeline#

Once your project is set up and configured:

  1. Navigate into your project directory:

    cd path/project
    
  2. Execute the pipeline:

    • For a local run (on your current machine):

      microcat run-local
      

      This command uses the config.yaml and sample.tsv from the current directory.

    • For a cluster run (if you downloaded profiles): Specify the workflow and the name of the profile you want to use. For example, to run a single-cell workflow using a Slurm profile (assuming a profile named generic exists, typically at path/project/.profile/generic):

      microcat run-remote --cluster-engine generic
      

5. Output#

Pipeline results will be generated in the results/ subdirectory within your project folder. Log files for each step can be found in the logs/ subdirectory.