Databases for MicroCAT Analysis

Databases for MicroCAT Analysis#

MicroCAT relies on various reference databases for its analyses, primarily for host genome alignment and metagenomic classification. This section guides you through downloading common databases essential for running MicroCAT pipelines.

1. Host Genome References (for Cell Ranger / STARsolo)#

To align sequencing reads to a host genome (e.g., human, mouse), you need a reference genome package. These packages are typically prepared for aligners like STAR, which is used by Cell Ranger’s mkref utility and STARsolo. 10x Genomics provides pre-built reference packages that are compatible.

a. Downloading 10x Genomics References:

Visit the 10x Genomics Downloads page to find the latest available reference genomes. Look for references suitable for Cell Ranger (e.g., for human GRCh38, mouse mm10).

Download the desired reference tarball. The exact filename and link will be on the 10x Genomics site. For example, a common human reference might be named similarly to refdata-gex-GRCh38-2020-A.tar.gz.

# Example: Replace with the actual download link and filename from 10x Genomics
# Ensure you have the correct URL from the 10x Genomics website.
wget http://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz

Extract the downloaded reference archive:
```
tar -xzvf refdata-gex-GRCh38-2020-A.tar.gz
```
This command will create a new directory (e.g., refdata-gex-GRCh38-2020-A) containing all the necessary reference files. (Source: 10x Genomics Cell Ranger Installation Tutorial)

b. Configuring MicroCAT for Host References:

After downloading and extracting the host reference, you need to inform MicroCAT of its location. The specific configuration key might depend on the aligner MicroCAT is set to use (e.g., STARsolo or Cell Ranger).

If your MicroCAT workflow uses STARsolo for alignment:

microcat config --starsolo_ref /path/to/your/refdata-gex-GRCh38-2020-A

If your workflow uses Cell Ranger references (consult MicroCAT documentation for the precise configuration key if it differs):
```
microcat config --cellranger_ref /path/to/your/refdata-gex-GRCh38-2020-A
```
Replace /path/to/your/ with the absolute path to the extracted reference directory (e.g., /opt/refs/refdata-gex-GRCh38-2020-A).

2. Kraken2 Databases (for Metagenomic Classification)#

Kraken2 is a tool used for the taxonomic classification of metagenomic reads, helping to identify microbial and viral sequences within your samples. Building Kraken2 databases from scratch can be computationally intensive. Fortunately, pre-built databases are available, notably from Ben Langmead’s group, hosted on AWS. (Source: benlangmead.github.io/aws-indexes/k2)

a. Downloading Recommended Kraken2 Databases:

We recommend the PlusPF database for comprehensive metagenomic analysis, including detection of protozoa and fungi. According to the Kraken 2 / Bracken Refseq indexes page, the latest version (e.g., k2_pluspf_20250402) includes Standard (RefSeq archaea, bacteria, viral, plasmid, human, UniVec_Core) plus RefSeq protozoa & fungi.

Create a dedicated directory for your Kraken2 databases if you haven’t already:
```
mkdir -p /path/to/your/databases/kraken2_dbs
```
Replace /path/to/your/databases/ with your preferred storage location.

Download the k2_pluspf_20250402.tar.gz database. This archive is approximately 71.8 GB and will expand to about 93.2 GB.

# Ensure you use the latest version available on the Kraken 2 / Bracken Refseq indexes page
wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20250402.tar.gz -P /path/to/your/databases/kraken2_dbs/

Extract the downloaded database archive:
```
tar -xvzf /path/to/your/databases/kraken2_dbs/k2_pluspf_20250402.tar.gz -C /path/to/your/databases/kraken2_dbs/
```
This will create a new directory named k2_pluspf_20250402 (or a similar name, depending on the archive’s structure) within /path/to/your/databases/kraken2_dbs/. This directory will contain the actual Kraken2 database files (e.g., hash.k2d, opts.k2d, taxo.k2d).

Alternatively, if fungal detection is not a priority and you prefer a database that is still comprehensive for bacteria, archaea, viruses, and human sequences, the Standard database is a suitable option. For example, k2_standard_20250402 (approx. 66.9 GB archive, 86.8 GB index) can be downloaded from the same Kraken 2 / Bracken Refseq indexes page. The download and extraction process is similar.

For users with memory-constrained environments, consider using the size-capped versions of these databases (e.g., PlusPF-16GB, Standard-16GB, or Standard-8GB) also available on the Kraken 2 / Bracken Refseq indexes page. These offer a trade-off by reducing database size at the cost of some sensitivity.

b. Configuring MicroCAT for the Kraken2 Database:

Once the database is downloaded and extracted, update MicroCAT’s configuration to point to the directory containing these Kraken2 database files:

microcat config --krak2_ref /path/to/your/databases/kraken2_dbs/k2_pluspf_20250402

Ensure you replace /path/to/your/databases/kraken2_dbs/k2_pluspf_20250402 (or the directory name of your chosen database, e.g., k2_standard_20250402) with the correct absolute path to the extracted database folder.

c. Other Kraken2 Databases:

The Kraken 2 / Bracken Refseq indexes page lists various other pre-built Kraken2 databases beyond the recommended PlusPF and Standard versions. These include:

More comprehensive databases like PlusPFP (Standard plus protozoa, fungi & plant).
Specialized databases (e.g., Viral only, for focused viral studies).
Very large databases like core_nt for broader coverage.
Size-capped versions of the main databases (e.g., PlusPF-16GB, Standard-8GB, PlusPFP-16GB) which are smaller and suitable for systems with limited memory, though they may have reduced sensitivity.

The process for using any of these databases is similar:

Download the chosen database tarball from the Kraken 2 / Bracken Refseq indexes page.
Extract the archive.
Configure MicroCAT using microcat config --krak2_ref /path/to/extracted_kraken2_db_directory.

Always consider the size of the database and its contents to ensure it aligns with your research questions and available computational resources. Larger databases offer more comprehensive classification but require more memory and disk space.

Databases for MicroCAT Analysis

Contents

Databases for MicroCAT Analysis#

1. Host Genome References (for Cell Ranger / STARsolo)#

2. Kraken2 Databases (for Metagenomic Classification)#