Command Reference

This section provides detailed information about all command-line tools available in MIMI, including their purpose, when to use them, and what results to expect.

mimi_kegg_extract

Extracts compound information from the KEGG COMPOUND database using its REST API. Can retrieve compounds within a specific molecular weight range or from a list of compound IDs.

$ mimi_kegg_extract --help
usage: mimi_kegg_extract [-h] [-l MIN_MASS] [-u MAX_MASS] [-i COMPOUND_IDS] [-o OUTPUT] [-b BATCH_SIZE]

Extract compound information from KEGG

options:
-h, --help            show this help message and exit
-l MIN_MASS, --min-mass MIN_MASS
                        Lower bound of molecular weight in Da
-u MAX_MASS, --max-mass MAX_MASS
                        Upper bound of molecular weight in Da
-i COMPOUND_IDS, --input COMPOUND_IDS
                        Input TSV file containing KEGG compound IDs
-o OUTPUT, --output OUTPUT
                        Output TSV file path (default: kegg_compounds.tsv)
-b BATCH_SIZE, --batch-size BATCH_SIZE
                        Number of compounds to process in each batch (default: 5)

Example:

# Extract compounds by ID
$ mimi_kegg_extract -i compound_ids.tsv -o data/processed/kegg_compounds.tsv


# Extract compounds between 40-1000 Da
$ mimi_kegg_extract -l 40 -u 1000 -o data/processed/kegg_compounds_40_1000Da.tsv

# First remove lines starting with '#'
$ grep -v '^#' data/processed/kegg_compounds_40_1000Da.tsv > data/processed/kegg_compounds_40_1000Da_clean.tsv

# Then sort by compound ID (second column)
$ { head -n 1 data/processed/kegg_compounds_40_1000Da_clean.tsv; tail -n +2 data/processed/kegg_compounds_40_1000Da_clean.tsv | sort -k2,2; } > data/processed/kegg_compounds_40_1000Da_sorted.tsv

# Finally remove duplicate chemical formulas
$ awk '!seen[$1]++' data/processed/kegg_compounds_40_1000Da_sorted.tsv > data/processed/kegg_compounds_40_1000Da_sorted_uniq.tsv

mimi_hmdb_extract

Extracts metabolite information from the Human Metabolome Database (HMDB) and converts it to a TSV format compatible with MIMI. Requires a list of metabolites downloaded from HMDB in XML format.

$ mimi_hmdb_extract --help
usage: mimi_hmdb_extract [-h] -x XML [-l MIN_MASS] [-u MAX_MASS] [-o OUTPUT]

Extract metabolite information from HMDB XML file

options:
-h, --help            show this help message and exit
--id-tag ID_TAG       Preferred ID tag to use. Options: accession, kegg_id, chebi_id, pubchem_compound_id, drugbank_id
-x XML, --xml XML     Path to HMDB metabolites XML file
-l MIN_MASS, --min-mass MIN_MASS
                        Lower bound of molecular weight in Da
-u MAX_MASS, --max-mass MAX_MASS
                        Upper bound of molecular weight in Da
-o OUTPUT, --output OUTPUT
                        Output TSV file path (default: metabolites.tsv)

Example:

# Extract metabolites between 40-1000 Da

$ mimi_hmdb_extract -x data/raw/hmdb_metabolites.xml -l 40 -u 1000 -o data/processed/hmdb_compounds_40_1000Da.tsv

# First remove lines starting with '#'
$ grep -v '^#' data/processed/hmdb_compounds_40_1000Da.tsv > data/processed/hmdb_compounds_40_1000Da_clean.tsv

# Then sort by compound ID (second column)
$ { head -n 1 data/processed/hmdb_compounds_40_1000Da_clean.tsv; tail -n +2 data/processed/hmdb_compounds_40_1000Da_clean.tsv | sort -k2,2; } > data/processed/hmdb_compounds_40_1000Da_sorted.tsv

# Finally remove duplicate chemical formulas
$ awk '!seen[$1]++' data/processed/hmdb_compounds_40_1000Da_sorted.tsv > data/processed/hmdb_compounds_40_1000Da_sorted_uniq.tsv

Input files:

HMDB provides downloads for all metabolites contained in the database, as well as specific subsets (e.g. serum, saliva, etc.). Downloaded files should automatically acquire the file extension .xml and will look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<hmdb xmlns="http://www.hmdb.ca">
<metabolite>
  <version>5.0</version>
  <creation_date>2005-11-16 15:48:42 UTC</creation_date>
  <update_date>2021-09-14 15:44:51 UTC</update_date>

  [ data on individual compounds follows ... ]

</metabolite>
</hmdb>

Note that the HMDB database version (5.0) and date of the last update (2021-09-14) are included at the top of this example file, ensuring that this information is easily available for citation purposes.


mimi_cache_create

Creates precomputed cache files containing theoretical molecular masses and isotope patterns for compounds. Caching significantly speeds up mass comparisons, and the same cache files can be reused for any mass analysis involving the same database and isotope ratios.

$ mimi_cache_create  --help
usage: mimi_cache_create [-h] [-l JSON] [-n CUTOFF] -d DBTSV [DBTSV ...] -i {pos,neg} -c DBBINARY

Molecular Isotope Mass Identifier

options:
-h, --help            show this help message and exit
-l JSON, --label JSON
                        Labeled atoms
-n CUTOFF, --noise CUTOFF
                        Threshold for filtering molecular isotope variants with relative abundance below CUTOFF w.r.t. the monoisotopic mass (defaults to 1e-5)
-d DBTSV [DBTSV ...], --dbfile DBTSV [DBTSV ...]
                        File(s) with list of compounds
-i {pos,neg}, --ion {pos,neg}
                        Ionisation mode
-c DBBINARY, --cache DBBINARY
                        Binary DB output file (if not specified, will use base name from JSON file)

Example:

# Create natural abundance cache
$ mimi_cache_create -i neg -d data/processed/kegg_compounds_40_1000Da_sorted_uniq.tsv -c outdir/nat

# Create C13-95% labeled cache
$ mimi_cache_create -i neg -l data/processed/C13_95.json -d data/processed/kegg_compounds_40_1000Da_sorted_uniq.tsv -c outdir/C13_95

mimi_cache_dump

Dumps the contents of a MIMI cache file in human-readable TSV format. Useful for inspecting cache files and verifying their contents.

$ mimi_cache_dump --help
usage: mimi_cache_dump [-h] [-n NUM_COMPOUNDS] [-i NUM_ISOTOPES] [-o OUTPUT] cache_file

MIMI Cache Dump Tool

positional arguments:
cache_file            Input cache file (.pkl)

options:
-h, --help            show this help message and exit
-n NUM_COMPOUNDS, --num-compounds NUM_COMPOUNDS
                        Number of compounds to output (default: all)
-i NUM_ISOTOPES, --num-isotopes NUM_ISOTOPES
                        Number of isotopes per compound to output (default: all)
-o OUTPUT, --output OUTPUT
                        Output file (default: stdout)

Example:

# Dump first 5 compounds with 2 isotopes each
$ mimi_cache_dump -n 5 -i 2 outdir/nat.pkl -o outdir/cache_contents.tsv

mimi_mass_analysis

Analyzes mass spectrometry data by comparing measured masses in sample peak lists against precomputed theoretical molecular masses stored in cache files.

$ mimi_mass_analysis --help
usage: mimi_mass_analysis [-h] -p PPM -vp VPPM -c DBBINARY [DBBINARY ...] -s SAMPLE [SAMPLE ...] -o OUTPUT

Molecular Isotope Mass Identifier

options:
-h, --help            show this help message and exit
-p PPM, --ppm PPM     Parts per million for the mono isotopic mass of chemical formula
-vp VPPM              Parts per million for verification of isotopes
-c DBBINARY [DBBINARY ...], --cache DBBINARY [DBBINARY ...]
                        Binary DB input file(s)
-s SAMPLE [SAMPLE ...], --sample SAMPLE [SAMPLE ...]
                        Input sample file
-o OUTPUT, --output OUTPUT
                        Output file

Example:

# Analyze single sample with natural abundance cache
$ mimi_mass_analysis -p 1.0 -vp 1.0 -c outdir/nat -s data/processed/testdata1.asc -o outdir/results.tsv

# Analyze multiple samples with multiple caches
$ mimi_mass_analysis -p 1.0 -vp 1.0 -c outdir/nat outdir/C13_95 -s data/processed/testdata1.asc data/processed/testdata2.asc -o outdir/batch_results.tsv