Command Reference
=================
This section provides detailed information about all command-line tools available in MIMI, including their purpose, when to use them, and what results to expect.
mimi_kegg_extract
-----------------
Extracts compound information from the `KEGG COMPOUND database `_ using its `REST API `_. Can retrieve compounds within a specific molecular weight range or from a list of compound IDs.
.. code-block:: text
$ mimi_kegg_extract --help
usage: mimi_kegg_extract [-h] [-l MIN_MASS] [-u MAX_MASS] [-i COMPOUND_IDS] [-o OUTPUT] [-b BATCH_SIZE]
Extract compound information from KEGG
options:
-h, --help show this help message and exit
-l MIN_MASS, --min-mass MIN_MASS
Lower bound of molecular weight in Da
-u MAX_MASS, --max-mass MAX_MASS
Upper bound of molecular weight in Da
-i COMPOUND_IDS, --input COMPOUND_IDS
Input TSV file containing KEGG compound IDs
-o OUTPUT, --output OUTPUT
Output TSV file path (default: kegg_compounds.tsv)
-b BATCH_SIZE, --batch-size BATCH_SIZE
Number of compounds to process in each batch (default: 5)
**Example**::
# Extract compounds by ID
$ mimi_kegg_extract -i compound_ids.tsv -o data/processed/kegg_compounds.tsv
# Extract compounds between 40-1000 Da
$ mimi_kegg_extract -l 40 -u 1000 -o data/processed/kegg_compounds_40_1000Da.tsv
# First remove lines starting with '#'
$ grep -v '^#' data/processed/kegg_compounds_40_1000Da.tsv > data/processed/kegg_compounds_40_1000Da_clean.tsv
# Then sort by compound ID (second column)
$ { head -n 1 data/processed/kegg_compounds_40_1000Da_clean.tsv; tail -n +2 data/processed/kegg_compounds_40_1000Da_clean.tsv | sort -k2,2; } > data/processed/kegg_compounds_40_1000Da_sorted.tsv
# Finally remove duplicate chemical formulas
$ awk '!seen[$1]++' data/processed/kegg_compounds_40_1000Da_sorted.tsv > data/processed/kegg_compounds_40_1000Da_sorted_uniq.tsv
mimi_hmdb_extract
-----------------
Extracts metabolite information from the `Human Metabolome Database (HMDB) `_ and converts it to a TSV format compatible with MIMI. Requires a list of metabolites downloaded from `HMDB `_ in XML format.
.. code-block:: text
$ mimi_hmdb_extract --help
usage: mimi_hmdb_extract [-h] -x XML [-l MIN_MASS] [-u MAX_MASS] [-o OUTPUT]
Extract metabolite information from HMDB XML file
options:
-h, --help show this help message and exit
--id-tag ID_TAG Preferred ID tag to use. Options: accession, kegg_id, chebi_id, pubchem_compound_id, drugbank_id
-x XML, --xml XML Path to HMDB metabolites XML file
-l MIN_MASS, --min-mass MIN_MASS
Lower bound of molecular weight in Da
-u MAX_MASS, --max-mass MAX_MASS
Upper bound of molecular weight in Da
-o OUTPUT, --output OUTPUT
Output TSV file path (default: metabolites.tsv)
**Example**::
# Extract metabolites between 40-1000 Da
$ mimi_hmdb_extract -x data/raw/hmdb_metabolites.xml -l 40 -u 1000 -o data/processed/hmdb_compounds_40_1000Da.tsv
# First remove lines starting with '#'
$ grep -v '^#' data/processed/hmdb_compounds_40_1000Da.tsv > data/processed/hmdb_compounds_40_1000Da_clean.tsv
# Then sort by compound ID (second column)
$ { head -n 1 data/processed/hmdb_compounds_40_1000Da_clean.tsv; tail -n +2 data/processed/hmdb_compounds_40_1000Da_clean.tsv | sort -k2,2; } > data/processed/hmdb_compounds_40_1000Da_sorted.tsv
# Finally remove duplicate chemical formulas
$ awk '!seen[$1]++' data/processed/hmdb_compounds_40_1000Da_sorted.tsv > data/processed/hmdb_compounds_40_1000Da_sorted_uniq.tsv
**Input files:**
HMDB provides downloads for all metabolites contained in the database, as well as specific subsets (e.g. serum, saliva, etc.). Downloaded files should automatically acquire the file extension `.xml` and will look something like this:
.. code-block:: text
5.0
2005-11-16 15:48:42 UTC
2021-09-14 15:44:51 UTC
[ data on individual compounds follows ... ]
Note that the HMDB database version (5.0) and date of the last update (2021-09-14) are included at the top of this example file, ensuring that this information is easily available for citation purposes.
.. code-block:: text
mimi_cache_create
-----------------
Creates precomputed cache files containing theoretical molecular masses and isotope patterns for compounds. Caching significantly speeds up mass comparisons, and the same cache files can be reused for any mass analysis involving the same database and isotope ratios.
.. code-block:: text
$ mimi_cache_create --help
usage: mimi_cache_create [-h] [-l JSON] [-n CUTOFF] -d DBTSV [DBTSV ...] -i {pos,neg} -c DBBINARY
Molecular Isotope Mass Identifier
options:
-h, --help show this help message and exit
-l JSON, --label JSON
Labeled atoms
-n CUTOFF, --noise CUTOFF
Threshold for filtering molecular isotope variants with relative abundance below CUTOFF w.r.t. the monoisotopic mass (defaults to 1e-5)
-d DBTSV [DBTSV ...], --dbfile DBTSV [DBTSV ...]
File(s) with list of compounds
-i {pos,neg}, --ion {pos,neg}
Ionisation mode
-c DBBINARY, --cache DBBINARY
Binary DB output file (if not specified, will use base name from JSON file)
**Example**::
# Create natural abundance cache
$ mimi_cache_create -i neg -d data/processed/kegg_compounds_40_1000Da_sorted_uniq.tsv -c outdir/nat
# Create C13-95% labeled cache
$ mimi_cache_create -i neg -l data/processed/C13_95.json -d data/processed/kegg_compounds_40_1000Da_sorted_uniq.tsv -c outdir/C13_95
mimi_cache_dump
---------------
Dumps the contents of a MIMI cache file in human-readable TSV format. Useful for inspecting cache files and verifying their contents.
.. code-block:: text
$ mimi_cache_dump --help
usage: mimi_cache_dump [-h] [-n NUM_COMPOUNDS] [-i NUM_ISOTOPES] [-o OUTPUT] cache_file
MIMI Cache Dump Tool
positional arguments:
cache_file Input cache file (.pkl)
options:
-h, --help show this help message and exit
-n NUM_COMPOUNDS, --num-compounds NUM_COMPOUNDS
Number of compounds to output (default: all)
-i NUM_ISOTOPES, --num-isotopes NUM_ISOTOPES
Number of isotopes per compound to output (default: all)
-o OUTPUT, --output OUTPUT
Output file (default: stdout)
**Example**::
# Dump first 5 compounds with 2 isotopes each
$ mimi_cache_dump -n 5 -i 2 outdir/nat.pkl -o outdir/cache_contents.tsv
mimi_mass_analysis
------------------
Analyzes mass spectrometry data by comparing measured masses in sample peak lists against precomputed theoretical molecular masses stored in cache files.
.. code-block:: text
$ mimi_mass_analysis --help
usage: mimi_mass_analysis [-h] -p PPM -vp VPPM -c DBBINARY [DBBINARY ...] -s SAMPLE [SAMPLE ...] -o OUTPUT
Molecular Isotope Mass Identifier
options:
-h, --help show this help message and exit
-p PPM, --ppm PPM Parts per million for the mono isotopic mass of chemical formula
-vp VPPM Parts per million for verification of isotopes
-c DBBINARY [DBBINARY ...], --cache DBBINARY [DBBINARY ...]
Binary DB input file(s)
-s SAMPLE [SAMPLE ...], --sample SAMPLE [SAMPLE ...]
Input sample file
-o OUTPUT, --output OUTPUT
Output file
**Example**::
# Analyze single sample with natural abundance cache
$ mimi_mass_analysis -p 1.0 -vp 1.0 -c outdir/nat -s data/processed/testdata1.asc -o outdir/results.tsv
# Analyze multiple samples with multiple caches
$ mimi_mass_analysis -p 1.0 -vp 1.0 -c outdir/nat outdir/C13_95 -s data/processed/testdata1.asc data/processed/testdata2.asc -o outdir/batch_results.tsv