Frequently Asked Questions

Scientists & Research Teams

Technical questions from biologists, bioinformaticians, and research scientists - answered plainly.

What sequencing and assembly standards does IsoGentiX use?

IsoGentiX enforces Earth BioGenome Project (EBP) reference-grade standards for all genome assemblies:

  • Assembly quality - Merqury QV score =40 (error rate <1 per 10,000 bases); BUSCO completeness =90% against the relevant plant lineage database (embryophyta_odb10 or family-level where available)
  • Sequencing depth - =40 - coverage for short-read polishing; long-read sequencing (PacBio HiFi or ONT R10) targeting =50 - for de novo assembly
  • Scaffold continuity - N50 scaffold length =10Mb; chromosome-level assembly targeted where resources allow
  • Annotation - structural annotation via MAKER/BRAKER2 pipelines; functional annotation via OrthoFinder against plant reference proteomes; repeat masking with RepeatMasker/RepeatModeler

Transcriptome data is generated from total RNA (RNAlater-preserved at collection), sequenced to =50M read pairs per sample, trimmed with Trimmomatic, and assembled/quantified with STAR + featureCounts against the genome assembly.

IsoGentiX metabolomics uses two complementary approaches per specimen:

  • LC-MS/MS (targeted and untargeted) - UHPLC separation, positive and negative mode ESI, data-dependent acquisition for untargeted metabolite discovery, and targeted MRM panels for alkaloid quantification. Raw data in mzML format; processing via MZmine 3 and GNPS for spectral networking and compound annotation.
  • NIR metabolite fingerprinting - Near-infrared spectroscopy on fresh/dried plant material, generating a rapid metabolite fingerprint used for specimen-level chemical profiling and chemotyping across populations. Processed as spectra matrices and PCA-reduced fingerprints.

Metabolite annotation uses the following databases: HMDB, KNApSAcK, the Dictionary of Natural Products, and family-specific alkaloid databases. Confidence levels follow the Metabolomics Standards Initiative (MSI) reporting framework (levels 1-4). Unknown compounds are reported at MSI level 4 with molecular formula and spectral data retained for downstream characterisation.

Each specimen collection follows a standardised protocol developed over the IsoGentiX founders' many years of Madagascar fieldwork. At collection:

  • GPS coordinates (-3m accuracy), altitude, aspect, slope, and habitat type are recorded
  • Phenological state (vegetative, flowering, fruiting, dormant) is documented and photographed
  • Voucher samples are taken for herbarium deposition (PBZT, Antananarivo)
  • Fresh tissue is flash-frozen in liquid nitrogen for genomics/transcriptomics; separate aliquots are preserved in RNAlater; dried samples taken for metabolomics and NIR
  • Soil samples (0-15cm depth) are collected from the rhizosphere zone for XRF elemental analysis and microbiome profiling
  • A GUID is assigned to the specimen at collection; all field data and downstream laboratory data are linked to this GUID from day one

Field metadata is captured in structured JSON forms synced to the IsoGentiX data management system. The GUID, collection date, collector ID, and GPS coordinates are recorded on the IRCC certificate at collection, establishing the provenance chain before any laboratory work begins.

IsoGentiX collects a minimum of 5 specimens per target species, distributed across the full geographic range of the species wherever possible. For species with fragmented distributions (common in tsingy and ultramafic specialists), specimens are collected from each distinct population cluster.

The intraspecies sampling design reflects the biological reality that metabolite profiles can vary substantially between geographically isolated populations of the same species - a phenomenon well-documented in alkaloid-producing plants. Collecting multiple specimens per species generates the intraspecies chemical variation data needed to: (a) identify chemotypes, (b) determine which metabolites are consistent across all populations vs. population-specific, and (c) provide the range of biological replicates needed for statistically sound comparative analysis.

The full multi-omics stack is applied to all specimens, not just representative individuals. Intraspecies genomic variation data (SNPs, structural variants) is an additional output of the multi-specimen approach.

IsoGentiX is a commercial data platform and its primary licensing model is for pharmaceutical, agritech, and AI companies. However, IsoGentiX is open to research collaboration arrangements with academic institutions in specific cases:

  • Collaborative research programmes where academic expertise contributes to data generation (e.g. specialist taxonomic identification, bioinformatics pipeline development)
  • Academic use of publicly released data subsets - IsoGentiX plans to release a proportion of specimen data to public repositories following commercial licensing periods
  • Joint grant applications with academic partners where IsoGentiX contributes data and commercial partner involvement strengthens impact case

Academic researchers interested in collaboration should contact us with a brief description of their research programme and what data access would be needed. We evaluate these requests case by case.

Yes - accurate species identification is foundational to the IsoGentiX data programme, and we actively work with Malagasy and international taxonomists. All collections are made with the involvement of trained Malagasy botanists. Voucher specimens are deposited at PBZT (Parc Botanique et Zoologique de Tsimbazaza, Antananarivo) and duplicate specimens at collaborating international herbaria.

Taxonomic identifications are reviewed against current treatments in Tropicos and GBIF, with specialist consultation for genera where Malagasy taxonomy is actively revised. Molecular phylogenetic placement is generated as part of the genome assembly pipeline, providing an independent check on morphological identification for ambiguous specimens.

Field botanists and taxonomists interested in contributing to the collection programme should contact us - particularly those with expertise in Apocynaceae, Didiereaceae, or tsingy-endemic genera, where specialist identification skills are most needed.

Standard domain licences include access to the IsoGentiX data portal with pre-computed analyses: BUSCO reports, assembly statistics, metabolite annotation summaries, PCA plots of intraspecies metabolite variation, and biosynthetic gene cluster predictions (antiSMASH for plant secondary metabolite BGCs).

Enhanced bioinformatics support - including comparative analysis across the licensed domain, custom annotation pipelines, and dedicated data scientist support - is available as an add-on to domain licences. This is particularly relevant for pharmaceutical licensees who want to move from raw data to a shortlist of compounds of interest without building internal bioinformatics capacity.

For AI training data licensees, pre-computed embeddings and normalised feature matrices are available in addition to raw data formats.

Scientific questions or collaboration enquiries

Contact us to discuss data methodology, research collaboration, or technical access requirements in detail.

Get in Touch