Frequently Asked Questions

AI Platforms & Foundation Models

Questions from AI platform teams and life sciences foundation model developers - answered plainly.

The Training Data Problem

If AI can design molecules from scratch, why does biological training data still matter?

AI drug discovery and protein function models are bounded by the chemical and genetic space in their training data. Models trained predominantly on public databases - which are heavily biased toward model organisms and well-explored chemical space - generate outputs that cluster within already-known regions of biology. They cannot extrapolate reliably to genuinely novel chemical structures or gene families because they have never been trained on them.

A 2025 analysis found that 58.1% of AI-generated drug candidates had high structural similarity to existing known compounds. The novelty ceiling is a training data problem, not a model architecture problem. Madagascar's endemic flora - 160 million years of isolated evolution producing chemistry not represented in any public database - provides exactly the non-redundant, structurally novel training signal that next-generation life sciences foundation models require.

IsoGentiX data is not a competitor to AI drug discovery. It is the training data that allows AI drug discovery to move beyond its current knowledge ceiling.

What is the training data saturation problem in biological AI?

Public biological databases - GenBank, UniProt, PDB, ChEMBL - have been heavily mined by every major life sciences AI programme. The models trained on these databases have already extracted most of the predictive signal available in that data. Additional training on the same sources produces diminishing returns.

The saturation problem is structural: public databases are disproportionately populated with data from a small number of model organisms (human, mouse, Arabidopsis, E. coli, zebrafish) and from chemical space that has already been explored by conventional drug discovery. The biological diversity that exists in nature - and that represents the majority of evolutionary innovation in molecular biology - is almost entirely absent from these databases.

Madagascar's 12,000+ endemic plant species, with less than 0.08% represented in any public database, represent one of the largest single reservoirs of uncharacterised biological diversity accessible under a commercially deployable data licence.

Why is specimen-level multi-omics data specifically valuable for AI training?

Most biological databases provide single data types - sequence only, or metabolomics only, or protein structures only. Training AI models on single data types produces models with limited cross-modal generalisation: they can predict within their modality but cannot reliably bridge from sequence to function to metabolite output.

IsoGentiX specimen-level data links 8 data layers to a single biological individual: genome, transcriptome, metabolome, proteome, epigenome, microbiome, soil chemistry, and habitat. This is the kind of deeply integrated, multi-modal dataset needed to train models that can predict metabolic output from sequence, identify biosynthetic gene clusters from multi-omics context, or learn the relationship between epigenetic state and metabolite profile.

The specimen-level GUID linkage ensures that all data layers for a given individual remain associated - the training signal is not diluted by aggregation across specimens with different genetic backgrounds, environmental contexts, or phenological states.

Data Formats & Licensing

What formats is AI training data delivered in?

Standard delivery formats for AI training use cases:

Genome/transcriptome - FASTA/FASTQ (raw reads), GFF3 (annotations), pre-computed embeddings available on request
Metabolomics - mzML (spectra), processed peak tables in CSV/Parquet, SMILES strings for identified compounds
Proteomics - FASTA (sequences), mzML (mass spectra), processed intensity tables
Epigenomics - BED/bigWig (methylation tracks), processed CpG matrices
Microbiome - FASTQ (16S/shotgun reads), processed OTU tables in TSV/Parquet
Metadata - JSON-LD (Darwin Core compliant), with specimen GUID as the linking key across all modalities

Custom bulk delivery formats for large-scale AI training dataset integration are available by arrangement. Parquet and HDF5 delivery for large numerical arrays is supported. Contact us for a technical data format discussion.

Is IsoGentiX data cleared for use in commercial AI model training?

Yes - Nagoya Protocol compliance is specifically what makes IsoGentiX data commercially deployable for AI training where public databases are not. Public databases contain data with unresolved provenance: sequences submitted without PIC documentation, metabolite profiles from specimens with no recorded chain of custody, aggregated datasets with no clear country-of-origin record.

For AI companies operating in jurisdictions covered by the Nagoya Protocol (EU, UK, and increasingly others), using biological data in commercial AI model training without verified provenance creates regulatory exposure. IsoGentiX data comes with complete ABS documentation: IRCC certificates registered in the CBD Clearing House, MAT terms, blockchain-anchored chain of custody, and steganographic watermarking. This is the documentation your legal team needs to demonstrate due diligence under EU Regulation 511/2014.

How does AI training data licensing differ from pharmaceutical domain licensing?

AI training data licences are structured differently from pharmaceutical domain licences. Key differences:

Scope - AI training licences typically cover broad dataset access (e.g. all specimens, all modalities) rather than exclusive taxonomic or geographic domains
Use restriction - licensed for incorporation into AI model training; any model outputs that identify specific commercialisable compounds or gene targets may require a separate downstream licence
Exclusivity - AI training data licences can be non-exclusive (the same dataset can be licensed to multiple AI platforms); exclusive AI training data licences are available at premium terms
Delivery cadence - AI training datasets are typically delivered as point-in-time snapshots with optional update subscriptions, rather than the rolling quarterly release model used for domain licences

What is the current scale of the dataset and how will it grow?

IsoGentiX is in the active collection and sequencing phase of the programme. The current dataset covers priority species across the four major Madagascar biomes, with full 8-layer profiles per specimen. The programme is designed to scale to thousands of specimens across hundreds of species over the collection period.

AI training data licences include update subscriptions that provide access to newly generated data as it is released - each quarterly release adds specimens, species, and geographic coverage. Early AI training data partners therefore benefit from a growing dataset, with the earliest cohort of specimens already processed and available from day one of the licence.

For AI teams that need to understand current dataset scale before committing to a licence, a data inventory summary is available under NDA. Contact us to request one.

Discuss AI training data access

Get in Touch

AI Platforms & Foundation Models

Discuss AI training data access

Cookie Preferences