Why AI Needs Novel Biological Training Data - and Why Public Databases Aren't Enough
AI drug discovery platforms are hitting a structural ceiling: the genomic data they are trained on is largely redundant, public, and clustered in a narrow slice of biological space. The bottleneck is not model architecture - it is data novelty.
The core problem with public genomic databases for AI training
The world's major biological databases - GenBank, Ensembl, UniProt, NCBI - contain billions of sequence records. They represent extraordinary scientific achievements: decades of coordinated international effort to make biological knowledge openly accessible. For many research purposes, they are indispensable.
But for training AI drug discovery models, they have a structural limitation that no amount of additional compute or architectural innovation can resolve. Public databases are not a representative sample of biological diversity. They are heavily biased toward model organisms: humans, mice, Arabidopsis thaliana, yeast, Escherichia coli. They are biased toward commercially prioritised crops - maize, rice, soybean, wheat - that have received genomic investment proportional to their agricultural economic value. And they are biased toward organisms that can be easily cultured in laboratory conditions, which excludes the overwhelming majority of the planet's plant diversity.
Vast swathes of the evolutionary tree - including most of the planet's 400,000+ plant species - are represented in public databases by partial sequences, voucher specimens with no genomic data, or nothing at all. An AI model trained on this data does not learn the biology of plants. It learns the biology of a narrow, systematically selected subset of plants. When asked to extrapolate to the chemical spaces occupied by unstudied organisms, it does so poorly - not because the model architecture is inadequate, but because the training distribution does not contain the information needed to make accurate predictions in that space.
A 2025 analysis of 71 AI-generated drug candidate molecules found that in 58.1% of cases, the AI-designed molecules had a Tanimoto similarity coefficient above 0.4 to existing known compounds - meaning the model was largely recombining chemistry it had already seen. The fundamental constraint is training data diversity, not model capability. More compute applied to the same public data produces more sophisticated recombination of the same chemical space, not genuine exploration of novel territory.
What AI drug discovery platforms actually need
The leading AI drug discovery platforms - Recursion Pharmaceuticals, Isomorphic Labs, BenevolentAI, Insilico Medicine - are not primarily constrained by model architectures. Transformer architectures, graph neural networks, and diffusion models for molecular generation are all mature enough to extract value from good training data. What limits model quality in this domain is the training data itself. Specifically, four properties that public databases structurally cannot provide at the scale these platforms require.
Chemical novelty
Compounds not already represented in ChEMBL, PubChem, or established natural product databases. Novel chemistry forces models to explore molecular space they have not previously encountered, expanding the generative range of downstream applications. Natural products from unstudied organisms - particularly those with long evolutionary isolation - are the primary available source of structurally novel chemistry at scale.
Linked data across every layer, tied to a single specimen
Data where genomic, transcriptomic, and metabolomic layers are linked at the individual specimen level - so that models can learn causal relationships between gene families and compound outputs, not just population-level correlations between sequence features and compound classes. This is the structural property that public databases, assembled from heterogeneous sources across different individuals and institutions, cannot provide.
Annotated biosynthetic context
Not just "here is a compound" but "here is the biosynthetic gene cluster responsible, here is its tissue-specific expression profile, here is the environmental context in which expression was elevated, here are the regulatory genes modulating output." Biosynthetic context data enables models to learn the logic of natural product chemistry - the rules by which gene families produce structural classes - rather than learning a lookup table of known gene-compound pairs.
Non-public exclusivity
Training on the same public data as every competitor produces models with similar capabilities and similar blind spots. The value of proprietary training data is not only that it is better - it is that it is different. Models trained on genuinely novel biological data explore chemical and genetic space that competitor models have not visited. That asymmetry is durable competitive advantage, not just incremental improvement.
The Madagascar proposition for AI platforms
Madagascar's endemic flora represents a training data opportunity that is structurally unlike anything currently available at scale in public databases. The combination of properties it offers does not exist in any other single geographic or regulatory access point.
165 million years of evolutionary isolation has produced a flora whose secondary metabolite chemistry has minimal structural overlap with the compound families represented in ChEMBL, PubChem, or any published natural product database derived primarily from studied organisms. When an AI model encounters structural classes from Malagasy Apocynaceae or Didiereaceae alkaloid profiles, it is encountering molecular architecture it has not seen before - which is precisely the condition under which training on that data expands model capability rather than reinforcing existing patterns.
The scale - 12,000+ endemic species, with a target of 10,000+ characterised over five years - is sufficient to train specialised foundation models for natural product chemistry discovery and plant biosynthetic pathway prediction. The eight-layer specimen-level multi-omics architecture provides the linked data across every layer, tied to a single specimen that population-level datasets cannot. The FAIR-compliant blockchain-verified provenance documentation provides the legally defensible commercial clearance that public database data, with its DSI provenance ambiguity post-COP-16, increasingly cannot.
"The AI models that will define the next generation of drug discovery are not going to be trained on the same public data as the models that exist today. They are going to be trained on data that is genuinely novel - from organisms that biology has not yet characterised."
What the market is already signalling
The investment and partnership activity of the past three years makes the market's read of the public data ceiling explicit. These are not speculative signals - they are capital allocation decisions by organisations with deep technical understanding of the problem.
Anthropic's acquisition of Coefficient Bio at approximately $400 million - just eight months after the company's founding - reflects a direct bet that proprietary biological data infrastructure, not model architecture, is the primary value driver in biological AI. Google DeepMind's AlphaFold demonstrated that biological data at scale, properly structured, produces models of considerable commercial and scientific value: the structural proteome data underpinning AlphaFold was unprecedented in its coverage and quality, and the model outputs reflected that.
Pharmaceutical companies including Sanofi, Bristol-Myers Squibb, Takeda, and AbbVie have formed or joined AI training data consortia explicitly to access novel biological data that public databases do not contain. The rationale is direct: if every AI drug discovery platform is trained on the same public data, the models converge on the same chemical space and the same failure modes. Proprietary data access is the mechanism for differentiation.
The market is explicitly acknowledging that the public data ceiling has been reached for leading-edge applications. The organisations positioning for the next decade of AI drug discovery are doing so by securing data access, not by optimising model architectures against existing public training sets.
The data platform versus the model: a distinction that matters
IsoGentiX is not building an AI drug discovery model. This distinction is commercially important and frequently misunderstood in conversations about the biological AI space.
The AI model layer - the architectures, training pipelines, and inference systems that pharmaceutical companies and AI drug discovery platforms deploy - is capital-intensive, rapidly evolving, and increasingly commoditised at the architectural level. The competitive differentiation between leading model-builders is, in large part, a function of their training data, not their model designs. Building another drug discovery model on top of public data is a difficult business to differentiate.
The biological data layer - generating genuinely novel, Nagoya-compliant, specimen-level multi-omics from an undercharacterised biodiversity hotspot - requires a different and substantially harder-to-replicate set of capabilities: field operations in remote terrain, cold-chain logistics for cryopreservation, regulatory access negotiated with a national government, community consent relationships built over years, genomics laboratory infrastructure, and a provenance architecture designed to satisfy legal requirements in multiple jurisdictions. These capabilities cannot be assembled quickly. They cannot be replicated from a desk. They are the moat.
If your organisation is evaluating sources of novel biological training data for genomics or chemistry foundation models, the relevant questions are three: Is the data non-redundant with public databases - does it contain chemistry and genetic architecture that your models have not already been trained on? Is every record provenance-documented and legally cleared for commercial use, including in products sold into EU-regulated markets? Is the multi-modal data integrated at specimen level, or fragmented across different individuals in ways that limit causal inference? IsoGentiX is built to answer yes to all three.