IsoGentiX vs Public Databases: What's the Difference?
Public genomic databases have transformed biology. But for pharmaceutical AI, agritech crop science, and Nagoya-compliant commercial use, they have structural limitations that IsoGentiX is specifically built to address.
What public databases are good at
GenBank, Ensembl, UniProt, and NCBI are extraordinary scientific achievements. Between them they contain billions of sequence records accumulated over four decades of publicly funded science. They have enabled comparative genomics, powered vaccine development, trained the first generation of biological AI models, and made molecular biology accessible to researchers worldwide without computational infrastructure of their own.
For model organism research, academic comparative genomics, and the great majority of basic science applications, they remain the right tool. The scientific community's debt to these databases is enormous and the IsoGentiX position does not minimise it.
For pharmaceutical AI and agritech applications with commercial intent, however, public databases have three structural limitations that compound rather than cancel. Understanding these limitations is essential for any procurement team evaluating biodiversity data sources for a commercial programme.
Limitation 1: Taxonomic bias
Public databases are not representative samples of biological diversity. Their contents reflect decades of funding decisions, logistical constraints, and the tractability of specific organisms in laboratory conditions. The result is a corpus that is heavily biased toward:
- Model organisms - human, mouse, Arabidopsis, yeast, E. coli - which together account for a disproportionate share of total sequence data
- Commercially prioritised crops - maize, rice, soybean, wheat - which have been sequenced extensively because they have economic constituencies that fund genome programmes
- Organisms that culture well in laboratory conditions, skewing toward species compatible with European and North American research infrastructure
The consequence for novel chemistry is severe. Madagascar alone has 12,000+ endemic plant species - one of the world's densest concentrations of chemically unexplored biology. Fewer than 10 of these species have chromosome-level assemblies in any public database.
The 98% of Madagascar's flora that has been phytochemically unexplored is not an anomaly - it is representative of how biology's data collection has been shaped by funding and logistics, not scientific priority. Every region that lacks well-funded genomics institutions is similarly underrepresented. The bias is systematic and structural, not a gap that incremental effort will close.
Limitation 2: No Nagoya compliance documentation
This is the most commercially significant distinction, and it is one that is growing in legal importance with every year that passes under the Nagoya Protocol.
Public databases contain no proof that the genetic resources they hold were collected with Prior Informed Consent from the country of origin. No Mutually Agreed Terms have been documented. No chain-of-custody has been maintained between collection event, sequence deposition, and any commercial use that follows.
For any pharmaceutical or agritech company using these sequences in a commercial R&D pipeline - and particularly for AI companies training models on genomic data - this creates a growing legal exposure under the Nagoya Protocol and EU Regulation 511/2014. The obligations under these frameworks apply to the use of genetic resources, not just their physical collection. A company that trains a drug discovery model on sequence data with no provenance documentation is, on the current trajectory of regulatory enforcement, accumulating liability.
The COP-16 Cali Fund (2024) expanded these obligations explicitly to Digital Sequence Information - the digital representation of genetic resources, including the sequence data held in public databases. The regulatory direction is unambiguous: benefit-sharing obligations are extending upstream, toward data use rather than just physical access.
Limitation 3: Population-level, not specimen-level
Most public database records are assembled from different individuals. The genome record for a given species may come from one plant; the transcriptomics from another; the metabolomics data - if it exists at all - from a third individual in a different location. The data is assigned to species, not to individual specimens.
This produces population-level averages - data that describes what is typical for the species rather than what is true for any individual within it. For classical comparative genomics this is often sufficient. For the causal analysis that AI-driven drug discovery and precision agritech require, it introduces a structural problem.
When genome data and metabolome data come from different individuals, the relationships between genetic features and chemical output reflect population-level correlation rather than individual-level causation. A machine learning model trained on this data learns population averages. It cannot reliably learn the genotype-to-metabolite relationships that make AI-driven drug discovery and crop improvement scientifically productive.
IsoGentiX vs public databases: a direct comparison
| Criterion | Public databases (GenBank, Ensembl, UniProt) | IsoGentiX |
|---|---|---|
| Coverage of Madagascar's endemic flora | <0.1% - fewer than 10 chromosome-level assemblies | 10,000+ species target; all 8 layers per specimen |
| Nagoya compliance documentation | None - no PIC, MAT, or chain-of-custody records | Full - IRCC, blockchain-verified PIC/MAT, ABSCH-registered |
| Data integration level | Species-level - different individuals per data type | Specimen-level - all 8 layers from same individual via GUID |
| Commercial use clearance | No - DSI liability risk under EU Reg 511/2014 | Yes - provenance documentation designed for commercial deployment |
| Genome assembly quality | Variable; many low-quality or partial assemblies | EBP standard - Merqury QV ≥40, BUSCO ≥90% |
| Metabolomics data | Limited or absent for most species | LC-MS/MS and NIR fingerprint per specimen |
| Cryopreserved germplasm availability | No | Yes - physical voucher + −80°C and −196°C storage |
| Pricing model | Free for academic use | Licensed - commercial terms negotiated directly |
"The question is not whether public databases are valuable - they are. The question is whether they are sufficient for building a commercial AI drug discovery pipeline on Madagascar's biological chemistry. They are not."
Does IsoGentiX compete with public databases?
No. IsoGentiX contributes tier 1 non-commercial genome assemblies for flagship species to the Earth BioGenome Project under CC-BY academic licence. The scientific community benefits from this public contribution, and the IsoGentiX data architecture is designed to interoperate with public database infrastructure rather than to replace it.
The distinction is between what is appropriate for academic science and what is sufficient for commercial deployment. IsoGentiX provides the commercial data layer: full multi-omics integration, metabolomics, soil chemistry, biosynthesis pathway annotations, population panels, Nagoya-compliant provenance chains. This layer is available only to licensed commercial subscribers. The public contribution and the commercial offering serve different communities and are not in competition.
The relevant evaluation criterion is not "does IsoGentiX data overlap with what's in GenBank?" It is: "does the data IsoGentiX provides come with the specimen-level integration, Nagoya provenance documentation, and biological novelty that our commercial programme requires?" These are criteria public databases were never designed to meet.