IsoGentiX Knowledge Hub

IsoGentiX vs Public Databases: What's the Difference?

Public genomic databases have transformed biology. But for pharmaceutical AI, agritech crop science, and Nagoya-compliant commercial use, they have structural limitations that IsoGentiX is specifically built to address.

← Back to Knowledge Hub

What public databases are good at

GenBank, Ensembl, UniProt, and NCBI are extraordinary scientific achievements. Between them they contain billions of sequence records accumulated over four decades of publicly funded science. They have enabled comparative genomics, powered vaccine development, trained the first generation of biological AI models, and made molecular biology accessible to researchers worldwide without computational infrastructure of their own.

For model organism research, academic comparative genomics, and the great majority of basic science applications, they remain the right tool. The scientific community's debt to these databases is enormous and the IsoGentiX position does not minimise it.

For pharmaceutical AI and agritech applications with commercial intent, however, public databases have three structural limitations that compound rather than cancel. Understanding these limitations is essential for any procurement team evaluating biodiversity data sources for a commercial programme.

Limitation 1: Taxonomic bias

Public databases are not representative samples of biological diversity. Their contents reflect decades of funding decisions, logistical constraints, and the tractability of specific organisms in laboratory conditions. The result is a corpus that is heavily biased toward:

The consequence for novel chemistry is severe. Madagascar alone has 12,000+ endemic plant species - one of the world's densest concentrations of chemically unexplored biology. Fewer than 10 of these species have chromosome-level assemblies in any public database.

The 98% of Madagascar's flora that has been phytochemically unexplored is not an anomaly - it is representative of how biology's data collection has been shaped by funding and logistics, not scientific priority. Every region that lacks well-funded genomics institutions is similarly underrepresented. The bias is systematic and structural, not a gap that incremental effort will close.

Limitation 2: No Nagoya compliance documentation

This is the most commercially significant distinction, and it is one that is growing in legal importance with every year that passes under the Nagoya Protocol.

Public databases contain no proof that the genetic resources they hold were collected with Prior Informed Consent from the country of origin. No Mutually Agreed Terms have been documented. No chain-of-custody has been maintained between collection event, sequence deposition, and any commercial use that follows.

For any pharmaceutical or agritech company using these sequences in a commercial R&D pipeline - and particularly for AI companies training models on genomic data - this creates a growing legal exposure under the Nagoya Protocol and EU Regulation 511/2014. The obligations under these frameworks apply to the use of genetic resources, not just their physical collection. A company that trains a drug discovery model on sequence data with no provenance documentation is, on the current trajectory of regulatory enforcement, accumulating liability.

The COP-16 Cali Fund (2024) expanded these obligations explicitly to Digital Sequence Information - the digital representation of genetic resources, including the sequence data held in public databases. The regulatory direction is unambiguous: benefit-sharing obligations are extending upstream, toward data use rather than just physical access.

Limitation 3: Population-level, not specimen-level

Most public database records are assembled from different individuals. The genome record for a given species may come from one plant; the transcriptomics from another; the metabolomics data - if it exists at all - from a third individual in a different location. The data is assigned to species, not to individual specimens.

This produces population-level averages - data that describes what is typical for the species rather than what is true for any individual within it. For classical comparative genomics this is often sufficient. For the causal analysis that AI-driven drug discovery and precision agritech require, it introduces a structural problem.

When genome data and metabolome data come from different individuals, the relationships between genetic features and chemical output reflect population-level correlation rather than individual-level causation. A machine learning model trained on this data learns population averages. It cannot reliably learn the genotype-to-metabolite relationships that make AI-driven drug discovery and crop improvement scientifically productive.

IsoGentiX vs public databases: a direct comparison

Criterion Public databases (GenBank, Ensembl, UniProt) IsoGentiX
Coverage of Madagascar's endemic flora <0.1% - fewer than 10 chromosome-level assemblies 10,000+ species target; all 8 layers per specimen
Nagoya compliance documentation None - no PIC, MAT, or chain-of-custody records Full - IRCC, blockchain-verified PIC/MAT, ABSCH-registered
Data integration level Species-level - different individuals per data type Specimen-level - all 8 layers from same individual via GUID
Commercial use clearance No - DSI liability risk under EU Reg 511/2014 Yes - provenance documentation designed for commercial deployment
Genome assembly quality Variable; many low-quality or partial assemblies EBP standard - Merqury QV ≥40, BUSCO ≥90%
Metabolomics data Limited or absent for most species LC-MS/MS and NIR fingerprint per specimen
Cryopreserved germplasm availability No Yes - physical voucher + −80°C and −196°C storage
Pricing model Free for academic use Licensed - commercial terms negotiated directly
"The question is not whether public databases are valuable - they are. The question is whether they are sufficient for building a commercial AI drug discovery pipeline on Madagascar's biological chemistry. They are not."

Does IsoGentiX compete with public databases?

No. IsoGentiX contributes tier 1 non-commercial genome assemblies for flagship species to the Earth BioGenome Project under CC-BY academic licence. The scientific community benefits from this public contribution, and the IsoGentiX data architecture is designed to interoperate with public database infrastructure rather than to replace it.

The distinction is between what is appropriate for academic science and what is sufficient for commercial deployment. IsoGentiX provides the commercial data layer: full multi-omics integration, metabolomics, soil chemistry, biosynthesis pathway annotations, population panels, Nagoya-compliant provenance chains. This layer is available only to licensed commercial subscribers. The public contribution and the commercial offering serve different communities and are not in competition.

The right question for procurement teams

The relevant evaluation criterion is not "does IsoGentiX data overlap with what's in GenBank?" It is: "does the data IsoGentiX provides come with the specimen-level integration, Nagoya provenance documentation, and biological novelty that our commercial programme requires?" These are criteria public databases were never designed to meet.