The Data Architecture

Eight layers deep. Every specimen traceable. Nothing approximated.

GUID-linked, cryptographically auditable, EBP-standard multi-omics — built for pharmaceutical AI, agritech pipeline integration, and LLM foundation model training.

Commercial Access Request a Briefing
The IsoGentiX Architecture

Eight layers of data. One specimen. Unbroken provenance.

Every data product originates from a physical, GPS-logged, FPIC-consented specimen. Each is assigned a permanent GUID at collection — linking every downstream layer to the exact plant, location, and legal authorisation.

Specimen Provenance & GPS Metadata

GPS-logged (±3m), FPIC-consented, permit-governed physical herbarium voucher. GUID assigned at point of collection — the legal and scientific anchor for every layer above.

GPS ±3m · FPIC + PIC documentation · Permit reference · GUID at collection · Nagoya-compliant

Reference Genome (WGS)

Chromosome-level whole-genome assembly from long-read sequencing. Intraspecies variation panels from multiple individuals per species where population permits.

Merqury QV ≥40 · BUSCO ≥90% · Chromosome-level · Long-read · EBP standard

Transcriptome (RNA-Seq)

Tissue- and condition-specific expression profiling. Stress-response and drought-tolerance profiles with biosynthetic gene cluster identification.

RNA-Seq · Tissue-specific · Stress-response profiles · BGC annotation · DESeq2

Metabolome (LC-MS/MS Targeted)

Targeted alkaloid, terpenoid, and flavonoid panels. Full feature maps per specimen with compounds annotated against public spectral databases.

LC-MS/MS targeted · Alkaloids · Terpenoids · Flavonoids · Spectral DB annotation · Unknown-feature lists

Proteome (LC-MS/MS)

Protein expression profiling linked to transcriptomic data. AlphaFold-compatible structure predictions for uncharacterised enzymes endemic to Malagasy species.

LC-MS/MS proteomics · AlphaFold format · Binding domain analysis · Novel enzyme characterisation

Epigenome (ATAC-Seq / BS-Seq)

Chromatin accessibility and methylation profiling. Stress-condition epigenetic modifications — how species adapted to Madagascar’s extreme edaphic conditions.

ATAC-Seq · Whole-genome BS-Seq · Regulatory elements · Methylation profiles · Stress-condition data

Microbiome (16S / ITS)

Rhizosphere and endophytic profiling from matched soil and plant tissue. Ecological context unavailable from any laboratory culture collection.

16S rRNA bacterial · ITS fungal · Rhizosphere characterisation · Endophyte profiling

Ecological & Ethnobotanical Context

Validated ethnobotanical use records, IUCN status, and edaphic soil chemistry from collection site. Transforms raw omics into interpretable biological intelligence.

IUCN status · Ethnobotanical records · Soil chemistry · pH & mineral analysis · Distribution modelling
Earth BioGenome Project Standards

Reference-grade genomes. No shortcuts.

The IsoGentiX genome programme operates to Earth BioGenome Project (EBP) standards - the most rigorous benchmarks in the field. Every genome assembly is independently quality-assessed before entering the platform. This is not aspirational: it is a contractual requirement written into every data access agreement.

EBP quality thresholds exist because they are the minimum standard at which a genome assembly is reliably interpretable by AI/ML models and valid for drug target identification. Below these thresholds, gaps and errors in the assembly create false signals - precisely the noise pharmaceutical AI pipelines were built to eliminate.

When a data partner licenses an IsoGentiX genome, they receive a quality certificate alongside the data. Every assembly that does not pass is resequenced - not released.

QV = 40 Merqury quality value
(<1 error per 10,000 bp)
= 90% BUSCO completeness
(conserved gene presence)
Chrom. Chromosome-level
assembly target
Long-read Long-read
sequencing platform

Intraspecies variation: Where population size and permit conditions allow, multiple individuals per species are sequenced to capture edaphic and population-level genetic variation - providing the statistical depth that single-specimen databases cannot offer.

EBP-standard. Merqury QV ≥ 40. BUSCO ≥ 90%. Contractual quality guarantee on every genome assembly delivered.

GUID Architecture

One permanent identifier. Every layer. Unbreakable chain.

Every specimen is assigned a Globally Unique Identifier (GUID) at the moment of physical collection. This GUID travels through every analytical layer - from the herbarium voucher to the final metabolite profile - creating an unbroken, machine-queryable chain of custody that is simultaneously a legal compliance record and a scientific data integrity mechanism.

Specimen IGX-2026-00847 - GUID linkage example

SPECIMEN_GUID IGX-2026-00847-EST-S04 Anchor
HERBARIUM_REF TAN-H-2026-00847 - Parc National Analamazaotra Physical
PERMIT_REF MEDD-ABS-2026-0847 - FPIC-VIL-044 Legal
GENOME_ACCESSION IGX-WGS-00847 - QV=43.2 - BUSCO=94.1% Layer 2
METABOLOME_ID IGX-MET-00847 - 312 compounds annotated Layer 4
BLOCKCHAIN_HASH 0x3a7f9c2d-e841b3f0 - immutable audit record Provenance

For Pharmaceutical Partners

Query by compound family or target class - then trace every hit back to its specimen, its genome, its collection location, and its benefit-sharing terms. No ambiguity. No data hygiene backlog.

For Agritech Partners

Access trait-linked genomic regions with substrate and microbiome context attached. Filter by edaphic condition: laterite, karst, ultramafic. Every result includes the ecological provenance needed to contextualise gene expression in crop-improvement models.

For AI / LLM Partners

Clean, GUID-structured multi-modal biological data in machine-readable formats. Every training sample carries legal provenance metadata - addressing the Nagoya compliance gap that currently prevents most AI platforms from using biological datasets at scale.

Blockchain Provenance

An immutable audit trail from field to API endpoint.

Every data transaction - collection, processing, quality assurance, licensed access - is recorded as an immutable event on a distributed ledger. This is not a marketing claim. It is the technical architecture required to demonstrate Nagoya Protocol compliance to regulators, to satisfy pharmaceutical due-diligence requirements, and to provide AI training data that can withstand legal scrutiny of its provenance.

Field Collection

GUID minted at point of collection. GPS, permit reference, and FPIC documentation hashed and recorded.

Laboratory Processing

Each analytical step (DNA extraction, sequencing run, LC-MS acquisition) logged with instrument ID, operator, and timestamp.

Quality Assurance

QV and BUSCO scores recorded. Assemblies below threshold flagged and held. Pass/fail decision and rationale recorded.

Benefit Sharing

Monetary and non-monetary benefit-sharing disbursements linked to licence events - recorded against the audit trail and reportable to MEDD.

Data Access

Every licensed query logged with partner ID and scope. Permissioned API returns provenance metadata with every data response.

Full Governance & Nagoya Compliance Detail
Edaphic Context & Intraspecies Variation

The same species on different soils is a different dataset.

Madagascar's extraordinary geological diversity - laterite plateau, tsingy limestone, ultramafic substrates, quartzitic spiny desert - means that the same plant species can exhibit dramatically different gene expression profiles, metabolite production, and stress-response mechanisms depending on where it grows.

Most public genomic databases treat a species as a single entity. IsoGentiX captures intraspecies variation at the edaphic level: where population size and permit conditions allow, we sequence multiple individuals per species from different substrate types. This produces variation panels that are directly relevant to crop resilience modelling and pharmaceutical lead diversification.

The microbiome layer (Layer 7) amplifies this: rhizosphere community composition from ultramafic substrates is profoundly different from that on laterite. These microbiome-substrate-plant interactions are entirely absent from any existing public or commercial dataset.

Four Substrate Types - One Species

Laterite plateau Base genome

Standard alkaloid profile; high drought-response gene expression

Tsingy limestone karst +34% alkaloid diversity

Elevated calcium stress ? novel secondary metabolite production

Ultramafic substrates Unique gene clusters

Heavy metal tolerance genes with no public analogues in any database

Quartzitic spiny desert Desiccation tolerance

Extreme xerophyte adaptations; biosynthetic pathway variants of agritech interest

Unlocking nature's intelligence.

Founding Partners receive first-mover domain exclusivity, direct input into the species prioritisation schedule, and data access that begins before public database availability.

Commercial Access Tiers Request a Briefing