GenomeOcean: How DOE’s JGI Is Using AI to Read and Write DNA at Scale

“We are drowning in data, but we are starved for knowledge,” Zhong Wang, a computational biologist and staff scientist at the Department of Energy (DOE) Joint Genome Institute (JGI), said recently, paraphrasing Nobel laureate Sydney Brenner.

That’s a concise summary of a very modern scientific problem: Biologists can now generate genomic data far faster than researchers can interpret it. JGI’s answer is the GenomeOcean effort, which released a pilot, 4-billion-parameter genome foundation model trained using supercomputing resources at the National Energy Research Scientific Computing Center (NERSC).

GenomeOcean uses large language models to interpret and generate complex sequences of publicly available biological data, such as DNA, at scale. The goal is to compress the time between hypothesis and insight for scientists working to advance precision medicine, drug development, and environmental genomics.

That effort will soon receive a major boost from NERSC’s next flagship supercomputer, Doudna, or NERSC-10, which will be built by Dell Technologies and powered by NVIDIA Vera Rubin platform. Doudna will bring new speed, memory, and architectural capabilities aimed at the convergence of artificial intelligence (AI) and high-performance computing.

JGI’s mission: read genomes, understand function, write solutions

Wang framed JGI’s mission in three plain-language questions.

First: How do we read the genome? JGI helps scientists sequence DNA and RNA from samples. Second: How do we understand the function of the genome? That’s the hard middle step – translating sequences into biological meaning. Third: How do we write the genome? This is synthetic biology, in service of DOE priorities including energy and environmental challenges.

GenomeOcean was created to help researchers move faster from raw sequence data to actionable biological insights.

The bottlenecks: genomic scale, metagenomic messiness, and uncertain annotation

As Wang noted, today’s challenge isn’t a lack of data. It’s that the data is vast, unstructured, and only partially labeled, especially in metagenomics, where researchers sequence genetic material from entire microbial communities, rather than single cultured organisms.

In assembled metagenomes alone, JGI has more than 30 terabytes of data, Wang said. And that’s just one part of a broader, petabyte-size biology landscape.

The downstream problem is interpretation. Functional annotation often relies on prediction, which is sometimes accurate and sometimes uncertain, Wang noted. The result can be slow, iterative science that struggles to keep up with the pace of sequencing.

The impetus for GenomeOcean: existing models didn’t match real scenarios

GenomeOcean began as a practical response to what Wang described as a gap between early large language models in the biological arena and the situations JGI scientists face. His team evaluated available approaches and found they weren’t delivering what JGI needed in real workflows.

“Early models were promising, but they didn’t translate cleanly to the day-to-day problems we’re trying to solve in genomics,” Wang said.

But JGI had two critical ingredients: a deep bench of genomic expertise and access to serious compute through NERSC allocations. That combination made it feasible to build a foundation model tuned to scientific utility, not just high-performance computing benchmark appeal.

The GenomeOcean model design: built for efficiency and trust

GenomeOcean borrows the basic concept that made large language models transformative – training at scale to learn structure from raw sequences – and applies it to DNA. The GenomeOcean model was trained on a corpus derived from 220 TB of metagenomic datasets collected from diverse habitats, according to JGI.

Wang emphasized two design priorities:

Efficiency. GPUs are scarce, and science teams can’t assume unlimited compute. GenomeOcean was built to run faster and make better use of available accelerators

Accuracy and scientific reliability. In genomics, Wang warned, it can be harder to spot hallucinations than in human language tasks because a plausible-looking sequence may still be biologically wrong. That puts pressure on model selection and evaluation

Depending on the task, GenomeOcean can be 50 to 100 times faster than other models, Wang said.

Open science – responsibly: public artifacts, plus tools to protect data integrity

GenomeOcean is being developed under open science principles, with publicly available data intended to make the work reproducible and extensible by the broader research community. The official implementation repository, for example, includes guidance for containerized deployment – an important detail for scientific software reuse.

But open doesn’t mean uncontrolled. Wang raised a specific concern: If synthetic sequences generated by models were to leak into public repositories without labeling, they could pollute the “gold standard” datasets that scientists rely on.

To safeguard the integrity of shared scientific data, the team released an artificial DNA detector that Wang said achieves over 99% accuracy in distinguishing synthetic DNA generated by GenomeOcean from real genomic sequences.

What changes with Doudna: faster training loops, larger models, and fewer bottlenecks

GenomeOcean is already operating at the frontier of what current high-performance computing systems can comfortably support. As model sizes grow, Wang said, the work becomes harder in practical ways: Models may not fit on a single GPU, which creates challenges communicating across nodes.

That’s where the new Doudna supercomputer comes in.

Built by Dell Technologies and accelerated by NVIDIA next-generation Vera Rubin CPU-GPU platform, it will accelerate both AI and simulation workflows. Doudna will provide more than 10 times the performance of Perlmutter, NERSC’s current flagship supercomputer, according to the DOE.

To meet NERSC’s requirement for a scalable, flexible, open systems architecture, Doudna will be built on Dell’s PowerEdge Integrated Rack 7000, a high-density, rack-scale infrastructure platform designed for AI and high-performance computing workloads.

“We proposed the PowerEdge IR7000 to future-proof architecture and offer flexible GPU scaling tailored to NERSC’s workflows,” said Paul Perez, technology senior fellow, Dell Technologies.

NVIDIA Quantum-X800 InfiniBand will provide ultra-low-latency, high-bandwidth networking that allows thousands of GPUs to communicate as if they were a single processor, enabling real-time analysis of massive datasets.

Meanwhile, NVIDIA BlueField data processing units will offload networking, security, and storage tasks from the main processors to keep the system’s compute power focused entirely on scientific discovery. Dell’s Omnia open-source software will help teams simplify the management of advanced computing workloads.

From the researcher’s perspective, Wang expects Doudna-class capabilities to change what is feasible in genome research. He noted anticipated improvements in memory, memory speed, and new training “tricks” such as lower-precision (4-bit) training support on future chips – capabilities that could make training large foundation models more practical.

He also offered a rough calculation to illustrate the scale of the work: Training a model at GenomeOcean’s current scale could take 50,000 GPU node hours. With Doudna, Wang estimated that training could be 30 to 50 times faster, enabling dramatically shorter iteration cycles for next-generation models.

Doudna is expected to be available to DOE scientists in the first half of 2027.

End-to-end AI enablement: reducing the heavy lift behind big models

Big models aren’t just a compute problem. They’re an operations problem. Even with deep scientific expertise, training large models can be a major lift for scientists whose primary expertise is in the life sciences, not computer sciences, Wang noted.

Early on, the GenomeOcean team wasn’t ready to embrace a fast-moving ecosystem of frameworks and tooling, Wang recalled. Progress came through collaboration with university partners and internal experts, and through structured learning such as workshops and industry platform training that helped demystify AI systems for domain scientists.

Now, Wang looks forward to exploring the capabilities of the Dell AI Factory with NVIDIA, which bundles everything an organization needs to build, run, and scale AI models – potentially removing the bulk of that AI “lift” for NERSC scientists.

“GenomeOcean is the kind of initiative that benefits from an end-to-end approach, where infrastructure, software, and services work together so researchers can move from training to inference faster, with fewer operational hurdles,” Perez added.

What’s next: bigger models, faster loops, and broader impact

GenomeOcean’s long-term promise is a faster, more reliable scientific loop: train, evaluate, generate sequences, validate, and repeat – faster and at greater scale than traditional approaches allow.

As Doudna comes online, JGI expects to train larger models faster, try new approaches, and make GenomeOcean easier for more researchers to use, while keeping it open and reproducible.