In 2014, A. Murat Eren, PhD, who goes by Meren, had the beginnings of a promising career: an appointment as an Assistant Research Scientist at the Marine Biological Laboratory (MBL) in Woods Hole, Mass., and an opportunity to continue doing what he did well: studying communities of microbes using computer algorithms he developed himself. But in 2014 he decided to stop doing that and write a new software platform instead.
One year and thousands lines of code later, that software—called anvi’o, short for “analysis and visualization platform for ‘omics data”—is a unique tool to help scientists visualize multiple large sets of genetic and molecular data, in an easy-to-use, interactive display. It offers a new way to explore and interact with complex data to better understand the trillions of microbes living on, inside and around us.
“This is like jazz music,” said Meren, who is now an Assistant Professor of Medicine at the University of Chicago. “Once you get into it, you realize how deep it goes.”
That depth and complexity of the microbial world is what led him to build anvi’o. He had been studying the human gut microbiome—first at the University of New Orleans and Children’s Hospital New Orleans, then at the MBL—using marker genes, a common method for identifying the various kinds of microbes living in a given environment.
With this approach, scientists extract the DNA from a sample and sequence information-rich segments around a specific gene called the 16S ribosomal RNA gene to see what kinds of bacteria are living there. This gene is part of the ribosome, the part of all living cells that is responsible for producing proteins. The omnipresence of this gene makes it a universal target, and the information it contains gives researchers a starting point for categorizing the microbial community structure, i.e. what kinds of microbes are present in a sample.
This method can produce a comprehensive picture of all the different bacteria present in a sample, but because it focuses on just one gene, it tells more about who is in a given environment rather than what they are doing there.
“It’s like surveying the population of a large city,” Meren said. “If you have limited resources, one way to do it quickly would be to ask everyone for a driver’s license and record which state they were from.”
“This would give you one aspect of the city’s diversity, and possibly allow you to make some generalizations about the population based on things we know about those states, but it doesn’t go much deeper than that,” he said.
Meren worked with marker genes for three years, but wanting to get a more complete understanding of the microbial world, he shifted his focus to a burgeoning field called metagenomics, along with two colleagues from the MBL, Tom Delmont, and Özcan Esen. With metagenomics, scientists can assemble large pieces of microbial genomes directly from environmental samples, and study them in a more intimate manner.
“In our survey example, it’s like sitting down with survey subjects for a detailed interview, instead of just checking out their driver’s licenses,” Meren said. “In the end, you have a much more complete picture of the people you interviewed, although you can’t cover nearly as many people this way.”
To get metagenomic data, researchers slice up all the DNA found in a sample into short fragments that can be sequenced with high-throughput sequencing equipment. Assembling these short pieces back into contiguous DNA segments, or contigs, and putting the right set of them together to represent distinct genomes is notoriously difficult though, and an area of intense, ongoing research.
So far, scientists have come up with various ways to match contigs that originate from the same genome, such as relying on sequence signatures of contigs and their distribution patterns across samples to link them. Some of these contig collections, or “draft genomes” as they’re sometimes called, might match genomes from known organisms, and some might represent previously undiscovered, novel strains.
The problem, as Meren and his colleagues saw it, was that even if you managed to piece together several complete genomes, there was no good way to visualize or interact with this data. Researchers mostly relied on ad hoc pie charts, bar graphs and cluster diagrams from Excel and other statistical tools that often offered a narrow window into the complex world of microbes.
Meren and his colleagues designed anvi’o so that scientists without an extensive computer science or bioinformatics background could easily visualize and work with metagenomic data. It relies on a powerful visualization strategy, primarily by displaying the data in a series of concentric circles. Each ring in the display is a different sample or set of contextual information, overlaid on snippets of DNA that have been organized to represent likely genomes as the radiating lines or spokes of the wheel. The display is fully interactive—a user can identify and highlight draft genomes or contigs and drill deeper to see more detailed information.
The power of the tool becomes apparent when you see how it can layer different types of data next to each other in the same display—metagenomes, single-cell genomes, and metatranscriptomes, or the sequencing of RNA molecules in a sample to see which organisms are active and producing proteins.
In a paper describing anvi’o in the journal PeerJ, Meren and his colleagues re-analyzed data from several studies to show how the tool could help them investigate these public datasets even deeper. In one particularly striking example, they re-analyzed the data from ocean water samples collected during the 2010 Deepwater Horizon disaster in the Gulf of Mexico. Some microbes consume hydrocarbons like petroleum as a food source, so they naturally congregate in the area after a huge oil spill. Using anvi’o, Meren and his colleagues identified a population of organisms that was previously overlooked, but highly abundant (based on the metagenome) and highly active (based on the metatranscriptome) in the water plume during the oil spill. These same organisms were undetectable at the same place two years later.
By combining single-cell genomes with metagenomic and metatranscriptomic data generated by the efforts of multiple investigators, anvi’o helped Meren and his colleagues create a holistic perspective of different phases of the oil spill and its impact on the ocean. This side-by-side comparison recovered the parts of the microbial population that were most likely responsible for degrading the oil.
A better understanding of the key players of the oil spill is critical, since it could lead to more effective remediation efforts. If scientists could pinpoint the exact strain of bacteria that breaks down oil after a spill, this knowledge could guide cleanup efforts in similar situations.
But Meren said this example was just one demonstration of anvi’o. The software is freely available online, and he and his team at the Department of Medicine want to make sure researchers elsewhere have access to it.
“In most cases as a scientist, what you can do is defined by the tools that are available to you,” he said. “We want to give all researchers the same benefits anvi’o provides to investigate complex metagenomic data, and make their own exciting new discoveries.”