The CGRL and BRC collaborate to enable computational genomics

July 18, 2016

The Computational Genomics Resource Laboratory (CGRL), a California Institute for Quantitative Biosciences (QB3) core facility, and Berkeley Research Computing (BRC) have joined forces to reduce the time and cost required to analyze data from high-throughput sequencing, also known as next-generation sequencing, the fundamental technique for the modern study of genomes. 

The CGRL supports hundreds of biologists in the Berkeley community, providing consulting and workshops on genomic analyses to help them accomplish their research goals. The facility also provides access to computing clusters with software for doing genomics to more than 125 researchers. Now, the CGRL and BRC are using their respective genomic and computing expertise to provide new benefits to campus researchers.

Professor Brian Staskawicz, one of the founders of the CGRL, and researchers in his group use computational resources made available by this partnership to study the genomics underlying the ongoing war between bacteria and plant cells. As farmers try to efficiently grow produce, such as tomatoes, bacteria use specially evolved weaponry acting as molecular Trojan Horses that enter cells and use the plant’s own genome against it to excavate spaces where the bacteria can thrive. Staskawicz and his colleagues are working to decipher this genomic process in order to thwart these bacteria.

Jason Huff, Director of the day-to-day operations of the CGRL, is working with data from algae in collaboration with Associate Prof. Daniel Zilberman to uncover the evolutionary role played by transposons in the proliferation of noncoding DNA in genomes. Huff explains that,

“One of the biggest problems computing can solve is the assembly of genome sequences we don’t know already.”

A primary research goal for many of the biologists Huff supports is to read, or sequence, genomes, the complete set of genetic material in an organism, composed of long strands of DNA that encode all of the proteins in cells. They use high-throughput sequencing to generate hundreds of millions of short DNA fragments. Software enables researchers to analyze this staggering amount of data, for example to read out a whole genome by finding overlapping regions of the short fragments to reconstruct the much longer original sequence. Sequencing is the first important step toward understanding the role of the genome in the normal growth, as well as disease, of an organism.

The CGRL specializes in understanding the expert application of genomic software. CGRL staff install, configure and maintain genomic software within the BRC high-performance computing environment and consult with biologists on the best approach for their specific research question. Staff members also collaborate on projects, as Huff says, “enabling biologists who have their own expertise to save the time of learning how to do the computation. It allows them to do research quickly with an experienced computational collaborator, so they can focus on the approaches for which they are expert.”

Genome sequences can also be used to understand where an organism stands in evolutionary relationship to others. For example, CGRL Bioinformatic Scientist Ke Bi collaborates with Professor David Wake to analyze high-throughput sequencing data from museum samples of extinct species. They are working to reconstruct whole genomes, benefiting from access to the large memory nodes available in the BRC’s High Performance Computing (HPC) cluster, Savio. This work will produce the first molecular data from these species, giving a better understanding of their positions in the tree of life, despite their unfortunate demise. Bi also collaborates with several researchers in the Museum of Vertebrate Zoology to analyze museum samples up to a century old to understand animal evolution in response to recent climate change.

Researchers in the group of Professor John Taylor, another CGRL founder, have benefited from the partnership between the CGRL and BRC as well. Christopher Hann-Soden, a PhD student in Taylor’s lab, switched from the original CGRL computing cluster to the BRC's cluster. Savio’s fast I/O enables Hann-Soden to reconstruct fungi genomes in mere days now, a process that previously took weeks. High-throughput sequencing yields hundreds of millions of DNA fragments, requiring significant storage capacity and fast communication between compute nodes and data stores, all available within Savio.

Motivated by use cases such as these, the CGRL is transitioning users of its original cluster to BRC’s Savio cluster. Researchers gain access to newer, faster computational resources, and the CGRL benefits from infrastructure support -- cooling, electricity, 24/7 monitoring, system administration and help from BRC’s team of high-performance computing experts -- offered at no expense to institutional users and condo contributors. As the CGRL enables biologists to focus on and accomplish their genomic research, BRC enables the CGRL to support and empower these researchers.