Modern Next-Generation Sequencing (NGS) technologies have greatly reduced the cost of full genome sequencing. As the chemical aspect of the sequencing process becomes faster and more parallel the computational components become a more and more significant component of the cost and time.
Other areas of genomics are facing a similar problem, where the volume of data being produced is increasing much faster than the processing rate. For many algorithms, such as de novo assembly and homology search, runtimes of hours or days are common. Reducing the time and expense of this processing provides scientists with the ability to turn more of their raw data into meaningful results.
Our research covers FPGA based approaches to many areas related to processing large genomics datasets.
NGS makes many copies of the DNA or RNA then sequences millions of overlapping short segments in parallel, as shown below. In the short-read alignment problem, the aligner must determine where in the reference genome each short-read originated. Making this problem more difficult, the short-reads of approximately 100 base pairs (bps) potentially contain errors, including inserted bases, deleted bases and incorrectly read bases. The red T on the left-hand side of the figure demonstrates a read error. There are also reads that do not map directly to the reference genome because of a genetic variant in the individual being sequenced. The red Gs in the figure below demonstrate this situation. At this position in the reference genome there is a T base, but the sequenced individual has a G base instead. Discovering these variations is usually the motivation for sequencing a specific individual.
Reference genome and aligned short-reads, with variation highlighted
In order to separate read errors from actual variation, the short-read aligner must map each of millions of reads using inexact matching. Our FPGA accelerated short-read aligner performs this matching using an array of Smith-Waterman engines, while Candidate Alignment Locations (CALs) are identified through a hashing scheme.
Short-read aligner block diagram
Oncology researches may choose to grow human cancer cells in a mouse allowing them to study the behavior of the cancer in vivo at a level of detail that would be impossible with human subjects. This is referred to as a tumor xenograft. One challenge of this approach is that if the cancer cells are sequenced using NGS, some of the reads will come from the human tumor, with close to human DNA, and others will contain mouse DNA. In order to address this and other similar situations, the reads are mapped against multiple references (mouse and human in this case) in an attempt to identify the correct source.
Reads can be classified by which reference they map to better
Using the FPGA accelerated short-read aligner described above, we can align xenograft reads against both the human and mouse reference genome more quickly and accurately than software aligners. After performing both alignments, a variety of classification schemes are possible. The simplest scheme classifies reads as originating from mouse or human cells depending on the higher alignment score. In the chart above, this means that reads in the bottom-right half would be classified as human and reads in the top-left half would be classified as mouse. By observing the small number of Xs and Os on the wrong side of the diagonal you can see that such a simple classification scheme is good, if imperfect.
In some cases a reference genome is not available, making alignment impossible. This can occur if the sequenced species has never had a reference constructed. In this case, the situation is similar to the way the first references were constructed. Alternatively, in a metagenomic study, samples are taken from an area that contains multiple species such as soil, water or intestines. If there are many different unknown speciesí DNA in the samples, it may not be practical to use references for all of them, even if those references exist.
Regardless of the source, de novo assemblers must connect reads to create the longest possible contiguous sequences (contigs). This is a much more difficult problem than reference-based assembly, and requires a much larger working data set because instead of processing each read individually, in sequence or parallel, any read could overlap with any other read. An all-to-all read comparison is obviously too expensive, so carefully designed data structures based on de Bruijn graphs, FM-indices or Bloom filters are often used. However, even these structures are often too large for FPGA onboard memory, requiring creative solutions.
A simple example of short-reads, an early assembly in progress and the resulting contig
Although proteins perform many of the functions fundamental to life, there are also many RNAs with critical functions of their own. The first discovered and most well-known example is the transfer RNA, which carries amino acids to proteins a cell is constructing. These non-coding RNA (ncRNA) typically rely on the chemical properties of their 3D structure to be effective, similar to proteins. Also like proteins, the same ncRNA can exist in many different species or many parts of the same genome. However, the sequence may not be identical in all of those species. Instead, as shown in the Iron Response Element (IRE) ncRNA below, different bases can be substituted in the same position if the complementary base in the correct position is also substituted. This preserves the functional structure even if the sequence changes.
The sequence and secondary structure for the IRE ncRNA in four different species, with a covarying region highlighted
Because the sequence canít be relied upon to identify functionally similar ncRNAs, the problem of modeling and searching for these homologs is non-trivial. One approach, used in the popular tool Infernal, models the paired bases of the ncRNA and identifies sequences that match that model using dynamic programing algorithms. These algorithms, such as Viterbi and CYK, perform well on an FPGA accelerator.
Corey B. Olson, Maria Kim, Cooper Clauson, Boris Kogon, Carl Ebeling, Scott Hauck, Walter L. Ruzzo,"Hardware Acceleration of Short Read Mapping", IEEE Symposium on Field-Programmable Custom Computing Machines, 2012.
Elliott Brossard, Dustin Richmond, Joshua Green, Carl Ebeling, Larry Ruzzo, Corey Olson, Scott Hauck,"A Data-Intensive Programming Model for FPGAs: A Genomics Case Study", Symposium on Application Accelerators in High-Performance Computing (SAAHPC'12), 2012.
Nathaniel McVicar, Larry Ruzzo, Scott Hauck,"Accelerating ncRNA Homology Search with FPGAs", International Conference on Field Programmable Logic and Applications, 2013.
Corey Olson, An FPGA Acceleration of Short Read Human Genome Mapping , M.S. Thesis, University of Washington, Dept. of EE, 2011.
Maria Kim, Accelerating Next Generation Genome Reassembly in FPGAs: Alignment Using Dynamic Programming Algorithms , M.S. Thesis, University of Washington, Dept. of EE, 2011.