*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Distributed Memory Building Blocks for Massive Biological Sequence Analysis
Tony C. Pan
School of Computational Science and Engineering
College of Computing
Georgia Institute of Technology
Date: Monday, March 27th, 2018
Time: 11:00am - 1:00pm
Location: Klaus 1202
Committee:
Dr. Srinivas Aluru (Advisor, School of Computational Science and Engineering, Georgia Tech)
Dr. David Bader (School of Computational Science and Engineering, Georgia Tech)
Dr. Umit Catalyurek (School of Computational Science and Engineering, Georgia Tech)
Dr. King Jordan (School of Biology, Georgia Tech)
Dr. Fredrick Vannberg (School of Biology, Georgia Tech)
Dr. Richard Vuduc (School of Computational Science and Engineering, Georgia Tech)
Abstract:
K-mer indices and de Bruijn graphs are important data structures in bioinformatics with multiple applications ranging from foundational tasks such as error correction, alignment, and genome assembly, to knowledge discovery tasks including repeat detection and SNP identification. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the data sets at the current generation rate of 1.8 terabases every 3 days.
The volume and velocity with which sequencing data is generated necessitate efficient algorithms and implementation of k-mer indices and de Bruijn graphs, two central components in bioinformatic applications. Existing applications that utilize k-mer counting and de Bruijn graphs, however, tend to provide embedded, specialized implementations. The research presented here represents efforts toward the creation of the first reusable, flexible, and extensible distributed memory parallel libraries for k-mer indexing and de Bruijn graphs. These libraries are intended for simplifying the development of bioinformatics applications for distributed memory environments. For each library, our goals were to create a set of API that are simple to use, and provide optimized implementations based on efficient parallel algorithms. We designed algorithms that minimize communication volume and latency, and developed implementations with better cache utilization and SIMD vectorization.
We present Kmerind, a k-mer counting and indexing library based on distributed memory hash table and distributed sorted arrays, that provide efficient construction, query, and update operations. For de Bruijn graphs, we developed Bruno by leveraging Kmerind functionalities to support parallel de Bruijn graph construction, chain compaction, error removal, and graph traversal and element query. We present the performance and scalability of Kmerind and Bruno in shared memory and distributed memory settings, and compare their performance to existing state-of-the-art tools.