*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
This event is offered virtually. Please click here to join via Zoom.
Bioinformatics, Imaging, and Distributed Data Management: Big Data Science Challenges and Perspectives
Tony Pan, Ph.D.
Senior Research Scientist
Assistant Director of Data Infrastructure, Institute for Data Engineering and Science (IDEaS)
Georgia Tech
ABSTRACT
High content data such as radiology and pathology images and bulk and single cell sequence data are accumulating at an ever-increasing rate. Extracting meaningful information from diverse and high-volume datasets, for example gene-gene interactions from RNAseq data and morphometric features in pathology images, present complex challenges and opportunities for data access, algorithm design, and computational optimization. Concurrently, muti-site collaborations and data repository efforts are increasingly the norm with critical requirements in distributed data management, computation, and security and privacy. NSF’s Engineering Research Center for Cell-based Manufacturing Technologies (CMaT) spans 7 core universities as well as academic and industrial partners with the goal of improving manufacturing processes for cell therapy. Shriner’s Arthrogryposis Multiplex Congenita (AMC) Registry seeks to establish an international consortium and a distributed data repository for this rare disease, to enable better understanding of the mechanism, diagnosis, and treatment of the disease. NCI’s The Cancer Imaging Archive similarly supports radiology and digital pathology imaging-based research through its public data repository and analytics services.
In this talk, I will present some recent work in optimizing bioinformatic algorithms and implementations on multi-core and multi-node architectures, and accelerating image analysis in HPC and cloud environments. Through efficient algorithm design and implementation, we have significantly reduced the computation time for bulk and single cell sequence analysis, genome assembly, and gene regulatory network reconstruction. Using high performance clusters, GPUs, and cloud computing, our work has accelerated feature extraction in microscopy images. For large data sets, the improvements are enabling. Finally, I will discuss past and on-going experiences and challenges faced in supporting heterogeneous data management for geographically distributed partners including NCI’s The Cancer Imaging Archive, NSF’s CMaT Center, and Shriner’s AMC Registry.
BIOGRAPHY
Tony Pan is the Assistant Director of Data Infrastructure and a Senior Research Scientist at the Georgia Institute of Technology (GT) Interdisciplinary Institute for Data Engineering and Science (IDEaS). Dr. Pan’s research interests center around developing data science methods to enable large scale biomedical and bioinformatic studies, namely through flexible and extensible data management, high performance computing (HPC) approaches, and efficient sequential and parallel algorithms. Currently, Dr. Pan is leading the data management infrastructure definition and implementation to support the NSF Engineering Research Center for Cell Manufacturing Technologies (CMaT) at GT, and the development of data management infrastructure and gene association studies for the Arthrogryposis Registry at Shriner’s Hospitals for Children. In the areas of HPC and bioinformatics, he has been developing efficient algorithms for genomic sequence analysis, gene regulatory network reconstruction and single cell sequencing data analysis. His prior experiences include middleware and applications development for large scale data management in distributed environments, including in NCI’s Cancer Bioinformatics Grid (caBIG) project, a multi-institutional effort to create a data sharing. He has also made significant contributions in the areas of large-scale microscopy image and genomic sequence data analysis, leveraging HPC and parallel algorithms and machine learning methods.