Scaling up computational biology applications

Abstract

The scale of computing applications has been dramatically increasing over the past several years. As applications in computational genomics are let loose on ever-more-complex problems, the scale of the inputs to these applications has shot up. And as the pursuit of parallelism has led to increasing core counts for servers, and increasing numbers of servers and racks for data centers, the scale of the systems that these applications must run on has also dramatically risen. Running applications at large scale (both in terms of input size and system size) is hence of critical importance both to scientists pushing the frontiers of knowledge and for businesses processing increasing amounts of data.

A critical problem in developing large scale applications is detecting and debugging scaling issues, which are problems with program behavior that emerge only as a program scales up, while remaining hidden at small scales. We discuss scaling issues under two broad categories. The first type of scaling issue is correctness bugs, which arise due to program bugs. For example, a conditional check may overflow at large input sizes, resulting in the program executing incorrectly. Or as the number of threads a program uses increases, the likelihood of a malign race condition occurring may commensurately increase. The second type of scaling issue is performance bottlenecks. For example, as input size increases, a loop whose memory accesses were all cache hits at small scales may start to produce a large number of cache misses. Or as system size increases, a method that performs communication may slow down as network congestion increases. For brevity, we will typically refer to both correctness bugs and performance bottlenecks as bugs in this proposal.

We are proposing to target computational genomics application which have the characteristics described above --- they are called on to execute on larger problem sizes, thanks to the wide availability of next generation sequencing (NGS) equipment and the initial implementations leave a lot to be desired in terms of parallelization. We will create automatic techniques to uncover scalability bottlenecks in such programs and diagnosis mechanisms to pinpoint the software code regions leading to the scalability bottlenecks. We have already started working with a biology faculty member (Dr. Michael Gribskov) who is developing leading-edge applications in sequence alignment and discovery of RNA structures.

Intellectual Merit

To be completed...

Broader Impact

To be completed...

Use of FutureGrid

To be completed...

Scale Of Use

To be completed...

Publications


FG-410
Saurabh Bagchi
Purdue University
Active

Project Members

Milind Kulkarni

Timeline

32 weeks 6 days ago