Deployment of Virtual Clusters on a Commercial Cloud Platform for Molecular Docking

Project Information

Discipline
Computer Science (401) 
Subdiscipline
26.0616 Biotechnology Research 
Orientation
Research 
Abstract

The project aims to create and deploy virtual clusters that run a protein-ligand molecular interaction simulation program called DOCK to FutureGrid. This will allow tasks to be performed on a large scale cheaply and efficiently. Three areas will be investigated: 1) The elasticity of the virtual clusters, 2) the fault tolerance of the system, and 3) the use of several virtual clusters on various commercial clouds to form a single system. By utilizing commercial and other clouds, like FutureGrid, the system performance will be increased and will allow millions of protein-ligand interaction simulations to be run in a massively parallel manner.

Intellectual Merit

The proposed project emphasizes three specialized topics to flexibly scale and configure virtual machines onto a multi-cloud environment, and to detect and isolate failures of the system. The merit of this project is that it is a real-world test of a multi-cloud implementation of the virtual screening software DOCK. The results from testing this system, will be compared to our previous results using single clusters or grid computing. It is expected that this work will provide a more consistent, stable, and easy-to-use virtual screening workflow on a very large-scale. Additionally, with the large amount of resources on a cloud, the number of simulations that can be run on a virtual cluster is easily scalable with Hadoop/MapReduce, thus this system can adapt to the ever increasing size of chemical compound databases. As the result, the system will facilitate a better understanding and mapping of protein-ligand interactions furthering knowledge of protein signaling networks in the human body.

Broader Impacts

The deployment of the DOCK simulation program onto clouds will allow for 2 distinct improvements that will enhance the process of drug discovery: 1) significant increase in availability of resources regardless of a computer science knowledge background and 2) decrease in the duration of the overall process of drug discovery. First, by having DOCK simulation program available on clouds, this allows other scientists to have access to this essential tool that increases the speed of drug discovery through its ability to screen millions of compounds in a short amount of time. The intended user will not need extensive knowledge in computer science and the user will not have large financial burdens resulting from the procurement and maintenance of hardware needed for grid computing. Second, by having the DOCK simulation program available on cloud for use by many different communities, this will facilitate the interactions between different scientists and lead to a positive impact on the drug discovery process in the pharmaceutical industry.

Project Contact

Project Lead
Anthony Nguyen (acn004) 
Project Manager
Kohei Ichikawa (kohei) 
Project Members
Anthony Nguyen, Yue Song, Katy Pham, Kohei Ichikawa, Jason Haga  

Resource Requirements

Hardware Systems
  • foxtrot (IBM iDataPlex at UF)
  • sierra (IBM iDataPlex at SDSC)
 
Use of FutureGrid

FutureGrid will be used to upload virtual clusters that can run protein-ligand interaction simulations. It will also be used alongside commercial and noncommercial clouds (Microsoft Azure and AIST Cluster) that will have the same type of virtual clusters. Tasks will be distributed across the multiple cloud environments and the clouds will be set up to communicate with each other with regards to task distribution and failures.

Scale of Use

Scale of use depends on which task is being done. There are 3 tasks that are to be accomplished. First is doing multi-cloud work that involves virtual clusters on FutureGrid communicating with virtual clusters on other clouds. For this task, only about 5-10 VMs will be used. The other two tasks involve looking at Fault Tolerance and Elasticity of the Virtual Machine Clusters. For these two tasks, hundreds to thousands of VMs will be used in conjunction with Hadoop/MapReduce as we are hoping to observe the limitations of this method of computing.

Project Timeline

Submitted
07/03/2014 - 00:54