Applied Cyberinfrastructure concepts

Project Information

Discipline
Computer Science (401) 
Orientation
Education 
Abstract

The resources provided by FutureGrid will be utilized by students enrolled in Applied Cyberinfrastructure Concepts ( ISTA 420/520 Fall 2014) at the University of Arizona. This project based learning class will introduce fundamental concepts, tools and resources for effectively managing common tasks associated with analyzing large datasets. Providing familiarity with cyberinfrastrucutre (CI) resources available at the University of Arizona campus, iPlant Collaborative, NSF XSEDE centers, FutureGrid and commercial providers such as Amazon. Students will learning to apply relevant CI skills (for final project) and develop wiki based documentation of these best practices, learning how to effectively collaborate in interdisciplinary team settings and targeting the optimal CI resources and tools for their project. The course will comprise of series of guest lectures by subject matter experts from projects that have developed widely adopted foundational Cyberinfrastrcutrue resources, followed by hands on laboratory exercises focused around those resources (some of which will be tailored for cloud resources on FutureGrid). The students will utilize these resources and gain practical experience from laboratory exercises for a final project. The final project will include data set and requirements provided by domain scientists (Genomics and Geosciences). Students will be provided access to compute resources at: UA campus clusters, iPlant Collaborative and at NSF XSEDE and FutureGrid. Students will also learn how to write a proposal for obtaining future allocation to large scale national resources through XSEDE.

Intellectual Merit

The computational challenges encountered while managing, analyzing and visualizing data are a common and recurring theme across various domains. The rapid evolution of computational capabilities ranging from cloud, hadoop, NoSQL etc have augmented the traditional HPC model, providing a wide array of choices for analyzing research data. This course is targeted to equip students from very diverse disciplines ranging from astronomy, hydrology to life sciences and CISE disciplines a broad understanding of how these computational resources and technologies can be effectively utilized for their specific research questions. This course will introduce students to opportunities, guidance and initial roadmap for gaining access to scalable computational resources and connecting with community of practitioners when working with local, regional, national and commercial infrastructure providers.

Broader Impacts

This course will provide orientation to wide array of resources that are essential in order to support and foster creative approaches for managing computational challenges across disciplines. Current roster of students include many disciplines that have not had access to course work or formal training that include topics in high throughput computing; through this course we intend to foster interdisciplinary collaborations (among students from various disciplines) while providing practical experience to students who have traditionally not had access to these capabilities. Our goal is to espouse computational thinking which is essential to prepare the current and future generation of “data scientists”, affording them the audacity and ability to scale their research.

Project Contact

Project Lead
Nirav Merchant (nirav) 
Project Manager
Nirav Merchant (nirav) 
Project Members
Eric Lyons, Brayden Streeter, Kent Guerriero, Daniel Spence, Nicholas Callahan, wenli zhang, XIAORAN LI, Nathan Lin, Weifeng Li, Sriharsha Mucheli, Daniel Maguire, shuqing gu, Austin Paine, Justin Mannin, Mike Hume, Tomasz Stawicki, Stephen Kovalsky, YUN WANG, Alyssa Musante, Munir Jaber  

Resource Requirements

Hardware System
  • I don't care (what I really need is a software environment and I don't care where it runs)
 
Use of FutureGrid

Students will learn how to create custom VM for their specific projects which require full LAMP stack to support a integrated bioinformatics application These projects will include a) use of iRODS for federated data handling b) Makeflow (cctools from Doug Thain @ Notre Dame) and Pegasus (Eva Deelman @ USC) to provide scale out of tasks . Students may also choose to implement a condor cluster or star cluster for managing their tasks http://web.mit.edu/star/hpc/index.html The above mentioned applications will store certain data components using NoSQL database (Mongo) and Key Value store (Redis) and ZeroMQ http://www.zeromq.org/ for concurrency framework Hadoop will be utilized to scale out some of the queries from Mongo along side apache PIG Above mentioned resources will integrate with similar resources running at iPlant and XSEDE resources (TACC) i.e iRODS federation with nodes in iPlant and makeflow workers executing at TACC Students will be asked to choose different platforms/cloud infrastructure (OpenStack, Amazon) to implement their work and compare and contrast performance and capabilities.

Scale of Use

There are 24 students in the class, certain assignments will be based on FutureGrid resources and these will be conducted in groups of 4. I expect the while class to use of 4-10 VM for design/prototype (over a 2 week period) and then a scale out of 40 VM for final assignment (for the whole class). Hardware requirement will be between 2 to 4 core machines with 4 to 8 Gb RAM and 10-50 GB disk space (EBS style or some form of persistent storage) All scale out will be done under guidance and collaboration with Future Grid team once the proof of concept VM's are functional. All requirements are flexible and can be tailored to resources available

Project Timeline

Submitted
07/29/2014 - 09:14