Applied Cyberinfrastructure concepts

Abstract

The resources provided by FutureGrid will be utilized by students enrolled in Applied Cyberinfrastructure Concepts ( ISTA 420/520 Fall 2014) at the University of Arizona. This project based learning class will introduce fundamental concepts, tools and resources for effectively managing common tasks associated with analyzing large datasets. Providing familiarity with cyberinfrastrucutre (CI) resources available at the University of Arizona campus, iPlant Collaborative, NSF XSEDE centers, FutureGrid and commercial providers such as Amazon. Students will learning to apply relevant CI skills (for final project) and develop wiki based documentation of these best practices, learning how to effectively collaborate in interdisciplinary team settings and targeting the optimal CI resources and tools for their project. The course will comprise of series of guest lectures by subject matter experts from projects that have developed widely adopted foundational Cyberinfrastrcutrue resources, followed by hands on laboratory exercises focused around those resources (some of which will be tailored for cloud resources on FutureGrid). The students will utilize these resources and gain practical experience from laboratory exercises for a final project. The final project will include data set and requirements provided by domain scientists (Genomics and Geosciences). Students will be provided access to compute resources at: UA campus clusters, iPlant Collaborative and at NSF XSEDE and FutureGrid. Students will also learn how to write a proposal for obtaining future allocation to large scale national resources through XSEDE.

Intellectual Merit

The computational challenges encountered while managing, analyzing and visualizing data are a common and recurring theme across various domains. The rapid evolution of computational capabilities ranging from cloud, hadoop, NoSQL etc have augmented the traditional HPC model, providing a wide array of choices for analyzing research data. This course is targeted to equip students from very diverse disciplines ranging from astronomy, hydrology to life sciences and CISE disciplines a broad understanding of how these computational resources and technologies can be effectively utilized for their specific research questions. This course will introduce students to opportunities, guidance and initial roadmap for gaining access to scalable computational resources and connecting with community of practitioners
when working with local, regional, national and commercial infrastructure providers.

Broader Impact

This course will provide orientation to wide array of resources that are essential in order to support and foster creative approaches for managing computational challenges across disciplines. Current roster of students include many disciplines that have not had access to course work or formal training that include topics in high throughput computing; through this course we intend to foster interdisciplinary collaborations (among students from various disciplines) while providing practical experience to students who have traditionally not had access to these capabilities. Our goal is to espouse computational thinking which is essential to prepare the current and future generation of “data scientists”, affording them the audacity and ability to scale their research.

Use of FutureGrid

Students will learn how to create custom VM for their specific projects which require full LAMP stack to support a integrated bioinformatics application
These projects will include a) use of iRODS for federated data handling b) Makeflow (cctools from Doug Thain @ Notre Dame) and Pegasus (Eva Deelman @ USC) to provide scale out of tasks . Students may also choose to implement a condor cluster or star cluster for managing their tasks http://web.mit.edu/star/hpc/index.html
The above mentioned applications will store certain data components using NoSQL database (Mongo) and Key Value store (Redis) and ZeroMQ http://www.zeromq.org/ for concurrency framework
Hadoop will be utilized to scale out some of the queries from Mongo along side apache PIG
Above mentioned resources will integrate with similar resources running at iPlant and XSEDE resources (TACC) i.e iRODS federation with nodes in iPlant and makeflow workers executing at TACC
Students will be asked to choose different platforms/cloud infrastructure (OpenStack, Amazon) to implement their work and compare and contrast performance and capabilities.

Scale Of Use

There are 24 students in the class, certain assignments will be based on FutureGrid resources and these will be conducted in groups of 4. I expect the while class to use of 4-10 VM for design/prototype (over a 2 week period) and then a scale out of 40 VM for final assignment (for the whole class). Hardware requirement will be between 2 to 4 core machines with 4 to 8 Gb RAM and 10-50 GB disk space (EBS style or some form of persistent storage)
All scale out will be done under guidance and collaboration with Future Grid team once the proof of concept VM's are functional.
All requirements are flexible and can be tailored to resources available

Publications


FG-448
Nirav Merchant
University of Arizona
Active

Project Members

Alyssa Musante
Austin Paine
Brayden Streeter
Daniel Maguire
Daniel Spence
Eric Lyons
Justin Mannin
Kent Guerriero
Mike Hume
Munir Jaber
Nathan Lin
Nicholas Callahan
shuqing gu
Sriharsha Mucheli
Stephen Kovalsky
Tomasz Stawicki
Weifeng Li
wenli zhang
XIAORAN LI
YUN WANG

Timeline

6 days 8 hours ago