Center for Cloud and Autonomic Computing: New Members
The NSF Cloud and Autonomic Computing Center (CAC) is an innovative research center that collaborates with a globally diverse group of companies to create industry-relevant projects and accelerate the transfer of technology to the private sector. CAC is the only NSF Industry-University Cooperative Research Center that explicitly identifies cloud computing as the main area of emphasis of center activities. Our researchers include leading academics in cloud and autonomic computing and top-quality graduate students. Membership in CAC represents an opportunity for companies to utilize research and systems created by the center and to guide a multiuniversity research program to provide innovative enterprise solutions.
The Cloud and Autonomic Computing Center provides a proven structure for partnering businesses with top university researchers. The center is funded by the National Science Foundation, industry and government members, and university matching funds. Current and past members have included industry leaders in computer hardware and software, defense systems, and information technology applications. Over the five years of existence of CAC, members have included companies such as IBM, Microsoft, Intel, Raytheon, NEC, Ball Aerospace, Northrop-Grumman, Merrill-Lynch, Citrix, Xerox, and Samsung. These members have cooperatively sponsored and participated in over twenty-five projects in topics related to datacenter management, virtualization, cloud applications, cloud bursting, software-defined networking, cybersecurity, and resilient computing. More than seventy technical papers have resulted from these projects, and several projects have led to new software and tools that have benefited company products or practices. Tens of students have been involved in these efforts, a significant percentage of them having joined the member companies for internships or after graduation.
The ability to work with recognized scientists and to drive research in areas of interest is an important benefit of membership. This prospectus summarizes the Indiana University and University of Chicago project portfolios.
Cloud Infrastructure Layer
FutureGrid. FutureGrid is a major collaboration involving Indiana University, the University of Chicago, and the University of Florida that supports general high-performance computing (HPC), cloud, and Grid testbeds, as shown in Figure 1, over a distributed environment. The system has proven effective in computational and computer science software testing and education, where its rich variety of environments and its interactive models are especially valuable.
Dynamic Provisioning of HPC and Cloud Infrastructures. Indiana University has developed and prototyped a provisioning environment that can dynamically instantiate images spanning HPC, commercial clouds (currently Azure), and the base “research” environments (OpenStack, Eucalyptus, Nimbus, OpenNebula). This environment allows the FutureGrid testbed to be reconfigured on demand. The cross-environment system requires an image library that supports the different deployment systems; this is supported by our universal image library that stores images as templates that can be generated on demand for each environment. Future work will integrate this library with software-defined networks (OpenFlow), the ViNe networking system from the University of Florida, and the GENI community to allow general distributed systems to be instantiated.
Infrastructure-as-a-Service. The University of Chicago is home to the Nimbus project, which released the first open source implementation of Infrastructure-as-a-Service in 2005. Since then, the Nimbus team has been leading research on improvements to IaaS, developing methods of better resource management, utilization, and scheduling in compute clouds; efficient image distribution; and efficient data movement and management in infrastructure clouds. Of particular interest is the ongoing investigation of frontiers in HPC computing on infrastructure clouds. Our work this space strongly emphasizes collaboration with domain scientists and has resulted in pioneering examples of cloud use for nuclear physics, high-energy physics, geosciences, bioinformatics, and other domain sciences.
Networking. Indiana University is developing a cloud-oriented network awareness service, based on the widely deployed perfSONAR system developed by Martin Swany from IU. We are also developing support for high-performance data transfer to support distributed cloud storage with data acceleration, streaming, caching, and replication available as services. We will test this with the important cloud NoSQL storage models.
Cloud Platforms
Work on development of easy-to-use and efficient cloud platforms, frameworks, and programming environments reflects the changing needs and opportunities in the scientific environment. For example, the dropping prices of sensors create unprecedented opportunities and change the nature of many sciences. The Ocean Observatory Initiative (OOI) project, of which the University of Chicago is a partner, is building a sensor-based observatory that is changing the nature of ocean sciences from a traditionally exploratory science to an observatory science (Fig. 2). Leveraging the on-demand resource availability in the cloud to provide a timely response to real-time information coming from such sensors is a critical element of its architecture. A significant thrust of the partners is developing methods that will make leveraging cloud computing capabilities easy for science.
Cloud Programming Environments and Runtime for Data-Enabled Science. Indiana University is developing a data analytics environment with the architecture of Figure 1 (right) and prototyped on FutureGrid. One key feature is the use of Iterative MapReduce where Judy Qiu has developed Twister, which runs iterative algorithms on both HPC and cloud environments. Initial performance results are shown in Fig. 3. Further work is focusing on the Map-Collective concept of Twister, with needed collectives being optimized on each environment. SPIDAL (Scalable Parallel Interoperable Data Analytics Library) is another key idea where we are collecting scalable high-performance data analytics algorithms. Initial work includes MDS (multidimensional scaling) for dimension reduction, topic (hidden factor) determination, and clustering, in which we include both performance and robustness (coming from deterministic annealing) enhancements over the algorithms found in libraries such as R and Mahout.
Secure MapReduce on Hybrid Clouds. Many uses of commercial clouds have been impeded by privacy concerns, as today’s cloud providers offer little assurance for the protection of sensitive user data. This problem cannot be addressed by existing secure outsourcing techniques, which are based on heavy-weight cryptographic primitives such as homomorphic encryption and secure multiparty computation and thus typically are too expensive to handle computation at a large scale. Indiana University is developing a new approach that divides the computation into two steps. The first requires strong privacy protection but is of low compute intensity and hence is suitable for a small, protected private cloud. The second has a dominant compute cost but is not sensitive and hence can be deployed on a shared resource. XiaoFeng Wang and Haixu Tang from IU have demonstrated this concept for a genomics problem and developed a hybrid MapReduce as the programming environment.
Scalable and Highly Available Platforms. Cloud computing offers the promise of providing scalable and highly available services, but leveraging it is not easy. For this reason, the most recent focus of the Nimbus team is on easy-to-use, easy-to-scale, and highly available aggregates such as elastic virtual clusters or workflow platforms, spanning resources over multiple, private, community, and commercial clouds. We design and develop services capable of providing high availability and elastic scaling by relying on a distributed resource pool provided by infrastructure clouds. In the context of these services, we investigate resource and storage management in multicloud environments and the issues of performance and cost that they pose. Building on our early work in contextualization, we also provide environment management of aggregates, turning them into desired, repeatable deployments. Such deployments can be used by domain applications or as repeatable platforms for research conducted on experimental platforms such as FutureGrid.
Applications
The work at the University of Chicago involves strong interactions with application domain scientists, who are often struggling with new scientific demands and new technological developments. Interacting directly with users working on the edge of what is possible gives us access to upcoming requirements and keeps us constantly searching for new solutions. The Computation Institute at the University of Chicago and Argonne National Laboratory is a meeting point of scientists from disciplines as diverse as nuclear physics (see Fig. 4) and social sciences, biology and geosciences. In addition, our group nurtures longstanding collaborations with scientists from many diverse disciplines, among others through the XSEDE and Open Science Grid projects.
The focus at Indiana University is on two important areas where clouds can be used effectively: data-intensive problems with large collective communication (supportable by Iterative MapReduce) and pleasingly parallel problems. The latter is illustrated by our Sensor (Internet of Things) cloud and bioinformatics, where we work with several groups both locally and outside IU.
Education, Outreach, and Workforce Development
Cloud and Data Science Curricula, Tutorials, Virtual Schools, and Workshops. Both IU and the University of Chicago support a variety of activities leading to the development of courses and curricular activities both teaching and relying on distributed infrastructure. Specific activities include courses offered in the School of Informatics and Computing, interactive tutorials offered at conferences, and virtual schools hosted over the internet. An important new development is the use of the MOOC (massive open online course) format to deliver material even when the number of participants is conventional (20–100). All activities draw on FutureGrid clouds in order to allow attendees to interactively experiment with cloud computing and to transition from education to research and development.
Minority-Serving Institutions and Underrepresented Communities. A strong focus at Indiana University is relations with minority-serving national organizations supporting American Indian (AIHEC), African American (NAFEO), and Hispanic (HACU) colleges. We have a strong summer undergraduate research program with mainly MSI participation and MSI attendance at FutureGrid workshops. One focus of MOOCs is to deliver very large classes, but an alternative is to support multiple customized conventional-sized online classes. We are pursuing the latter by building a repository of lessons (around 10 minutes each in length) that can be composed into custom modules; our initial project is a cloud module for the Elizabeth City State University HBCU master’s degree curriculum.