Managing an Adaptive Cloud Cache for Supporting Data-Intensive Applications

Project Information

Discipline
Computer Science (401) 
Orientation
Research 
Abstract

Today's computing projects, including myriad data mining, analysis, and scientific applications, are becoming increasingly data-intensive. Despite steady advances in hardware and com- puting paradigms, this so-called data deluge continues to extend processing times due to various reasons. For example, in popular scenarios involving parallel and distributed frameworks, large data transfers are necessarily invoked among various machines to communicate or merge results. Meanwhile, the emergence of Cloud computing has been apropos for enabling solutions to the aforementioned problem. In response to the data deluge, processing times might be expedited by harnessing the Cloud' s resource elasticity, i.e., the on-demand nature of virtual machine allocation and ostensibly infinite storage units, for a price. Traditionally, many data-intensive applications, including analysis and scientific workflows, are known to contain frequent redundant overlaps in their computation patterns. This implies that many applications can benefit from caching inter- mediate and final computed results for reuse. Thus, by avoid- ing repeated heavy computation and amortizing data movement inherent to those processes, data-intensive applications can be accelerated. However, certain challenges lie within managing such a data cache in the Cloud. For instance, the structure of the physical storage hierarchy (machine memory, local and network disks, persistent storage) should adapt to application' s performance requirements. Furthermore, data placement policies and heuristics for resource consolidation must also be developed to optimize an application' s performance and cost effectiveness. We propose developing an elastic Cloud cache manager, which we aim to release to the public as an open source project, that seeks to address the aforementioned challenges.

Intellectual Merit

Our proposed cache would provide a 2-tiered system, capable of (1) predicting and managing costs of provisioning Cloud resources and (2) adaptively manages cache data within the provisioned resources through promotion and demotion of data blocks in the storage hierarchy to optimize performance.

Broader Impacts

A cost-conscious cache would be useful to multiple stakeholders, including helping accelerate scientific applications and general service-oriented applications.

Project Contact

Project Lead
David Chiu (david.chiu) 
Project Manager
David Chiu (david.chiu) 

Resource Requirements

Hardware Systems
  • hotel (IBM iDataPlex at U Chicago)
  • india (IBM iDataPlex at IU)
  • sierra (IBM iDataPlex at SDSC)
  • xray (Cray XM5 at IU)
 
Use of FutureGrid

Research: My graduate students and I are exploring caching and data management options on the Cloud. We are focusing on harnessing elasticity to suit our application's storage/caching resource requirements.\n\n\n\nEducation and Teaching: I teach college courses in Web Data Management and Database Systems, and up to this point, we have been relying on Amazon's Educational Grant for access to Cloud resources, which is useful for teaching distributed data processing, like Map-Reduce.

Scale of Use

Several (10 to 20) VMs for experiments and courses.

Project Timeline

Submitted
11/24/2010 - 23:09