Streaming in the Clouds

Project Information

Discipline
Computer Science (401) 
Orientation
Research 
Abstract

In the recent years BigData has become an important aspect of scientific discoveries - a process referred to as the Forth Paradigm. From the wide spectrum of applications and acquisitions methods, the ones that will generate the biggest amounts of data fall in the category of streaming data, i.e., networks of sensors, observatories, telescopes or experiments such as CERN LHC. As the amount of acquired information grows and the location of data sources are increasingly geographically distributed, it becomes important to process the data in scalable and efficient ways. Cloud computing presents an interesting option for a scalable processing platform. However, the question arises how to best use cloud computing capabilities for geographically distributed stream processing. In this work, we explore and analyze different approaches to streaming data to the cloud and evaluate them in the context of multiple cloud offerings including Microsoft Azure, and and FutureGrid's Nimbus and OpenStack installations. We show, using an ATLAS application, that using the right approach to streaming data can improve the average data rates three times. 


Intellectual Merit

The project goal is to understand how streaming is supported by cloud environments. This is a key aspect for the future, as the nature of BigData in the future is expected to be of stream data.

Broader Impacts

The results and observations can be used by all scientific researchers that will have to analyze such data (i.e. stream data). The observations and discovery will allow them to optimally scale and adjust their experiment configuration in order to process all the amounts of data they need.

Project Contact

Project Lead
Radu Tudoran (rtudoran) 
Project Manager
Pierre Riteau (priteau) 
Project Members
Pierre Riteau, Radu Tudoran, Sergey Panitkin  

Resource Requirements

Hardware Systems
  • hotel (IBM iDataPlex at U Chicago)
  • india (IBM iDataPlex at IU)
  • sierra (IBM iDataPlex at SDSC)
 
Use of FutureGrid

FutureGrid is used for running the Virtual machines in which the stream processing will be performed. The purpose is to understand how such data can be processed in cloud environments.

Scale of Use

The number of VMs used are in the order of tens up to hundred. As the goal is to understand how BigData streaming is supported at large scale, scalability in terms of number of nodes/ VMs is important.

Project Timeline

Submitted
08/20/2013 - 07:55