Climate Data Analytics Using MapReduce

Project Information

Discipline
Atmospheric Sciences (301) 
Subdiscipline
40.04 Atmospheric Sciences and Meteorology 
Orientation
Research 
Abstract

The dramatic growth in climate/ocean/weather (CWO) data is outstripping the growth in compute speed offered by the traditional uniprocessor programming model used to analyze it. New approaches are necessary. Parallel programming in the form of message-passing can address this problem but at the cost of algorithmic complexity, and adds an additional requirement for implementing fault-tolerance to achieve reliability at the ultrascale. MapReduce, a programming pattern that has been highly successful in the commercial sector, provides a framework with automatic parallelism and fault-tolerance combined with the kind of ease-of-use favored by scientists for data analytics. This project will explore the potential for MapReduce climate data analysis by prototype MapReduce-based climate data analysis components, applying them to climate data analysis problems, and creating value-added climate datasets.

Intellectual Merit

Parallel programming is now a fact of life for large-scale scientific modeling. Analysis of large-scale model output such as that produced by coupled climate models has become sufficiently demanding to make parallel computing necessary. One approach is to combine parallel file I/O with message-passing to create scalable data analysis engines. This strategy is especially appropriate for climate model output where the history file sizes are growing proportionately with increases in model resolution. The hunt for knowledge from smaller-scale reanalysis data and thousands of weather station records, however, is a different problem and one amenable to MapReduce. This project will explore the feasibility of using MapReduce to perform climate data analysis and data mining. The outcomes of this project will be: a set of Map and Reduce kernels specific to climate data statistical analysis; performance analysis of these kernels for a number of testbed problems; and value-added data sets derived from reanalysis and station timeseries data.

Broader Impacts

This project will, if supported, lead to: unprecedented large-scale exploratory data analysis that will unearth new knowledge that will result in peer-reviewed publications; eventual creation of value-added climate and weather datasets derived from large reanalysis, model output, and station timeseries datasets (the long-term strategy is to create web-based “atlases” of derived statistics--e.g., probability density functions estimated from timeseries and spatial sampling); and creation of a “gateway” parallel, fault-tolerant programming style capable of engaging novice scientific programmers (e.g., undergraduate and early-career graduate students in the geosciences).

Project Contact

Project Lead
Jay Larson (larson) 
Project Manager
Jay Larson (larson) 

Resource Requirements

Hardware System
  • I don't care (what I really need is a software environment and I don't care where it runs)
 
Use of FutureGrid

Prototyping, testing and performance studies of climate data analysis MapReduce kernels; large-scale analysis of reanalysis and station data. All using Hadoop or another MapReduce framework (e.g., Twister).

Scale of Use

Initially, the usage will be short scalability tests; that is, there will be requests for large numbers of nodes for short periods for testbed problems. Eventually, once performance evaluation is completed and deemed sufficient to support production, some climate data analysis work will be performed, with requests proportionate to how well my applications scale. I suspect I will use the system frequently.

Project Timeline

Submitted
03/01/2012 - 19:50