CDIS White Paper

Last revised: 31 July 2000 4:25 pm
By: SBT

Center for Data Intensive Science



University of Florida

Florida State University

University of Chicago

University of Illinois at Chicago

California Institute of Technology

San Diego Supercomputer Center

OVERVIEW

We propose an NSF-supported Science and Technology Center for Data Intensive Science (CDIS) that links (a) discovery and development of the methodologies, software, engineering, and technology tools critical to qualitatively new modes of extracting and generating scientific knowledge from massive data collections; (b) dissemination of this progress via training of both students and active researchers; (c) knowledge interchange with partners that develop software and systems for data-intensive functions (management, analysis) by both repurposing emerging business methods to scientific use and by making leading edge scientific practice accessible to those developers.

The vision for this center is rooted in a rapidly emerging reality. Massive data collections are occurring more and more often in multiple, seemingly unconnected scientific specialities. Their appearance and growth are a result of dramatic increases in both capability and capacity of scientific instrumentation (both experimental and computational), storage technology, and networks. It is expected that the largest collections, which today approach the Petabyte range (1,000 Terabytes), will expand by the early part of the next decade to the Exabyte scale, one thousand times larger!. We are thus entering an era of data intensive science. In this mode, gaining new insight is critically dependent upon superior storage, retrieval, indexing, searching, filtering, and analyzing extremely large quantities of data. This era promises many new opportunities for systematic discernment of patterns, signals, and correlations that are far too subtle or interwoven to be picked up by simple scanning, ordinary statistical tools, and routine visualization methods. As well, it offers the opportunity to rethink computational science methodology based on the assumption that data-intensivity is to be avoided.

This new era also presents enormous long-term Information Technology (IT) challenges. On scaling arguments alone, a more-of-the-same methodological approach clearly will fail. In particular, the large quantities of data may well be distributed across the network environment. Further, the gross size of the collections accessed and the analytical calculations to be performed on them both will force data-intensive applications to harness large numbers of distributed storage systems, networks, and computers. (An oddly charming example is the harnessing of thousands of otherwise idle PC's to process SETI data.) These trends will be catalyzed further by the growing emphasis on multi-speciality team research. As nationally and internationally distributed researchers increasingly work in geographically dispersed teams while the size of the data structures burgeons, the result is a growing requirement for better collaborative tools and related IT infrastructure to maximize the potential and rate of scientific discovery. And, these trends will be exploited as creative scientists learn not to be constrained by data intensivity.

Data intensive specialties are growing rapidly in number and variety. A partial list today includes high energy physics, nuclear physics, whole-sky astronomical digital surveys, gravitational wave searches, crystallography using high intensity X-ray sources, climate modeling, systematic satellite earth observations, molecular genomics, three dimensional (including whole body) medical scans, proteomics, Virtual Reality (VR) simulations, and digital archiving. Areas such as molecular and materials modeling have been shaped in part by avoidance of data-intensivity. The Center's focus will be on a significant subset of these disciplines, particularly on the common issues and opportunities of data intensivity faced by all of them. The CDIS research program is enthusiastically, purposefully multidisciplinary, with participants from computer science, computational science, engineering, physics, astronomy, chemistry and biology.

CDIS logical organization is based upon three clusters of shared, cross-cutting research teams:

Information management and Production (Teams = Cataloging, Indexing and Searching, New Uses of D.I.)
Knowledge Generation (Teams = Collaborative Support, Management, Feature Extraction)
Essential Infrastructure (Teams = Networks, Grids and Systems, Portals and Devices)

By design the teams in each cluster have significant overlap with the others.

CDIS has devised a broad program to extend the knowledge gained through its research efforts to other scientific communities, international collaborators, the IT industry, students at all levels, and the general public. The knowledge transfer effort is particularly strong on account of our close collaboration with national and international laboratories and research projects, bioinformatics programs at other institutes, regional and national organizations, and vendors from many portions of the IT industry. CDIS also will have an interactive web interface for visitors to access, search, manipulate, correlate and visualize sample data sets from our application disciplines and thus gain a sense of the excitement and empowerment involved in scientific careers. A key component of the education and human resource effort is to leverage the strengths of the EOT-PACI program of the NCSA Alliance. This will take advantage of their extensive experience and well-developed linkages in designing K-12 and undergraduate programs and in reaching women and underrepresented minorities. In the Southeast, an analogous role will be played by linkage with SURA.

****

Introduction: Data Intensive Science

We propose creation of a Science and Technology Center for Data Intensive Science (CDIS) that will provide the research foundation and essential development for the methodologies, software, engineering, and technology tools common to data intensive science specialties in the next twenty years. The Center's focus will be on the challenges and opportunities offered by immense, often heterogeneous data collections. Today, the largest of these collections is approaching a Petabyte in size. It is expected to attain the Exabyte (1,000 Petabytes) scale early in the next decade. For specialities already in this range, there are several enormous computational science and computer science challenges to extraction of scientific insights and results. For computational specialties that have avoided data-intensivity, the prospect of exploiting ultra-large archives opens the possibility of rethinking the basic design perspective of key methodologies to achieve new levels of performance. In both cases, the issue is how to unlock the implicit value of ultra-large datsets at a qualitatively higher level than that of a repository. Key aspects of the challenge include:

magnitude and complexity
geographicaly dispersed character (increasingly the case)
multi-disciplinary, geographically dispersed user teams.

These in turn carry many implicit challenges. Examples include identification of common problems, efficient sharing of methods and algorithms developed in one specialty (avoidance of wheel-reinvention), re-purposing of commercial data intensive application methods (e.g. data mining, portals), and widespread linkage of new developments in data-intensive specialties to the teaching of science. Such challenges can be met only through a focused systematic, sustained, multidisciplinary approach. That is the mission of this Center.

Extremely rapid expansion in dataset size is being driven by long-term, continuing exponential increases in data storage and transmission capacity in parallel with price declines in storage and communications media. Current and foreseeable developments make it likely that by the the next decade individual scientists will have access to Petabit (10¹⁵ bits) local storage systems. Similarly, the aggregate storage capacities available to many science communities will reach Exabytes (10¹⁸ bytes). (Current areal densities are roughly 1 gigabit/sq. inch. A 35 gigabit/sq. inch device has been demonstrated and magnetic sytems capable of another factor of 100 have been reported. See "Chemical and Engineering News" June 12, 2000, page 37). Trends in both experiment and computation are to exploit these developments aggressively by generating and retaining more data. As a result, individuals, teams, and whole research communities are building tremendously large, heterogeneous aggregations of potentially or actually interesting data in their areas. These include observational data and video, VR experiences, etc. In other communities that have long been data-limited to the extent of discarding most of their computed results (for example, materials simulation and modeling), there is now serious consideration of rethinking methodologies to include pattern mining and recognition of vast collections of computed data. Mere new storage capabilities alone obviously are inadequate to realize the scientific potential of ultra-large archives. Use of current methods of storage, indexing, retrieval, analysis, and presentation will not suffice. Without both qualitative and quantitative improvements in these areas, massive datasets simply will drown research teams in undigested data. Specifically, at least the following questions must be addressed:

How can a community collaboratively create, manage, organize, catalog, analyze, process, visualize, and extract knowledge from a distributed Exabyte-scale aggregate of heterogeneous data?
How can an individual scientist utilize a Petabyte of heterogeneous data? Presuming that the data have been created, managed, organized, and cataloged, what methods and tools are eseential to enable a person to analyze, process, visualize, and extract knowledge from such an aggregate?
How is that presupposed creation, management, organization, and cataloging of massive archives to be done with maximum benefit and minimum diversion of resources from the fundamental science being addressed?
How can multiple communities avoid diversion of effort from the science they are pursuing into redundant study of methods of analysis and management of Petabyte archives?
What would be the impact on algorithm design and performance, hence scientific problem solution, for computational specialties such as simulations of having the ability to utilize ultra-large archives of computed data (instead of discarding the bulk of computed data as currently done)?

Clearly, only a coordinated, multi-speciality attack on these challenges will yield solutions that will be transferable among a variety of disciplines.A useful prototype is Netlib, which provides well-documented, tested mathematical software for incorporation in a wide variety of applications, both scientific and commercial. Using linear algebra as an example, software available through Netlib has long since relieved developers of the burden of writing original code.

(Need 1 or 2 more knowledge transfer arguments here)

Examples of Data Intensive Disciplines

Data-intensivity characterizes much of the most important and far-reaching research being carried out today on the fundamental forces of nature, biological systems, atmospheric and geophysical phenomena, the composition and structure of the universe, and advanced materials. Experiments and activities in these fields include

Gravity wave searches: Gravitation wave experiments use kilometer-scale laser interferometers to detect the gravitational waves of pulsars, supernovae and the events leading to black holes. The detectors store FFT signal data accompanied by calibration and environmental information, resulting in a data store of approximately 100 - 200 Terabytes per year of heterogeneous data by 2002. Planned upgrades will generate up to 500 Terabytes per year by the middle of the decade. The leading experimental efforts include LIGO (US), GEO (UK-Germany), VIRGO (France-Italy) and the recently proposed LISA (NASA-ESA), a space-based experiment.

Digital full-sky astronomy: A set of overlapping, automated digital sky surveys will provide comprehensive, multi-wavelength studies of stars, galaxies, nebula, and large-scale structure.These experiments will map and catalog the entire sky at high resolution (with angular features as small 1 arc-sec), generating many Terabytes of data per year per experiment. Examples of present and planned surveys include the Sloan Digital Sky Survey (SDSS), 2MASS, COSMOS and DPOSS, culminating in a possible National Virtual Observatory (NVO), which will link these collections in a vast searchable database several Petabytes in size.
High energy and nuclear physics: High energy and nuclear physics exploit the collisions of high energy particles to explore matter at the smallest length scales and produce exotic new particles. Sophisticated detectors with millions of electronic channels surround the collision regions, producing a very dense, heterogeneous data stream to storage describing the hundreds or even thousands of collision decay products. Typical data storage rates today are several hundred Terabytes per year per experiment. When the Large Hadron Collider at CERN starts up in 2006, the data volume from all experiments there will comprise several Petabytes per year. This rate will rise rapidly as beam intensities grow.
Molecular genomics: Whole-genome sequencing data will permit wholesale study of genetic relationships of organisms and how complex mutations express themselves. (Anything else?). An exciting example is the Human Genome Project, which is providing comprehensive DNA sequencing data on the human genome;
**Comment (SBT) Presumably Ron Marks, and perhaps others are writing something for here? Reminder that we need to be careful not to look naive w/r to work already done on human genome informatics.
Proteomics: This is the study of protein topologies, docking sites, steric effects, and structure/activity relationships, both experimentally and by simulations. **Comment (SBT) I have asked Adrian Roitberg to give a few lines. Paul, I believe you said Bill Luttge was writing also on this topic?
Three dimensional scans: Medical imaging, including whole body scans, are becoming an important diagnostic tool. For example, the Human Brain Project will carry out time series of 3-D scans of the human brain to enable understanding of our most complex and uncharted organ system;
**SBT comment! Again, at risk of being discouragingly negative, most of the preceding topic is subsumed under the Imaging rubric and the imaging centers therefore think they "own" it.
X-ray scattering: High-intensity collimated X-ray beams from synchrotron sources are used to probe the 3-D atomic-scale structure of advanced materials and biological molecules. Third-generation facilities such as the Advanced Photon Source (APS), ESRF (Grenoble), and SPRING 8 (Japan) have photon fluxes many orders of magnitude higher than previous facilities. These are driving revolutionary changes in the data-intensivity of this large multidisciplinary user population worldwide. Similarly, advances in X-ray optics now permit routine imaging, diffraction, and spectroscopy with submicron-sized X-ray beams. For instance, an existing microcrystal diffraction station at the materials Research Collaborative Access Team (MRCAT) beamline at APS currently can generate of order terabytes/month of diffraction data. Since this rate is limited only by detector technology, it is expected to increase significantly in coming years.
Systematic satellite earth observations: Remote sensing by satellite provides complete, multiple-wavelength observations of our planet to yield improved understanding the Earth as an integrated system. The best known example is the Earth Observing System (EOS), which will record Petabytes of data in the early part of this decade.
Climate modeling: I need someone from FSU to write this.
Geosciences: I'm thinking of the NEES facility for earthquake studies, supported by NSF. What do they have in the way of data requirements?
Virtual reality (VR) simulations: Perhaps Tom DeFanti can document the data requirements;
Materials and molecular modeling: Information transfer between scales in multi-scale simulation modelling, such as in the NSF-KDI funded project led by University of Florida's Quantum Theory Project, today involves wholesale data abandonment. Such modelling, both of biomolecules and novel materials, is well-known as compute intensive, a reputation that obscures the fact that, by design, current methods rest on a strategy of minimizing data-intensivity. Even so, only a tiny fraction of the computed data is analyzed in terms of a small number of scalar and vector functions, with feature extraction and pattern recognition typically by qualitative visualization. Yet using this methodology, a single study by a single investigator can yield several Terabytes of data in a month. Essentially nothing is known about how any of these applications would improve - both in algorithms and results - if ultra-large data analysis, management, and multi-Petabyte storage were a reality because today the analysis and management problems of such archives are regarded as too severe to be worth addressing.

Vision of the Center

Despite both the increasing number of scientific specialties that depend on massive datasets and the emerging potential to do new science with such archives, little effort has been invested in finding sharable ways to address the challenges and exploit the opportunities. As a result, valuable methodologies, algorithms, and toolkits devised within one specialty have two typical traits. Such solutions almost always (1) are customized for the specialty in which they originated and (2) are essentially unknown by researchers in other areas. This situation is exacerbated by lack of interchange with the Information Technology (IT) industry. Solutions obtained there typically are unfamiliar in research, hence the potential for repurposing from commercial to research objectives goes unexplored. Reciprocally, successes in scientific research are transmitted to the business sector unsystematically at best and more often not at all.
**SBT comment! The preceding sentences are a pretty brave set of claims. It would be good to have a supporting example or two. Ask Tom DeFanti for examples from Robert Grossman?

CDIS will overcome these limitations by providing an multi-disciplinary environment that links faculty, postdocs and students to pursue individual and group research while leveraging the considerable human and technological resources at the Center to cope with the challenges of accessing massive datasets to extract scientific results. At the same time, scientists from different disciplines will be able to collaborate on specific problems with one another, with visitors, with Center staff and even with vendors. The resulting cross-fertilization of ideas will provide tools and methodologies that will benefit several fields at once and provide educational opportunities for students who want to work in these disciplines or in the IT industry where their skills will be highly marketable.

The institutions composing CDIS are:

University of Florida (lead institution)
Florida State University
University of Chicago
University of Illinois at Chicago
California Institute of Technology
San Diego Supercomputing Center

Comment (SBT). Does this map really help us in the preproposal? Lots of acreage for a rather simple message.

Because the normal way for research to be funded and proceed is within each specialty area of a major academic discipline, it is clear that the opportunities (and obstacles) of data-intensivity common to multiple disciplines will not be addressed without explicit organization, commitments, and incentives. Equally clearly, efficient, non-redundant transfer of such knowledge to students and other researchers, let alone to industry, also will not happen without such circumstances. Thus another way to frame the overall goal of CDIS is to be the missing organization that obtains the needed commitments, and provides the missing incentives.

The CDIS program strategy is to link selected fundamental scientific specialties with computational science and computer science and engineering expertise to build the shared tools and methodologies necessary to exploiting the research potential of ultra-large data collections. The linkages are by functional clusters of cross-cutting teams

Information Management and Production Cataloging Team: *Reagan Moore (CS) Chaitan Baru (CS) Indexing and Searching Team: Ron Marks (bio) Henry Baker (bio) Steve Benner (bio) Randy Duran (X-ray) New Uses of D.I. Team: Sam Trickey (Phys, Chem) Hai-Ping Cheng (Phys) Adrian Roitberg (Chem)
Knowledge Generation Collaborative Support Team *Tom DeFanti (CS) Rick Stevens (CS) Ian Foster (CS, Grids) Alan George (HPC) Julian Bunn (HEP) Harvey Newman (HEP) Geoffrey Fox (E/O) Knowledge Management Team *Reagan Moore (CS) lead? Chaitan Baru (CS) Feature Extraction Team: Albert Lazzarini (Gravity waves) Tom Prince (Sky surveys) Rafael Guzman (Sky surveys) Ron Marks (bio) Henry Baker (bio) Steve Benner (bio) Randy Duran (X-ray) Gerhard Ritter (CS)
Essential Infrastructure Networks Team: *Tom DeFanti (CS) Ian Foster (CS, Grids) Alan George (HPC) Harvey Newman (HEP) Dave Pokorney (Networking) Grids and Systems Team: Alan George (HPC) Harvey Newman (HEP) Julian Bunn (HEP) Paul Avery (HEP) Randy Duran (X-ray) Steve Benner (bio) Portals and Devices Team: Geoffrey Fox (CS, E/O) Sumi Helal (wireless)

Integral to the success of the Center will be the development and upgrading of storage facilities and high-speed networks, as well as technology advances in the IT sector, particularly those involving networking and data storage products. Thus, CDIS will be a distributed Center designed to exploit an extensive hardware and software infrastructure to integrate the IT resources of the geographically separated elements into a single resource that can be used by all participants. This infrastructure consists of (1) high-speed networks; (2) Grid middleware based on the GriPhyN Project (Caltech, UC, UIC and SDSC are members of GriPhyN) to unify the IT resources of the center components; and (3) collaborative tools (distance meeting, etc.) developed by Center members and industry. The distributed form naturally incorporates a key reality that underlies the Center’s mission: important, large-scale data collections need not be physically located where researchers are concentrated, but can be distributed over several geographical locations yet utilized transparently by researchers having access to high-speed networks and an appropriate system and software infrastructure.

The Challenges of Data Intensive Science

***** This section is very rough and may have to be merged into another one. *****

A central goal for CDIS will be to make scientists, especially the youngest researchers with limited opportunities to travel (**SBT comment! I'm missing the point here. What has travel got to do with the challenge?) a highly productive resource, deeply engaged in the life of their experiments and in the ongoing worldwide process of search and discovery. In order to achieve these technical and human goals a number of unprecedented challenges in information technology must be met:

Rapid and transparent access to data subsets drawn from massive datasets, rising from the fractional Petabyte to the Exabyte scale over the next decade;
Transparent access to distributed CPU resources, from the Teraflops (2000) to the Petaflops (2010) scale;
Very small signals or subtle correlations which must be extracted from enormous backgrounds or extraneous information;
An intellectual community, numbering in the thousands and distributed globally, that needs access to this data.

Our research program consists of a set of cross-cutting activities that relate fundamental computer science and computational science research to the needs of the application disciplines:

Automatic cataloging: Managing the stream of incoming data of various types, generating metadata, indexing for subsequent searchs, identifying "interesting" interrelationships, etc.
Indexing and searching: Doing all the things that are done now with limited effectiveness for text data, but for orders of magnitude greater quantities of nontextual data (simulation, video, ...)
Feature extraction: Detecting "interesting" features in very large datasets of different types. Some examples: (a) detecting and ounting "cyclones" or "extreme heat stress events" in climate model data; (b) finding instances of "angry" interactions in a psychology video archive. We can imagine much more complex examples.
Collaborative research: Analyzing enormous datasets in a geographically distributed community.
Technological infrastructure: Integrating the networking and network-connected RAM servers, disk servers, compute servers, etc. required to enable the sharing and extremely rapid analysis of community data resourcess. Particular attention will be paid to optimizing efficient transport of data
Knowledge portals: Developing and extending the range of information interfaces to web-based interfaces, mobile devices and new consumer appliances.
Knowledge management: Managing and automating the processes by which knowledge, rather than the underlying bits, is created, integrated, stored and retrieved. How do we put humans above the fray?

Education and Outreach

The CDIS research program and institute structure, combined with an existing infrastructure for outreach programs, are naturally suited to education and outreach. First of all, the data intensive sciences that we will pursue within the Center will provide a fertile training ground for students wanting to work in their own or similar disciplines. Moreover, the training they receive will provide ample opportunity for many of them to enter the IT industry, which is moving very quickly towards distributed IT resources as a way of improving e-commerce transactions between businesses (B2B transactions) and between businesses and consumers (B2C transactions). Students will be able to spend time at other Center institutions (while remaining students at their home institutions) and work on large-scale projects they would not be able to do otherwise. We expect that this will be a very attractive option for many of our best students.

Our outreach effort builds on successful existing outreach activities. We expect, for example, to base most of our effort on the considerable resources of the EOT-PACI program run by the NCSA Alliance. In addition, UF and FSU participate in the national QuarkNet program sponsored by Fermilab to train high school teachers in high energy physics. CACR runs a program for minority youths. The GriPhyN Project also has a strong minority component represented by the participation of the University of Brownsville, a historically Hispanic institution.

Need courses for undergraduates and graduates. CSIT will have some. New hires in bioinformatics at UF. What about bio organization at UF that puts together curricula (can't think of name)?

Finally, we intend to have a very strong visiting program that will allow guest researchers to spend time at any of the institutes to collaborate on particular problems. We believe that a great deal of cross-fertilization between fields will take place as a direct result of visitor interactions.

Knowledge Transfer

Need letters of support from vendors who can help with computing, data storage and networking technology. Possibilities are IBM, Sun, Compaq, HP, SGI, Cisco. We are also looking at Qwest (networks), Lucent (networks and communications), Digital Island, Akamai. Microsoft and Intel are strong possibilities.

Other ideas: Collaboration with UF Brain Institute, other medical institutesor organizations, national laboratories (Fermilab, Argonne, Brookhaven, JLAB, Advanced Photon Source, CARS, international laboratories (CERN, Chile observatory?), UCAID, SURA, bioinformatics centers, bioinformatics vendors, other IT vendors. Need to identify the vendors here.

Outreach to international students and scientists. CERN is easy. What about Chilean observatory? Are there other examples of reaching international audiences?

The web portal could be an enormously useful tool. Imagine if students or the general public information could manipulate something like a body scan or digital picture of a galaxy and calculate something useful? Or look at a genome and learn something about mutation rates or relatedness to other organisms?

Management

Need text and diagram from David Tanner

Why a Center is Necessary

Look at arguments presented by last year's winners. There are several that we can use.

Last Modified July 26, 2000
Comments, etc. to Paul Avery