CDIS White Paper

Last revised: 31 July 2000 4:25 pm
By: SBT
Center for Data Intensive Science



University of Florida

Florida State University

University of Chicago

University of Illinois at Chicago

California Institute of Technology

San Diego Supercomputer Center



OVERVIEW

We propose an NSF-supported Science and Technology Center for Data Intensive Science (CDIS) that links (a) discovery and development of the methodologies, software, engineering, and technology tools critical to qualitatively new modes of extracting and generating scientific knowledge from massive data collections; (b) dissemination of this progress via training of both students and active researchers; (c) knowledge interchange with partners that develop software and systems for data-intensive functions (management, analysis) by both repurposing emerging business methods to scientific use and by making leading edge scientific practice accessible to those developers.

The vision for this center is rooted in a rapidly emerging reality. Massive data collections are occurring more and more often in multiple, seemingly unconnected scientific specialities. Their appearance and growth are a result of dramatic increases in both capability and capacity of scientific instrumentation (both experimental and computational), storage technology, and networks. It is expected that the largest collections, which today approach the Petabyte range (1,000 Terabytes), will expand by the early part of the next decade to the Exabyte scale, one thousand times larger!. We are thus entering an era of data intensive science. In this mode, gaining new insight is critically dependent upon superior storage, retrieval, indexing, searching, filtering, and analyzing extremely large quantities of data. This era promises many new opportunities for systematic discernment of patterns, signals, and correlations that are far too subtle or interwoven to be picked up by simple scanning, ordinary statistical tools, and routine visualization methods. As well, it offers the opportunity to rethink computational science methodology based on the assumption that data-intensivity is to be avoided.

This new era also presents enormous long-term Information Technology (IT) challenges. On scaling arguments alone, a more-of-the-same methodological approach clearly will fail. In particular, the large quantities of data may well be distributed across the network environment. Further, the gross size of the collections accessed and the analytical calculations to be performed on them both will force data-intensive applications to harness large numbers of distributed storage systems, networks, and computers. (An oddly charming example is the harnessing of thousands of otherwise idle PC's to process SETI data.) These trends will be catalyzed further by the growing emphasis on multi-speciality team research. As nationally and internationally distributed researchers increasingly work in geographically dispersed teams while the size of the data structures burgeons, the result is a growing requirement for better collaborative tools and related IT infrastructure to maximize the potential and rate of scientific discovery. And, these trends will be exploited as creative scientists learn not to be constrained by data intensivity.

Data intensive specialties are growing rapidly in number and variety. A partial list today includes high energy physics, nuclear physics, whole-sky astronomical digital surveys, gravitational wave searches, crystallography using high intensity X-ray sources, climate modeling, systematic satellite earth observations, molecular genomics, three dimensional (including whole body) medical scans, proteomics, Virtual Reality (VR) simulations, and digital archiving. Areas such as molecular and materials modeling have been shaped in part by avoidance of data-intensivity. The Center's focus will be on a significant subset of these disciplines, particularly on the common issues and opportunities of data intensivity faced by all of them. The CDIS research program is enthusiastically, purposefully multidisciplinary, with participants from computer science, computational science, engineering, physics, astronomy, chemistry and biology.

CDIS logical organization is based upon three clusters of shared, cross-cutting research teams:

By design the teams in each cluster have significant overlap with the others.

CDIS has devised a broad program to extend the knowledge gained through its research efforts to other scientific communities, international collaborators, the IT industry, students at all levels, and the general public. The knowledge transfer effort is particularly strong on account of our close collaboration with national and international laboratories and research projects, bioinformatics programs at other institutes, regional and national organizations, and vendors from many portions of the IT industry. CDIS also will have an interactive web interface for visitors to access, search, manipulate, correlate and visualize sample data sets from our application disciplines and thus gain a sense of the excitement and empowerment involved in scientific careers. A key component of the education and human resource effort is to leverage the strengths of the EOT-PACI program of the NCSA Alliance. This will take advantage of their extensive experience and well-developed linkages in designing K-12 and undergraduate programs and in reaching women and underrepresented minorities. In the Southeast, an analogous role will be played by linkage with SURA.

****

Introduction: Data Intensive Science

We propose creation of a Science and Technology Center for Data Intensive Science (CDIS) that will provide the research foundation and essential development for the methodologies, software, engineering, and technology tools common to data intensive science specialties in the next twenty years. The Center's focus will be on the challenges and opportunities offered by immense, often heterogeneous data collections. Today, the largest of these collections is approaching a Petabyte in size. It is expected to attain the Exabyte (1,000 Petabytes) scale early in the next decade. For specialities already in this range, there are several enormous computational science and computer science challenges to extraction of scientific insights and results. For computational specialties that have avoided data-intensivity, the prospect of exploiting ultra-large archives opens the possibility of rethinking the basic design perspective of key methodologies to achieve new levels of performance. In both cases, the issue is how to unlock the implicit value of ultra-large datsets at a qualitatively higher level than that of a repository. Key aspects of the challenge include:

These in turn carry many implicit challenges. Examples include identification of common problems, efficient sharing of methods and algorithms developed in one specialty (avoidance of wheel-reinvention), re-purposing of commercial data intensive application methods (e.g. data mining, portals), and widespread linkage of new developments in data-intensive specialties to the teaching of science. Such challenges can be met only through a focused systematic, sustained, multidisciplinary approach. That is the mission of this Center.

Extremely rapid expansion in dataset size is being driven by long-term, continuing exponential increases in data storage and transmission capacity in parallel with price declines in storage and communications media. Current and foreseeable developments make it likely that by the the next decade individual scientists will have access to Petabit (1015 bits) local storage systems. Similarly, the aggregate storage capacities available to many science communities will reach Exabytes (1018 bytes). (Current areal densities are roughly 1 gigabit/sq. inch. A 35 gigabit/sq. inch device has been demonstrated and magnetic sytems capable of another factor of 100 have been reported. See "Chemical and Engineering News" June 12, 2000, page 37). Trends in both experiment and computation are to exploit these developments aggressively by generating and retaining more data. As a result, individuals, teams, and whole research communities are building tremendously large, heterogeneous aggregations of potentially or actually interesting data in their areas. These include observational data and video, VR experiences, etc. In other communities that have long been data-limited to the extent of discarding most of their computed results (for example, materials simulation and modeling), there is now serious consideration of rethinking methodologies to include pattern mining and recognition of vast collections of computed data. Mere new storage capabilities alone obviously are inadequate to realize the scientific potential of ultra-large archives. Use of current methods of storage, indexing, retrieval, analysis, and presentation will not suffice. Without both qualitative and quantitative improvements in these areas, massive datasets simply will drown research teams in undigested data. Specifically, at least the following questions must be addressed:

Clearly, only a coordinated, multi-speciality attack on these challenges will yield solutions that will be transferable among a variety of disciplines.A useful prototype is Netlib, which provides well-documented, tested mathematical software for incorporation in a wide variety of applications, both scientific and commercial. Using linear algebra as an example, software available through Netlib has long since relieved developers of the burden of writing original code.

(Need 1 or 2 more knowledge transfer arguments here)

Examples of Data Intensive Disciplines

Data-intensivity characterizes much of the most important and far-reaching research being carried out today on the fundamental forces of nature, biological systems, atmospheric and geophysical phenomena, the composition and structure of the universe, and advanced materials. Experiments and activities in these fields include

Vision of the Center

Despite both the increasing number of scientific specialties that depend on massive datasets and the emerging potential to do new science with such archives, little effort has been invested in finding sharable ways to address the challenges and exploit the opportunities. As a result, valuable methodologies, algorithms, and toolkits devised within one specialty have two typical traits. Such solutions almost always (1) are customized for the specialty in which they originated and (2) are essentially unknown by researchers in other areas. This situation is exacerbated by lack of interchange with the Information Technology (IT) industry. Solutions obtained there typically are unfamiliar in research, hence the potential for repurposing from commercial to research objectives goes unexplored. Reciprocally, successes in scientific research are transmitted to the business sector unsystematically at best and more often not at all.
**SBT comment! The preceding sentences are a pretty brave set of claims. It would be good to have a supporting example or two. Ask Tom DeFanti for examples from Robert Grossman?

CDIS will overcome these limitations by providing an multi-disciplinary environment that links faculty, postdocs and students to pursue individual and group research while leveraging the considerable human and technological resources at the Center to cope with the challenges of accessing massive datasets to extract scientific results. At the same time, scientists from different disciplines will be able to collaborate on specific problems with one another, with visitors, with Center staff and even with vendors. The resulting cross-fertilization of ideas will provide tools and methodologies that will benefit several fields at once and provide educational opportunities for students who want to work in these disciplines or in the IT industry where their skills will be highly marketable.

The institutions composing CDIS are:

Comment (SBT). Does this map really help us in the preproposal? Lots of acreage for a rather simple message.

Because the normal way for research to be funded and proceed is within each specialty area of a major academic discipline, it is clear that the opportunities (and obstacles) of data-intensivity common to multiple disciplines will not be addressed without explicit organization, commitments, and incentives. Equally clearly, efficient, non-redundant transfer of such knowledge to students and other researchers, let alone to industry, also will not happen without such circumstances. Thus another way to frame the overall goal of CDIS is to be the missing organization that obtains the needed commitments, and provides the missing incentives.

The CDIS program strategy is to link selected fundamental scientific specialties with computational science and computer science and engineering expertise to build the shared tools and methodologies necessary to exploiting the research potential of ultra-large data collections. The linkages are by functional clusters of cross-cutting teams

Integral to the success of the Center will be the development and upgrading of storage facilities and high-speed networks, as well as technology advances in the IT sector, particularly those involving networking and data storage products. Thus, CDIS will be a distributed Center designed to exploit an extensive hardware and software infrastructure to integrate the IT resources of the geographically separated elements into a single resource that can be used by all participants. This infrastructure consists of (1) high-speed networks; (2) Grid middleware based on the GriPhyN Project (Caltech, UC, UIC and SDSC are members of GriPhyN) to unify the IT resources of the center components; and (3) collaborative tools (distance meeting, etc.) developed by Center members and industry. The distributed form naturally incorporates a key reality that underlies the Center’s mission: important, large-scale data collections need not be physically located where researchers are concentrated, but can be distributed over several geographical locations yet utilized transparently by researchers having access to high-speed networks and an appropriate system and software infrastructure.

The Challenges of Data Intensive Science

***** This section is very rough and may have to be merged into another one. *****

A central goal for CDIS will be to make scientists, especially the youngest researchers with limited opportunities to travel (**SBT comment! I'm missing the point here. What has travel got to do with the challenge?) a highly productive resource, deeply engaged in the life of their experiments and in the ongoing worldwide process of search and discovery. In order to achieve these technical and human goals a number of unprecedented challenges in information technology must be met:

Our research program consists of a set of cross-cutting activities that relate fundamental computer science and computational science research to the needs of the application disciplines:

Education and Outreach

The CDIS research program and institute structure, combined with an existing infrastructure for outreach programs, are naturally suited to education and outreach. First of all, the data intensive sciences that we will pursue within the Center will provide a fertile training ground for students wanting to work in their own or similar disciplines. Moreover, the training they receive will provide ample opportunity for many of them to enter the IT industry, which is moving very quickly towards distributed IT resources as a way of improving e-commerce transactions between businesses (B2B transactions) and between businesses and consumers (B2C transactions). Students will be able to spend time at other Center institutions (while remaining students at their home institutions) and work on large-scale projects they would not be able to do otherwise. We expect that this will be a very attractive option for many of our best students.

Our outreach effort builds on successful existing outreach activities. We expect, for example, to base most of our effort on the considerable resources of the EOT-PACI program run by the NCSA Alliance. In addition, UF and FSU participate in the national QuarkNet program sponsored by Fermilab to train high school teachers in high energy physics. CACR runs a program for minority youths. The GriPhyN Project also has a strong minority component represented by the participation of the University of Brownsville, a historically Hispanic institution.

Need courses for undergraduates and graduates. CSIT will have some. New hires in bioinformatics at UF. What about bio organization at UF that puts together curricula (can't think of name)?

Finally, we intend to have a very strong visiting program that will allow guest researchers to spend time at any of the institutes to collaborate on particular problems. We believe that a great deal of cross-fertilization between fields will take place as a direct result of visitor interactions.

Knowledge Transfer

Need letters of support from vendors who can help with computing, data storage and networking technology. Possibilities are IBM, Sun, Compaq, HP, SGI, Cisco. We are also looking at Qwest (networks), Lucent (networks and communications), Digital Island, Akamai. Microsoft and Intel are strong possibilities.

Other ideas: Collaboration with UF Brain Institute, other medical institutesor organizations, national laboratories (Fermilab, Argonne, Brookhaven, JLAB, Advanced Photon Source, CARS, international laboratories (CERN, Chile observatory?), UCAID, SURA, bioinformatics centers, bioinformatics vendors, other IT vendors. Need to identify the vendors here.

Outreach to international students and scientists. CERN is easy. What about Chilean observatory? Are there other examples of reaching international audiences?

The web portal could be an enormously useful tool. Imagine if students or the general public information could manipulate something like a body scan or digital picture of a galaxy and calculate something useful? Or look at a genome and learn something about mutation rates or relatedness to other organisms?

Management

Need text and diagram from David Tanner

Why a Center is Necessary

Look at arguments presented by last year's winners. There are several that we can use.


Last Modified July 26, 2000
Comments, etc. to Paul Avery