Replied: Wed, 31 Jul 1996 08:32:45 -0400 Replied: "David E. Bernholdt" Replied: Wed, 31 Jul 1996 08:28:09 -0400 Replied: "David E. Bernholdt" Received: from nova.npac.syr.edu (bernhold@nova.npac.syr.edu [128.230.7.2]) by postoffice.npac.syr.edu (8.7.5/8.7.1) with ESMTP id QAA15350 for ; Tue, 30 Jul 1996 16:30:13 -0400 (EDT) Received: from localhost (bernhold@localhost) by nova.npac.syr.edu (8.7.5/8.7.1) with SMTP id QAA13235 for ; Tue, 30 Jul 1996 16:30:12 -0400 (EDT) Message-Id: <199607302030.QAA13235@nova.npac.syr.edu> X-Authentication-Warning: nova.npac.syr.edu: bernhold owned process doing -bs X-Authentication-Warning: nova.npac.syr.edu: Host bernhold@localhost didn't use HELO protocol To: gcf@npac.syr.edu Subject: TrAC Internet Column -- first draft Date: Tue, 30 Jul 1996 16:30:12 -0400 From: "David E. Bernholdt" Content-Type: text Content-Length: 11848 I won't put in the references until we reach something like final form. Comments appreciated, especially on how I portrayed NHSE. -- Internet Resource Discovery for Chemistry - Where Are Those Vast Untapped Resources?

TrAC - Internet Column

Internet Resource Discovery for Chemistry - Where Are Those Vast Untapped Resources?

David E. Bernholdt and Geoffrey C. Fox

Northeast Parallel Architectures Center, Syracuse University, Syracuse, NY 13244-4100 USA

Introduction

Several of these columns have referred to the remarkable growth of the Internet and the wide range of resources which are becoming available; similar refrains are often heard elsewhere as well. There are, of course, both positive and negative aspects to this. On the good side, there are more numerous, more sophisticated, and more useful resources. But as the information space grows, it becomes harder keep track of all the new resources of interest.

The world-wide web (WWW) tends to dominate the news these days, since it is the vehicle through which the majority of these new services and resources are offered. Fortunately for the resource discovery problem, there are an increasing number of WWW services which attempt to catalog or index web sites. Although at this point their coverage may be spotty, and their indexing criteria not always well-matched to technical interests of a chemist, they can be quite useful already (and report millions of "hits" daily). Given the popularity of both the world-wide web as a whole, and search services for it, we expect to see further improvements in this area, including better cataloging of highly specialized technical offerings.

There is, however, more to the Internet than the world-wide web. Chemists in particular have long made extensive use of electronic mailing lists and Usenet newsgroups as forums for discussion, and also anonymous-FTP and other methods for the exchange of software. WWW-based access to resources of this sort is spotty and leaves much to be desired.

Electronic Mailing Lists and Usenet Newsgroups

An initial survey has identified more than 80 chemistry-related electronic mailing lists and Usenet groups and deeper investigation will surely reveal more. Collectively, and even in some cases individually, these lists have many thousands of subscribers and the total message volume is measured in thousands per month. These mailing lists and newsgroups clearly provide an enormous resource for those who utilize them. Unfortunately, these lists may be among the best-kept secrets of the Internet -- they are in general not widely advertised, and not easily discovered through casual browsing. Even after interesting mailing lists are located, there are serious limitations on how they can be used. In many cases, a quick search of the archives of a mailing list would reveal that question a researcher may want to post has already been thoroughly discussed and in principle this should be faster and more efficient than repeating the question and waiting for responses. However, many lists have no archives at all, while many others offer only e-mail or FTP access to archives, without convenient search capabilities. And although there is often some overlap in the discussions on various lists, in general what archives there may be are separate and must be searched individually. At present, this can be very tedious and time consuming, so researchers often neglect an enormous amount of valuable information on the Internet simply because it is too hard to access.

To help address this problem, we have begun a project called "AskNPAC about Chemistry" which gives the chemistry community an entirely different way to use the wealth of information represented by mailing lists and newsgroups. The idea is to provide a single point of access to chemistry-related mailing lists and newsgroups which can be browsed, much in the same way that one uses a mail or news reader, or searched in any combination for specific information. The database will receive a continuous stream of postings from all of the chemistry-related forums we can identify (more than 80 already), and where possible will also include past traffic collected from individual list archives.

The project is based on the "AskNPAC" news database server under development here at the Northeast Parallel Architectures Center (NPAC), mainly to support education and computer science objectives. The system uses the Oracle database software and Oracle's WOW web interface package as its base, operating on a standard unix workstation. AskNPAC communicates with an HTTP (hypertext transport protocol) server, so that anyone with a web browser can easily access the databases. And because it is built on a true relational database, rather than a full-text search engine, as often used for such applications, queries can be focused on sender, date, or other features in addition to the body of the message.

Because mailing lists and newsgroups conveniently provide rapid dissemination of information to a sizable topical audience, they are often used to announce or solicit information about software to solve a particular problem, or other resources. This extends the utility of the database archive beyond "just" the knowledge-base it represents. By extracting particular types of information, such as software announcements, we can help with the discovery of other resources on the Internet which, like mailing lists, are often under-used.

Facilitating Software Reuse

When confronted with a computational problem, most of us would rather find a package, library, or routine that fits our needs than spend our time "reinventing the wheel". And indeed, quite a bit of useful software is available, either on the Internet or off, if we only knew about it. Unfortunately, most existing search services on the web are not well-suited to indexing software resources, particularly in very technical areas such as chemistry.

In some areas, such as numerical analysis, there is a great deal of high-quality free software available which has been collected into large repositories such as the well-known Netlib service offered by the University of Tennessee and Oak Ridge National Laboratory. Although services like Netlib have proven extremely valuable and popular, there is room for further improvement. Although from a user's point of view having a large collection of software is an advantage, it increases the burden on the management side. In addition, it is very hard to include commercial software in such a repository, which would be a particular problem in trying to extend the Netlib model to the chemistry domain. Finally, it still requires a certain amount of sophistication from the user to identify the appropriate software within the repository for the particular task at hand. One attempt to improve this situation is the National High-Performance Computing & Communications Software Exchange (NHSE), a project of the Center for Research in Parallel Computation, of which NPAC is a member. The NHSE is a distributed or "virtual" repository based on a uniform idea of cataloging holdings and providing pointers to the appropriate physical repository or commercial source. This means that repositories do not have to be huge to be useful, which should make it easier for organizations to contribute to NHSE (further simplified by the "Repository in a Box" kit to be released later this year). It also makes it easier to provide useful information about software which is not available directly on the Internet, or which is commercial. The cataloging scheme relies on domain-specific taxonomies to help classify and locate software. Having a well-defined route to classifying software which takes into account the nature of the science for which it is used makes it easier for users to locate the appropriate tools for their needs. Beyond this, the NHSE is also developing roadmaps to help less-experienced users locate software more easily.

The NHSE is generally still in the early stages of development, and so far the focus has been on the mathematical software area. This is an area which an impact spanning many scientific fields and there are already a number of large software repositories to build upon offering generally high-quality software. In domains such as chemistry, there is further to go, but given the complexity of much chemistry-related software, there is at least as much to gain from making it easier to find and reuse. The initial challenge in developing a chemistry-oriented repository for the NHSE is creating a taxonomy, a task which is currently underway. After that, we must begin cataloging software. The AskNPAC about Chemistry database will be very useful at this stage. We plan to "mine" the database for software announcements, producing a continuous stream of candidates for cataloging. Because of the volume of chemistry software, however, we at NPAC cannot hope to do more than provide basic cataloging data. We would be very interested in working with individuals or organizations to improve the depth of the NHSE repository, either through the development of specialized repositories for specific sub-fields, or through more detailed descriptions, reviews, roadmaps, etc.

Relationship to Other Internet Information Services

The purpose of both the NHSE Chemistry repository and the AskNPAC about Chemistry database is to make existing resources easier to find and use rather than to replace them. Neither of the projects described here could exist without the numerous mailing lists and software packages already available. There is also a wealth of information available on the WWW, at sites associated with mailing lists and many others--information about software, pointers to WWW resources, etc. It is our experience that this information is often quite detailed, provided by researchers who have done a great deal of work to thoroughly investigate issues in a small domain of interest to them. Our project cannot possibly provide this level of detail--our goal is to bring together a broad spectrum of information which will help the researcher "discover" the detailed information provided by others far more quickly and easily than is currently possible.

Conclusion

Although the growth of the Internet and the popularity of the world-wide web have lead to a wide range of resources of interest to chemists, others, of equal or greater interest, have been largely ignored. We have developed the "AskNPAC about Chemistry" database to provide easier access to the thousands of messages each month which appear on chemistry-oriented mailing lists and newsgroups. We are also developing a chemistry repository for the National HPCC Software Exchange in order to simplify the job of locating software to facilitate research. We will mine the AskNPAC database to help identify software to catalog, and we plan to develop other "products" based on information in the database, such as a chemistry-focused web index.

We offer these projects as a service to the chemistry community, and we solicit your assistance in order to make them as valuable as possible. This can range from providing us with feedback about the services or ideas for related offerings you would find useful to a more active involvement, such as hosting an NHSE repository in a sub-field or the development of more detailed roadmaps to the software.