Replied: Fri, 24 Jan 1997 23:34:39 -0500 Replied: "David E. Bernholdt" Received: from nova.npac.syr.edu (bernhold@nova.npac.syr.edu [128.230.7.2]) by postoffice.npac.syr.edu (8.7.5/8.7.1) with ESMTP id QAA06357; Fri, 24 Jan 1997 16:52:35 -0500 (EST) Received: from localhost (bernhold@localhost) by nova.npac.syr.edu (8.7.5/8.7.1) with SMTP id QAA06052; Fri, 24 Jan 1997 16:52:34 -0500 (EST) Message-Id: <199701242152.QAA06052@nova.npac.syr.edu> X-Authentication-Warning: nova.npac.syr.edu: bernhold owned process doing -bs X-Authentication-Warning: nova.npac.syr.edu: Host bernhold@localhost didn't use HELO protocol To: gcf@npac.syr.edu cc: paulc@npac.syr.edu Subject: CSIR status Date: Fri, 24 Jan 1997 16:52:30 -0500 From: "David E. Bernholdt" Content-Type: text Content-Length: 21412 CSIR is now running as http://www.csir.org, which is a virtual host for nhse.npac. Work still needs to be done on backups, but that seems to be proceeding. We're looking for someone to replace Yang Meng to do some of the legwork associated with CSIR. I have a couple of candidates already and will probably finalize that next week. I think we're ready to begin publicizing this thing, and the first step is to get that TrAC article out. I have updated it and tried to answer the reviewers. The final version is appended. It is not really much different than our initial submission. Please let me know if you have any changes so I can get this out ASAP. Thanks, David -- Internet Resource Discovery for Chemistry - Where Are Those Vast Untapped Resources?

TrAC - Internet Column

Internet Resource Discovery for Chemistry - Where Are Those Vast Untapped Resources?

David E. Bernholdt and Geoffrey C. Fox

Northeast Parallel Architectures Center, Syracuse University, Syracuse, NY 13244-4100 USA

Introduction

Several of these columns have referred to the remarkable growth of the Internet and the wide range of resources which are becoming available; similar refrains are often heard elsewhere as well. There are, of course, both positive and negative aspects to this. On the good side, there are more numerous, more sophisticated, and more useful resources. But as the information space grows, it becomes harder keep track of all the new resources of interest.

The world-wide web (WWW) tends to dominate the news these days, since it is the vehicle through which the majority of these new services and resources are offered. Fortunately for the resource discovery problem, there are an increasing number of WWW services [1-2] which attempt to catalog or index web sites. Although at this point their coverage may be spotty, and their indexing criteria not always well-matched to the technical interests of a chemist, they can be quite useful already (and report millions of "hits" daily). Given the popularity of both the world-wide web as a whole, and search services for it, we expect to see further improvements in this area, including better cataloging of highly specialized technical offerings.

There is, however, more to the Internet than the world-wide web. Chemists in particular have long made extensive use of electronic mailing lists and Usenet newsgroups as forums for discussion, and also anonymous-FTP and other methods for the exchange of software. WWW-based access to resources of this sort is spotty and leaves much to be desired. The Chemistry Software and Information Resources (CSIR) project [3] is designed to make it easier to access and utilize this kind of information.

Electronic Mailing Lists and Usenet Newsgroups

An initial survey has identified more than 80 chemistry-related electronic mailing lists and Usenet groups and deeper investigation will surely reveal more. Collectively, and even in some cases individually, these lists have many thousands of subscribers and the total message volume is measured in thousands per month. These mailing lists and newsgroups clearly provide an enormous resource for those who utilize them. Unfortunately, these lists may be among the best-kept secrets of the Internet -- they are in general not widely advertised, and not easily discovered through casual browsing. There are several resources which can help identify mailing lists. A periodic posting on the Usenet [4] catalogs more than 1700 lists, and a new web-based service, Liszt [5], claims more than 66,000 lists in its catalog. Neither of these compendia, however, offer many chemistry-related mailing lists. For example, a search of the Liszt catalog for the term "chemistry" produces 111 hits, but more than three-quarters of those are obviously lists of purely local interest -- specific to the operation of a university course or department. A more technical catalog, in this case of biology-related forums, is the BIOSCI service [6] which manages the bionet newsgroup hierarchy and associated mailing lists. Finally, the largest compendium of chemistry-related lists we are aware of is maintained by Kris Boulez [7]. This is the primary source of the aforementioned 80 chemistry-related lists and newsgroups.

Even after interesting mailing lists are located, there are serious limitations on how they can be used. In many cases, a quick search of the archives of a mailing list would reveal that question a researcher may want to post has already been thoroughly discussed and in principle this should be faster and more efficient than repeating the question and waiting for responses. However, many lists have no archives at all, while many others offer only e-mail or FTP access to archives, without convenient search capabilities. And although there is often some overlap in the discussions on various lists, in general what archives there may be are separate and must be searched individually. At present, this can be very tedious and time consuming, so researchers often neglect an enormous amount of valuable information on the Internet simply because it is too hard to access.

One component of the CSIR project is the "AskNPAC About Chemistry" Mailing List Archive, which gives the chemistry community an entirely different way to use the wealth of information represented by mailing lists and newsgroups. The idea is to provide a single point of access to chemistry-related mailing lists and newsgroups which can be browsed, much in the same way that one uses a mail or news reader, or searched in any combination for specific information. The database will receive a continuous stream of postings from all of the chemistry-related forums we can identify (more than 80 already), and where possible will also include past traffic collected from individual list archives.

The project is based on the "AskNPAC" news database server under development here at the Northeast Parallel Architectures Center (NPAC), mainly to support education and computer science objectives. The system uses the Oracle database software and Oracle's WOW web interface package as its base, operating on a standard unix workstation. AskNPAC communicates with an HTTP (hypertext transport protocol) server, so that anyone with a web browser can easily access the databases. This approach offers several advantages when compared with more familiar web-based mailing list archive tools such as Hypermail [8] and MHonArc [9]. First is the ability to treat both mailing lists and Usenet newsgroups in the same fashion. Second, most web-based archive tools can handle only one list at a time for searches, and browsing, while with the database access is much more flexible and need not be limited to a single list at a time. Finally, using a structured database rather than "flat files" makes it easier to offer the user fielded searches (queries by sender, date, etc.) and to bring to bear the sophisticated text indexing and searching features of database packages like Oracle, including word stemming, thesauri, and more advanced queries (these capabilities are under development for AskNPAC).

The AskNPAC section of the CSIR home page [3] offers a variety of information to help users familiarize themselves with the AskNPAC service, including introductions to electronic mailing lists and Usenet news and more details about how to subscribe to them on your own, as well as a list of the chemistry-related lists archived by AskNPAC. There is also a link to archives in other topical areas which use the AskNPAC software package.

A present, AskNPAC offers four different user interfaces, each of which can be accessed directly from the CSIR home page. Each interface is specialized for a certain mode of access to the archive. There is a "hypermail-like" or "line-based" interface, which offers an interface similar to that of the popular HyperMail [8] package, which is very convenient for browsing of newsgroups. Messages can be presented by conversation thread, date, subject, or author and are easily read in sequence. The "forms-based" interface offers the most flexibility for searching the archive contents, and it follows much the same pattern as form-based interfaced for other WWW search engines. Querys can be performed based on sender, date, or for general text ("Query By Keyword"). It is possible to query any combination of lists by selecting them appropriately in the scroll-box of the search forms [10]. The "calendar-based" interface is geared toward accessing messages based on the date they were sent, and the "frame-based" interface offers general capabilities for both browsing and searching, but without the sophistication of the more specialized interfaces. AskNPAC is being transformed from a research project to a user-friendly web tool, so the current user interfaces may seem a bit rough around the edges. Nevertheless, they are functional, and we encourage your feedback to help us improve the interface.

Because mailing lists and newsgroups conveniently provide rapid dissemination of information to a sizable topical audience, they are often used to announce or solicit information about software to solve a particular problem, or other resources. This extends the utility of the database archive beyond "just" the knowledge-base it represents. By extracting particular types of information, such as software announcements, we can help with the discovery of other resources on the Internet which, like mailing lists, are often under-used.

Facilitating Software Reuse

When confronted with a computational problem, most of us would rather find a package, library, or routine that fits our needs than spend our time "reinventing the wheel". And indeed, quite a bit of useful software is available, either on the Internet or off, if we only knew about it. Unfortunately, most existing search services on the web are not well-suited to indexing software resources, particularly in very technical areas such as chemistry.

In some areas, such as numerical analysis, there is a great deal of high-quality free software available which has been collected into large repositories such as the well-known Netlib service [11] offered by the University of Tennessee and Oak Ridge National Laboratory. Although services like Netlib have proven extremely valuable and popular, there is room for further improvement. Although from a user's point of view having a large collection of software is an advantage, it increases the burden on the management side. In addition, it is very hard to include commercial software in such a repository, which would be a particular problem in trying to extend the Netlib model to the chemistry domain. Finally, it still requires a certain amount of sophistication from the user to identify the appropriate software within the repository for the particular task at hand.

One attempt to improve this situation is the National High-Performance Computing & Communications Software Exchange (NHSE) [12], a project of the Center for Research in Parallel Computation, of which NPAC is a member. The NHSE is a distributed or "virtual" repository based on a uniform idea of cataloging holdings and providing pointers to the appropriate physical repository or commercial source. This means that repositories do not have to be huge to be useful, which should make it easier for organizations to contribute to NHSE (further simplified by the "Repository in a Box" kit now available in beta-test form from the NHSE web pages). It also makes it easier to provide useful information about software which is not available directly on the Internet, or which is commercial. The cataloging scheme relies on domain-specific taxonomies to help classify and locate software. Having a well-defined route to classifying software which takes into account the nature of the science for which it is used makes it easier for users to locate the appropriate tools for their needs. Beyond this, the NHSE is also developing roadmaps to help less-experienced users locate software more easily.

The NHSE is generally still in the early stages of development, and so far the focus has been on the mathematical software area. This is an area which an impact spanning many scientific fields and there are already a number of large software repositories to build upon offering generally high-quality software. In domains such as chemistry, there is further to go, but given the complexity of much chemistry-related software, there is at least as much to gain from making it easier to find and reuse. The Chemistry Software Exchange component of CSIR has recently been "released" with more than 100 software assets in the catalog. Clearly not an exhaustive list, this is meant to act as a seed, encouraging software providers to help flesh it out by informing us of their offerings. At present, catalog entries are given keywords from a rough classification scheme. An important challenge in developing a chemistry-oriented repository for the NHSE is creating a robust, detailed taxonomy, a task which is currently underway. Once the taxonomy is complete, cataloging can begin in earnest. The AskNPAC about Chemistry database will be very useful at this stage. We plan to "mine" the database for software announcements, producing a continuous stream of candidates for cataloging. Because of the volume of chemistry software, however, we at NPAC cannot hope to do more than provide basic cataloging data. We would be very interested in working with individuals or organizations to improve the depth of the NHSE repository, either through the development of specialized repositories for specific sub-fields, or through more detailed descriptions, reviews (for which honoraria are available), roadmaps, etc. A mailing list has been setup to facilitate communications among those interested in helping to develop the NHSE Chemistry Software Exchange. To join, send a message with the text "subscribe nhse-chemistry" to majordomo@npac.syr.edu.

Relationship to Other Internet Information Services

The purpose of both the Chemistry Software Exchange and the AskNPAC Chemistry Archive is to make existing resources easier to find and use rather than to replace them. Neither of the resources described here could exist without the numerous mailing lists and software packages already available. There is also a wealth of information available on the WWW, at sites associated with mailing lists and many others--information about software, pointers to WWW resources, etc. It is our experience that this information is often quite detailed, provided by researchers who have done a great deal of work to thoroughly investigate issues in a small domain of interest to them. The CSIR project cannot possibly provide this level of detail--our goal is to bring together a broad spectrum of information which will help the researcher "discover" the detailed information provided by others far more quickly and easily than is currently possible.

Conclusion

Although the growth of the Internet and the popularity of the world-wide web have lead to a wide range of resources of interest to chemists, others, of equal or greater interest, have been largely ignored. As part of our "Chemistry Software and Information Resources" project, we have developed the "AskNPAC About Chemistry" database to provide easier access to the thousands of messages each month which appear on chemistry-oriented mailing lists and newsgroups. We are also developing a chemistry repository for the National HPCC Software Exchange in order to simplify the job of locating software to facilitate research. We will mine the AskNPAC database to help identify software to catalog, and we plan to develop other "products" based on information in the database, such as a chemistry-focused web index.

We offer these projects as a service to the chemistry community, and we solicit your assistance in order to make them as valuable as possible. This can range from providing us with feedback about the services or ideas for related offerings you would find useful to a more active involvement, such as hosting an NHSE repository in a sub-field or the development of more detailed roadmaps to the software.

Acknowledgements

We are very grateful for the efforts of Gang Cheng, developer of the AskNPAC service, and for the work of Paul Coddington and Yang Meng on the Chemistry Software Exchange.

References

[1] A representative, but certainly not complete list of WWW search services includes: Alta Vista, Excite NetSearch, Inktomi HotBot, Infoseek Guide, LinkStar, Lycos, Magellan, MetaCrawler, NlightN, NRP WWW Yellow Pages, Pathfinder, Open Text, SavySearch, Search.com, Yahoo, WebCrawler.

[2] Yahoo offers pointers to a variety of search services, directories, and related information.

[3] Chemistry Software and Information Resources (CSIR), http://www.csir.org.

[4] Stephanie da Silva <arielle@taronga.com>, maintainer, Publicly Accessible Mailing Lists, regular posting to Usenet group news.answers, also available as http://www.cis.ohio-state.edu/hypertext/faq/usenet/mail/mailing-lists/top.html.

[5] Liszt Directory of E-Mail Discussion Groups, http://www.liszt.com.

[6] BIOSCI, http://www.bio.net.

[7] Kris Boulez, <Kris.Boulez@rug.ac.be>, maintainer, Overview of Chemical Mailing Lists, http://bionmr1.rug.ac.be/chemistry/overview.html.

[8] Tom Gruber and Kevin Hughes, Hypermail, Version 1.02, Enterprise Integration Technologies, http://www.eit.com/software/hypermail.

[9] E. Hood, MHonArc: An Internet Mail-to HTML Converter, Version 1.2.3, http://www.oac.uci.edu/indiv/ehood/mhonarc.html.

[10] The method of selecting multiple groups in the scroll-box depends on your computer's operating system and possibly on your choice of WWW browser. For example, on Windows-based platforms the convention is that shift-click selects a contiguous group of items, and control-click allows disjoint selections to be made. With MacOS, it is command-click rather than control-click but otherwise the same. Under Unix/X11, you simply click on each group you want to include (click again to unselect).

[11] Netlib, http://www.netlib.org.

[12] National High Performance Computing and Communications Software Exchange (NHSE), http://www.nhse.org.