Reply-to: gcf@npac.syr.edu To: dongarra@cs.utk.edu cc: ken@rice.edu, jpool@ccsf.caltech.edu Subject: Some Initial Chemistry NHSE Endorsement Date: Sun, 11 Aug 1996 10:54:02 -0400 From: Geoffrey Fox The following reviews of an article describing for Chemists our NHSE Chemistry resource are basically positive to approach and this could be helpful in CRPC and of course NHSE review If you have any suggested improvements, please send them to me as we intend to get out soon -- Thanks! Geoffrey Fox gcf@npac.syr.edu, http://www.npac.syr.edu Phone 3154432163 (Npac central 3154431723) Fax 3154434741 DRAFT ARTICLE ITSELF ************************************************ Internet Resource Discovery for Chemistry - Where Are Those Vast Untapped Resources?

TrAC - Internet Column

Internet Resource Discovery for Chemistry - Where Are Those Vast Untapped Resources?

David E. Bernholdt and Geoffrey C. Fox

Northeast Parallel Architectures Center, Syracuse University, Syracuse, NY 13244-4100 USA

Introduction

Several of these columns have referred to the remarkable growth of the Internet and the wide range of resources which are becoming available; similar refrains are often heard elsewhere as well. There are, of course, both positive and negative aspects to this. On the good side, there are more numerous, more sophisticated, and more useful resources. But as the information space grows, it becomes harder keep track of all the new resources of interest.

The world-wide web (WWW) tends to dominate the news these days, since it is the vehicle through which the majority of these new services and resources are offered. Fortunately for the resource discovery problem, there are an increasing number of WWW services [1-2] which attempt to catalog or index web sites. Although at this point their coverage may be spotty, and their indexing criteria not always well-matched to technical interests of a chemist, they can be quite useful already (and report millions of "hits" daily). Given the popularity of both the world-wide web as a whole, and search services for it, we expect to see further improvements in this area, including better cataloging of highly specialized technical offerings.

There is, however, more to the Internet than the world-wide web. Chemists in particular have long made extensive use of electronic mailing lists and Usenet newsgroups as forums for discussion, and also anonymous-FTP and other methods for the exchange of software. WWW-based access to resources of this sort is spotty and leaves much to be desired.

Electronic Mailing Lists and Usenet Newsgroups

An initial survey has identified more than 80 chemistry-related electronic mailing lists and Usenet groups and deeper investigation will surely reveal more. Collectively, and even in some cases individually, these lists have many thousands of subscribers and the total message volume is measured in thousands per month. These mailing lists and newsgroups clearly provide an enormous resource for those who utilize them. Unfortunately, these lists may be among the best-kept secrets of the Internet -- they are in general not widely advertised, and not easily discovered through casual browsing. Even after interesting mailing lists are located, there are serious limitations on how they can be used. In many cases, a quick search of the archives of a mailing list would reveal that question a researcher may want to post has already been thoroughly discussed and in principle this should be faster and more efficient than repeating the question and waiting for responses. However, many lists have no archives at all, while many others offer only e-mail or FTP access to archives, without convenient search capabilities. And although there is often some overlap in the discussions on various lists, in general what archives there may be are separate and must be searched individually. At present, this can be very tedious and time consuming, so researchers often neglect an enormous amount of valuable information on the Internet simply because it is too hard to access.

To help address this problem, we have begun a project called "AskNPAC about Chemistry" [3] which gives the chemistry community an entirely different way to use the wealth of information represented by mailing lists and newsgroups. The idea is to provide a single point of access to chemistry-related mailing lists and newsgroups which can be browsed, much in the same way that one uses a mail or news reader, or searched in any combination for specific information. The database will receive a continuous stream of postings from all of the chemistry-related forums we can identify (more than 80 already), and where possible will also include past traffic collected from individual list archives.

The project is based on the "AskNPAC" news database server under development here at the Northeast Parallel Architectures Center (NPAC), mainly to support education and computer science objectives. The system uses the Oracle database software and Oracle's WOW web interface package as its base, operating on a standard unix workstation. AskNPAC communicates with an HTTP (hypertext transport protocol) server, so that anyone with a web browser can easily access the databases. And because it is built on a true relational database, rather than a full-text search engine, as often used for such applications, queries can be focused on sender, date, or other features in addition to the body of the message.

Because mailing lists and newsgroups conveniently provide rapid dissemination of information to a sizable topical audience, they are often used to announce or solicit information about software to solve a particular problem, or other resources. This extends the utility of the database archive beyond "just" the knowledge-base it represents. By extracting particular types of information, such as software announcements, we can help with the discovery of other resources on the Internet which, like mailing lists, are often under-used.

Facilitating Software Reuse

When confronted with a computational problem, most of us would rather find a package, library, or routine that fits our needs than spend our time "reinventing the wheel". And indeed, quite a bit of useful software is available, either on the Internet or off, if we only knew about it. Unfortunately, most existing search services on the web are not well-suited to indexing software resources, particularly in very technical areas such as chemistry.

In some areas, such as numerical analysis, there is a great deal of high-quality free software available which has been collected into large repositories such as the well-known Netlib service offered by the University of Tennessee and Oak Ridge National Laboratory. Although services like Netlib have proven extremely valuable and popular, there is room for further improvement. Although from a user's point of view having a large collection of software is an advantage, it increases the burden on the management side. In addition, it is very hard to include commercial software in such a repository, which would be a particular problem in trying to extend the Netlib model to the chemistry domain. Finally, it still requires a certain amount of sophistication from the user to identify the appropriate software within the repository for the particular task at hand.

One attempt to improve this situation is the National High-Performance Computing & Communications Software Exchange (NHSE), a project of the Center for Research in Parallel Computation, of which NPAC is a member. The NHSE is a distributed or "virtual" repository based on a uniform idea of cataloging holdings and providing pointers to the appropriate physical repository or commercial source. This means that repositories do not have to be huge to be useful, which should make it easier for organizations to contribute to NHSE (further simplified by the "Repository in a Box" kit to be released later this year). It also makes it easier to provide useful information about software which is not available directly on the Internet, or which is commercial. The cataloging scheme relies on domain-specific taxonomies to help classify and locate software. Having a well-defined route to classifying software which takes into account the nature of the science for which it is used makes it easier for users to locate the appropriate tools for their needs. Beyond this, the NHSE is also developing roadmaps to help less-experienced users locate software more easily.

The NHSE is generally still in the early stages of development, and so far the focus has been on the mathematical software area. This is an area which an impact spanning many scientific fields and there are already a number of large software repositories to build upon offering generally high-quality software. In domains such as chemistry, there is further to go, but given the complexity of much chemistry-related software, there is at least as much to gain from making it easier to find and reuse. The initial challenge in developing a chemistry-oriented repository for the NHSE is creating a taxonomy, a task which is currently underway. After that, we must begin cataloging software. The AskNPAC about Chemistry database will be very useful at this stage. We plan to "mine" the database for software announcements, producing a continuous stream of candidates for cataloging. Because of the volume of chemistry software, however, we at NPAC cannot hope to do more than provide basic cataloging data. We would be very interested in working with individuals or organizations to improve the depth of the NHSE repository, either through the development of specialized repositories for specific sub-fields, or through more detailed descriptions, reviews (for which honoraria are available), roadmaps, etc.

Relationship to Other Internet Information Services

The purpose of both the NHSE Chemistry repository and the AskNPAC about Chemistry database is to make existing resources easier to find and use rather than to replace them. Neither of the projects described here could exist without the numerous mailing lists and software packages already available. There is also a wealth of information available on the WWW, at sites associated with mailing lists and many others--information about software, pointers to WWW resources, etc. It is our experience that this information is often quite detailed, provided by researchers who have done a great deal of work to thoroughly investigate issues in a small domain of interest to them. Our project cannot possibly provide this level of detail--our goal is to bring together a broad spectrum of information which will help the researcher "discover" the detailed information provided by others far more quickly and easily than is currently possible.

Conclusion

Although the growth of the Internet and the popularity of the world-wide web have lead to a wide range of resources of interest to chemists, others, of equal or greater interest, have been largely ignored. We have developed the "AskNPAC about Chemistry" database to provide easier access to the thousands of messages each month which appear on chemistry-oriented mailing lists and newsgroups. We are also developing a chemistry repository for the National HPCC Software Exchange in order to simplify the job of locating software to facilitate research. We will mine the AskNPAC database to help identify software to catalog, and we plan to develop other "products" based on information in the database, such as a chemistry-focused web index.

We offer these projects as a service to the chemistry community, and we solicit your assistance in order to make them as valuable as possible. This can range from providing us with feedback about the services or ideas for related offerings you would find useful to a more active involvement, such as hosting an NHSE repository in a sub-field or the development of more detailed roadmaps to the software.

Acknowledgements

We are very grateful for the efforts of Gang Cheng, developer of the AskNPAC service.

References

[1] A representative, but certainly not complete list of WWW search services includes: Alta Vista, Excite NetSearch, Inktomi HotBot, Infoseek Guide, LinkStar, Lycos, Magellan, MetaCrawler, NlightN, NRP WWW Yellow Pages, Pathfinder, Open Text, SavySearch, Search.com, Yahoo, WebCrawler.

[2] Yahoo offers pointers to a variety of search services, directories, and related information.

[3] A prototype version of the AskNPAC About Chemistry service is available.

SOME CHEMISTRY REVIEWS of THIS ARTICLE ********************************* - ------- Message 1 Date: Thu, 08 Aug 1996 10:27:05 -0400 To: "Stephen R. Heller" Subject: Re: Article for Review At 03:47 PM 8/7/96 -0400, you wrote: > >I have a new Trends in Analytical Chemistry (TrAC) Internet column >article which has been submitted. I would like to ask your help in >reviewing it. > > Would you please point you browser at: >http://pgenome.arsusda.gov:8000/trac1.html > >Many thanx Interesting. He is missing the mailing list: mmodlist@chem.iupui.edu Prof. Clark Still's MacroModel Commercial Modeling program. I cannot tell if he has the Computational Chemistry List from OSC. (Unless he calls it chemcom ?). The article could benefit from two or more examples. The Web frontend is not clear on how to search for an Email address or a software package. I'd also like abit more discussion on Oracle searching benefits (if any) and on Hypermail and MHonarc email Web archives. He seems to be using Hypermail, but does not reference it. Basically I liked the article, but it could be improved esp. with references to the original mailing list sites and software-enabling technologies (Hypermail). So Steve - How is that for a fast review? - ------- Message 2 Date: Fri, 9 Aug 96 12:56 +0200 To: "Stephen R. Heller" Subject: Re: TrAC article for review Steve I realy enjoyed reading this last article you send me, further more it was fun checking the "AskNPAC About Chemistry". The Bernholdt and Fox article is an interesting and important one and should be brought to the attention of all chemists. One important thing is that this paper show that there are other things in the Internet except WWW and still e-mail Usenet Newsgroups and Mailing Lists contain a lot of important and useful material. The prototype version of the AskNPAC About Chemistry covers about 35 mailing lists archives and 20 Usegroups usenets. Searching could be carried out by keywords, header list, date, sender, URL. It also can be carried out as frame-based, form-based, line-based, or calander- based. It seems that very few scientists are aware about this system and since its begining only 77 entries have been recorded from 21 IP adresses (probably 21 users as I located six entries of mine from the last two days). A good deal of publicity should be done to this facility as it is the best available tool for searching mailing lists and usenet groups. There is no quistion that this article should be published in TrAC as soon as possible. I think that it would be even nicer if the authers would add two URL addresses to this article, one to one of the lists of chemistry mailing lists and usenet newsgroup , the other to an Internet guide to listserves and mailing lists (few of these are available arround). - ------- Message 3 Date: Sat, 10 Aug 1996 13:34:53 +0000 To: "Stephen R. Heller" Subject: Re: TrAC article for review This article addresses a critical area of the current Internet, resource discovery for chemistry. Unfortunately, I find it spends very little time on the most important aspect; how to discover chemical resources in the 60 odd million documenents out there. Whilst it mentions the various global index engines, the article restricts itself to expressing the hope of "better cataloging of highly specialised offerings" (ie chemistry). We are not really given any hints on how this might be achieved (there ARE mechanisms, but I think these must be the subject of another article). Certainly the current state of chemical searches of the Internet will not cause CAS any sleepless nights. The article then turns its focus to mailing lists and usenet groups. Apparently, some 80 chemistry related lists have been identified. I think a table listing these is essential for this article. The usenet medium in particular is notorious for the poor quality it often contains, and I wonder how this project will address this aspect. CAS, MDLI and many other organisations have long since learnt the lesson of quality control prior to indexing. I am also intrigued by permanency. Some of the most recent index engines (Inktomi) claim to be able to re-index a resouce every 48 hours. Thus the user is likely to find any indexed resource will still be available to them when they request it. Usenet in particular, but also mail lists are notoriously evanescent. Once a posting is a few weeks old, will the resources described in it still be available? How will such issues be addressed? In conclusion, I think this is an important area, and work in it is to be encouraged and welcomed. In this article, the authors present a new chemical resource discovery mechanism. However, I feel they have missed a wonderful opportunity to discuss some of the broader and more important issues of the future and how they might be addressed. - ------- End of Forwarded Messages ------- End of Forwarded Messages