Cataloging your web site


his chapter describes how you can automatically generate web pages that list and categorize the HTML files in your web site. The AutoCatalog feature lets you automatically provide your web users with easy access to your content by:

If your web server is one of many in a company, organization, or educational facility, you can also use the AutoCatalog feature to provide a resource description of your web site to any Netscape Catalog Server. Netscape Catalog Server can then provide a central server where users can find information on any of the individual web servers in your organization.

The AutoCatalog feature of Enterprise Server 3.0 provides only a subset of the functionality of Netscape Catalog Server. While the AutoCatalog feature can list and categorize the files on your web site, Netscape Catalog Server also indexes information, provides searching capabilities, and catalogs documents from multiple servers. Netscape Catalog server is also highly configurable, allowing users to write plug-in functions and define rules for gathering and categorizing documents. If you would like a more robust cataloging tool, you may want to purchase Netscape Catalog Server to work in conjunction with your Enterprise Server.

What can AutoCatalog do for my web site?

If you have a large web site with many files and directories, it can be difficult to organize the content so that your users can quickly find specific information. If your web server also contains directories of information from various groups or people, the content may not be unified.

The AutoCatalog feature creates an organized catalog of all of the documents on your web server. It sorts the documents by title, classification, author, and last-modification time, as shown in Figure 14.1.

Users see your catalog as categorized links

How does AutoCatalog work?

The AutoCatalog feature is actually controlled by an agent process called the catalog agent. The catalog agent accesses your server through HTTP requests. You either set up the catalog agent to run at set times, or you can manually run it from a form in the Server Manager. The catalog agent sends requests to your server until the catalog agent determines there are no more files to catalog.

The catalog agent gathers information in a two-step process. First it enumerates (gathers) the URLs referenced in each HTML file and determines which of these URLs it should catalog. Then it generates a resource description that contains information about the HTML file.

Enumerating the URLs

The catalog agent sends an HTTP request to your server and accesses the first URL you specify. Typically this is the URL to your home page, but you can set it to start in any directory or HTML file in your document root. The catalog agent gets the first HTML document and scans it for information to catalog.

Warning!
If your server uses access control based on hostname or IP address, make sure you've allowed your local machine full access to your web site. Also, if your server is configured to use authentication through entries in a user database, make sure you give access to the catalog agent. See "The process.conf file" on page 267 for more information.
The first scan lists any URLs referenced in the HTML file. The second scan generates a resource description, as described in "Generating a resource description" on page 259. After the catalog agent enumerates the URLs and generates a resource description, it determines which HTML files to scan next. The catalog agent in Netscape Enterprise Server limits the URLs it traverses: it accesses only those HTML files located in your server. Figure 14.2 shows how the catalog agent scans the files on a sample web server.

The catalog agent enumerates URLS, and then generates resource descriptions

Generating a resource description

For each document found during enumeration, the catalog agent scans the HTML document for information to catalog. For example, the agent might gather the following information:

After the catalog agent gathers this information from the first HTML file, it uses the enumerated URLs to choose which file to scan next.

Generating HTML catalog files

After the catalog agent gathers all of the information for your web server, it generates several HTML files that users will view to find information on your web site. These HTML files are kept in the https-identifier/catalog directory. Users can access the categorized information by going to the catalog directory on your server:

http://yourserver.org/catalog

You can restrict access to this directory and treat it as you do other parts of your web site. You can also turn off the catalog feature, which effectively means no one can access the catalog directory or any of its HTML files.

Using AutoCatalog

To use the AutoCatalog feature with your server, you first turn on the catalog agent and configure it to gather and sort the information about your web site. The catalog agent then collects information from HTML documents on your server, and it creates static HTML files that categorize your server's content in several ways.

Before any user can access the generated HTML files, you must turn on the catalog option. To let users access your server's catalog:

  1. From the Server Manager, choose Auto Catalog|On/Off. The AutoCatalog On/Off form appears.
  2. Click the Server On button.
  3. Save and apply your changes.

See "Accessing catalog files" on page 266 for information on accessing the HTML files created by the catalog agent.

Configuring AutoCatalog

You can configure how the catalog agent accesses the content on your server. You can set directories in your document root where the catalog agent starts cataloging. That is, if you have several directories in your document root (the main directory where you keep your server's content files), you can set the catalog agent to access only certain directories and their subdirectories.

To configure the catalog agent:

  1. From the Server Manager, choose AutoCatalog|Configure. The Configure Catalog Agent form appears. To find your server's document root, choose the Primary Document Directory link in Content Mgmt.
  2. Type the directories where you want the catalog agent to begin searching (it starts with the index.html file in that directory). For example, if your server's document directory has three subdirectories called first, second, and third, and you want the catalog agent to search only the second directory, type /second in the Starting Directories field.
    If you leave the Starting Directories field blank, the catalog agent searches your home page first (this file is usually called index.html), and then it searches any URLs referenced in that file.
  3. Select the speed at which the catalog agent should search your server's directories. The default is 7. The speed setting determines the number of "hits" the server will experience when the catalog agent is working. That is, when the catalog agent is searching through your server's files, it can simultaneously send the server one or more requests for documents. The catalog agent can also wait before sending a request to the server.
    In general, the speed setting should be appropriate for your server and its content. If you have a high-access server and up-to-date cataloging isn't very important, you should choose a low speed; if your server has low load times (perhaps in the early morning hours) and cataloging is very important to you, you should run the catalog agent at a high speed.

    Table 14.1 defines the speed settings.
    Speed settings

    Speed setting

    Simultaneous retrievals

    Delay (seconds)

    1

    1

    60

    2

    1

    30

    3

    1

    15

    4

    1

    5

    5

    1

    1

    6

    2

    0

    7

    4

    0

    8

    8

    0

    9

    12

    0

    10

    16

    0

  4. Enter the username and password that the agent will use to access any password-protected sources that are to be enumerated.
  5. Click OK.

Scheduling the catalog agent

You can configure the catalog agent to run at specific times on specific days of the week. This feature is useful if your web site content changes frequently. For example, you might have a site where many people upload content and you don't directly monitor the content changes. Or, you might manage a site whose content is very dynamic and should be cataloged frequently.

Note
If the content on your server changes infrequently or if all of the content changes simultaneously, you'll probably want to recatalog your content manually instead of scheduling times for recataloging. Manual recataloging minimizes the performance impact on your server.
To schedule the catalog agent:

  1. Make sure that cron control is started by choosing Global Settings|Cron Controls in the administration server. For more information on Cron Controls and the administration server, see Managing Netscape Servers.
  2. From the Server Manager, choose Auto-Catalog|Schedule. The Schedule Catalog Agent form appears.
  3. Select the hour and minute when you want the catalog agent to run. The drop-down lists let you choose a time in ten-minute increments.
  4. Check the days of the week that you want the catalog agent to run. You can check one or more days.
  5. Check Activate schedule. If you want to stop the server from cataloging your files on a schedule, check Deactivate schedule.
  6. Click OK.

When the catalog agent runs, it logs its progress in a file called robot.log. This file appears in the https-identifier/logs directory under your server root directory. The log file contains the URLs retrieved, generated, and enumerated. This log file gives more detail than the status report (see "Getting a status report for the catalog agent" on page 264).

Controlling the catalog agent manually

You can control the catalog agent manually. This feature is useful for several reasons:

To manually control the catalog agent:

  1. From the Server Manager, choose Auto Catalog|Manually Control. The Control Catalog Agent form appears.
  2. Select one of the following buttons for controlling the catalog agent:
    Start starts the catalog agent using the settings in the Configure Catalog Agent form.

    Status displays the current status of the agent. See the following section for more information on status.

    Stop Enumeration stops the catalog agent from traversing files, but it continues generating the resource description for the file it's scanning.

    Stop stops the catalog agent that you manually started. If the agent is in the middle of enumerating or generating a resource description, you'll lose that information, but the catalog agent will stop itself and clean up any temporary files it was using. You might use Stop Enumeration instead. The catalog agent will run again later if you scheduled the agent to run at specific times.

    Kill immediately stops the server. You'll lose any information the catalog agent was working on.

Getting a status report for the catalog agent

Whenever the catalog agent runs, you can get a status report that describes what the catalog agent is doing. To view the status report, click the Status button on the Control Catalog Agent form.

A sample status report for the catalog agent.

Table 14.2 defines all of the status attributes.
Status attributes

Attribute

Description

active

The number of URLs the catalog agent is currently working on

spawned

The number of URLs the catalog agent has enumerated but hasn't yet retrieved

retrieved

The number of URLs retrieved through HTTP connections

enumerated

The number or URLs enumerated so far

generated

The number or URLs generated so far

filtered-at-metadata

The number of URLs rejected by the catalog agent when scanning the META data in the HTML files

filtered-at-data

The number of URLs rejected by the catalog agent when scanning the data in the HTML files (for example, if the links reference an external host)

retrievals-pending

The number of URLs remaining that need to be retrieved

retrievals-active

The number of URLs the agent is currently retrieving

retrievals-active-peak

The highest number of URLs the agent simultaneously retrieved

deleted

The number of URLs filtered

migrated

The number of URLs enumerated but waiting to have resource descriptions processed

defunct

The number of URLs filtered

spawn-backlog

The number of URLs waiting to be processed by the catalog agent

spawn-string-cache

The number of unique host names that appeared in links

bytes-retrieved

The total number of bytes the catalog agent has retrieved, that is, the total number of bytes for all of the files the agent has retrieved through HTTP connections

Accessing catalog files

Once you have a working catalog, you can access the catalog main page at the following URL:

http://yourserver.org/catalog

Catalog files are kept on your server in a directory under the server root directory called https-identifier/catalog. Because this directory is outside your document root directory (where you keep all of your web content), the server creates an additional document directory that maps the URL prefix /catalog to the https-identifier/catalog directory on your hard disk. You can view this setting by choosing Content Mgmt|Additional Document Directories in the Server Manager.

Catalog configuration files

The catalog agent uses the following configuration files:

The catalog agent also uses and obeys restrictions set in a file called robots.txt. You can use this file to restrict areas of your server from your catalog agent. This file is also used by any other robots or catalog agents that visit your web server.

The filter.conf file

The filter.conf file uses the same syntax as the obj.conf file. It is a series of directives and functions with attributes that define the rules the catalog agent follows (which directory to start cataloging) and how to generate the resource descriptions. The filter.conf file uses four directives:

You should only modify this file if you plan to use your web server with Netscape Catalog Server. For more information on the configuration files, see the documentation for Netscape Catalog Server.

The process.conf file

The process.conf file configures the catalog agent. It includes information such as:

Example process.conf file

The following sample file shows how you can set a username and password that the catalog agent uses when authenticating to your server. The email address is also used to identify the catalog agent.

<Process csid="x-catalog://www.netscape.com:9999/AutoCatalog" \
speed=10 \
email="user@domain" \
username="anonymous" \
password="robin@" \
http://www.netscape.com/
</Process>

The robots.txt file

The catalog agent is a type of robot--that is, it is a program that gathers information from your web site by recursively following links from one HTML file to another. There are many different kinds of robots that roam the World Wide Web looking for web servers to use for information gathering. For example, there are many companies that search the web, index documents, and then provide the information as a service to their customers (typically through searchable forms).

Robots are also sometimes called web crawlers or spiders.
Because some web administrators want to control what directories and files a robot can access, the web community designed a standard robots.txt file for excluding robots from web servers. The catalog agent was designed to follow instructions in a robots.txt file. However, not all web robots follow these guidelines.

You can use the robots.txt file to restrict your server's catalog agent, but if your web server is part of the World Wide Web, keep in mind that the robots.txt file might be used by other robots visiting your site.

Note
The catalog agent, and any other robot, is restricted by access control settings and user authentication.

Format for robots.txt

The robots.txt file consists of one or more groups of lines with name-value pairs that instruct the robots. Each group of lines should describe the User-Agent type, which is the name of a particular robot. The Netscape catalog agent is called Netscape-Catalog-Agent/1.0. After you specify which User-Agents you want to configure, you include a Disallow line that lists the directories you want to restrict. You can include one or more groups in your robots.txt file.

Each line in the group has the format

"field:value"

The field name is not case-sensitive, but the value is case-sensitive. You can include comment lines by beginning the comment with the # character. The following example shows one group that configures all robots and tells them not to go into the directory called /usr:

# This is a sample robots.txt file
User-agent: *
Disallow: /usr

Example robot.txt files

The following example robots.txt file specifies that no robots should visit any URL starting with /usr or /tmp:

# robots.txt for http://www.mysite.com/
User-agent: *
Disallow: /usr
Disallow: /tmp

The next example restricts all robots from your web site except the Netscape catalog agent:

# robots.txt for http://www.site.com/
User-agent: *
Disallow: *
# Netscape catalog agent is a good robot
User-agent: Netscape-Catalog-Agent/1.0
Disallow:

The following example tells all robots, including the catalog agent, not to traverse your web site.

# No robots allowed!
User-agent: *
Disallow: /

Editing the robots.txt file

You can edit the robots.txt file manually or by using the online form. The Edit Robots.txt form will create a robots.txt file if one does not already exist. If you choose to edit the file manually, use the format described in "Format for robots.txt" on page 269. To edit the robots.txt file using the Edit Robots.txt form:

  1. Choose Auto-Catalog|Edit Robots.txt in the Server Manager. The Edit Robots.txt form appears.
  2. In the User-Agent field, enter the names of the User-Agents, or robots, you want to configure. Each User-Agent should be on a separate line in the text field. The User-Agents you list in this field are those for which you will be disallowing access to specific directories. The User-Agent names are case-sensitive.
    For example, if you want to configure the Netscape catalog agent, type Netscape-Catalog-Agent/1.0 in the User-Agent field. If you want to configure all robots, you should enter *.

  3. In the Disallow field, enter the names of the directories you want to restrict, listing each directory on a separate line. The directory names are case-sensitive also.
  4. Click OK.


Copyright 1997 Netscape Communications Corporation. All rights reserved.