Cataloging your web site

his chapter describes how you can automatically generate web pages that list and categorize the HTML files in your web site. It can be difficult finding information on large web sites, especially if they aren't well organized. The autocatalog feature lets you automatically provide your web users with easy access to your content by

If your web server is one of many in a company, organization, or educational facility, you can use the auto-catalog feature to provide a resource description of your web site to any Netscape Catalog Server. The Netscape Catalog Server can then provide a central server where users can find information on any of the individual web servers in your organization.

What can a catalog do for my web site?

If you have a large web site with many files and directories, it can be difficult to organize the content so that your users can quickly find specific information. Your web server might also contain directories of information from various groups or people, so the content isn't unified.

The catalog feature helps you provide several ways to find information on your web server, as shown in the following figure.

Users see your catalog as categorized links

How do users access my catalog?

Once you have a working catalog, you can provide links on your home page to the catalog main page. Users can also access the catalog by typing the URL:

http://<yourserver.org>/catalog

Catalog files are kept on your server in a directory under the server root directory called https-[identifier]/catalog. Because this directory is outside your document root directory (where you keep all of your web content), the server creates an additional document directory that maps the URL prefix /catalog to the https-[identifier]/catalog directory on your hard disk. You can view this setting by choosing Content Mgmt|Additional Document Directories in the Server Manager.

How does the catalog feature work?

An application called the catalog agent accesses your server through HTTP requests. You either set up the catalog agent to run at set times, or you can manually run it from a form in the Server Manager. The catalog agent sends requests to your server until the catalog agent determines there are no more files to catalog.

The catalog agent gathers information in a two-step process. First it enumerates (gathers) the URLs referenced in each HTML file and determines which of these URLs it should catalog. Then it generates a resource description that contains information about the HTML file.

Enumerating the URLs

The catalog agent sends an HTTP request to your server and accesses the first URL you specify. Typically this is the URL to your home page, but you can set it to start in any directory or HTML file in your document root. The catalog agent gets the first HTML document and scans it for information to catalog.

Warning!
If your server uses access control based on host name or IP address, make sure you've allowed your local machine full access to your web site. Also, if your server is configured to use authentication through entries in a user database, make sure you give access to the catalog agent. See "The process.conf file" on page 212 for more information.

The first scan lists any URLs referenced in the HTML file. The second scan generates a resource description, as described in the following section. After the catalog agent enumerates the URLs and generates a resource description, it determines which HTML files to scan next. The catalog agent in the Netscape Enterprise Server limits the URLs it traverses: it only accesses HTML files located in your server. The following figure shows how the catalog agent scans the files on a sample web server.

The catalog agent enumerates URLS, and then generates resource descriptions.

Generating a resource description

For each document found during enumeration, the catalog agent scans the HTML document for information to catalog. For example, it gathers

After the catalog agent gathers this information from the first HTML file, it uses the enumerated URLs to choose which file to scan next.

Generating HTML catalog files

After the catalog agent gathers all of the information for your web server, it generates several HTML files that users will view to find information on your web site. These HTML files are kept in the https-[identifier]/catalog. Users can access the categorized information by going to the catalog directory on your server:

http://<yourserver.org>/catalog

You can restrict access to this directory and treat it as you do other parts of your web site. You can also turn off the catalog feature, which effectively means no one can access the catalog directory or any of its HTML files.

Setting up the catalog agent

The catalog agent is the application that runs periodically, sending requests to your server, and retrieving and organizing the information in your server.

You can configure the catalog agent to run automatically at specific times or you can manually run the agent. You can also set how you want the catalog agent to interact with your server. For example, you define the directories you want the catalog agent to access.

Using the AutoCatalog feature

To use the AutoCatalog feature with your server, you first turn on the catalog agent and configure it to gather and sort the information about your web site. The catalog agent then collects information from HTML documents on your server, and it creates static HTML files that categorizes your server's content in several ways.

Before any user can access the generated HTML files, you must turn on the catalog option. To let users access your server's catalog,

  1. Choose Auto Catalog|On/Off in the Server Manager.
  2. Click the On button. You'll be prompted to restart the server.
See "How do users access my catalog?" on page 202 for information on accessing the HTML files created by the catalog agent.

Configuring the catalog agent

You can configure how the catalog agent accesses the content on your server. You can set directories in your document root where the catalog agent starts cataloging. That is, if you have several directories in your document root (the main directory where you keep your server's content files), you can set the catalog agent to access only certain directories and their subdirectories.

To configure the catalog agent,

  1. Choose AutoCatalog|Configure Catalog Agent in the Server Manager.
    To find your server's document root, choose the Primary Document Directory link in Content Mgmt.
  2. Type the directories where you want the catalog agent to begin searching (it starts with the index.html file in that directory). For example, if your server's document directory has three subdirectories called first, second, and third, and you only want the catalog agent to search the second directory, type /second in the Starting Points field. If you leave the directories blank, the catalog agent searches your home page first (this file is usually called index.html), and then it searches any URLs referenced in that file.

  3. Select the Speed you want the catalog agent to use when searching your server's directories. The default is 7. The Speed setting configures the number of "hits" the server will experience when the catalog agent is working. That is, when the catalog agent is searching through your server's files, it can simultaneously send the server one or more requests for documents. The catalog agent can also wait before sending a request to the server. In general, you should pick a setting that fits your server and its content. If you have a high-access server and up-to-date cataloging isn't very important, you should choose a low Speed; if your server has low load times (perhaps in the early morning hours) and cataloging is very important to you, you should run the catalog agent with a high Speed.

    The following table defines the speed settings:
    Speed Setting

    Simultaneous retrievals

    Delay (seconds)

    1

    1

    60

    2

    1

    30

    3

    1

    15

    4

    1

    5

    5

    1

    1

    6

    2

    0

    7

    4

    0

    8

    8

    0

    9

    12

    0

    10

    16

    0

  4. Click OK and confirm your changes.

Scheduling the catalog agent

You can configure the catalog agent to run at specific times on specific days of the week. This feature is useful if you web site content changes frequently. For example, you might have a site where many people upload content and you don't directly monitor the content changes. Or, you might manage a site whose content is very dynamic and should be cataloged frequently.

Note
If the content on your server changes infrequently or if all of the content changes simultaneously, you'll probably want to manually recatalog your content instead of scheduling it. This minimizes the performance impact on your server.

To schedule the catalog agent,

  1. Choose Auto-Catalog|Schedule Catalog Agent in the Server Manager forms.
  2. Select the hour and minute when you want the catalog agent to run. The drop-down lists let you choose a time in ten-minute increments.
  3. Check the days of the week that you want the catalog agent to run. You can check one or more days.
  4. Check Activate schedule. If you want to stop the server from cataloging your files on a schedule, check Deactivate schedule.
  5. Click OK. You'll get a confirmation message.
When the catalog agent runs, it logs its progress in a file called robot.log. This file appears in the https-[identifier]/logs directory under your server root directory. The log file contains the URLs retrieved, generated, and enumerated. This log file is a more verbose list than what you see in the status (see page 209).

Manually controlling the catalog agent

You can manually control the catalog agent. This is useful for several reasons:

The manual-control form has several buttons for controlling the catalog agent:

Start starts the catalog agent using the settings in the Configure Catalog Agent form.

Status displays the current status of the agent. See the following section for more information on status.

Stop Enumeration stops the catalog agent from traversing files, but it continues generating the resource description for the file it's scanning.

Stop stops the catalog agent that you manually started. If the agent is in the middle of enumerating or generating a resource description, you'll lose that information, but the catalog agent will stop itself and clean up any temporary files it was using. You might use Stop Enumeration instead. The catalog agent will run again later if you scheduled the agent to run at specific times.

Kill immediately stops the server. You'll lose any information the catalog agent was working on.

Getting a status report for the catalog agent

Whenever the catalog agent runs, you can get a status report that describes what the catalog agent is doing.

A sample status report for the catalog agent.

The following table defines all of the status attributes.
Attribute

Description

active

The number of URLs the catalog agent is currently working on

spawned

The number of URLs the catalog agent has enumerated but hasn't yet retrieved.

retrieved

The number of URLs retrieved through HTTP connections

enumerated

The number or URLs enumerated so far

generated

The number or URLs generated so far

filtered-at-metadata

The number of URLs rejected by the catalog agent when scanning the META data in the HTML files.

filtered-at-data

The number of URLs rejected by the catalog agent when scanning the data in the HTML files (for example, if the links reference an external host).

retrievals-pending

The number of URLs remaining that need to be retrieved

retrievals-active

The number of URLs the agent is currently retrieving

retrievals-active-peak

The highest number of URLs the agent simultaneously retrieved

deleted

The number of URLs filtered.

migrated

The number of URLs enumerated but waiting to have resource descriptions processed.

defunct

The number of URLs filtered.

spawn-backlog

The number of URLs waiting to be processed by the catalog agent.

spawn-string-cache

The number of unique host names that appeared in links.

bytes-retrieved

The total number of bytes the catalog agent has retrieved. This is the total number of bytes for all of the files the agent has retrieved through HTTP connections.

Catalog configuration files

The catalog agent and the RDS use the following configuration files:

The catalog agent also uses and obeys restrictions set in a file called robots.txt.You can use this file to restrict areas of your server from your catalog agent. This file is also used by any other robots or catalog agents that visit your web server.

The filter.conf file

The filter.conf file uses the same syntax as the obj.conf file. The file is a series of directives and functions with attributes that define the rules the catalog agent follows (which directory to start cataloging) and that define how to generate the resource descriptions. The filter.conf file uses four directives:

You should only modify this file if you plan to use your web server with Netscape Catalog Server. For more information on the configuration files and modifying them, see the documentation for Netscape Catalog Server.

The process.conf file

The process.conf file configures the catalog agent. It includes information such as:

An example process.conf file

The following sample file shows how you can set a user name and password that the catalog agent uses when authenticating to your server. The email address is also used to identify the catalog agent.

<Process csid="x-catalog://www.netscape.com:9999/AutoCatalog" 
   speed=10 
   email="user@domain" 
   username="anonymous" 
   password="robin@" 
   http://www.netscape.com/
</Process>

The robots.txt file

The catalog agent is a type of robot--that is, it is a program that gathers information from your web site by recursively following links from one HTML file to another. There are many different kinds of robots that roam the World Wide Web looking for web servers to use for information gathering. For example, there are many companies that search the web, index documents, and then provide the information as a service to their customers (typically through searchable forms).

Robots are also sometimes called web crawlers or spiders.
Because some web administrators want to control what directories and files a robot can access, the web community designed a standard robots.txt file for excluding robots from web servers. The catalog agent was designed to follow instructions in a robots.txt file. However, not all web robots follow these guidelines.

You can use the robots.txt file to restrict your server's catalog agent, but if your web server is part of the World Wide Web, keep in mind that the robots.txt file might be used by other robots visiting your site.

Note
The catalog agent, and any other robot, is restricted by access control settings and user authentication.

Format for robots.txt

The robots.txt file consists of one or more groups of lines with name-value pairs that instruct the robots. Each group of lines should describe the User-agent type, which is a name that robots call themselves. The Netscape catalog agent is called Netscape-Catalog-Agent/1.0. After you specify which User-agents you want to configure, you include a Disallow line that lists the directories you want to restrict. You can include one or more groups in your robots.txt file.

Each line in the group has the format

"<field>:<value>"

The field name is case insensitive, but the value is case-sensitive. You can include comment lines by beginning the comment with the # character. The following example shows one group that configures all robots and tells them not to go into the directory called /usr:

# This is a sample robots.txt file
User-agent: *
Disallow: /usr

Example robot.txt files

The following example robots.txt file specifies that no robots should visit any URL starting with /usr or /tmp:

# robots.txt for http://www.mysite.com/
User-agent: *
Disallow: /usr
Disallow: /tmp

The next example restricts all robots from your web site except the Netscape catalog agent:

# robots.txt for http://www.site.com/
User-agent: *
Disallow: *
# Netscape catalog agent is a good robot
User-agent: Netscape-Catalog-Agent/1.0
Disallow:

The following example tells all robots, including the catalog agent, not to traverse your web site.

# No robots allowed!
User-agent: *
Disallow: /