Users see your catalog as categorized links
How does AutoCatalog work?
The AutoCatalog feature is actually controlled by an agent process called the catalog agent. The catalog agent accesses your server through HTTP requests. You either set up the catalog agent to run at set times, or you can manually run it from a form in the Server Manager. The catalog agent sends requests to your server until the catalog agent determines there are no more files to catalog.
The catalog agent gathers information in a two-step process. First it enumerates (gathers) the URLs referenced in each HTML file and determines which of these URLs it should catalog. Then it generates a resource description that contains information about the HTML file.
Enumerating the URLs
The catalog agent sends an HTTP request to your server and accesses the first URL you specify. Typically this is the URL to your home page, but you can set it to start in any directory or HTML file in your document root. The catalog agent gets the first HTML document and scans it for information to catalog.
Warning!
If your server uses access control based on hostname or IP address, make sure
you've allowed your local machine full access to your web site. Also, if your
server is configured to use authentication through entries in a user database,
make sure you give access to the catalog agent. See "The process.conf file" on
page 267 for more information.
The first scan lists any URLs referenced in the HTML file. The second scan generates a resource description, as described in "Generating a resource description" on page 259. After the catalog agent enumerates the URLs and generates a resource description, it determines which HTML files to scan next. The catalog agent in Netscape Enterprise Server limits the URLs it traverses: it accesses only those HTML files located in your server. Figure 14.2 shows how the catalog agent scans the files on a sample web server.
The catalog agent enumerates URLS, and then generates resource descriptions
Generating a resource description
For each document found during enumeration, the catalog agent scans the HTML document for information to catalog. For example, the agent might gather the following information:
<TITLE>
tags.<META>
tag. For example, "General HTML" is the Classification in the following text:<META NAME="Classification" CONTENT="General HTML">
<META>
tag. For example, "J. Doe" is the Author in the following text:<META NAME="Author" CONTENT="J. Doe">
https-
identifier
/catalog
directory. Users can access the categorized information by going to the catalog directory on your server:
http://
yourserver.org
/catalog
You can restrict access to this directory and treat it as you do other parts of your web site. You can also turn off the catalog feature, which effectively means no one can access the catalog directory or any of its HTML files.
See "Accessing catalog files" on page 266 for information on accessing the HTML files created by the catalog agent.
Configuring AutoCatalog
You can configure how the catalog agent accesses the content on your server. You can set directories in your document root where the catalog agent starts cataloging. That is, if you have several directories in your document root (the main directory where you keep your server's content files), you can set the catalog agent to access only certain directories and their subdirectories.
To configure the catalog agent:
index.html
file in that directory). For example, if your server's document directory has three subdirectories called first, second, and third, and you want the catalog agent to search only the second directory, type /second
in the Starting Directories field.
index.html
), and then it
searches any URLs referenced in that file.
Table 14.1 defines the speed settings.
Speed setting | Simultaneous retrievals | Delay (seconds) |
---|---|---|
1 | 1 | 60 |
2 | 1 | 30 |
3 | 1 | 15 |
4 | 1 | 5 |
5 | 1 | 1 |
6 | 2 | 0 |
7 | 4 | 0 |
8 | 8 | 0 |
9 | 12 | 0 |
10 | 16 | 0 |
Scheduling the catalog agent
You can configure the catalog agent to run at specific times on specific days of the week. This feature is useful if your web site content changes frequently. For example, you might have a site where many people upload content and you don't directly monitor the content changes. Or, you might manage a site whose content is very dynamic and should be cataloged frequently.
Note
If the content on your server changes infrequently or if all of the content
changes simultaneously, you'll probably want to recatalog your content
manually instead of scheduling times for recataloging. Manual recataloging
minimizes the performance impact on your server.
To schedule the catalog agent:
cron
control is started by choosing Global Settings|Cron Controls in the administration server. For more information on Cron Controls and the administration server, see Managing Netscape Servers.
When the catalog agent runs, it logs its progress in a file called robot.log
. This file appears in the https-
identifier
/logs
directory under your server root directory. The log file contains the URLs retrieved, generated, and enumerated. This log file gives more detail than the status report (see "Getting a status report for the catalog agent" on page 264).
Controlling the catalog agent manually
You can control the catalog agent manually. This feature is useful for several reasons:
Status displays the current status of the agent. See the following section for more information on status.
Stop Enumeration stops the catalog agent from traversing files, but it continues generating the resource description for the file it's scanning.
Stop stops the catalog agent that you manually started. If the agent is in the middle of enumerating or generating a resource description, you'll lose that information, but the catalog agent will stop itself and clean up any temporary files it was using. You might use Stop Enumeration instead. The catalog agent will run again later if you scheduled the agent to run at specific times.
Kill immediately stops the server. You'll lose any information the catalog agent was working on.
Getting a status report for the catalog agent
Whenever the catalog agent runs, you can get a status report that describes what the catalog agent is doing. To view the status report, click the Status button on the Control Catalog Agent form.
A sample status report for the catalog agent.
Accessing catalog files
Once you have a working catalog, you can access the catalog main page at the following URL:
http://
yourserver.org
/catalog
Catalog files are kept on your server in a directory under the server root directory called https-
identifier
/catalog
. Because this directory is outside your document root directory (where you keep all of your web content), the server creates an additional document directory that maps the URL prefix /catalog
to the https-
identifier
/catalog
directory on your hard disk. You can view this setting by choosing Content Mgmt|Additional Document Directories in the Server Manager.
Catalog configuration files
The catalog agent uses the following configuration files:
filter.conf
file is used by the catalog agent to determine what data to save in the resource descriptions. This file also configures the catalog agent. You should modify this file only if you have Netscape Catalog Server. See that product's documentation for more information on this configuration file.
process.conf
file configures the catalog agent and tells it where to send the resource descriptions that it generates. This file contains all of the catalog agent settings you specified in the Server Manager forms, including the URL where the catalog agent begins its enumeration.
robot.conf
file specifies which filter.conf
file the catalog agent uses.rdm.conf
file contains information for all catalogs served by the resource description server (RDS). RDSs collect resource descriptions from the robots that search the network and send this information to the catalog server. The RDS is actually the back end of Netscape Catalog Server, and is not part of the autocatalog feature. You should modify this file only if you have Netscape Catalog Server and its documentation.
csid.conf
file contains configuration information for the servers that the RDS catalogs.robots.txt
. You can use this file to restrict areas of your server from your catalog agent. This file is also used by any other robots or catalog agents that visit your web server.
filter.conf
file uses the same syntax as the obj.conf
file. It is a series of directives and functions with attributes that define the rules the catalog agent follows (which directory to start cataloging) and how to generate the resource descriptions. The filter.conf
file uses four directives:
process.conf
file configures the catalog agent. It includes information such as:
<Process csid="x-catalog://www.netscape.com:9999/AutoCatalog" \
speed=10 \
email="user@domain" \
username="anonymous" \
password="robin@" \
http://www.netscape.com/
</Process>
Robots are also sometimes called web crawlers or spiders.Because some web administrators want to control what directories and files a robot can access, the web community designed a standard
robots.txt
file for excluding robots from web servers. The catalog agent was designed to follow instructions in a robots.txt
file. However, not all web robots follow these guidelines.
You can use the robots.txt
file to restrict your server's catalog agent, but if your web server is part of the World Wide Web, keep in mind that the robots.txt
file might be used by other robots visiting your site.
Note
The catalog agent, and any other robot, is restricted by access control settings and user authentication.
robots.txt
file consists of one or more groups of lines with name-value pairs that instruct the robots. Each group of lines should describe the User-Agent type, which is the name of a particular robot. The Netscape catalog agent is called Netscape-Catalog-Agent/1.0. After you specify which User-Agents you want to configure, you include a Disallow line that lists the directories you want to restrict. You can include one or more groups in your robots.txt
file.
Each line in the group has the format
"
field
:
value
"
The field name is not case-sensitive, but the value is case-sensitive. You can include comment lines by beginning the comment with the # character. The following example shows one group that configures all robots and tells them not to go into the directory called
/usr
:
# This is a sample robots.txt file
User-agent: *
Disallow: /usr
robots.txt
file specifies that no robots should visit any URL starting with /usr
or /tmp
:
# robots.txt for http://www.mysite.com/
User-agent: *
Disallow: /usr
Disallow: /tmp
The next example restricts all robots from your web site except the Netscape catalog agent:
# robots.txt for http://www.site.com/
User-agent: *
Disallow: *
# Netscape catalog agent is a good robot
User-agent: Netscape-Catalog-Agent/1.0
Disallow:
The following example tells all robots, including the catalog agent, not to traverse your web site.
# No robots allowed!
User-agent: *
Disallow: /
robots.txt
file manually or by using the online form. The Edit Robots.txt form will create a robots.txt
file if one does not already exist. If you choose to edit the file manually, use the format described in "Format for robots.txt" on page 269. To edit the robots.txt
file using the Edit Robots.txt form:
Netscape-Catalog-Agent/1.0
in the User-Agent field. If you want to
configure all robots, you should enter *
.