http://<yourserver.org>/catalog
Catalog files are kept on your server in a directory under the server root directory called https-[identifier]/catalog
. Because this directory is outside your document root directory (where you keep all of your web content), the server creates an additional document directory that maps the URL prefix /catalog
to the https-[identifier]/catalog
directory on your hard disk. You can view this setting by choosing Content Mgmt|Additional Document Directories in the Server Manager.
Warning!If your server uses access control based on host name or IP address, make sure you've allowed your local machine full access to your web site. Also, if your server is configured to use authentication through entries in a user database, make sure you give access to the catalog agent. See "The process.conf file" on page 212 for more information.
The first scan lists any URLs referenced in the HTML file. The second scan generates a resource description, as described in the following section. After the catalog agent enumerates the URLs and generates a resource description, it determines which HTML files to scan next. The catalog agent in the Netscape Enterprise Server limits the URLs it traverses: it only accesses HTML files located in your server. The following figure shows how the catalog agent scans the files on a sample web server.
The catalog agent enumerates URLS, and then generates resource descriptions.
Generating a resource description
For each document found during enumeration, the catalog agent scans the HTML document for information to catalog. For example, it gathers
You can enter title, author, and classification automatically in Netscape Navigator Gold by choosing Properties|Document from the menu.
<META NAME="Classification" CONTENT="General HTML">
<META NAME="Author" CONTENT="J. Doe">
https-[identifier]/catalog
. Users can access the categorized information by going to the catalog directory on your server:
http://<yourserver.org>/catalog
You can restrict access to this directory and treat it as you do other parts of your web site. You can also turn off the catalog feature, which effectively means no one can access the catalog directory or any of its HTML files.
Configuring the catalog agent
You can configure how the catalog agent accesses the content on your server. You can set directories in your document root where the catalog agent starts cataloging. That is, if you have several directories in your document root (the main directory where you keep your server's content files), you can set the catalog agent to access only certain directories and their subdirectories.
To configure the catalog agent,
To find your server's document root, choose the Primary Document Directory link in Content Mgmt.
/second
in the Starting Points field.
The following table defines the speed settings:
Speed Setting
|
Simultaneous retrievals
|
Delay (seconds)
1 |
1 |
60 |
2 |
1 |
30 |
3 |
1 |
15 |
4 |
1 |
5 |
5 |
1 |
1 |
6 |
2 |
0 |
7 |
4 |
0 |
8 |
8 |
0 |
9 |
12 |
0 |
10 |
16 |
0 |
|
---|
NoteIf the content on your server changes infrequently or if all of the content changes simultaneously, you'll probably want to manually recatalog your content instead of scheduling it. This minimizes the performance impact on your server. To schedule the catalog agent,
https-[identifier]/logs
directory under your server root directory. The log file contains the URLs retrieved, generated, and enumerated. This log file is a more verbose list than what you see in the status (see page 209).
Manually controlling the catalog agent
You can manually control the catalog agent. This is useful for several reasons:
The following table defines all of the status attributes.
Catalog configuration files
The catalog agent and the RDS use the following configuration files:
filter.conf
file is used by the catalog agent to determine what data to save in the resource descriptions. The file also configures the catalog agent. You should only modify this file if you have Netscape Catalog Server. See that product's documentation for more information on this configuration file.
process.conf
file configures the catalog agent and tells it where to send the resource descriptions that it generates. This file contains all of the catalog agent settings you specified in the Server Manager forms including the URL where the catalog agent begins its enumeration.
robot.conf
file specifies which filter.conf file the catalog agent uses.
rdm.conf
file contains information for all catalogs served by the RDS server. You should only modify this file if you have Netscape Catalog Server and its documentation.
csid.conf
file contains configuration information for the servers that the RDS server catalogs.
<Process csid="x-catalog://www.netscape.com:9999/AutoCatalog" speed=10 email="user@domain" username="anonymous" password="robin@" http://www.netscape.com/ </Process>
Robots are also sometimes called web crawlers or spiders.Because some web administrators want to control what directories and files a robot can access, the web community designed a standard robots.txt file for excluding robots from web servers. The catalog agent was designed to follow instructions in a robots.txt file. However, not all web robots follow these guidelines. You can use the robots.txt file to restrict your server's catalog agent, but if your web server is part of the World Wide Web, keep in mind that the robots.txt file might be used by other robots visiting your site.
Note
The catalog agent, and any other robot, is restricted by access control settings and user authentication.
"<field>:<value>"
The field name is case insensitive, but the value is case-sensitive. You can include comment lines by beginning the comment with the # character. The following example shows one group that configures all robots and tells them not to go into the directory called /usr
:
# This is a sample robots.txt file
User-agent: *
Disallow: /usr
# robots.txt for http://www.mysite.com/
User-agent: *
Disallow: /usr
Disallow: /tmp
The next example restricts all robots from your web site except the Netscape catalog agent:
# robots.txt for http://www.site.com/
User-agent: *
Disallow: *
# Netscape catalog agent is a good robot
User-agent: Netscape-Catalog-Agent/1.0
Disallow:
The following example tells all robots, including the catalog agent, not to traverse your web site.
# No robots allowed!
User-agent: *
Disallow: /