Introduction to CGI Programming Setting the Stage for CGI Programming: HTTP, URL, CGI, MIME, HTTPD, and many other acronyms too numerous to mention Wojtek Furmanski, Nancy McCracken NPAC Syracuse University 111 College Place Syracuse NY 13244-4100 January 31, 1996 updated September 1996 Click here for body text In a Nutshell MIME stands for Multipart Internet Mail Extensions and is the developing standard for the contents of all messages passed over the Internet. HTTP is Hypertext Transport Protocol and is the protocol that provides the basis of the World Wide Web: transmitting multimedia documents across the Internet. HTTPD is the daemon running the HTTP Web server. URL stands for Uniform Resource Locator and is the universal addressing scheme for all documents (multimedia) on the WWW. CGI is the Common Gateway Interface and is the scheme to interface other programs and systems to the HTTP Web protocol, using the same data protocols as the HTTP clients and servers. References: HTML and CGI Unleashed, John December and Mark GInsburg, chapters 19 and 20. Innumerable web documents. Internet Documents: Drafts, Memos and Standards Some material presented here comes from Internet documents. Here is a summary of various document formats you may find. Internet Drafts Working documents of the Internet Engineering Task Force (IETF), its Area and Working Groups. Other groups may also distribute Internet Drafts. Some of these IDs are labelled by IETF-#. IDs are valid for a maximum of 6 months and may be updated, replaced or obsoleted by other documents at any time. Internet Memos Referred to as RFC-# (Request for Comments) More formal and complete than Internet Drafts, usually represent standard proposals/candidates. Some RFCs become obsolete by subsequent RFCs, some others make it as standards Internet Standards Labelled by STD-# and often associated with the RFC-# specs (e.g. Internet E-Mail is referred to as FRC-822 or STD-11) Internet Documents - Examples Here are a few sample Internet documents relevant for this part of the course. RFC-822: Crocker, D., "Standard for the Format of ARPA Internet Text Messages", SRD 11, RFC 822, UDEL, 1982. RFC-1036: R. Horton and R. Adams, "Standard for Interchange of USENET Messages", RFC 850, AT&T, December 1987. RFC-1521: Borenstein, N. and Freed, N., "MIME (Multipurpose Internet Mail Extension) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies", RFC 1521, Bellcore, September 1993. RFC-1524: Borenstein, N. "A User Agent Configuration Mechanism for Multimedia Mail Format Information", RFC 1524, Bellcore, September 1993. Internet Draft: Tim Berners-Lee, "Basic HTTP", CERN, 1992/3. Internet E-Mail (RFC-822) We all know and use it, but here is a formal specification. Each message is a stream of 7-bit ASCII chars which contains a header and optional (newline separated) body. Header consists of a set of entries with one entry per line given by a colon separated key:value pair. Key contains no spaces or tabs and cannot exceed 63 chars. Body is a fully unstructured sequence of ASCII chars. There is a finite set of standard keys and an extension mechanism via the "X"-prefix. The standard set (as used by MH) is: Date Bcc Resent-Date Resent-Fcc From Fcc Resent-From resent- Sender Message-ID Resent-To Message-Id To Subject Resent-cc Forwarded cc In-Reply-To Resent-Bcc Replied Multipurpose Internet Mail Extension (MIME) Goals Multimedia, multi-language, multi-component extension of RFC-822 Full backward compatibility with RFC-822 (e.g. retain 7-bit ASCII encoding) Open design to incorporate multiple well-known formats Easy extension to new types and formats History: Working Group formed in fall 90 (Bellcore as lead) First Internet Draft 1991 Internet Standard proposed in June 92 (RFC-1341) Revised Version in September 93 (RFC-1521) Now - many implementations in place/progress, also strong (and confusing?) coupling with the WWW/HTTP protocol MIME - Extension Model Retain RFC-822 header+body format Add new header fields Allow for multipart multimedia bodies Include media type and encoding information in new header fields such as: Content-Type, Content-Description, Content-Transfer-Encoding, Content-ID Retain 7-bit ASCII for all valid encoding schemes Implement multi-component bodies via a special 'magic type' Content-Type: multipart Provide natural support required for large multimedia message files such as remote references (similar to hyperlinks but NOT the URL model) and file fragmentation by further specification of the 'multipart' type. MIME - "Content-Type" Header Field Two level hierarchical typing scheme adopted of the form: basetype/subtype Seven base media types are defined this minimal set is enforced, i.e. all extensions must pass the whole ID->RFC->STD process. Allow for less restrictive subtyping the base types, for example: Content-Type: text/plain Content-Type: text/richtext Some standard subtypes are specified and many more are expected. New subtypes must be registered with the IANA (Internet Assigned Numbers Authority). Private experimental subtypes prefixed with "X-" may be used freely and without registration. Seven base types are: text, image, audio, video, multipart, message, application. MIME - Base Content Types text subtypes: plain (just ASCII) and richtext (a simple markup extension including , etc. tags) character sets can be further specified in the header value field as follows: Content-type: text/plain; charset=us-ascii Other charsets can be used to support other languages such as iso-8859-1 (French) or iso-2022-JP (Japanese). These charsets need to be encoded in one of two encoding modes: base64 or quoted-printable. The latter retains ASCII subset and is more natural for non-ASCII extensions. image Standard subtypes: gif, jpeg. Others expected. base64 is a natural encoding scheme for binary media - it packs three 8-bit chars into four 7-bit chars. audio Standard subtype: single-channel 8KHz u-law. Others expected. video Standard subtype: mpeg. Others plausible. MIME - Base Content Types, continued multipart Specifies a MIME message composed of several parts with possible different Content-Type fields. Parts are separated by a boundary string, specified in the multipart header entry Subtypes: mixed (serial combination of media), parallel (for parallel presentation if possible), alternative (multiple representations of the same data) and digest (all parts are messages) message Subtypes: rfc822 (standard ARPA e-mail format), partial (a single chunk of a larger message, chopped into pieces for transmission and then reassembled), external-body (pointer to a remote data - similar to typerlink/URL but different representation) application Current subtypes: postscript, ODA Placeholder for "anything else" - several interactive/custom/creative extensions expected here Already registered: Andrew-inset,t ATOMICMAIL (Bellcore) MIME - Implementation Status 3+ public domain implementations available The most popular - Metamail (Bellcore) is a MIME transition (backward compatible with most current mailing systems) Some existin implementations: PMDF, IMAP2, C-Client, Mail-Manager, MH-MIME, Z-Mail, Andrew, Pine, Elm, Unix Sytem 5 4.3, STI Document Browser, Servicemail, MIXMH MIME support in progress by key vendors on most platforms ATOMICMAIL (also called "computational mail" or "active mail") project at Bellcore towards interactive extensions of MIME HTTP - Hypertext Transport Protocol HTTP provides an upper level to the Internet, that is, it is built on top of a back-bone network with all the packets flowing from client to server and vice versa using the standard TCP/IP protocol. It uses MIME formats and concepts, but does not fully conform to MIME as the WWW is not a mail system. HTTP protocol is compatible with other network services such as FTP (File Transfer Protocol), NNTP (Network News Transport Protocol). On a UNIX-based machine, the basic services are enumerated in the file /etc/services. Each service cooresponds to a standard port. For example, telnet is mapped to port 43, and FTP is mapped to port 21. All ports below 1024 are privileged - only the system administrator can determine port use. The HTTP service is standardly assigned to port 80 - it provides a much shorter service connection than the other services. HTTPD - HTTP Daemon The HTTP daemon is the server which responds to the Internet service requests on standard port 80 (or on another custom port). The server program is available from NCSA and is easily installed by editing a set of configuration files which give directory locations for documents, cgi scripts, error messages and icons, and which allows for options regarding path names, domain access, and so on. URL - Uniform Resource Locator A URL has the standard form service://machine:port/file.file-extension HTML hyperlinks typically use the service http for linking to other documents and media files. Some other internet services can also be used such as ftp://machine/file.file-extension. In this way, a Web server can provide other Internet services through the browser interface. The machine is an Internet address and can either be a symbolic name provided by the Domain Name Service (DNS) or the IP numbers. If the port is not specified, it defaults to 80. The file.file-extension is given by any Unix path name starting from the directory known to the server as "document root". Which path names are valid is one of the options of the server - whether "public_html" is automatically put into the path name and whether paths starting with "~username" are allowed. In the http service, the file-extension is used to tell the browser what helper application to use to view the file. Typical file extensions are html, gif, jpeg, mpeg, au, ram, etc. HTTP - How does it work? On each hyperlink click, the browser (client) initiates a connection with the server at the "machine" (e.g. using UNIX BSD connect call on the default port 80, or a custom user-defined port) A request is sent to the server, formatted as a MIME-like message. The server replies with another MIME-like message which is received by the browser and either formatted in the browser window or viewed with a helper application. The connection is closed on both sides. (The exception to this is the "server push" connection.) HTTP - GET Request Example GET /document.html HTTP/1.0 Accept: www/source Accept: text/html Accept: image/gif User-Agent: Lynx/2.2 libww/2.14 From: mnotulli@ukonaix.cc.ukans.edu -- blank-line-terminating-the-request -- First line syntax is always: METHOD URL ProtocolVersion The following lines form a header of an (extended) MIME message "User-Agent" specifies the browser type "Accept" specifies MIME types recognized by the browser The server is expected to provide the requested data in one of these acceptable formats. HTTP - Reply Example HTTP/1.0 200 OK Date: Wednesday, 02-Feb-95 23:04:12 GMT Server: NCSA/1.1 MIME-version: 1.0 Last-modified: Monday, 15-Nov-94 23:33:16 GMT Content-type: text/html Content-length: 2345 -- -- blank-line-separating-header-and-body-- Document Title . . . This message contains both header and body Some replies contain only header (e.g. error reports, such as HTTP/1.0 404 Not Found) GET request also contained header only, whereas POST request (see next example) contains both header and body HTTP - POST Request Example POST /cgi-bin/post-query HTTP/1.0 Accept: www/source Accept: text/html Accept: video/mpeg Accept: image/x-rgb Accept: application/postscript User-Agent: Lynx/2.2 libwww/2.14 From: grobe@unanaix.cc.ukans.edu Content-type: application/x-www-form-urlencoded Content-length: 150 --blank-line-separating-header-and-body--- org=Academic%20Computing%20Services &users=10000 &browser=lynx &contact=Michael%20Grobe%20grobe@kuhbuh.cc.ukans.edu Both header and body present in POST requests - the body is typically used to pass a form contents to the server. Common Gateway Interface (CGI) - an introduction CGI is an interface for running programs on the server at the request of the client. The client look-and-feel for accessing CGI programs is identical to conventional static HTML, but the server side implementation is different. When the user clicks on a CGI link, the server calls the corresponding process and returns its output, not the data/file/code associated with the process. Typical Applications Support for dynamic generation of HTML documents, such as on-the-fly conversions from other formats. Interfacing with other (non-HTTP) remote services such as databases (WAIS, RDMS), video-on-demand, simulations, etc. Support for the two-way interactivity between clients and servers ( to be achieved by building some intelligence and multiple choice/response capabilities into the CGI programs) Interface to and integration with Forms/GUI area of HTML - submitted forms are handled by suitable CGI processes. Look at a simple example of an HTML form with its CGI Perl program.