IBM's XML Parser for Java - README


IBM's XML for Java is an Extensible Markup Language (XML) processor written in Java (alpha level). XML for Java provides two main functions:


Contents


Installation

Windows95, WindowsNT, OS/2 (ZIP archive)
  1. Install JDK-1.1.
  2. Install unzip or WinZip executable..
  3. Download the XML4J distribution package in ZIP format.
  4. Unzip the distribution package, xml4j_n_n_n.zip into a directory.
    C:\>unzip some_directory\xml4j_n_n_n.zip
    C:\>cd xml4j
    C:\xml4j>
    	    

    You will see the following files in the xml4j directory:

    FAQ.html FAQ
    README.html this file
    license.html license information
    apiDocs\ directory for API documents
    docs\ directory for documents
    data\personal.dtd sample DTD file
    data\personal.xml sample XML document
    src\ directory for source code
    xml4j_n_n_n.jar contains parser class files
    xml4jSamples_n_n_n.jar contains samples class files
    scripts\ directory for build scripts
    samples\ sample XML4J applications

  5. Try the following command to test your installation. This test parses the input and then regenerates the same XML document.
    C:\xml4j>type data\personal.xml
    C:\xml4J>jre -cp xml4j_n_n_n.jar;xml4jSamples_n_n_n.jar samples.XJParse.XJParse -d data\personal.xml
    		
  6. This step is required only if you have installed JDK 1.1.6 or you experience a run-time fatal error while invoking 'XJParse'.

    This fatal error is because of a bug in the JIT (symcjit.dll) shipped with JDK 1.1.6. The fix is to apply a patch which can be downloaded from the JavaSoft website: http://www.javasoft.com/products/jdk/1.1/download-jdk-windows.html

    Installing the patch involves replacing symcjit.dll with the new one.

UNIX
  1. Install JDK-1.1 and GNU gzip.
  2. Download a distribution package in .tar.gz format. (If you have installed the unzip command for UNIX, ZIP format is also Ok.)
  3. Extract the distribution package into a directory.
    # cd /usr/local
    # gzip -dc some_directory/xml4j.n.n.n.tar.gz | tar xvf -
    # cd xml4j
    	    

    You will see the following files in the xml4j directory:

    FAQ.html FAQ
    README.html this file
    license.html license information
    api/ directory for API documents
    docs/ directory for documents
    data/personal.dtd sample DTD file
    data/personal.xml sample XML document
    src/ directory for source code
    xml4j_n_n_n.jar contains class files
    xml4jSamples_n_n_n.jar contains samples class files
    scripts/ directory for build scripts
    samples/ sample XML4J applications

  4. Try the following command to test your installation. This program parses the input and then regenerates the same XML document.
    # cat data/personal.xml
    # jre -cp "xml4j_n_n_n.jar:xml4jSamples_n_n_n.jar" samples.XJParse.XJParse -d data/personal.xml
    	    


Sample Applications

Some sample applications provided are (all classes required to run these sample applications are in xml4jSamples_n_n_n.jar. Remember, 'jre' ignores the CLASSPATH environment variable and so you have to specify any non-standard .jar files (like swing etc) explicitly using the -cp option):

samples.XJParse.XJParse (Java application, previously named 'trlx'):

XJParse is an XML syntax checker. To check an XML document, type:

jre -cp xml4j_n_n_n.jar;xml4jSamples_n_n_n.jar samples.XJParse.XJParse -d <xml-filename>
SiteOutliner (Java application):

SiteOutliner is a Java application that scans a Web site and reports its profile in CDF format. The profile contains a list of links to the pages, showing the structure of the site. The user can limit the files to be scanned by using some conditions, such as file types (extensions) and modified dates. The program can be used in both command prompt and window environments.

CDF Editor (Java application):

CDF Editor is a Java application to edit CDF files. The user loads a CDF file and edits the channels and items.
jre -cp xml4j_n_n_n.jar;xml4jSamples_n_n_n.jar samples.CdfEditor.CdfEditor

CDF Viewer (Java applet):

CDF Viewer is an applet that parses CDF files and visualizes their structures by using a tree.

Validating Generation sample (Java application):

This sample generates a valid element tree according to the specified DTD (specify full pathname to personal.dtd).
jre -cp xml4j_n_n_n.jar;xml4jSamples_n_n_n.jar samples.Miscellaneous.GeneratingSample e:\xml4j\data\personal.dtd

XML Tree-View (Java application):

A sample application using com.ibm.xml.parser.util.TreeFactory. You need to install JFC 1.1 (Swing-1.0) to run this program.
jre -cp "xml4j_n_n_n.jar;xml4jSamples_n_n_n.jar;C:/swing-1.0.2/swingall.jar" samples.Miscellaneous.TreeView data\personal.xml

[A capture of TreeView]

XPointer Demonstration (Java application):

A sample application using com.ibm.xml.xpointer package. You need to install JFC 1.1 (Swing-1.0) to run this program.
jre -cp "xml4j_n_n_n.jar;xml4jSamples_n_n_n.jar;C:/swing-1.0.2/swingall.jar" samples.Miscellaneous.XPointerDemo data\personal.xml
This program has 2 function:
Click a node:
Display an XPointer expression of clicked node on a text field.
Put an XPointer expression and press "Go" button:
Select nodes pointed by the XPointer.


On certain platforms, where 'jre' is not be available, you can run these samples using 'java'. For this you can edit the CLASSPATH environment variable to include the parser (xml4j_n_n_n.jar) and samples (xml4jSamples_n_n_n.jar) jar files.


Program Development

This distribution archive includes a file named xml4j_n_n_n.jar. Add this file to your CLASSPATH environment variable, writing a command such as

set CLASSPATH=C:\xml4j\xml4j_n_n_n.jar;. (for Windows)

(assuming that you have installed XML for Java in C:\xml4j.)

setenv CLASSPATH /usr/local/xml4j/xml4j_n_n_n.jar:. (for UNIX, csh/tcsh)

export CLASSPATH="/usr/local/xml4j/xml4j_n_n_n.jar:." (for UNIX, ksh/bash/zsh)

The following resources are provided for application development:


Release Notes

This version of the processor is based on XML 1.0 Recomendation [10-Feb-1998]
The processor supports 37 encodings for `<?xml encoding="...."'
ISO-10646-UCS-4, ISO-10646-UCS-2, UTF-8, UTF-16, US-ASCII, ISO-8839-1 ... ISO-8859-9, ISO-2022-JP, Shift_JIS, EUC-JP, GB2312, Big5, EBCDIC-CP-(US, CA, NL, DK, NO, FI, SE, IT, ES, GB, FR, AR1, HE, CH, ROECE, YU, IS, AR2)
Validating Generation
Applications can recognize information in the Document Type Definition (DTD) and generate a document that has correct structure. See `How to query DTD information' in the programming guide.
W3C Document Object Model (DOM) Level 1 Specification [01-Oct-1998] Support:
Simple API for XML (SAX) 1.0 Support
com.ibm.xml.parser.SAXDriver provides the SAX interface.
Namespaces in XML Proposed Recommendation [17-Nov-1998] Support
See the programming guide.
Element Digest
See the DOMHASH document.
XPointer package
com.ibm.xml.xpointer package provides parsing XPointer expression, generating an XPointer instance from a node in a document tree, searching for nodes pointed by an XPointer instance.


CHANGES

19 Nov 1998
Release 1.1.9
19 Nov 1998
Fixed defects:
When using SAX, the entire document should not be loaded into memory.
NULL pointer exception in the sample CDFEditor.
Parser should not display warnings unless asked to.
XML4J code should not call printStackTrace().
Wild card defect in sample XJParse.
Stop using deprecated methods.
Start using JDK 1.2 JavaDoc for generating API documentation.

09 Nov 1998
Release 1.1.8
09 Nov 1998
Fixed defects:
The sample XJParse can now handle wild cards for filenames.
In the sample XJParse one can turn of fectching the DTD even though the DOCTYPE line is specified.
Fixed defect in the UTF8 decoder. Can now parse Jim Clark's valid/sa/052.xml.
Parser crashes while handling NOTATIONS.
StringPool#expandTable() crashes.
Fix defect in the way language ID's are checked. It should conform with section 2.12 of XML 1.0 Spec.
Normalization should handle entire sub-tree.
HTML DTD: TABLE content model's error.

30 Oct 1998
Release 1.1.7
30 Oct 1998
Fixed defects:
In an external DTD, a PE substitution at the end of a CDATA clause causes the parser to fail.
Parameter entity reference in a INCLUDE section generates error.
Bug in DTD getInsertableElementsForValidContent().
Make some methods in Parent class thread-safe.
Parser must check that default values for attributes conform to its type.

23 Oct 1998
Release 1.1.6
23 Oct 1998
Fixed defects:
Content Model matching fails against an enity reference.
ToXMLStringVisitor() prints entity declarations incorrectly.
TerminateSignal should not be ignored in readDTDStream().

09 Oct 1998
Release 1.1.5
09 Oct 1998
Fixed defects:
In the output of parser, 'quote' character is now represented as &#34; instead of &#x22. Both Netscape and IE handle this correctly.

02 Oct 1998
Release 1.1.4
02 Oct 1998
Parser now conforms to DOM Level 1 Specification (01-Oct-1998).
Parser now additionally supports EBCDIC-CP-(DK, NO, FI, SE, IT, ES, GB, FR, AR1, HE, CH, ROECE, YU, IS, AR2) encodings.
Fixed defects:
SAX: Missing endElement() notification.
SAX: Distinguish fatalError() and error().

25 Sep 1998
Release 1.1.3
25 Sep 1998
Parser now supports EBCDIC-CP-(US, CA, NL) encodings ( = CP037 Java encoding).
Fixed defects:
SAX: reuse of parser and ErrorHandler calls.
Invalid peer exception in sample program 'GeneratingSample'.
CR's and CRLF's in CDATASection, Comment or PI's are not being replaced by LF.
Overwriting an attribute including entity references is incorrect.

18 Sep 1998
Release 1.1.2
18 Sep 1998
Parser can now be interrupted.
Fixed defects:
Fix byte mask in UTF8 Reader.

11 Sep 1998
Release 1.1.1
11 Sep 1998
Parser error message strings, in English, have been rewritten.
Documented, in README.html, limitation of MS JVM in lack of support for IANA encodings.
Performance enhancements. Optimized input string buffering.
Fixed defects:
Parser can now handle URN's as SYSTEMID.

04 Sep 1998
Release 1.1.0
04 Sep 1998
Namespace API Change: New return values of TXElement#getNamespaceForQName() and TXElement#getNamespaceForPrefix() methods.
Fixed defects:
Conform Util.getInvalidURIChar() to RFC2396.
getDigest() for EntityReference doesn't work.
Don't do anything when newChild parameter equals oldChild parameter in Parent#replaceChild().
An adopted child must sever connections with original parents.
Reduce memory requirements when using SAX.
PIs with 0-length PI data such as <?foo?> crashes the parser.
Parser reports errors when a PI occurs just before EOF in a document.
Add commands to XJParse for version # and name-space printing.
Leading '\' not handled properly in context of filenames.

28 Aug 1998
Release 1.0.9
28 Aug 1998
Support attribute-based namespace (WD-xml-names-19980802).
Two new sample DTD's are now bundled. HTML40frameset.xml.dtd and HTML40loose.xml.dtd.
Added printNonSpecifiedAttributes flag to ToXMLPrintVisitor.
Added '-stoponerror' command line option to samples.XJParse.XJParse.
Fixed defects:
Null pointer exception when com.ibm.xml.parser.SAXDriver.parse() is called a second time.
Validate only when target document has !DOCTYPE and one or more !ELEMENT declarations.
Parser should not stop after first validation error.
com.ibm.xml.parser.Parent.realInsert() should not call isCheckOwnerDocument() if it was not created by a factory.
#REQUIRED attributes return wrong getSpecified() flag.
Change ErrorListerner.error()method's return type from void to int.
']]>' terminating conditional sections in DTD's are not recognized.
Can't replace the root element in TXDocument.
TXNodeList#replace() doesn't set next/previousSibling of removed Node to null.
HTMLPrintVisitor should not print comments in interal DTD.
Parser, by default, now stops parsing after an error occurs as required by the XML spec.

21 Aug 1998
Conformance to DOM Level 1 Proposed Recommendation [18-Aug-1998]
Changes due to above conformance:
Replaced the files in org.w3c.dom package by java-binding.zip in PR-DOM.
Renamed
NodeList#getSize() to NodeList#getLength()
NamedNodeMap#getSize() to NamedNodeMap#getLength()
NodeType symbols (Node.ELEMENT to Node.ELEMENT_NODE, etc.)
Added
Node#getOwnerDocument()
Removed
DocumentFragment#getMasterDoc()
Notation#setSystemId()
Notation#setPublicId()
Entity#setSystemId()
Entity#setPublicId()
Fixed defects:
Use Exception instead of IOException in API (parser\FormatPrintVisitor.java ...).
Hexadecimal character references cause errors.
TXAttribute#toXMLString() prints contents twice.
TXCDATASection#getNodeType() doesn't return Node.CDATA_SECTION
HTMLPrintVisitor can't print empty content like "<BODY></BODY>".
HTMLPrintVisitor should not print entity references.
ToXMLStringVisitor prints replaced text instead of entity references in attribute values.
Shell scripts now work with a public domain shell (Cygnus-Win32).
SAX resolveEntity() handler that returns an InputSource now works.
DOM: TXAttribute has only String value, no value as child nodes.
XPointer#point() doesn't work against a tree including EntityR.
Document inherits from Node instead of DocumentFragment.
Node#insertBefore()/replaceChild()/appendChild() check types of children.
Attribute#getParentNode()/getPreviousSibling()/getNextSibling() always returns null.
Element#getElementsByTagName() returns all elements when the parameter is "*".
Removed the ElementFactory interface because all factory functions are moved to the TXDocument class.

14 Aug 1998
The jar file has been split into two: one for parser binaries, and the other for samples binaries.
Fixed defects:
Updated the programming guide.
Reduced memory occupied by Tree nodes.
Removed dependency on Symantec's JIT patch. Now its no longer necessary to install the JIT patch over 1.1.6.
Shell scripts (to compile all sources) should now work.
Correctly compiling Message_ja.java with -EUCJIS encoding.

07 Aug 1998
Moved all sample applications to toplevel/samples directory.
Fixed defects:
TreeView sample crashes with Null pointer exception.
TXAttribute#toXMLString() prints attribute value twice.
DTD#makeContentElementList() doesn't return null for EMPTY/ANY elements.
Use UNIX new line conventions in tar distribution.
DOM: TXElement#normalize() isn't implemented.
XPointer#point(TXDocument) should be point(Document).
31 Jul 1998
Parser now conforms to DOM-19980720 spec.
Fixed defects:
DTD#getInsertableElementsForValidContent() doesn't return correct result.
ContentModel#checkAfterTargetPosition() is wrong.
TXComment is always printed as "<!--null-->".
Parser crashes by NullPointerException in init2().
TXPI("foo", "bar") is printed as "<?foobar?>".
DOM: Factory methods in TXDocument aren't used.
ToXMLStringVisitor and FormatPrintVisitor print an internal DTD subset twice, GeneralReference(&foo;) and the reference's contents.
Parent#insertAfter() doesn't work correctly.
Parser#readDTDStream() aborts by NullPointerException.
24 Jul 1998
Release 1.0.4
24 Jul 1998
Added SCCS revision control strings to source files.
22 Jul 1998
Updated documentation.
New exceptions classes defined for TreeTraversal.
7 Jul 1998
Added new parameter to Parser#parseSingleContent() for alias feature
Added new sample: com.ibm.xml.sample.Alias and alias.dtd, alias-sample.xml
Added Stderr#loadCatalog()
Added "-c catalogfile" option to trlx.
6 Jul 1998
Fixed a bug of TXElement#addTextElement()
Modified util.TreeFactory for current DefaultElementFactory
Replaced util.XHFactory to util.HTMLPrintVisitor
Util.backReference() doesn't convert ' and
The parser never warn redefined entities for lt/gt/amp/quot/apos
Added new sample: com.ibm.xml.sample.HTMLPrint
3 Jul 1998
Moved Format#printSpace() to Util, Format#indent() to Util
Moved DefaultElementFactory#sortStringVector() to Util
Added new class: FormatPrintVisitor, and removed Format
Fixed a bug of Text#insert()
23 Jun 1998
Fixed a bug of a TextDecl in an external parameter entity in an external DTD subset (Parser.java, Token.java)
Fixed a bug that an encoding of DTD wasn't set (Parser.java)
19 Jun 1998
Fixed a bug of parameter entity references in an IGNORE section.
19 Jun 1998
Release 1.0.0
18 Jun 1998
Removed com.ibm.xml.xpointer.Version
Change format of com.ibm.xml.parser.Version
Added DTD#getInsertableElementsForValidContent()
16 Jun 1998
Moved Parser#setNamespace() to TXDocument#setNamespaceParameters()
Added Parser#getNumberOfWarnings()
Added Parser#setEndBy1stError()
12 Jun 1998
Public release.
12 Jun 1998
Added new class: com.ibm.xml.xpointer.Pointed
Some changes for XPointer#point()
Renamed XPointerSample to XPointerDemo
User javadoc in JDK-1.2beta3
11 Jun 1998
Removed com.ibm.xml.xpointer.RelTermArguments class
Added new method: XPointer#point()
Added new sample program: com.ibm.xml.sample.XPointerSample
10 Jun 1998
Renamed TXAttributeList#toArray() to makeArray() because of a conflict to java.util.Vector#toArray() in JDK-1.2beta
Fixed a bug of conditional section in parameter entities.
Fixed a bug of TXElement#attributeElements()
Added new methods: Child#makeXPointer(), XPointer#makeXPointer(Child)
Some changes for xpointer package.
Added -xpointer option to trlx
8 Jun 1998
Fixed a bug of TXElement#searchAncestors()
Moved searchAncestors() from TXElement to Child
Renamed Namespace#getNSNs()/setNSNs() to Namespace#getNSName()/setNSName()
Removed Namesapce#getNSPrefixName()/setNSPrefixName()
Added Namespace#getUniversalName()
Added TXElement#TXElement(TXDocument,String prefix,String localpart)
4 Jun 1998
Added ElementFactory#createText(char[],int,int,boolean) and modify the parser to use this method insted of createText(String,boolean)
3 Jun 1998
Moved trlx and sample programs into com.ibm.xml.sample package
Added java.io.Serializable interface to object model classes.
Added new samples: com.ibm.xml.sample.SerializeSave and com.ibm.xml.sample.SerializeLoad
2 Jun 1998
Fixed some bugs of parameter entity
Fixed a bug of parsing NOTATION attribute
1 Jun 1998
Rewrite parameter entity processing
Integrate EntityValue and Entity
Added new method DTD#getEntity()
28 May 1998
Added Source#Source(InputStream,String) and Source#getEncoding()
Removed Parser#notifyNextEncoding()
Replace almost all code for parameter entities
27 May 1998
Removed TXElement#setUserData() and getUserData()
Removed some deprecated methods
Added attribute normalization (I had forgot to implement it ;-)
25 May 1998
Removed Parser#setDebugPrintName()
22 May 1998
Fixed a bug of more than one definitoins for the same attribute
Added Attlist#getAttDef(String)
Fixed a bug of more than one ID attribute in an element
Added Attlist#contains()
Attlist#addElement() returns boolean
Added `isParameter' to a constructor of Entity
Fixed a bug of chracter encoding detection when no XMLDecl
Fixed a bug of 0Byte entities (xmltest/valid/not-sa/001.xml)
20 May 1998
Improved error checking for TextDecl. (xmltest/not-wf/ext-sa/{002,003}.xml)
Fixed a bug of detection of UTF-16/UCS-2 encodings. (xmltest/valid/ext-sa/014.xml)
19 May 1998
Add a question about VisualAge for Java to FAQ.html.
Removed TXDocument#setRootName()
Removed debug code in ContentModel
Added new class: LibraryException
18 May 1998
Fixed a bug of SAX characters().
13 May 1998
Public release.


TO DO


Limitation of MS JVM

Microsoft's JVM does not support the same encodings as the Sun's JVM implementation. So if you use any of these encodings (like ISO-8859-2) in your document in a Windows environment with Microsoft's JVM, you will get a run time error. This is a limitation of Microsoft's JVM.


Limitation of SUN JVM

Current releases of JVM from SUN Microsystems (JDK 1.1.6) do not correctly support EBCDIC encodings. Does not translate the new line character correctly.

IBM's implemenation of Java 1.1.6 correctly tranlates EBCDIC characters to Unicode.


Contact

Technical questions and comments to alphaWorks communityXchange or xml4j@us.ibm.com.

Non-technical questions to xml4j@us.ibm.com.


[ IBM | alphaWorks | XML for Java | communityXchange - XML for Java]