In the following program, the parse tree is read in from a file. The lines in red are the key statements that initiate the creation of the Parser object, and the initial read from the file.
import com.ibm.xml.parser.*; import java.io.*; public class GetParseTree { public static void main (String args[]) { String filename = null; if (args.length > 0) { filename = args[0]; if (filename != null) { InputStream is; try { is = new FileInputStream(filename); } catch (FileNotFoundException notFound) { System.err.println(notFound); return; } //*** The doc is the root of the DOM Tree. It is of type //*** TXDocument which implements the DOM Document interface. TXDocument doc = new Parser(filename).readStream(is); } } } } |
A Parser instance cannot be reused. An application can call Parser#readStream() method only
once.
You can output the parse tree in XML format into a stream. These lines can be inserted right after the last
statement above,
and the XML will be echoed to standard output.
//*** Output Document as XML String charset = "ISO-8859-1"; // MIME charset name String jencode = MIME2Java.convert(charset); PrintWriter pw; try { pw = new PrintWriter(new OutputStreamWriter(System.out, jencode)); } catch (UnsupportedEncodingException unsupported) { System.err.println(unsupported); return; } doc.setEncoding(charset); try { doc.print(pw, jencode); } catch (IOException io) { System.err.println(io); return; } |
You can configure the parser's behavior after making a Parser instance, and before the call of readStream(). The following functions may be called:
For example:
.... Parser parser = new Parser(filename); //*** Set some Parse options... parser.setWarningNoDoctypeDecl(false); parser.setWarningNoXMLDecl(false); parser.setPreserveSpace(false); parser.setKeepComment(false); TXDocument doc = parser.readStream(is); .... |
You can control output of errors produced by the parser. You might want to do this if you want to handle your own errors, or if you don't want error messages printed to stderr. To handle your own errors, make an instance of a class implementing the interface ErrorListener, and then specify the instance to Parser constructor.
The Object key parameter of the error() method is an instance of String or Exception.. When key is a String, it means a type of error (See a source com/ibm/xml/parser/r/Message.java).
import com.ibm.xml.parser.*; import java.io.*; class ErrorIgnorer implements ErrorListener { public int error(String fname, int lineno, int charoff, Object key, String mes) { // do nothing return 0; } } .... //*** Parser uses ErrorIgnorer class Parser parser = new Parser(filename, new ErrorIgnorer(), null); .... |
import com.ibm.xml.parser.*; import java.io.*; import java.awt.TextArea; import java.awt.Frame; class ErrorEater extends TextArea implements ErrorListener { public int error(String fname, int lineno, int charoff, Object key, String mes) { append( fname+":"+lineno+":"+mes+"\n"); return 1; } } public class UseErrorEater { public static void main (String args[]) { String filename = null; if (args.length > 0) { filename = args[0]; if (filename != null) { InputStream is; try { is = new FileInputStream(filename); } catch (FileNotFoundException notFound) { System.err.println(notFound); return; } //*** Parser uses ErrorEater TextArea class ErrorEater ee = new ErrorEater(); Frame f = new Frame(); // allows us to close the frame with the mouse. f.addWindowListener(new java.awt.event.WindowAdapter() { public void windowClosing(java.awt.event.WindowEvent e) { System.exit(0); } }); f.setSize(400,300); f.add("Center", ee); f.show(); //*** Here is the usage of our ErrorEater... Parser parser = new Parser(filename, ee, null); TXDocument doc = parser.readStream(is); } } } } |
See the sources, com/ibm/xml/parser/trlxml.java, com/ibm/xml/parser/Stderr.java for additional information.
Since our Parser's TXDocument, and TXElement are implementation classes of the DOM interfaces Document
and Element
respectively, you can write a client which can traverse the tree using only the DOM interfaces, without refering
to our implementation classes. For example:
.... import org.w3c.dom.*; .... //*** Only refer to DOM Interfaces... public static void traverseDOMBranch(Node node) { // do what you want with this node here... System.out.println(node.getNodeName()+":"+node.getNodeValue()); if (node.hasChildNodes()) { NodeList nl = node.getChildNodes(); int size = nl.getLength(); for (int i = 0; i < size; i++) { traverseDOMBranch(nl.item(i)); } } } .... //*** Note how we refer only to DOM Interface references. Document doc = parser.readStream(is); Element root = (Element)doc.getDocumentElement(); traverseDOMBranch(root); |
TXDocument can have one TXElement instance, zero or one DTD instances, and instances of TXPI and TXComment as children. All children of TXDocument can be accessed with TXDocument#getChildren() / TXDocument#getChildrenArray(). The TXElement instance can be accessed with TXDocument#getDocumentElement() also.
TXElement can have some instances of TXElement, TXText, TXPI, and TXComment as children. All children of TXElement can be accessed with TXElement#getChildren() / TXElement#getChildrenArray().
Some methods of TXDocument and TXElement returns instance(s) of Child interface. These Child instances are also instances of TXElement or TXText or TXPI or TXComment or DTD(if a child of TXDocument). To know what class an instance belongs to, use Node#getNodeType() or instanceof operator like this:
import com.ibm.xml.parser.*; import org.w3c.dom.*; import java.util.Enumeration; .... //*** Refer to TX classes... public static void traverseTX(Node node) { // do what you want with this node here... if (node instanceof TXElement) { TXElement el = (TXElement)node; // Do fancy TXElement stuff here... System.out.println(node.getNodeName()+":"+node.getNodeValue()); } else if (node instanceof TXText) { TXText te = (TXText)node; // Do fancy TXText stuff here... System.out.println(node.getNodeName()+":"+node.getNodeValue()); } if (node.hasChildNodes()) { NodeList nl = node.getChildNodes(); int size = nl.getLength(); for (int i = 0; i < size; i++) { traverseTX(nl.item(i)); } } } .... Document doc = parser.readStream(is); Element root = (Element)doc.getDocumentElement(); traverseTX(root); |
The XML4J parser keeps all spaces and passes them to applications, according to 2.10 White Space Handling in XML 1.0 Proposed Recommendation. The processor sets the IsIgnorableWhitespace() flag to true on TXText instances which consist of only white spaces.
<MEMBERS> <PERSON>Hiroshi</PERSON> <PERSON>Naohiko</PERSON> <PERSON> Kent </PERSON> </MEMBERS> |
TXElement (getName():"MEMBERS", getText():"\n Hiroshi\n Naohiko\n \n Kent\n \n") TXText ("\n ", ignorable) TXElement (getName():"PERSON", getText():"Hiroshi") TXText ("Hiroshi") TXText ("\n ", ignorable) TXElement (getName():"PERSON", getText():"Naohiko") TXText ("Naohiko") TXText ("\n ", ignorable) TXElement (getName():"PERSON", getText():"\n Kent\n ") TXText ("\n Kent\n ") TXText ("\n", ignorable)
You might find it useful to call TXText#trim(String) / TXText#trim(String,boolean,boolean) when your application does not need to retain leading or trailing spaces.
Creating an end tag hook
In this example, we create a "hook", so we can do processing when specific end tags are encountered.
import com.ibm.xml.parser.*; import java.io.*; // AElementHandler handles an Element and doesn't Filter it class AElementHandler implements ElementHandler { public TXElement handleElement(TXElement el) { System.out.println("handling:"+el.getName()); return el; } } .... Parser parser = new Parser(filename); parser.addElementHandler(new AElementHandler(), "SPEECH"); TXDocument doc = parser.readStream(is); |
Filtering out specific tags
This second example shows you how to filter out tags, by not allowing them to be placed into the parse tree.
import com.ibm.xml.parser.*; import java.io.*; // FilterElementHandler handles an Element and Filters LINE elements... class FilterElementHandler implements ElementHandler { public TXElement handleElement(TXElement el) { System.out.println("handling:"+el.getName()); if (el.getName().equals("LINE")) return null; else return el; } } .... Parser parser = new Parser(filename); parser.addElementHandler(new FilterElementHandler()); TXDocument doc = parser.readStream(is); |
Remember, to filter out tags, return null from handleElement(). These examples can be found in ElementHandlers.java.
When more than one ElementHandler is registered with the parser, the parser will first call ElementHandlers for specific TXElement's (first set, first called) and then will call ElementHandlers for all TXElement.
Even if an ElementHandler changes the name of a TXElement, the parser calls other ElementHandlers with the original name. When an ElementHandler returns null, the parser stops calling other ElementHandlers.
Parser parse = new Parser(...); parse.addElementHandler(handler1); parse.addElementHandler(handler2, "SPEECH"); parse.addElementHandler(handler3, "SPEECH"); parse.addElementHandler(handler4); TXDocument doc = parse.readStream(is); |
Make a TXDocument instance.
TXDocument doc = new TXDocument();
Create something, a DTD or root Element.
Element root = doc.createElement("ROOT"));
Append the newly created Element.
doc.appendChild(root);
Append something to the root Element you have added.
root.appendChild(doc.createElement("FOO"));
Use the quick reference below to see how to do common XML tasks using XML4J.
to create this XML representation: | Use this code: | |
---|---|---|
<?xml version="1.0" encoding="ISO-8859-1"?> | TXDocument doc = new TXDocument(); doc.setVersion("1.0"); doc.setEncoding("ISO-8859-1"); |
|
<?footarget foodata?> | TXPI pi = (TXPI)doc.createProcessingInstruction("footarget","foodata"); | |
<?footarget?> | TXPI pi = (TXPI)doc.createProcessingInstruction("footarget", ""); | |
<!-- comment --> | TXComment comm = (TXComment)doc.createComment(" comment "); | |
<!DOCTYPE ROOT SYSTEM "root.dtd"> | DTD dtd = doc.createDTD("ROOT", new ExternalID("root.dtd")); | |
<!DOCTYPE ROOT [...]> | DTD dtd = doc.createDTD("ROOT", null); dtd.addElement(...); |
|
<!ELEMENT ROOT EMPTY> | ElementDecl ed = doc.createElementDecl("ROOT", doc.createContentModel(ElementDecl.EMPTY)); | |
<!ELEMENT ROOT (#PCDATA|FOO|BAR)*> | CMNode model = new CM1op('*', new CM2op('|', new CM2op('|', new CMLeaf("#PCDATA"), new CMLeaf("FOO")),
new CMLeaf("BAR"))); ContentModel cm = doc.createContentModel(model); ElementDecl ed = fatory.createElementDecl("ROOT", cm); or
|
|
<!ELEMENT ROOT (FOO?, (DL|DD)+, BAR*)> | CMNode model = new CM2op(',', new CM2op(',', new CM1op('?', new CMLeaf("FOO")), new CM1op('+', new
CM2op('|', new CMLeaf("DL"), new CMLeaf("DD")))),new CM1op('*', new CMLeaf("BAR")));
ContentModel cm = doc.createContentModel(model); ElementDecl ed = doc.createElementDecl("ROOT", cm); or
|
|
<!ATTLIST ROOT att1 CDATA #IMPLIED att2 (A|B|O|AB) "A"> |
Attlist al = doc.createAttlist("ROOT"); AttDef ad = doc.createAttDef("att1"); ad.setDeclaredType(AttDef.CDATA); ad.setDefaultType(AttDef.IMPLIED); al.addElement(ad); ad = doc.createAttDef("att2"); ad.setDeclaredType(AttDef.NAME_TOKEN_GROUP); ad.addElement("A"); ad.addElement("B"); ad.addElement("O"); ad.addElement("AB"); ad.setDefaultStringValue("A"); al.addElement(ad); |
|
<!NOTATION png SYSTEM "viewpng.exe"> | TXNotation no = doc.createNotation("png", new ExternalID("viewpng.exe")); | |
<!ENTITY version.num "1.1.6"> | Entity ent = doc.createEntityDecl("version.num", "1.1.6", false); | |
<!ENTITY version.num SYSTEM "version.ent"> | Entity ent = doc.createEntityDecl("version.num", new ExternalID("version.ent"), null); | |
<!ENTITY logoicon SYSTEM "logo.png" NDATA png> | Entity ent = doc.createEntityDecl("logoicon", new ExternalID("logo.png"), "png"); | |
<ROOT att1="val1" att2="val2">any text</ROOT> | TXElement el = doc.createElement("ROOT"); el.setAttribute("att1", "val1"); el.setAttribute("att2", "val2"); el.addElement(doc.createText("any text")); |
|
<![CDATA[any text]]> | TXCDATASection cd = (TXCDATASection)doc.createCDATASection("any text"); | |
&foobar; | GeneralReference gr = (GeneralReference)doc.createEntityReference("foobar"); | |
NOTE: Any XML node can be created manually using the PseudoNode construct, i.e. with `new PseudoNode("literal");'. For example, `dtd.addElement(new PseudoNode("<!ELEMENT ROOT (FOO, BAR)*>"));' will create `<!ELEMENT ROOT (FOO, BAR)*>'. However, you can use a tree including PseudoNode instances only for printing. |
Here is an example program which uses a subset from the table above to generate a DOM tree with an inline DTD. The program also prints out the XML. If you redirect this to a file, you can check for correctness using XJParse.
import com.ibm.xml.parser.*; import java.io.*; import org.w3c.dom.*; /** * This class tests the various table entries in the * Programming Guide - guide.html, under the section * * How to make a new XML document. * * This is to verify the code snippets for accuracy with * the latest DOM. */ public class MakeNewDocument { public static void main (String args[]) { TXDocument doc = new TXDocument(); //<?xml version="1.0" //encoding="ISO-8859-1"?> doc.setVersion("1.0"); doc.setEncoding("ISO-8859-1"); //<?footarget foodata?> TXPI pi = (TXPI)doc.createProcessingInstruction("footarget", " foodata"); doc.appendChild(pi); // <!-- comment --> TXComment comm = (TXComment)doc.createComment(" comment "); doc.appendChild(comm); //<!DOCTYPE ROOT [...]> DTD dtd = doc.createDTD("ROOT", null); doc.appendChild(dtd); //<!ELEMENT ROOT (#PCDATA|FOO|BAR)*> CMNode model = new CM1op('*', new CM2op('|', new CM2op('|', new CMLeaf("#PCDATA"), new CMLeaf("FOO")), new CMLeaf("BAR"))); ContentModel cm = doc.createContentModel(model); ElementDecl ed = doc.createElementDecl("ROOT", cm); dtd.appendChild(ed); ElementDecl foodecl = doc.createElementDecl("FOO", doc.createContentModel(new CMLeaf("#PCDATA"))); ElementDecl bardecl = doc.createElementDecl("BAR", doc.createContentModel(new CMLeaf("#PCDATA"))); dtd.appendChild(foodecl); dtd.appendChild(bardecl); //<!ATTLIST ROOT //att1 CDATA #IMPLIED //att2 (A|B|O|AB) "A"> Attlist al = doc.createAttlist("ROOT"); AttDef ad = doc.createAttDef("att1"); ad.setDeclaredType(AttDef.CDATA); ad.setDefaultType(AttDef.IMPLIED); al.addElement(ad); ad = doc.createAttDef("att2"); ad.setDeclaredType(AttDef.NAME_TOKEN_GROUP); ad.addElement("A"); ad.addElement("B"); ad.addElement("O"); ad.addElement("AB"); ad.setDefaultStringValue("A"); al.addElement(ad); dtd.appendChild(al); //<!NOTATION png SYSTEM "viewpng.exe"> TXNotation no = doc.createNotation("png", new ExternalID("viewpng.exe")); dtd.appendChild(no); //<!ENTITY version.num "1.1.6"> Entity ent = doc.createEntityDecl("version.num", "1.1.6", false); dtd.appendChild(ent); //<ROOT att1="val1" //att2="val2">any //text</ROOT> TXElement rt = (TXElement)doc.createElement("ROOT"); rt.setAttribute("att1", "val1"); rt.setAttribute("att2", "B"); rt.appendChild(doc.createTextNode("any text")); TXElement foo = (TXElement)doc.createElement("FOO"); TXElement bar = (TXElement)doc.createElement("BAR"); rt.appendChild(foo); rt.appendChild(bar); doc.appendChild(rt); String encode = MIME2Java.convert("ISO-8859-1"); PrintWriter pw; try { pw = new PrintWriter(new OutputStreamWriter(System.out, encode)); } catch (UnsupportedEncodingException badEncoding) { System.err.println(badEncoding); return; } doc.setEncoding("ISO-8859-1"); try { doc.print(pw, encode); } catch (IOException io) { System.err.println(io); } } } |
If you want to add functionality to the TXElement class you must subclass TXElement. Then you must subclass the TXDocument class and call Parser#setElementFactory(my-new-TXDocument-subclass). The setElementFactory function is still named so for backwards compatibility.
For example, suppose we wanted to write an XML browser or editor. Not only do we need to remember and store errors away, but we will need to map these errors to DOM nodes. To do this, we need to map the current node to the error as the XML is being parsed. Well, we've already learned how to subclass the ErrorListener. And we've learned how to subclass the ErrorHandler, which we could use to keep track of the current TXElement. However, now we want to capture errors on Nodes, whether they are TXElements or anything else.
class MyElement extends TXElement { .... } class MyText extends TXText { .... } class MyFactory extends TXDocument { //*** The current Node! public static Node currentNode; public TXElement createElement(String name) { MyElement el = new MyElement(name); el.setFactory(this); currentNode = el; return el; } public TXText createText(String data, boolean ignorable) { MyText te = new MyText(data); te.setFactory(this); te.setIsIgnorableWhitespace(ignorable); currentNode = te; return te; } .... } class ErrorFlagger implements ErrorListener { static Hashtable errorNodes = new Hashtable(); static Object previous; static String errorString; public void error(String fname, int lineno, int charoff, Object key, String mes) { errorString = mes; previous = errorNodes.get(MyFactory.currentNode); if (previous != null) errorString += (String)previous; errorNodes.put(MyFactory.currentNode, errorString); } public static String getError(Node node) { return errorNodes.get(node); } public static Hashtable getErrorNodes() { return errorNodes.clone(); } } .... Parser parse = new Parser(...); parse.setElementFactory(new MyFactory()); TXDocument doc = parse.readStream(is); // doc has MyElement instances instead of TXElement instances // doc has MyText instances instead of TXText instances |
You must call setFactory(this) in the create*() methods of your factory class.
String systemlit = "http://.../foobar.dtd"; InputStream is = (new URL(systemlit)).openStream(); Parser parse = new Parser(...); DTD dtd = parse.readDTDStream(is); |
Enumeration en = dtd.getAttributeDeclarations("FOO"); while (en.hasMoreElements()) { AttDef attd = (AttDef)en.nextElement(); // attd.getName() is attribute name } |
First, get AttDef instance by the above method or by DTD#getAttributeDeclaration(String,String).
Second, check the attribute type by AttDef#getDeclaredType(), which returns one of the following values.
Enumeration en = dtd.getEntities(); while (en.hasMoreElements()) { EntityValue ev = (EntityValu)en.nextElement(); if (ev.isNDATA()) { // Each ev.getName() is valid value. } } |
Enumeration en = attd.elements(); while (en.hasMoreElements()) { String s = (String)en.nextElement(); // Each s is valid. } |
String newid = ... if (null != dtd.checkID(newid)) { // Can't use newid } else dtd.registID(element, newid); |
Enumeration en = dtd.IDs(); while (en.hasMoreElements()) { String id = (String)en.nextElement(); // The attribute can has one in a set of each id. } |
Enumeration en = attd.elements(); while (en.hasMoreElements()) { String s = (String)en.nextElement(); // Each s is valid. } |
<!ELEMENT PERSON (NAME, HEIGHT, WEIGHT, EMAIL?)>
This DTD declaration means that we must first insert a "NAME" element, then a "HEIGHT" element, a "WEIGHT" element, and finally we may optionally insert an "EMAIL" element.
Applications can get information about these validation rules using DTD#getInsertableElements() or DTD#getAppendableElements().
TXElement el = new TXElement("PERSON"); .... switch (dtd.getContentType("PERSON")) { case 0: // This element is not declared. break; case ElementDecl.EMPTY: // No element is insertable. break; case ElementDecl.ANY: // Any element is insertable. break; case ElementDecl.MODEL_GROUP: Hashtable tab = dtd.prepareTable("PERSON"); // This hashtable is reusable for any elements. dtd.getAppendableElement(el, tab); if (((InsertableElement)tab.get(DTD.CM_ERROR)).status) { // This element has incorrect structure. } else { Enumeration en = tab.elements(); while (en.hasMoreElements()) { InsertableElement ie = (InsertableElement)en.nextElement(); if (!ie.name.equals(DTD.CM_ERROR) && !ie.name.equals(DTD.CM_EOC) && ie.status) { if (ie.name.equals(DTD.CM_PCDATA)) { // Can append TextElement instance to el. } else { // Can append Element instance named ie.name. } } } } break; } |
XML4J allows you to turn off any validation of the content model. This means that the parser will not check
whether the xml document being parsed follows the definition in the DTD
file. You will still get an
error if the parser cannot find the .dtd
file refered to in the DOCTYPE
line.
To turn off validation
TXDocument
.
isCheckValidity()
to return false
.
Parser#setElementFactory()
with an instance of this subclass.
Here is the code illustrating how this can be done.
... Parser p = new Parser(fileName); p.setElementFactory(new TXDocument() { public boolean isCheckValidity() { return false; } }); TXDocument doc = p.readStream(is); ...
NOTE: The namespace specification is Work in Progress. The implementation of namespaces in XML4J is experimental.
TXElement / TXText / TXComment / TXPI have getDigest() methods, which can be used to return hash digest values calculated for subtrees of the parse tree. The getDigest() method defaults to returning a 128bit MD5 hash code for the current element and all its children. Therefore, when a child element is modified, each parent element's getDigest() will return a new digest value.
You can use getDigest for fast and efficient comparison of two parse trees, or for detecting changes in XML documents.
See the DOMHASH document for additional details.
As per the XML specification, 'encoding='
declaration is only optional in XML files in UTF-8 encoding.
If the XML file is in any other encoding, then the 'encoding
' attribute must be present in the XML
header declaration in the XML file (in the first line). The value of the 'encoding
' attribute must
be a supported encoding name. A list of names of encodings supported by XML4J may be found in the file com.ibm.xml.parser.MIME2Java.html.