Discover key features of DOM Level 3 Core, Part 2


Search for:	within
		Search help

IBM home | Products & services | Support & downloads | My account

developerWorks > XML | Java technology


	Discover key features of DOM Level 3 Core, Part 2

Contents:

Mapping to the Infoset

Bootstrapping

Document normalization

Accessing type information

Using the DOM Level 3 API in Xerces2

Related content:

Discover key features of DOM Level 3 Core, Part 1

Understanding DOM

Basics of bootstrapping with DOM

Subscriptions:

dW newsletters

dW Subscription
(CDs and downloads)

Bootstrapping, mapping to the XML Infoset, accessing type information, and working with Xerces

Level: Intermediate

Arnaud Le Hors (mailto:lehors@us.ibm.com?cc=&subject=Discover key features of DOM Level 3 Core, Part 2), Senior Software Engineer, IBM
Elena Litani (mailto:elitani@ca.ibm.com?cc=&subject=Discover key features of DOM Level 3 Core, Part 2), Staff Software Developer, IBM

26 August 2003

In this two-part article, the authors present some of the key features brought by the W3C Document Object Model (DOM) Level 3 Core Working Draft and show you how to use them with examples in Java code. In this second part, they cover operations on the document, access to type information, and introduce you to the early implementation of this API in the Apache Xerces2 project.

In Part 1, we showed you a set of DOM Level 3 Core features you can use when working with nodes. We will now describe the mapping of the DOM data model to the XML Infoset and how to remove implementation-dependent code from your application with the so called DOM bootstrapping mechanism. Then we will show how to revalidate DOM in memory so that you can check whether it still complies with your schema, describe how to access element and attribute type information and show you how to use all this cool stuff in Xerces.

Mapping to the Infoset
One of the important tasks that was accomplished for the DOM Level 3 is the alignment of the DOM data model with the XML Information Set (Infoset) through the addition of new methods to query missing XML Infoset information. For example, you can now query and modify the information stored in an XML declaration, such as version, standalone, and encoding, through the Document interface (which is mapped to the Infoset document information item). Similarly, the base URI and declaration base URI properties are computed according to XML Base and are available on the Node interface. You can also retrieve the XML Infoset element content whitespace property. This is the property that indicates whether a Text node only contains whitespace that is ignorable. You can retrieve it through the Text interface (which maps to the XML Infoset character information item). Listing 1 shows the actual method signatures of the interface in the Java language binding.

Listing 1. Method signatures in Java language binding


// XML Declaration information on
// the org.w3c.dom.Document interface
public String getXmlEncoding();
public void setXmlEncoding(String xmlEncoding);
public boolean getXmlStandalone();
public void setXmlStandalone(boolean xmlStandalone)
                                  throws DOMException;
public String getXmlVersion();
public void setXmlVersion(String xmlVersion)
                                  throws DOMException;

// element content whitespace property on the Text 
// interface
public boolean isWhitespaceInElementContent();

You can also retrieve the value of the attribute type property of an attribute information item -- this is the type of an attribute -- through the schemaTypeInfo attribute of the Attr interface. This is further detailed in a section below.

In addition, a new feature is provided to put the Document back in a form closest to the XML Infoset, since various editing operations, such as insertion or deletion of nodes, often leave you with a document that is further from the XML Infoset than it might be. This is one of the results you can obtain as part of an operation called document normalization that we describe in the section Document normalization.

Finally, the new Appendix C provides the mappings between the XML Infoset model and the DOM where each XML Infoset information item is mapped to its respective Node, and vice-versa, and each property of an information item is mapped to its respective Node attribute. This appendix should give you a good overview of the DOM data model and show you how to access the information you are looking for.

Bootstrapping
Previous versions of the DOM specification did not provide any way to bootstrap DOM implementations; therefore, in your applications you had to start with implementation-dependent code. The DOM Level 3 Core specification defines a DOMImplementationRegistry object that lets you find implementations based on the set of features you need. For instance, you can ask for an implementation that supports mutation events. Listing 2 shows how you can use the bootstrapping mechanism in your application to find the appropriate implementation.

Listing 2. Using bootstrapping to find an implementation


// set DOMImplementationRegistry.PROPERTY property 
// to reference all known DOM implementations

System.setProperty(DOMImplementationRegistry.PROPERTY,
                   "org.apache.xerces.dom.DOMImplementationSourceImpl");

// get an instance of DOMImplementationRegistry
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();

// DOM implementation that support the specified features
DOMImplementation i = registry.getDOMImplementation("MutationEvent");

This has numerous advantages. It not only makes your code independent of the implementation, but it also allows DOM implementers to provide you with implementations that may better suit your needs. This can result in better performance for your application. For instance, Xerces has more than one implementation: One is full-featured and supports many optional modules of the DOM; another is minimal and only supports the core functionality with lighter objects. If you don't need support for mutation events, why should you pay the price of creating objects that carry the weight of such a feature? With the bootstrapping mechanism, you can use the most appropriate implementation for your application.

Document normalization
One of the new methods defined in DOM Level 3 is the normalizeDocument method on the Document interface. As the name implies, you can use this method to normalize the document. By default, this method does the following:

Normalizes Text nodes, consolidating adjacent Text nodes into a single Text node
Updates the content of EntityReference nodes according to the entities they reference
Verifies and fixes namespace information in the document, making it namespace well-formed

It is important to note that the namespace normalization algorithm (defined in the Appendix B) used in this method only works with namespace-aware nodes -- nodes created using methods with an "NS" suffix, such as createElementNS. Namespace unaware nodes -- nodes created with the DOM Level 1 methods, such as createElement -- are not fully compatible with any processing that depends on XML Namespaces. If you have DOM Level 1 nodes in your document, normalizeDocument will fail and report an error when trying to perform namespace normalization. In general, you should not create nodes with DOM Level 1 methods if you want to use XML Namespaces and perform any operation on your document that requires XML Namespaces support. This is true for other operations, such as revalidating the document in memory against an XML Schema.

You can also configure normalizeDocument, through DOMConfiguration, to perform other operations on your document. For example, you can use this method to get rid of comments, to transform CDATASection nodes into Text nodes, or to discard all the namespace declaration attributes from the tree. You can also use it to easily get your document into a form that naturally maps to the XML Infoset by doing all of the above at once. Listing 3 shows you how to use Document.config to control normalizeDocument.

Listing 3. Using Document.config to control normalizeDocument


// retrieve document configuration
DOMConfiguration config = document.getConfig();
// remove comments from
config.setParameter("comments", false);
// remove namespace declarations
config.setParameter("namespace-declarations", false);
// transform document
core.normalizeDocument();

// put document into a form closest to the XML Infoset 
config.setParameter("infoset", true);
// transform document
core.normalizeDocument();

The normalizeDocument method also allows you to revalidate your document in memory with respect to its XML Schema or DTD. In the past, to revalidate your document once it had been modified you had to save it to a file and read it back with a validating parser. Using this new method, you can now do this much more efficiently by having the DOM implementation revalidate your document in memory. To do this, you first need to set the validate parameter of the DOMConfiguration to true. You then need to implement a DOMErrorHandler object, to which validation errors will be reported, and register it with the Document using the error-handler parameter. This is very similar to what you would do with a SAX parser. Finally, you can check whether your document is valid by calling normalizeDocument. Later in this article, we show how you can do that using Xerces.

Currently, no standard API exists for accessing the XML Schema Post-Schema Validation Infoset (PSVI). However, DOM Level 3 allows you to retrieve some PSVI information. For example, if you are interested in getting the PSVI-normalized schema value property, setting the datatype-normalization and validate parameters to "true" on the DOMConfiguration and calling normalizeDocument updates the tree with the XML Schema normalized values -- which means the attribute values and element content in your document will now represent the PSVI-normalized schema value property.

Accessing type information
The previous versions of the DOM did not provide any access to any type information; you had no way to get the type of an attribute or element node in a document. As we mentioned above, this is now possible in DOM Level 3 Core thanks to the introduction of a new interface called TypeInfo. This interface represents a type definition as a pair consisting of a name and a namespace URI. Depending on the schema used to validate your document, what this type definition corresponds to can vary.

If you use a DTD (at load time or with normalizeDocument), TypeInfo on an attribute node represents the type of the attribute. This is the attribute information item's attribute type property in the XML Infoset. However, on an element node, TypeInfo has null for name and null for namespace URI because DTDs do not define element types.

Now, if you use an XML Schema to validate your document, TypeInfo represents the type of the element on an element node, and the type of the attribute on an attribute node. In fact, TypeInfo also represents the PSVI type definition property for the corresponding element and attribute information items.

Note that for this information to be available, the element or attribute has to be valid with respect to the schema used. When the validation fails, DOM implementations are encouraged to provide you with the declared type to help you fix the document accordingly. Also, when the type is anonymous, an implementation-specific unique name is returned to you.

Using the DOM Level 3 API in Xerces2
The Apache Xerces2 parser 2.4.0 provides an early implementation of DOM Level 3 Core. However, because the DOM Level 3 Core specification is not yet a W3C Recommendation, the implementation is not part of the default Xerces distribution. To use this functionality, you need to either cast to the Xerces DOM implementation classes (such as org.apache.xerces.dom.DocumentImpl) or build Xerces locally using the "jars-dom3" target. This generates the dom3-xml-apis.jar file that contains the DOM Level 3 API and the dom3-xercesImpl.jar file that contains the implementation of the API. To build Xerces, you need to either extract the source code from CVS or download both of the Xerces source and tools distributions.

After you build Xerces with DOM Level 3 support, include the newly-generated jars in your CLASSPATH (dom3-xml-apis.jar and dom3-xercesImpl.jar), and you are ready to start programming using DOM Level 3.

If all you need is a DOM Level 3 Core implementation, you should request the Xerces implementation that supports the "Core" or "XML" features using the bootstrapping mechanism. As we mentioned, this returns a DOM implementation that uses less memory, but does not provide support for optional modules such as traversal.

As we also mentioned, DOM Level 3 introduces a mechanism that allows revalidation of documents in memory. However, the current version of Xerces (2.4.0) only supports revalidation against an XML Schema, not against a DTD. Note that if a DOM implementation supports revalidation against both XML Schema and DTD, and your document references different kinds of schemas (such as a DTD and an XML Schema), it is then unclear which one should be used for revalidation. To specify, for example, that you want to revalidate against the XML Schema, you can either remove the DocumentType node from the document, by retrieving the children of the Document node and removing the DocumentType child node, or you can set the schema-type parameter of the DOMConfiguration.

You can associate an XML Schema with a document in two ways:

Add an attribute to the documentElement (root element) with the name xsi:schemaLocation or xsi:noSchemaLocation and the schema location as its value
Set the DOMConfiguration schema-location parameter to the location of the schema you want to use during revalidation

Note that you should specify schema locations using absolute URIs. If you decide to use a relative URI, it will be resolved relatively to the location of the document exposed through the documentURI attribute of the Document interface. Alternatively, you can implement and register a DOMEntityResolver (defined in the DOM Level 3 Load and Save specification) and resolve relative URIs yourself. Listing 4 shows you how to revalidate your document in memory:

Listing 4. Revalidating in memory


// Retrieve configuration
DOMConfiguration config = document.getConfig();
// Set document base URI
document.setDocumentURI("file:///c:/data");
// Configure the normalizeDocument operation
config.setParameter("schema-type", "http://www.w3.org/2001/XMLSchema");
config.setParameter("validate", true);
config.setParameter("schema-location", "personal.xsd");
// Revalidate your document in memory
document.normalizeDocument();

Conclusion
We've shown you how the new features brought by DOM Level 3 Core can save you from writing a lot of code and improve the performance of your application. The less code you write, the less you have to maintain, the fewer bugs you're responsible for, and the better off you'll be! We've also presented and explained how to use several powerful new features, such as revalidation in memory and access to type information -- something developers have been asking for for a long time.

In short, DOM Level 3 Core ought to make your life easier -- especially when combined with other modules such as DOM Load & Save -- and we hope this article helps you take advantage of it.

Resources

Part 1 of this series covers operations on the node, such as renaming, moving nodes from one document to another, setting text content, and so on (developerWorks, August 2003).
Read about the DOM Level 2 Core W3C Recommendation.
To better understand the XML Infoset, read the XML Information Set specification.
Get familiar with the latest DOM Level 3 Core Last Call draft.
You can find out about other W3C specifications, such as XML Schemas and Namespaces in XML, at the W3C's Technical Reports and Publications page.
Learn about the Xerces2 DOM implementation.
Download the latest Xerces-J parser.
Find more XML resources on the developerWorks XML zone, including the introductory tutorial Understanding DOM (developerWorks, July 2003).
For more on bootstrapping with DOM, read this series of tips by Brett McLaughlin:
- Part 1 explains what bootstrapping is, explores the problems associated with it, and lays out the basics for use in DOM Levels 1 and 2 (developerWorks, November 2002).
- Part 2 builds on Part 1 by showing you a better way to bootstrap in your DOM applications (developerWorks, December 2002).
- Part 3 explains the changes to DOM Level 3 that relate to bootstrapping, and how they improve upon DOM Levels 1 and 2 (developerWorks, December 2002).
Stop by the popular XML and Java technology forum here on developerWorks, hosted by Brett McLaughlin; it's an open and honest environment where all things XML and Java can be discussed.
Check out IBM WebSphere Studio Site Developer, a robust, easy-to-use development environment for creating, building, and maintaining dynamic Web sites, applications, and Web services.
Find out how you can become an IBM Certified Developer in XML and related technologies.

About the authors
Arnaud Le Hors is a Senior Software Engineer at IBM, and is part of the XML Standards Strategy Group. He represents IBM in various Working Groups of W3C, such as XML Core and DOM. He's one of the editors of the DOM Level 1, 2, and 3, Core Specifications. Arnaud is also one of the developers of Xerces and one of the designers of Xerces2. You can reach him at lehors@us.ibm.com.

Elena Litani is a Staff Software Developer at the IBM Toronto Lab. She is one of the lead developers of Xerces2. For the last two years, Elena has been representing IBM in the W3C DOM Working Group. You can reach her at elitani@ca.ibm.com.

developerWorks > XML | Java technology

About IBM | Privacy | Terms of use | Contact