|
|
|
Contents: |
|
|
|
Related content: |
|
|
|
Subscriptions: |
|
|
| Bootstrapping, mapping to the XML Infoset, accessing type
information, and working with Xerces
Arnaud
Le Hors (mailto:lehors@us.ibm.com?cc=&subject=Discover
key features of DOM Level 3 Core, Part 2), Senior Software Engineer,
IBM Elena
Litani (mailto:elitani@ca.ibm.com?cc=&subject=Discover
key features of DOM Level 3 Core, Part 2), Staff Software Developer,
IBM
26 August 2003
In this two-part article, the authors present some of the
key features brought by the W3C Document Object Model (DOM) Level 3 Core
Working Draft and show you how to use them with examples in Java code.
In this second part, they cover operations on the document, access to
type information, and introduce you to the early implementation of this
API in the Apache Xerces2 project.
In Part 1, we
showed you a set of DOM Level 3 Core features you can use when working
with nodes. We will now describe the mapping of the DOM data model to the
XML Infoset and how to remove implementation-dependent code from your
application with the so called DOM bootstrapping
mechanism. Then we will show how to revalidate DOM in memory so that
you can check whether it still complies with your schema, describe how to
access element and attribute type information and show you how to use all
this cool stuff in Xerces.
Mapping to the
Infoset One of the important tasks that was accomplished for
the DOM Level 3 is the alignment of the DOM data model with the XML
Information Set (Infoset) through the addition of new methods to query
missing XML Infoset information. For example, you can now query and modify
the information stored in an XML declaration, such as
version , standalone , and encoding ,
through the Document interface (which is mapped to the
Infoset document information item). Similarly, the base URI and
declaration base URI properties are computed according to XML Base and are
available on the Node interface. You can also retrieve the
XML Infoset element content whitespace property. This is the property that
indicates whether a Text node only contains whitespace that
is ignorable. You can retrieve it through the Text interface
(which maps to the XML Infoset character information item). Listing 1
shows the actual method signatures of the interface in the Java language
binding. Listing 1. Method signatures in Java language
binding
// XML Declaration information on
// the org.w3c.dom.Document interface
public String getXmlEncoding();
public void setXmlEncoding(String xmlEncoding);
public boolean getXmlStandalone();
public void setXmlStandalone(boolean xmlStandalone)
throws DOMException;
public String getXmlVersion();
public void setXmlVersion(String xmlVersion)
throws DOMException;
// element content whitespace property on the Text
// interface
public boolean isWhitespaceInElementContent();
|
You can also retrieve the value of the attribute type property of an
attribute information item -- this is the type of an attribute -- through
the schemaTypeInfo attribute of the Attr
interface. This is further detailed in a section below.
In addition, a new feature is provided to put the Document
back in a form closest to the XML Infoset, since various editing
operations, such as insertion or deletion of nodes, often leave you with a
document that is further from the XML Infoset than it might be. This is
one of the results you can obtain as part of an operation called document normalization
that we describe in the section Document
normalization.
Finally, the new Appendix C provides the mappings between the XML
Infoset model and the DOM where each XML Infoset information item is
mapped to its respective Node , and vice-versa, and each
property of an information item is mapped to its respective
Node attribute. This appendix should give you a good overview
of the DOM data model and show you how to access the information you are
looking for.
Bootstrapping Previous versions of the DOM
specification did not provide any way to bootstrap DOM implementations;
therefore, in your applications you had to start with
implementation-dependent code. The DOM Level 3 Core specification defines
a DOMImplementationRegistry object that lets you find
implementations based on the set of features you need. For instance, you
can ask for an implementation that supports mutation events. Listing 2
shows how you can use the bootstrapping mechanism in your application to
find the appropriate implementation. Listing 2. Using
bootstrapping to find an implementation
// set DOMImplementationRegistry.PROPERTY property
// to reference all known DOM implementations
System.setProperty(DOMImplementationRegistry.PROPERTY,
"org.apache.xerces.dom.DOMImplementationSourceImpl");
// get an instance of DOMImplementationRegistry
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
// DOM implementation that support the specified features
DOMImplementation i = registry.getDOMImplementation("MutationEvent");
|
This has numerous advantages. It not only makes your code independent
of the implementation, but it also allows DOM implementers to provide you
with implementations that may better suit your needs. This can result in
better performance for your application. For instance, Xerces has more
than one implementation: One is full-featured and supports many optional
modules of the DOM; another is minimal and only supports the core
functionality with lighter objects. If you don't need support for mutation
events, why should you pay the price of creating objects that carry the
weight of such a feature? With the bootstrapping mechanism, you can use
the most appropriate implementation for your application.
Document
normalization One of the new methods defined in DOM Level 3
is the normalizeDocument method on the Document
interface. As the name implies, you can use this method to normalize the
document. By default, this method does the following:
- Normalizes
Text nodes, consolidating adjacent
Text nodes into a single Text node
- Updates the content of
EntityReference nodes according
to the entities they reference
- Verifies and fixes namespace information in the document, making it
namespace well-formed
It is important to note that the namespace normalization algorithm
(defined in the Appendix B) used in this method only works with namespace-aware nodes --
nodes created using methods with an "NS" suffix, such as
createElementNS . Namespace unaware nodes
-- nodes created with the DOM Level 1 methods, such as
createElement -- are not fully compatible with any processing
that depends on XML Namespaces. If you have DOM Level 1 nodes in your
document, normalizeDocument will fail and report an error
when trying to perform namespace normalization. In general, you should not
create nodes with DOM Level 1 methods if you want to use XML Namespaces
and perform any operation on your document that requires XML Namespaces
support. This is true for other operations, such as revalidating the
document in memory against an XML Schema.
You can also configure normalizeDocument , through
DOMConfiguration , to perform other operations on your
document. For example, you can use this method to get rid of comments, to
transform CDATASection nodes into Text nodes, or
to discard all the namespace declaration attributes from the tree. You can
also use it to easily get your document into a form that naturally maps to
the XML Infoset by doing all of the above at once. Listing 3 shows you how
to use Document.config to control
normalizeDocument . Listing 3. Using
Document.config to control normalizeDocument
// retrieve document configuration
DOMConfiguration config = document.getConfig();
// remove comments from
config.setParameter("comments", false);
// remove namespace declarations
config.setParameter("namespace-declarations", false);
// transform document
core.normalizeDocument();
// put document into a form closest to the XML Infoset
config.setParameter("infoset", true);
// transform document
core.normalizeDocument();
|
The normalizeDocument method also allows you to revalidate
your document in memory with respect to its XML Schema or DTD. In the
past, to revalidate your document once it had been modified you had to
save it to a file and read it back with a validating parser. Using this
new method, you can now do this much more efficiently by having the DOM
implementation revalidate your document in memory. To do this, you first
need to set the validate parameter of the
DOMConfiguration to true . You then need to
implement a DOMErrorHandler object, to which validation
errors will be reported, and register it with the Document
using the error-handler parameter. This is very similar to
what you would do with a SAX parser. Finally, you can check whether your
document is valid by calling normalizeDocument . Later in this
article, we show how you can do that using Xerces.
Currently, no standard API exists for accessing the XML Schema
Post-Schema Validation Infoset (PSVI). However, DOM Level 3 allows you to
retrieve some PSVI information. For example, if you are interested in
getting the PSVI-normalized schema value property, setting the
datatype-normalization and validate parameters
to "true" on the DOMConfiguration and calling
normalizeDocument updates the tree with the XML Schema
normalized values -- which means the attribute values and element content
in your document will now represent the PSVI-normalized schema value
property.
Accessing type
information The previous versions of the DOM did not provide
any access to any type information; you had no way to get the type of an
attribute or element node in a document. As we mentioned above, this is
now possible in DOM Level 3 Core thanks to the introduction of a new
interface called TypeInfo . This interface represents a type
definition as a pair consisting of a name and a namespace URI. Depending
on the schema used to validate your document, what this type definition
corresponds to can vary.
If you use a DTD (at load time or with normalizeDocument ),
TypeInfo on an attribute node represents the type of the
attribute. This is the attribute information item's attribute type
property in the XML Infoset. However, on an element node,
TypeInfo has null for name and null
for namespace URI because DTDs do not define element types.
Now, if you use an XML Schema to validate your document,
TypeInfo represents the type of the element on an element
node, and the type of the attribute on an attribute node. In fact,
TypeInfo also represents the PSVI type definition property
for the corresponding element and attribute information items.
Note that for this information to be available, the element or
attribute has to be valid with respect to the schema used. When the
validation fails, DOM implementations are encouraged to provide you with
the declared type to help you fix the document accordingly. Also, when the
type is anonymous, an implementation-specific unique name is returned to
you.
Using the DOM Level 3 API in
Xerces2 The Apache Xerces2 parser 2.4.0 provides an early
implementation of DOM Level 3 Core. However, because the DOM Level 3 Core
specification is not yet a W3C Recommendation, the implementation is not
part of the default Xerces distribution. To use this functionality, you
need to either cast to the Xerces DOM implementation classes (such as
org.apache.xerces.dom.DocumentImpl ) or build Xerces locally
using the "jars-dom3" target. This generates the
dom3-xml-apis.jar file that contains the DOM Level 3 API and
the dom3-xercesImpl.jar file that contains the implementation
of the API. To build Xerces, you need to either extract the source code
from CVS or download both of the Xerces source and tools
distributions.
After you build Xerces with DOM Level 3 support, include the
newly-generated jars in your CLASSPATH
(dom3-xml-apis.jar and dom3-xercesImpl.jar ), and
you are ready to start programming using DOM Level 3.
If all you need is a DOM Level 3 Core implementation, you should
request the Xerces implementation that supports the "Core" or
"XML" features using the bootstrapping mechanism.
As we mentioned, this returns a DOM implementation that uses less memory,
but does not provide support for optional modules such as traversal.
As we also mentioned, DOM Level 3 introduces a mechanism that allows
revalidation of documents in memory. However, the current version of
Xerces (2.4.0) only supports revalidation against an XML Schema, not
against a DTD. Note that if a DOM implementation supports revalidation
against both XML Schema and DTD, and your document references different
kinds of schemas (such as a DTD and an XML Schema), it is then unclear
which one should be used for revalidation. To specify, for example, that
you want to revalidate against the XML Schema, you can either remove the
DocumentType node from the document, by retrieving the
children of the Document node and removing the
DocumentType child node, or you can set the
schema-type parameter of the
DOMConfiguration .
You can associate an XML Schema with a document in two ways:
- Add an attribute to the
documentElement (root element)
with the name xsi:schemaLocation or
xsi:noSchemaLocation and the schema location as its value
- Set the
DOMConfiguration schema-location
parameter to the location of the schema you want to use during
revalidation
Note that you should specify schema locations using absolute URIs. If
you decide to use a relative URI, it will be resolved relatively to the
location of the document exposed through the documentURI
attribute of the Document interface. Alternatively, you can
implement and register a DOMEntityResolver (defined in the
DOM Level 3 Load and Save specification) and resolve relative URIs
yourself. Listing 4 shows you how to revalidate your document in
memory: Listing 4. Revalidating in memory
// Retrieve configuration
DOMConfiguration config = document.getConfig();
// Set document base URI
document.setDocumentURI("file:///c:/data");
// Configure the normalizeDocument operation
config.setParameter("schema-type", "http://www.w3.org/2001/XMLSchema");
config.setParameter("validate", true);
config.setParameter("schema-location", "personal.xsd");
// Revalidate your document in memory
document.normalizeDocument();
|
Conclusion We've shown
you how the new features brought by DOM Level 3 Core can save you from
writing a lot of code and improve the performance of your application. The
less code you write, the less you have to maintain, the fewer bugs you're
responsible for, and the better off you'll be! We've also presented and
explained how to use several powerful new features, such as revalidation
in memory and access to type information -- something developers have been
asking for for a long time.
In short, DOM Level 3 Core ought to make your life easier -- especially
when combined with other modules such as DOM Load & Save -- and we
hope this article helps you take advantage of it.
Resources
- Part 1 of this series
covers operations on the node, such as renaming, moving nodes from one
document to another, setting text content, and so on (developerWorks, August
2003).
- Read about the DOM Level 2 Core W3C
Recommendation.
- To better understand the XML Infoset, read the XML Information Set
specification.
- Get familiar with the latest DOM Level 3 Core Last
Call draft.
- You can find out about other W3C specifications, such as XML Schemas
and Namespaces in XML, at the W3C's Technical Reports
and Publications page.
- Learn about the Xerces2 DOM
implementation.
- Download the latest Xerces-J
parser.
- Find more XML resources on the developerWorks XML
zone, including the introductory tutorial Understanding DOM (developerWorks, July
2003).
- For more on bootstrapping with DOM, read this series of tips by
Brett McLaughlin:
- Part
1 explains what bootstrapping is, explores the problems associated
with it, and lays out the basics for use in DOM Levels 1 and 2
(developerWorks, November 2002).
- Part
2 builds on Part 1 by showing you a better way to bootstrap in
your DOM applications (developerWorks, December 2002).
- Part
3 explains the changes to DOM Level 3 that relate to
bootstrapping, and how they improve upon DOM Levels 1 and 2
(developerWorks, December 2002).
- Stop by the popular XML and Java technology
forum here on developerWorks, hosted
by Brett McLaughlin; it's an open and honest environment where all
things XML and Java can be discussed.
- Check out IBM WebSphere Studio Site
Developer, a robust, easy-to-use development environment for
creating, building, and maintaining dynamic Web sites, applications, and
Web services.
- Find out how you can become an IBM Certified Developer in
XML and related technologies.
About the
authors Arnaud Le Hors is a Senior Software Engineer at
IBM, and is part of the XML Standards Strategy Group. He represents
IBM in various Working Groups of W3C, such as XML Core and DOM. He's
one of the editors of the DOM Level 1, 2, and 3, Core
Specifications. Arnaud is also one of the developers of Xerces and
one of the designers of Xerces2. You can reach him at lehors@us.ibm.com. |
Elena Litani is a Staff Software
Developer at the IBM Toronto Lab. She is one of the lead developers
of Xerces2. For the last two years, Elena has been representing IBM
in the W3C DOM Working Group. You can reach her at elitani@ca.ibm.com.
|
|
|