|
|
|
Contents: |
|
|
|
Related content: |
|
|
|
Subscriptions: |
|
|
| Extracting data from XML documents
Brett
McLaughlin (mailto:brett@oreilly.com?cc=&subject=Tip:
Elements and text in ContentHandler) Author, O'Reilly and
Associates 14 August 2003
With a solid understanding of the SAX
ContentHandler interface (which you can obtain by reading
my previous tips), you are ready to perform useful tasks with SAX. The
most common task, of course, is obtaining the textual content of a
specific element, and then doing something with that data. This tip
details that process, from locating a certain element to reading its
data.
At this point, you should at least be comfortable with the mechanics of
SAX and the ContentHandler interface. You've seen how events
in a document parse are associated with specific callback methods in this
handler, and how insertion of code in those callbacks is the means by
which a SAX programmer interacts with XML data. However, understanding
theory is hardly enough to write a useful program. To make this theory
practical, this tip will demonstrate some realistic uses of SAX; I'll
focus primarily on elements and textual data, as these are the most common
use-cases of XML.
The first step in dealing with any element's content is simply locating
the element in the XML. Since SAX is going to report each element as it
finds it, this generally means implementing some simple string matching
code in the startElement() method. For example, if you want
to locate an element called myElement , you might have a
comparison like that shown in Listing 1. Listing 1.
Finding the myElement element
public void startElement (String uri, String localName,
String qName, Attributes atts)
throws SAXException {
if (localName.equals("myElement")) {
// Perform business-specific logic for myElement
} else {
// Perform business-specific logic for all other elements
}
}
|
This is pretty simple, and nothing you couldn't figure out on your own
with a little experimentation. However, you need to be very careful when
searching for elements in namespaced documents. To illustrate, consider
the XML shown in Listing 2. Listing 2. A tricky
namespace document
<po:purchaseOrder xmlns:po="http://www.po.com">
<po:order>
<po:item id="11-489-09" qty="500">
<po:name>Aiwa Micro Compact System</po:name>
<po:manufacturerInfo>
<mn:name xmlns:mn="http://www.po.com/manufacturers"
po:manufacturerId="98001">
Aiwa
</mn:name>
<mn:stock id="XR-M191" />
</po:manufacturerInfo>
</po:item>
</po:order>
</po:purchaseOrder>
|
This document is a partially contrived purchase order for a compact
disc/tape player from the Aiwa corporation. The purchase order is in the
namespace associated with the URL http://www.po.com, but
also includes manufacturer information, namespaced to the URI http://www.po.com/manufacturers.
This is a good way to separate out groups of data, and avoid namespace
conflicts; for example, two elements in the document are named
name , but each belongs to a different namespace.
The issue you need to be careful about concerns how you write your SAX
startElement() code. Suppose you want to find out the name of
the item ordered. This would seem simple enough, but can cause some tricky
problems. Re-examine the code shown in Listing 1, and you
should see a big gotcha -- both elements named
name will be picked up by this version of
startElement() , since both have the same local name
(name ). So in namespaced documents, you almost always need to
perform two string comparisons, as shown in Listing 3. Listing 3. Finding the po:name element
private static final String PO_NAMESPACE_URI = "http://www.po.com";
public void startElement (String uri, String localName,
String qName, Attributes atts)
throws SAXException {
if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
// Perform business-specific logic for po:name
} else {
// Perform business-specific logic for all other elements
}
}
|
The first, and most obvious, change in this code is a new check for the
PO namespace URI in addition to a check on the element's local name. Also,
be sure that you compare on the namespace URI, not the prefix. Checking
for a match on the prefix po will always fail, as that isn't
reported (except through the qName parameter, and using it in
this manner is a hack, at best). Another thing to notice is that I use a
constant for the URI to compare to. Since this URI will probably be used
for comparison in multiple places, it's better to take up one place in
memory (through the use of a static final String ), as opposed
to having the JVM allocate memory to a String constant multiple times (as
in uri.equals("http://www.po.com") ). This small trick can
save a lot of memory thrashing and garbage collection over the lifetime of
a program. Finally, notice that I always compare the local name first, and
the namespace URI second. You'll almost always find fewer elements with
the same name than elements in the same namespace, so the most restrictive
comparison is performed first; the end result is a speedier code
execution, as the second comparison is ignored for as many elements as is
possible.
Now you need to be able to pull the textual value out for an element.
This is simple, but must be done in a non-traditional way. You can't
simply call element.getTextValue() -- in fact, you must work
across three methods! First, locate the element you want, using
startElement() code as you've already seen. Then, you must
grab all the textual content from that element in
characters() . However, beware: This callback may be triggered
multiple times for a single piece of textual content. So the text "Aiwa
Corporation" might be reported as one string of characters through one
invocation of characters() , or as "Aiwa" to one invocation
and " Corporation" to another, or in any of an almost infinite variety of
other ways that involves more than one invocation of
characters() . Because you can't be sure this method will be
called only once, you have to perform a little character management, as
Listing 4 shows. Listing 4. Catching character
content
private static final String PO_NAMESPACE_URI = "http://www.po.com";
private StringBuffer elementContent = new StringBuffer();
public void startElement (String uri, String localName,
String qName, Attributes atts)
throws SAXException {
if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
// Perform business-specific logic for po:name
// Clear the current character content buffer
elementContent.clear();
} else {
// Perform business-specific logic for all other elements
}
}
public void characters(char[] ch, int start, int len) throws SAXException {
elementContent.append(new String(ch, start, len));
}
|
The first step here is to add a new member variable, a
StringBuffer called elementContent . You could
use a String , but as advanced Java programmers, you all know
that string concatenation is bad, right? So instead, you need to use a
construct that can easily be appended to without lots of memory overhead.
Then, you clear this buffer when you hit the desired element, removing any
content left over from previous iterations or callbacks. Finally, every
time content is reported through characters() , you add it to
the buffer. Sometimes, the buffer may only have one piece of content
appended (the entire element's textual content); other times, this
appending may happen four or five times. In either case, your code covers
you and ensures that you get all the content you're looking for.
As you may have noticed, though, something is still missing -- it's
never clear when you actually have all the content you want, and when you
can do something with that content. To handle this, you need to employ the
use of the endElement() callback, which informs you when the
element you are targeting for data extraction is closed. Adding some code
like that shown in Listing 5 takes care of this clean-up. Listing 5. Closing the element loop
private static final String PO_NAMESPACE_URI = "http://www.po.com";
private StringBuffer elementContent = new StringBuffer();
private String elementData;
public void startElement (String uri, String localName,
String qName, Attributes atts)
throws SAXException {
if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
// Perform business-specific logic for po:name
// Clear the current character content buffer
elementContent.clear();
} else {
// Perform business-specific logic for all other elements
}
}
public void characters(char[] ch, int start, int len) throws SAXException {
elementContent.append(new String(ch, start, len));
}
public void endElement (String uri, String localName, String qName)
throws SAXException {
if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
// We're done
elementData = elementContent.toString();
// Do something with this data
}
}
|
This should seem pretty obvious -- when the element is closed, you've
got all the textual data you want, and can go about the business of using
that data. However, let me warn you of two very important use-cases where
this code will either utterly fail, or work great while reporting
completely incorrect results:
- The element with desired content appears multiple times
- The element with desired content has mixed content (both
textual content and other nested elements)
The first case, in which an element appears multiple times, isn't too
hard to deal with. If you are only using the element's content
temporarily, such as in the body of endElement() , this isn't
an issue; your business code will get triggered each and every time that
element is encountered, each time with the correct data. Since you were
looking ahead and cleared the buffer in startElement() , you
don't have to worry about overlapping data. However, if you are trying to
save the textual content in a storage medium like a Map , you
might end up overwriting data from early elements with data from later
elements (all having the same name), which is a nasty bug to track down. I
recommend that you use SAX as a fire-and-forget mechanism, and not build
up data structures like this in the first place -- so in that case this
becomes a non-issue. Still, it's something to watch out for!
The second case is a little trickier, and most common when working with
HTML or XHTML. Suppose you have content like this:
<p>The quick <b>red fox <i>jumps</i></b> over the lazy brown dog.</p>
|
Further suppose that you want the textual content of the bold element
(b ). In this case, you're going to have to decide exactly
what content you want. In the current code, you are going to get a string
like this: red fox jumps . That may be exactly what you want;
if so, great. Notice, though, that this includes the textual content for
the target element, as well as textual content for its child elements. You
may find yourself in a situation where you want only the textual content
of the target element, and would rather omit all nested elements' textual
content. In these cases (which are a bit rare, admittedly), you are going
to need to be a little craftier in your code, a la Listing 6. Listing 6. Keeping only content for a specific
element
private static final String PO_NAMESPACE_URI = "http://www.po.com";
private StringBuffer elementContent = new StringBuffer();
private String elementData;
private boolean inElement = false;
private int nestedElements = 0;
public void startElement (String uri, String localName,
String qName, Attributes atts)
throws SAXException {
if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
// Perform business-specific logic for po:name
// Clear the current character content buffer
elementContent.clear();
inElement = true;
} else {
// Perform business-specific logic for all other elements
// Ensure we don't pick up content for other elements
if (inElement) {
nestedElements++;
}
}
}
public void characters(char[] ch, int start, int len) throws SAXException {
// Only get content if we're in the target element
if (inElement && (nestedElements == 0)) {
elementContent.append(new String(ch, start, len));
}
}
public void endElement (String uri, String localName, String qName)
throws SAXException {
if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
// We're done
elementData = elementContent.toString();
inElement = false;
// Do something with this data
} else {
// remove one from the nested element count, if appropriate
if (inElement) {
nestedElements--;
}
}
}
|
This version of the code adds a boolean variable,
inElement , which ensures that textual content is only picked
up specifically for the element being dealt with. First, that variable is
set whenever the start of the target element is reached. However, you have
to account for nested elements -- thus the counter
nestedElements , which starts at 0 (for no nested elements).
If startElement() is called on a nested element, one nested
element is added to the count; when it is closed off (through
endElement() ), it is peeled back off the stack. Only when you
have no nested elements is it safe to gather textual content. This is a
bit of a tricky solution, but then again, the problem isn't a trivial one.
Thankfully, it is a
rare one, so you won't have to mess with this sort of code very often.
At this point, I've exhausted the most common applications of the
ContentHandler interface. Rather than delving into its less
commonly-used functions in the next tips, I'll continue with a look at the
major facets of XML. While I may examine the nooks and crannies of SAX in
tips much further down the line, I'm trying to ground you in SAX and give
you the most commonly-used tools, rather than bore you with esoterica.
Along those lines, then, I'll look at the ErrorHandler
interface in the next tip, and explain how it can add error handling and
reporting capabilities to your XML processing with SAX. Until then, I'll
see you on the newsgroups and online.
Resources
- Read Brett McLaughlin's previous tip, "Get the most from
ContentHandlers" (developerWorks, July
2003).
- In his Working
XML column "Building a compiler for
the SAX ContentHandler", Benoit Marchal begins a series on how to
automate the creation of SAX ContentHandler (developerWorks,
November 2001).
- Get the nitty-gritty details in the XML specification, online at the
W3C.
- Learn even more about SAX with the "Understanding SAX"
tutorial, which demonstrates how to use SAX to retrieve, manipulate, and
output XML data (developerWorks, July
2003).
- Want more on how XML and Java technologies interact? Visit the XML and Java
technology forum hosted by XML/Java technology innovator Brett
McLaughlin.
- Check out XML annotated on XML.com.
- Check out the SAX Project home page.
- See the SAX-standardized features and properties
list.
- Supplement your skills with Java and XML by
Brett McLaughlin (O'Reilly and Associates).
- Find more XML resources on the developerWorks XML zone. For a
complete list of XML tips to date, check out the tips summary
page.
- Check out IBM WebSphere Studio Site
Developer a robust, easy-to-use development environment for
creating, building, and maintaining dynamic Web sites, applications, and
Web services.
- Find out how you can become an IBM Certified Developer in
XML and related technologies.
About the
author Brett McLaughlin
has been working in computers since the Logo days (Remember the
little triangle?). He currently specializes in building application
infrastructure using Java-related technologies. He has spent the
last several years implementing these infrastructures at Nextel
Communications and Allegiance Telecom, Inc. Brett is one of the
co-founders of the Java Apache project Turbine, which builds a
reusable component architecture for Web application development
using Java servlets. He is also a contributor of the EJBoss project,
an open source EJB application server, and Cocoon, an open source
XML Web-publishing engine. |
|
|