Contents: |
Related content: |
Subscriptions: |
| Extracting (more) data from XML documents
McLaughlin (mailto:brett@oreilly.com?cc=&subject=Tip:
Attributes in ContentHandler) Author, O'Reilly and Associates 21
August 2003
The one aspect of data processing with
ContentHandler that the author didn't cover in his last tip
was attribute processing. While attributes are most commonly used for
information transfer between an XML document and an XML processor, they
also often contain valuable business data. In this tip, Brett shows you
how SAX handles elements and reports those elements, as well as how you
can use code to extract element data.
If you've been following along with this series of tips, you may be
expecting to read about the SAX ErrorHandler interface --
that's what I promised at the end of the last tip, and that was certainly
my intention. However, I've received several requests and suggestions for
coverage of one last aspect of the ContentHandler interface,
which of course I've been discussing for several tips now. Since the
request was a good one, and involved another very common part of XML
processing, I thought it was worth dealing with now. (For those of you who
are just pining over error handling and the like, I hope you can hang on
until my next tip!)
The request, of course, was for XML attribute processing. After the
last several tips, I trust you're confident setting up, registering, and
using ContentHandler s, and that you have no problem locating
a specific element, or getting its textual content. What I left out,
though, was how to obtain an attribute's value. This turns out to be a
pretty simple process, so I'll breeze through it in this tip.
First, you need to locate the attribute (or attributes) that you want
the value for. To accomplish this, you should begin by figuring out which
element the attribute appears on. This can be done by looking at the XML
document you're interested in (Listing 1 shows a simple example), or by
browsing a DTD (shown in Listing 2) or XML Schema. All are valid
approaches -- pick the one you prefer. Listing 1. A
simple XML document
<?xml version="1.0"?>
<some-element some-attribute="value">Some content in the element</some-element>
<child age="1" birthDate="06/02/2003">
More content
| Listing 2. A
simple XML DTD (for Listing 1)
<!ELEMENT root (some-element*, some-other-element+)>
<!ELEMENT some-element (#PCDATA)>
<!ATTLIST some-element
some-attribute CDATA #REQUIRED
<!ELEMENT some-other-element (child+)>
<!ATTLIST child
For the sake of this example, assume that you're looking for the
birthDate attribute. Whether you look at the XML document or
the DTD (or an XML Schema), you should be able to determine that the
birthDate attribute is attached to the child
element. So your first task is to locate that element. Of course, you
already know how to do that, so this is a piece of cake -- if you've
forgotten, Listing 3 is a quick refresher. Listing 3.
Finding the child element
public void startElement (String uri, String localName,
String qName, Attributes atts)
throws SAXException {
if (localName.equals("child")) {
// Deal with the attributes
Now, you are going to start working with a new SAX class:
Attributes . To be accurate, this is actually an interface,
and your parser vendor provides some type of implementation of this
interface. In either case, you're only going to be dealing with the public
interface methods, so don't worry about what goes on under the hood. To
get started, take a look at Listing 4, which is the
Attributes interface in all its glory. Listing 4. The SAX Attributes interface
package org.xml.sax;
public interface Attributes
// Indexed access.
public abstract int getLength ();
public abstract String getURI (int index);
public abstract String getLocalName (int index);
public abstract String getQName (int index);
public abstract String getType (int index);
public abstract String getValue (int index);
// Name-based queries
public int getIndex (String uri, String localName);
public int getIndex (String qName);
public abstract String getType (String uri, String localName);
public abstract String getType (String qName);
public abstract String getValue (String uri, String localName);
public abstract String getValue (String qName);
This should be pretty easy to understand; the rest of the tip isn't
going to be anything revelatory. As the comments of this interface
indicate, you can access an attribute by either its name or its index. If
you know the name of the attribute (as you do in the make-believe example
-- birthDate ), I recommend using name-based queries, as shown
in Listing 5. Listing 5. Finding the birthDate
public void startElement (String uri, String localName,
String qName, Attributes atts)
throws SAXException {
if (localName.equals("child")) {
String childValue = atts.getValue("", "birthDate");
// Do something with the value
Simple enough, right? Notice that I could have used the version that
took in a qName (getValue(String qName) ), but I
generally prefer to pass in a URI and local name, just for
self-documentation. This case has no URI, so an empty string works just
fine. I could have also used getValue("birthDate") and gotten
the same results.
You can also use index-based access for your attribute work. This isn't
so common when you know the name of the attribute. In fact, it's downright
dangerous in those cases. The SAX specification doesn't guarantee that XML
attributes are going to be reported in the same order that they appear in
the document being processed. This means that even if you can visually
verify that a specific attribute appears second in the list of attributes
on an element, it won't necessarily be reported to the
startElement() method as the second in the attribute list. So
you really shouldn't rely on index-based access for a specific named
That said, index-based access is still really useful. For example, it
allows you to check out all attributes, and then to get the name and value
for each. Consider the code in Listing 6, which does just that, all using
index-based access. Listing 6. Inspecting all
attributes for the child element
public void startElement (String uri, String localName,
String qName, Attributes atts)
throws SAXException {
if (localName.equals("child")) {
int numAtts = atts.getLength();
for (int i=0; i<numAtts; i++) {
String attName = atts.getQName(index);
String value = atts.getValue(index);
System.out.println(" * Attribute named " + attName +
" found, with value '" + value + "'");
Well, I really am done with the ContentHandler interface
this time. In my next tip, I will indeed move on to
ErrorHandler , and see exactly how it can be used to handle
everything from a misplaced angle bracket to a missing attribute. I'll
also show you how a single parser (represented by an instance of an
XMLReader ) can have multiple handlers registered to it. For
those of you who came to this article all jazzed up about error handling,
sorry! Until my next tip, then, I will indeed see you online; and let me
know what you think; as you can see from this article, it does indeed make
a difference.
- Read Brett McLaughlin's previous developerWorks tips on
ContentHandler :
- Want more on how XML and Java technologies interact? Visit the XML and Java
technology forum hosted by XML/Java technology innovator Brett
- Get the nitty-gritty details in the XML specification, online at the
- Learn even more about SAX with the "Understanding SAX"
tutorial, which demonstrates how to use SAX to retrieve, manipulate, and
output XML data (developerWorks, July
- Check out XML annotated on XML.com.
- Check out the SAX Project home page.
- See the SAX-standardized features and properties
- Supplement your skills with Java and XML by
Brett McLaughlin (O'Reilly and Associates).
- Find more XML resources on the developerWorks XML zone. For a
complete list of XML tips to date, check out the tips summary
- IBM's DB2 database provides
not only relational database storage, but also XML-related tools such as
the DB2 XML Extender which
provides a bridge between XML and relational systems. Visit the DB2 Developer Domain
to learn more about DB2.
- Find out how you can become an IBM Certified Developer in
XML and related technologies.
About the
author Brett McLaughlin
has been working in computers since the Logo days (Remember the
little triangle?). He currently specializes in building application
infrastructure using Java-related technologies. He has spent the
last several years implementing these infrastructures at Nextel
Communications and Allegiance Telecom, Inc. Brett is one of the
co-founders of the Java Apache project Turbine, which builds a
reusable component architecture for Web application development
using Java servlets. He is also a contributor of the EJBoss project,
an open source EJB application server, and Cocoon, an open source
XML Web-publishing engine. |