Tip: Elements and text in ContentHandler


Search for:	within
		Search help

IBM home | Products & services | Support & downloads | My account

developerWorks > XML | Java technology


	Tip: Elements and text in ContentHandler

Contents:

Resources

About the author

Rate this article

Related content:

Get the most from ContentHandlers

Set up a SAX ContentHandler

Subscriptions:

dW newsletters

dW Subscription
(CDs and downloads)

Extracting data from XML documents

Level: Introductory

Brett McLaughlin (mailto:brett@oreilly.com?cc=&subject=Tip: Elements and text in ContentHandler)
Author, O'Reilly and Associates
14 August 2003

With a solid understanding of the SAX ContentHandler interface (which you can obtain by reading my previous tips), you are ready to perform useful tasks with SAX. The most common task, of course, is obtaining the textual content of a specific element, and then doing something with that data. This tip details that process, from locating a certain element to reading its data.

At this point, you should at least be comfortable with the mechanics of SAX and the ContentHandler interface. You've seen how events in a document parse are associated with specific callback methods in this handler, and how insertion of code in those callbacks is the means by which a SAX programmer interacts with XML data. However, understanding theory is hardly enough to write a useful program. To make this theory practical, this tip will demonstrate some realistic uses of SAX; I'll focus primarily on elements and textual data, as these are the most common use-cases of XML.

The first step in dealing with any element's content is simply locating the element in the XML. Since SAX is going to report each element as it finds it, this generally means implementing some simple string matching code in the startElement() method. For example, if you want to locate an element called myElement, you might have a comparison like that shown in Listing 1.

Listing 1. Finding the myElement element

public void startElement (String uri, String localName,
                       String qName, Attributes atts)
 throws SAXException {

  if (localName.equals("myElement")) {
    // Perform business-specific logic for myElement
   } else {
    // Perform business-specific logic for all other elements
   }
 }

This is pretty simple, and nothing you couldn't figure out on your own with a little experimentation. However, you need to be very careful when searching for elements in namespaced documents. To illustrate, consider the XML shown in Listing 2.

Listing 2. A tricky namespace document

<po:purchaseOrder xmlns:po="http://www.po.com">
  <po:order>
    <po:item id="11-489-09" qty="500">
    <po:name>Aiwa Micro Compact System</po:name>
    <po:manufacturerInfo>
      <mn:name xmlns:mn="http://www.po.com/manufacturers"
             po:manufacturerId="98001">
      Aiwa
    </mn:name>    
    <mn:stock id="XR-M191" />
    </po:manufacturerInfo>
  </po:item>
  </po:order>
</po:purchaseOrder>

This document is a partially contrived purchase order for a compact disc/tape player from the Aiwa corporation. The purchase order is in the namespace associated with the URL http://www.po.com, but also includes manufacturer information, namespaced to the URI http://www.po.com/manufacturers. This is a good way to separate out groups of data, and avoid namespace conflicts; for example, two elements in the document are named name, but each belongs to a different namespace.

The issue you need to be careful about concerns how you write your SAX startElement() code. Suppose you want to find out the name of the item ordered. This would seem simple enough, but can cause some tricky problems. Re-examine the code shown in Listing 1, and you should see a big gotcha -- both elements named name will be picked up by this version of startElement(), since both have the same local name (name). So in namespaced documents, you almost always need to perform two string comparisons, as shown in Listing 3.

Listing 3. Finding the po:name element

private static final String PO_NAMESPACE_URI = "http://www.po.com";

public void startElement (String uri, String localName,
                         String qName, Attributes atts)
 throws SAXException {
  
   if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
     // Perform business-specific logic for po:name
   } else {
     // Perform business-specific logic for all other elements
   }
 }

The first, and most obvious, change in this code is a new check for the PO namespace URI in addition to a check on the element's local name. Also, be sure that you compare on the namespace URI, not the prefix. Checking for a match on the prefix po will always fail, as that isn't reported (except through the qName parameter, and using it in this manner is a hack, at best). Another thing to notice is that I use a constant for the URI to compare to. Since this URI will probably be used for comparison in multiple places, it's better to take up one place in memory (through the use of a static final String), as opposed to having the JVM allocate memory to a String constant multiple times (as in uri.equals("http://www.po.com")). This small trick can save a lot of memory thrashing and garbage collection over the lifetime of a program. Finally, notice that I always compare the local name first, and the namespace URI second. You'll almost always find fewer elements with the same name than elements in the same namespace, so the most restrictive comparison is performed first; the end result is a speedier code execution, as the second comparison is ignored for as many elements as is possible.

Now you need to be able to pull the textual value out for an element. This is simple, but must be done in a non-traditional way. You can't simply call element.getTextValue() -- in fact, you must work across three methods! First, locate the element you want, using startElement() code as you've already seen. Then, you must grab all the textual content from that element in characters(). However, beware: This callback may be triggered multiple times for a single piece of textual content. So the text "Aiwa Corporation" might be reported as one string of characters through one invocation of characters(), or as "Aiwa" to one invocation and " Corporation" to another, or in any of an almost infinite variety of other ways that involves more than one invocation of characters(). Because you can't be sure this method will be called only once, you have to perform a little character management, as Listing 4 shows.

Listing 4. Catching character content

private static final String PO_NAMESPACE_URI = "http://www.po.com";
 private StringBuffer elementContent = new StringBuffer();

public void startElement (String uri, String localName,
            String qName, Attributes atts)
 throws SAXException {
  
   if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
      // Perform business-specific logic for po:name
      
        // Clear the current character content buffer
        elementContent.clear();
 } else {
      // Perform business-specific logic for all other elements
  }
 }
  
 public void characters(char[] ch, int start, int len) throws SAXException {
  elementContent.append(new String(ch, start, len));
 }

The first step here is to add a new member variable, a StringBuffer called elementContent. You could use a String, but as advanced Java programmers, you all know that string concatenation is bad, right? So instead, you need to use a construct that can easily be appended to without lots of memory overhead. Then, you clear this buffer when you hit the desired element, removing any content left over from previous iterations or callbacks. Finally, every time content is reported through characters(), you add it to the buffer. Sometimes, the buffer may only have one piece of content appended (the entire element's textual content); other times, this appending may happen four or five times. In either case, your code covers you and ensures that you get all the content you're looking for.

As you may have noticed, though, something is still missing -- it's never clear when you actually have all the content you want, and when you can do something with that content. To handle this, you need to employ the use of the endElement() callback, which informs you when the element you are targeting for data extraction is closed. Adding some code like that shown in Listing 5 takes care of this clean-up.

Listing 5. Closing the element loop

private static final String PO_NAMESPACE_URI = "http://www.po.com";
  private StringBuffer elementContent = new StringBuffer();
  private String elementData;

public void startElement (String uri, String localName,
                       String qName, Attributes atts)
  throws SAXException {
  
    if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
      // Perform business-specific logic for po:name
      
    // Clear the current character content buffer
    elementContent.clear();
  } else {
      // Perform business-specific logic for all other elements
  }
  }
  
  public void characters(char[] ch, int start, int len) throws SAXException {
    elementContent.append(new String(ch, start, len));
  }
  
public void endElement (String uri, String localName, String qName)
    throws SAXException {
  
    if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
      // We're done
    elementData = elementContent.toString();
      
    // Do something with this data
  }
  }

This should seem pretty obvious -- when the element is closed, you've got all the textual data you want, and can go about the business of using that data. However, let me warn you of two very important use-cases where this code will either utterly fail, or work great while reporting completely incorrect results:

The element with desired content appears multiple times
The element with desired content has mixed content (both textual content and other nested elements)

The first case, in which an element appears multiple times, isn't too hard to deal with. If you are only using the element's content temporarily, such as in the body of endElement(), this isn't an issue; your business code will get triggered each and every time that element is encountered, each time with the correct data. Since you were looking ahead and cleared the buffer in startElement(), you don't have to worry about overlapping data. However, if you are trying to save the textual content in a storage medium like a Map, you might end up overwriting data from early elements with data from later elements (all having the same name), which is a nasty bug to track down. I recommend that you use SAX as a fire-and-forget mechanism, and not build up data structures like this in the first place -- so in that case this becomes a non-issue. Still, it's something to watch out for!

The second case is a little trickier, and most common when working with HTML or XHTML. Suppose you have content like this:

<p>The quick <b>red fox <i>jumps</i></b> over the lazy brown dog.</p>

Further suppose that you want the textual content of the bold element (b). In this case, you're going to have to decide exactly what content you want. In the current code, you are going to get a string like this: red fox jumps. That may be exactly what you want; if so, great. Notice, though, that this includes the textual content for the target element, as well as textual content for its child elements. You may find yourself in a situation where you want only the textual content of the target element, and would rather omit all nested elements' textual content. In these cases (which are a bit rare, admittedly), you are going to need to be a little craftier in your code, a la Listing 6.

Listing 6. Keeping only content for a specific element

private static final String PO_NAMESPACE_URI = "http://www.po.com";
  private StringBuffer elementContent = new StringBuffer();
  private String elementData;
  private boolean inElement = false;
  private int nestedElements = 0;

public void startElement (String uri, String localName,
          String qName, Attributes atts)
  throws SAXException {
  
    if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
      // Perform business-specific logic for po:name
      
    // Clear the current character content buffer
    elementContent.clear();
    inElement = true;
  } else {
      // Perform business-specific logic for all other elements
      
    // Ensure we don't pick up content for other elements
    if (inElement) {
        nestedElements++;
    }
  }
  }
  
  public void characters(char[] ch, int start, int len) throws SAXException {
    // Only get content if we're in the target element
  if (inElement && (nestedElements == 0)) {
    elementContent.append(new String(ch, start, len));
  }
  }
  
public void endElement (String uri, String localName, String qName)
    throws SAXException {
  
    if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
      // We're done
    elementData = elementContent.toString();
    inElement = false;
      
    // Do something with this data
  } else {
      // remove one from the nested element count, if appropriate
    if (inElement) {
        nestedElements--;
    }
  }
  }

This version of the code adds a boolean variable, inElement, which ensures that textual content is only picked up specifically for the element being dealt with. First, that variable is set whenever the start of the target element is reached. However, you have to account for nested elements -- thus the counter nestedElements, which starts at 0 (for no nested elements). If startElement() is called on a nested element, one nested element is added to the count; when it is closed off (through endElement()), it is peeled back off the stack. Only when you have no nested elements is it safe to gather textual content. This is a bit of a tricky solution, but then again, the problem isn't a trivial one. Thankfully, it is a rare one, so you won't have to mess with this sort of code very often.

At this point, I've exhausted the most common applications of the ContentHandler interface. Rather than delving into its less commonly-used functions in the next tips, I'll continue with a look at the major facets of XML. While I may examine the nooks and crannies of SAX in tips much further down the line, I'm trying to ground you in SAX and give you the most commonly-used tools, rather than bore you with esoterica. Along those lines, then, I'll look at the ErrorHandler interface in the next tip, and explain how it can add error handling and reporting capabilities to your XML processing with SAX. Until then, I'll see you on the newsgroups and online.

Resources

Read Brett McLaughlin's previous tip, "Get the most from ContentHandlers" (developerWorks, July 2003).
In his Working XML column "Building a compiler for the SAX ContentHandler", Benoit Marchal begins a series on how to automate the creation of SAX ContentHandler (developerWorks, November 2001).
Get the nitty-gritty details in the XML specification, online at the W3C.
Learn even more about SAX with the "Understanding SAX" tutorial, which demonstrates how to use SAX to retrieve, manipulate, and output XML data (developerWorks, July 2003).
Want more on how XML and Java technologies interact? Visit the XML and Java technology forum hosted by XML/Java technology innovator Brett McLaughlin.
Check out XML annotated on XML.com.
Check out the SAX Project home page.
See the SAX-standardized features and properties list.
Supplement your skills with Java and XML by Brett McLaughlin (O'Reilly and Associates).
Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
Check out IBM WebSphere Studio Site Developer a robust, easy-to-use development environment for creating, building, and maintaining dynamic Web sites, applications, and Web services.
Find out how you can become an IBM Certified Developer in XML and related technologies.

About the author
Brett McLaughlin has been working in computers since the Logo days (Remember the little triangle?). He currently specializes in building application infrastructure using Java-related technologies. He has spent the last several years implementing these infrastructures at Nextel Communications and Allegiance Telecom, Inc. Brett is one of the co-founders of the Java Apache project Turbine, which builds a reusable component architecture for Web application development using Java servlets. He is also a contributor of the EJBoss project, an open source EJB application server, and Cocoon, an open source XML Web-publishing engine.

developerWorks > XML | Java technology

About IBM | Privacy | Terms of use | Contact