|
|
|
Contents: |
|
|
|
Related content: |
|
|
|
Subscriptions: |
|
|
| These open source projects can ease your XML-handling
tasks
Otis
Gospodnetic (mailto:otis@apache.org?cc=&subject=Parsing,
indexing, and searching XML with Digester and Lucene) Software
Engineer, Wireless Generation, Inc. 3 June 2003
Java developers can use the SAX interface
to parse XML documents, but this process is rather complex. Digester and
Lucene, two open source projects from the Apache Foundation, cut down
your development time for projects in which you manipulate XML. Lucene
developer Otis Gospodnetic shows you how it's done, with example code
that you can compile and run.
If you've ever wanted to parse XML documents but have found SAX just a
little difficult, this article is for you. In this article, we examine how
to use two open source tools from the Apache Jakarta project, Commons
Digester and Lucene, to handle the parsing, indexing, and searching of XML
documents. Digester parses the XML data, and Lucene handles indexing and
searching. You'll first see how to use each tool on its own and then how
to use them together, with sample code that you can compile and run.
About Digester and
Lucene Commons Digester is a subproject of the Commons
project, which is one of the initiatives developed by the community of
developers who create open source software under the Apache Jakarta
umbrella. Digester offers a simple and high-level interface for the
mapping of XML documents to Java objects. When Digester finds
developer-defined patterns in XML, it will take developer-specified
actions. Digester requires a few additional Java libraries, including an
XML parser compatible with either SAX 2.0 or JAXP 1.1. Digester's home
page, listed in the Resources
section at the end of this article, provides a short list of the libraries
that Digester needs.
Lucene is another Apache Jakarta project. Like Digester, it is a Java
library and not a stand-alone application. Behind its simple indexing and
search interface hides an elegant piece of software capable of handling
many documents.
In the rest of this article, we use Digester to parse a simple XML
file, then illustrate how Lucene creates indices. Then we marry the two
tools to create a Lucene-generated index from our sample XML document, and
finally use Lucene classes to search through that index.
Using Digester to parse
XML We use Digester to parse the simple XML document in
Listing 1, which contains entries in an imaginary address book. To
demonstrate handling of elements with and without attributes, I decided to
make type an attribute of the <contact>
element, while leaving all other elements without any attributes. Listing 1. XML snippet of a fictitious address
book
<?xml version='1.0' encoding='utf-8'?>
<address-book>
<contact type="individual">
<name>Zane Pasolini</name>
<address>999 W. Prince St.</address>
<city>New York</city>
<province>NY</province>
<postalcode>10013</postalcode>
<country>USA</country>
<telephone>1-212-345-6789</telephone>
</contact>
<contact type="business">
<name>SAMOFIX d.o.o.</name>
<address>Ilica 47-2</address>
<city>Zagreb</city>
<province></province>
<postalcode>10000</postalcode>
<country>Croatia</country>
<telephone>385-1-123-4567</telephone>
</contact>
</address-book>
|
Using Digester to parse the above XML document is very simple, as Listing 2 illustrates. (Clicking Listing 2 causes a
new browser window to open. Keep that window open so you can refer to
Listing 2 while reading the following discussion.)
The most involved part of using Digester is centralized in the
main() method. After creating an instance of Digester, we
have to create rules for actions that are to be triggered when certain
patterns are encountered in the XML document that we are parsing. We'll
look in more detail at each Digester rule that we defined in Listing 2.
Note that the order in which rules are passed to Digester matters a great
deal.
The first rule tells Digester to create an instance of the
AddressBookParser class when the pattern "address-book" is
found. Because <address-book> is the first element in
the address book XML file, this rule will be the first to be triggered
when we use Digester with our XML file.
digester.addObjectCreate("address-book", AddressBookParser.class);
|
This next rule instructs Digester to create an instance of class
Contact when it finds the <contact> child
element under the <address-book> parent.
digester.addObjectCreate("address-book/contact", Contact.class);
|
In the following snippet, we set the type property of the
Contact instance when Digester finds the type
attribute of the <contact> element.
digester.addSetProperties("address-book/contact", "type", "type");
|
Our AddressBookParser class contains several rules that
look similar to the one shown below. They instruct Digester to invoke the
setName() method of the Contact class instance
and use the value enclosed by <name> elements as the
method parameter.
digester.addCallMethod("address-book/contact/name", "setName", 0);
|
Finally, this rule tells Digester to call the addContact()
method when it finds the closing </contact> element.
digester.addSetNext("address-book/contact", "addContact");
|
Again, it's important that you consider the order in which the rules
are passed to Digester. While we could change the order of various
addSetProperties() rules in our class and still have properly
functioning code, switching the order of addObjectCreate()
and addSetNext() would result in an error.
Using Lucene to index
text There are four fundamental Lucene classes for indexing
text: IndexWriter , Analyzer ,
Document , and Field .
The IndexWriter class creates a new index and adds
documents to an existing index.
Before text is indexed, it is passed through an Analyzer .
Analyzer classes are in charge of extracting indexable tokens
out of text to be indexed and eliminating the rest. Lucene comes with a
few different Analyzer implementations. Some of them deal
with skipping stop words (frequently used words that don't help
distinguish one document from the other, such as a, an, the, in,
and on), for instance, while others deal with converting all tokens
to lowercase letters, so that searches are not case sensitive.
An index consists of a set of Document s, and each
Document consist of one or more Field s. Each
Field has a name and a value. You can think of a
Document as a row in an RDBMS, and Field s as
columns in that row.
Let's consider a simple scenario in which we add a single contact entry
with all its fields to the index. Listing 3 shows how we could do it,
using the classes we just described. Listing 3.
Lucene-based address book indexer
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
/**
* <code>AddressBookIndexer</code> class provides a simple
* example of indexing with Lucene. It creates a fresh
* index called "address-book" in a temporary directory every
* time it is invoked and adds a single document with a
* few fields to it.
*/
public class AddressBookIndexer
{
public static void main(String args[]) throws Exception
{
String indexDir =
System.getProperty("java.io.tmpdir", "tmp") +
System.getProperty("file.separator") + "address-book";
Analyzer analyzer = new WhitespaceAnalyzer();
boolean createFlag = true;
IndexWriter writer = new IndexWriter(indexDir, analyzer, createFlag);
Document contactDocument = new Document();
contactDocument.add(Field.Text("type", "individual"));
contactDocument.add(Field.Text("name", "Zane Pasolini"));
contactDocument.add(Field.Text("address", "999 W. Prince St."));
contactDocument.add(Field.Text("city", "New York"));
contactDocument.add(Field.Text("province", "NY"));
contactDocument.add(Field.Text("postalcode", "10013"));
contactDocument.add(Field.Text("country", "USA"));
contactDocument.add(Field.Text("telephone", "1-212-345-6789"));
writer.addDocument(contactDocument);
writer.close();
}
}
|
What exactly is happening here? Lucene indices are stored in
directories in the filesystem. Each index is contained within a single
directory, and multiple indices cannot share a directory. The first
parameter in IndexWriter 's constructor specifies the
directory where the index should be stored. The second parameter provides
the implementation of Analyzer that should be used for
preprocessing the text before it is indexed. The particular implementation
of Analyzer that we are using here uses the whitespace
character as the delimiter for tokenizing the input. The last parameter is
a boolean flag that, when true , tells
IndexWriter to create a new index in the specified directory,
or to overwrite any existing index in that directory. A value of
false instructs IndexWriter to add
Document s to an existing index instead. We then create a
blank Document , and add several Text Field s to
it. After the Document is populated, we add it to the index
through the instance of IndexWriter ; finally, we close the
index. Closing the IndexWriter is important, as doing so
ensures that all index changes are flushed to the disk.
It is important to note that Lucene offers several types of
Field s. In this example I used the Text Field s,
because Lucene doesn't just index them, but also stores their original
value verbatim in the index. This allows us to show all the contact fields
when searching the index. To learn more about other types of
Field s in Lucene, see the Resources
section.
Marrying Digester and
Lucene Now that you know how to use each of these tools on
their own, we can combine the two classes we've written. We'll use
Digester to handle XML parsing, and Lucene to handle indexing. You can see
the resulting DigesterMarriesLucene class in Listing 4. (Clicking Listing 4 causes a new browser
window to open. Keep that window open so you can refer to Listing 4 while
reading the following discussion.)
Let's look at some selections from this class in more detail. Just as
we did in the AddressBookIndexer class, we need to open the
Lucene index for writing using IndexWriter ; we do so here in
Listing 5. We pass in the path to the index directory, the
Analyzer to process all data being indexed, and a
createFlag that is set to true , so that the
index is opened in the append mode. Listing 5.
Opening the index for writing
String indexDir =
System.getProperty("java.io.tmpdir", "tmp") +
System.getProperty("file.separator") + "address-book";
Analyzer analyzer = new WhitespaceAnalyzer();
boolean createFlag = true;
// IndexWriter to use for adding contacts to the index
writer = new IndexWriter(indexDir, analyzer, createFlag);
|
The modified addContact(Contact) method shown in Listing 6
now creates a fresh instance of the Lucene Document every
time it is called. After the Document is populated with data
from the Contact instance that is passed into the method, it
is added to the index through an instance of IndexWriter .
Listing 6. New addContact(Contact) method adds the
document to the index
Document contactDocument = new Document();
contactDocument.add(Field.Text("type", contact.getType()));
contactDocument.add(Field.Text("name", contact.getName()));
contactDocument.add(Field.Text("address", contact.getAddress()));
contactDocument.add(Field.Text("city", contact.getCity()));
contactDocument.add(Field.Text("province", contact.getProvince()));
contactDocument.add(Field.Text("postalcode", contact.getPostalcode()));
contactDocument.add(Field.Text("country", contact.getCountry()));
contactDocument.add(Field.Text("telephone", contact.getTelephone()));
writer.addDocument(contactDocument);
|
Finally, in Listing 7, at the end of the main() method,
the index is optimized and closed to ensure that all
Document s added to it are indeed written to the index on the
disk. Listing 7. Optimizing and closing the
index
// optimize and close the index
writer.optimize();
writer.close();
|
Using Lucene to search
text Now that we can create a Lucene index from a document
containing address books entries encoded in XML, all we need is the
ability to search that index. Lucene's API for searching is as simple as
the indexing API. In the class in Listing 8, we search the index we
created with the DigesterMarriesLucene class. Here, we run a
query that looks for all contacts that contain the keyword "Zane" in the
field called name . Listing 8. Searching
the address book index created with the Lucene indexer
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import java.io.IOException;
/**
* <code>AddressBookSearcher</code> class provides a simple
* example of searching with Lucene. It looks for an entry whose
* 'name' field contains keyword 'Zane'. The index being searched
* is called "address-book", located in a temporary directory.
*/
public class AddressBookSearcher
{
public static void main(String[] args) throws IOException
{
String indexDir =
System.getProperty("java.io.tmpdir", "tmp") +
System.getProperty("file.separator") + "address-book";
IndexSearcher searcher = new IndexSearcher(indexDir);
Query query = new TermQuery(new Term("name", "Zane"));
Hits hits = searcher.search(query);
System.out.println("NUMBER OF MATCHING CONTACTS: " + hits.length());
for (int i = 0; i < hits.length(); i++)
{
System.out.println("NAME: " + hits.doc(i).get("name"));
}
}
}
|
You can see that the IndexSearcher class is used for
accessing an existing index. The argument passed to its constructor is the
path to the directory where the index is stored. Lucene provides a few
different query types, and TermQuery is the simplest of them.
The query in the code above will find all listings that contain the term
"Zane" in a field called name . The call to
IndexSearcher 's search(Query) method executes
the search against the index and returns a collection of matching
Document s in an instance of Hits .
While this search example is very simple, note that Lucene offers a
rich set of search-related features. For instance, you can use several
different types of queries with Lucene: boolean queries, wild-card
queries, phrase queries, and so on. Lucene also offers the ability to
search multiple indices at once, as well as the ability to search indices
located on remote computers. Another useful feature is Lucene's
QueryParser , which supports a powerful and user-friendly
query syntax. For more information about Lucene's query syntax, see the Resources
section.
Conclusion You should
now have good understanding of how to use Jakarta Commons Digester to
parse XML documents and how to use Jakarta Lucene to index XML documents
and search the resulting index. The approach described in this article
should satisfy the simple XML indexing and searching needs of most
developers. You should also take a look at the Sandbox subproject of
Lucene, which includes examples of indexing of XML documents using SAX 2
and DOM parsers. For more complex and generic solutions, visit Lucene's
contributions page, a link to which is included in Resources.
Resources
About the
author Otis Gospodnetic is an active Apache Jakarta
member, a developer of Lucene, and maintainer of the jGuru's Lucene
FAQ. His professional interests include Web crawlers, information
gathering and retrieval, and distributed computing. Otis currently
lives in New York City and can be reached at otis@apache.org. |
|
|