|
|
|
Contents: |
|
|
|
Related content: |
|
|
|
Subscriptions: |
|
|
| Moving beyond StreamTokenizer and StringTokenizer for
pattern matching
John
Zukowski (mailto:jaz@zukowski.net?cc=&subject=Parse
sequences of characters with the new regex library) President, JZ
Ventures, Inc. 1 August 2002
While previous versions
of the Java language supported pattern matching, the
StreamTokenizer and StringTokenizer classes
barely scratched the surface of what you can do with patterns. The Java
1.4 (and now 1.4.1) release contains support for pattern matching with
regular expressions in the java.util.regex package. In this
installment of Magic with Merlin, John Zukowski shows you how to
parse sequences of characters with the new regular expression library to
add power to your search patterns.
Parse text strings for
patterns Regular expressions are ways to match patterns
against text -- similar to how a compiler works to generate class files. A
compiler looks for various patterns in the source to convert the source
code expressions into bytecodes. By recognizing these source patterns, the
compiler is able to translate only valid representations of source into
compiled class files.
What are patterns? In the
context of regular expressions, patterns are text representations of
sequences of characters. For instance, if you wanted to know if the word
car existed within a character sequence, you would use the pattern
car because that is how you represent the exact string. For a more
complicated pattern, you can use special characters as placeholders. If
instead of searching for car, you wanted to search for any string
of text that began with the letter c and ended with the letter
r, you would use the c*r pattern, where * represents
any number of characters before the first r. The c*r pattern
would match any string of characters that begins with c and ends
with r, as in cougar, cavalier, or
chrysler.
How to specify pattern
expressions The main part of pattern matching is coming up
with the expression to use. This expression is then retained by the
Pattern class before it is passed on to the
Matcher class to check for matches in the context of a
character sequence. For instance, if you want to validate an e-mail
address, you might check whether the user input matches the pattern of a
sequence of alphanumeric characters, followed by the @ symbol, then
followed by two sets of characters separated by a period. This could be
represented by the expression of
\p{Alnum}+@\w+\.\p{Alpha}{2,3} . (Yes, this does oversimplify
an e-mail address structure and probably would reject certain valid e-mail
addresses, but as an example it's sufficient.)
Before we look at the specifics of the pattern language, let's look at
\p{Alnum}+@\w+\.\p{Alpha}{2,3} in detail. The
\p{Alnum} sequence means a single alphanumeric character (A
through Z, a through z, or 0 through 9). The plus sign (+) after
\p{Alnum} is called a quantifier. It is applied to the
prior part of the expression and means that \p{Alnum} must be
present one or more times. Use an asterisk (*) for zero or more times. The
@ is just that, meaning it must appear after at least one alphanumeric
character for the whole pattern to succeed. The \w+ is
similar to the \p{Alnum}+ , but adds an underscore ( _ ). Some
sequences have multiple expressions. The slash ( \ .) means the period.
Without the preceding slash, the period alone means any character. The
final \p{Alpha}{2, 3} means two or three alphabetic
characters.
The whole trick of working with patterns is to learn the specification
language. Let's look at some of the classes of more commonly used
expressions:
- Literals: Any character that doesn't have special meaning
within an expression is considered a literal and matches itself.
- Quantifiers: Certain characters or expressions are used to
count the number of times a literal or grouping can be present in a
character sequence for the sequence to match an expression. Groupings
are specified by a group of characters within parentheses.
- ? means once or not at all
- * means zero or more times
- + means one or more times
- Character classes: A character class is a set of characters
within square brackets where a match would be any one character within
the brackets. You can combine character classes with quantifiers, for
example,
[acegikmoqsuwy]* would be any sequence of
characters that include only the odd letters of the alphabet. Certain
character classes are predefined:
- \d -- A digit (from 0 to 9)
- \D -- A non-digit
- \s -- A white-space character, like tab or new line
- \S -- A non white-space character
- \w -- A word character (a through z, A through Z, 0 through 9, and
underscore)
- \W -- A non-word character (everything else)
- Posix character classes: Certain character classes are valid
for only US-ASCII comparison purposes. For instance:
- \p{Lower} -- Lowercase characters
- \p{Upper} -- Uppercase characters
- \p{ASCII} -- All ASCII characters
- \p{Alpha} -- An alphabetic character (combining \p{Lower} with
\p{Upper})
- \p{Digit} -- A number from 0 to 9
- \p{Alnum} -- Alphanumeric characters
- Range: Use a dash to specify a character class for an
inclusive range. For instance,
[A-J] means the uppercase
letters from A through J.
- Negation: The caret symbol ( ^ ) negates the contents of a
character class. For instance,
[^A-J] means any character
but A through J.
See the Pattern API documentation (available from Resources)
for additional details on the sequences.
How to use patterns
effectively Now that you've learned how to specify patterns,
let's use them. You need to ask the Pattern class to compile
them, as shown below. Notice that the slash character ( \ ) needs to be
escaped in the String constant.
Pattern pattern = Pattern.compile(
"\\p{Alnum}+@\\w+\\.\\p{Alpha}{2,3}");
|
After you have a compiled pattern, you can use the Pattern
class to split an input line into a series of words based upon the
pattern, or use the Matcher class to do some more complicated
tasks. Here's how to split a character sequence of input, where the
pattern used specifies the separators, not the words:
String words[] = pattern.split(input);
|
If you want to match a pattern multiple times within a character
sequence, the above code snippets are a good place to start. But if you
want to fetch specific input, you'll need the matcher()
method of Pattern When given some input, this method will
return the appropriate Matcher class. You then use the
Matcher instance to look through the results to find the
different matches for the pattern in the input sequence, or better yet,
use the Matcher instance as a search-and-replace tool:
Matcher matcher = pattern.matcher(input);
|
To match the pattern against the whole sequence, use
matches() . To see if just a part of the sequence matches, use
find() :
if (matcher.find()) {
// Found some string within input sequence
// That matched the compiled pattern
String match = matcher.group();
// Process matching pattern
}
|
Complete example These two
classes -- Pattern and Matcher -- are the whole
pattern-matching library. Coming up with the right regular expression and
then working with the results of the Matcher class is really
all there is to the library. Until a dedicated book on regular expressions
comes out for the Java language, find a good book on Perl to learn more
about the specific patterns. Listing 1 provides a complete example by
looking for the longest word in a particular file passed in from the
command line as input. Listing 1. Longest word
example
import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.*;
import java.util.*;
import java.util.regex.*;
public class Longest {
public static void main(String args[]) {
if (args.length != 1) {
System.err.println("Provide a filename");
return;
}
try {
// Map File from filename to byte buffer
FileInputStream input =
new FileInputStream(args[0]);
FileChannel channel = input.getChannel();
int fileLength = (int)channel.size();
MappedByteBuffer buffer = channel.map(
FileChannel.MapMode.READ_ONLY, 0, fileLength);
// Convert to character buffer
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharBuffer charBuffer = decoder.decode(buffer);
// Create line pattern
Pattern linePattern =
Pattern.compile(".*$", Pattern.MULTILINE);
// Create word pattern
Pattern wordBreakPattern =
Pattern.compile("[\\p{Punct}\\s}]");
// Match line pattern to buffer
Matcher lineMatcher =
linePattern.matcher(charBuffer);
// Holder for longest word
String longest = "";
// For each line
while (lineMatcher.find()) {
// Get line
String line = lineMatcher.group();
// Get array of words on line
String words[] = wordBreakPattern.split(line);
// Look for longest word
for (int i=0, n=words.length; i<n; i++) {
if (words[i].length() > longest.length()) {
longest = words[i];
}
}
}
// Report
System.out.println("Longest word: " + longest);
// Close
input.close();
} catch (IOException e) {
System.err.println("Error processing");
}
}
}
|
Resources
- Participate in the discussion forum on this
article. (You can also click Discuss at the top or bottom of the
article to access the forum.)
- Read the API documentation for the java.util.regex
package.
- Try out alphaWorks Regex
for Java for Java versions prior to 1.4.
- The developerWorks Linux zone
publishes a monthly column, Cultured Perl, which may provide you
with insight into regular expressions using the Java language.
- Read the complete
collection of Merlin tips by John Zukowski.
- Find more Java technology resources on the developerWorks Java technology
zone.
|
|