Magic with Merlin: Parse sequences of characters with the new regex library


Search for:	within
		Search help

IBM home | Products & services | Support & downloads | My account

developerWorks > Java technology


	Magic with Merlin: Parse sequences of characters with the new regex library

Contents:

Parse text strings for patterns

What are patterns?

Pattern expressions

Use patterns effectively

Related content:

Magic with Merlin column

Subscriptions:

dW newsletters

dW Subscription
(CDs and downloads)

Moving beyond StreamTokenizer and StringTokenizer for pattern matching

Level: Intermediate

John Zukowski (mailto:jaz@zukowski.net?cc=&subject=Parse sequences of characters with the new regex library)
President, JZ Ventures, Inc.
1 August 2002

While previous versions of the Java language supported pattern matching, the StreamTokenizer and StringTokenizer classes barely scratched the surface of what you can do with patterns. The Java 1.4 (and now 1.4.1) release contains support for pattern matching with regular expressions in the java.util.regex package. In this installment of Magic with Merlin, John Zukowski shows you how to parse sequences of characters with the new regular expression library to add power to your search patterns.

Parse text strings for patterns
Regular expressions are ways to match patterns against text -- similar to how a compiler works to generate class files. A compiler looks for various patterns in the source to convert the source code expressions into bytecodes. By recognizing these source patterns, the compiler is able to translate only valid representations of source into compiled class files.

What are patterns?
In the context of regular expressions, patterns are text representations of sequences of characters. For instance, if you wanted to know if the word car existed within a character sequence, you would use the pattern car because that is how you represent the exact string. For a more complicated pattern, you can use special characters as placeholders. If instead of searching for car, you wanted to search for any string of text that began with the letter c and ended with the letter r, you would use the c*r pattern, where * represents any number of characters before the first r. The c*r pattern would match any string of characters that begins with c and ends with r, as in cougar, cavalier, or chrysler.

How to specify pattern expressions
The main part of pattern matching is coming up with the expression to use. This expression is then retained by the Pattern class before it is passed on to the Matcher class to check for matches in the context of a character sequence. For instance, if you want to validate an e-mail address, you might check whether the user input matches the pattern of a sequence of alphanumeric characters, followed by the @ symbol, then followed by two sets of characters separated by a period. This could be represented by the expression of \p{Alnum}+@\w+\.\p{Alpha}{2,3}. (Yes, this does oversimplify an e-mail address structure and probably would reject certain valid e-mail addresses, but as an example it's sufficient.)

Before we look at the specifics of the pattern language, let's look at \p{Alnum}+@\w+\.\p{Alpha}{2,3} in detail. The \p{Alnum} sequence means a single alphanumeric character (A through Z, a through z, or 0 through 9). The plus sign (+) after \p{Alnum} is called a quantifier. It is applied to the prior part of the expression and means that \p{Alnum} must be present one or more times. Use an asterisk (*) for zero or more times. The @ is just that, meaning it must appear after at least one alphanumeric character for the whole pattern to succeed. The \w+ is similar to the \p{Alnum}+, but adds an underscore ( _ ). Some sequences have multiple expressions. The slash ( \ .) means the period. Without the preceding slash, the period alone means any character. The final \p{Alpha}{2, 3} means two or three alphabetic characters.

The whole trick of working with patterns is to learn the specification language. Let's look at some of the classes of more commonly used expressions:

Literals: Any character that doesn't have special meaning within an expression is considered a literal and matches itself.
Quantifiers: Certain characters or expressions are used to count the number of times a literal or grouping can be present in a character sequence for the sequence to match an expression. Groupings are specified by a group of characters within parentheses.
- ? means once or not at all
- * means zero or more times
- + means one or more times
Character classes: A character class is a set of characters within square brackets where a match would be any one character within the brackets. You can combine character classes with quantifiers, for example, [acegikmoqsuwy]* would be any sequence of characters that include only the odd letters of the alphabet. Certain character classes are predefined:
- \d -- A digit (from 0 to 9)
- \D -- A non-digit
- \s -- A white-space character, like tab or new line
- \S -- A non white-space character
- \w -- A word character (a through z, A through Z, 0 through 9, and underscore)
- \W -- A non-word character (everything else)
Posix character classes: Certain character classes are valid for only US-ASCII comparison purposes. For instance:
- \p{Lower} -- Lowercase characters
- \p{Upper} -- Uppercase characters
- \p{ASCII} -- All ASCII characters
- \p{Alpha} -- An alphabetic character (combining \p{Lower} with \p{Upper})
- \p{Digit} -- A number from 0 to 9
- \p{Alnum} -- Alphanumeric characters
Range: Use a dash to specify a character class for an inclusive range. For instance, [A-J] means the uppercase letters from A through J.
Negation: The caret symbol ( ^ ) negates the contents of a character class. For instance, [^A-J] means any character but A through J.

See the Pattern API documentation (available from Resources) for additional details on the sequences.

How to use patterns effectively
Now that you've learned how to specify patterns, let's use them. You need to ask the Pattern class to compile them, as shown below. Notice that the slash character ( \ ) needs to be escaped in the String constant.


Pattern pattern = Pattern.compile(
  "\\p{Alnum}+@\\w+\\.\\p{Alpha}{2,3}");

After you have a compiled pattern, you can use the Pattern class to split an input line into a series of words based upon the pattern, or use the Matcher class to do some more complicated tasks. Here's how to split a character sequence of input, where the pattern used specifies the separators, not the words:


String words[] =  pattern.split(input);

If you want to match a pattern multiple times within a character sequence, the above code snippets are a good place to start. But if you want to fetch specific input, you'll need the matcher() method of Pattern When given some input, this method will return the appropriate Matcher class. You then use the Matcher instance to look through the results to find the different matches for the pattern in the input sequence, or better yet, use the Matcher instance as a search-and-replace tool:


Matcher matcher = pattern.matcher(input);

To match the pattern against the whole sequence, use matches(). To see if just a part of the sequence matches, use find():


if (matcher.find()) {
    // Found some string within input sequence
    // That matched the compiled pattern
    String match = matcher.group();
    // Process matching pattern
}

Complete example
These two classes -- Pattern and Matcher -- are the whole pattern-matching library. Coming up with the right regular expression and then working with the results of the Matcher class is really all there is to the library. Until a dedicated book on regular expressions comes out for the Java language, find a good book on Perl to learn more about the specific patterns. Listing 1 provides a complete example by looking for the longest word in a particular file passed in from the command line as input.

Listing 1. Longest word example



import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.*;
import java.util.*;
import java.util.regex.*;

public class Longest {
  public static void main(String args[]) {
    if (args.length != 1) {
      System.err.println("Provide a filename");
      return;
    }

    try {
      // Map File from filename to byte buffer
      FileInputStream input = 
        new FileInputStream(args[0]);
      FileChannel channel = input.getChannel();
      int fileLength = (int)channel.size();
      MappedByteBuffer buffer = channel.map(
        FileChannel.MapMode.READ_ONLY, 0, fileLength); 

      // Convert to character buffer
      Charset charset = Charset.forName("ISO-8859-1");
      CharsetDecoder decoder = charset.newDecoder();
      CharBuffer charBuffer = decoder.decode(buffer);

      // Create line pattern
      Pattern linePattern = 
        Pattern.compile(".*$", Pattern.MULTILINE);

      // Create word pattern
      Pattern wordBreakPattern = 
        Pattern.compile("[\\p{Punct}\\s}]");

      // Match line pattern to buffer
      Matcher lineMatcher = 
        linePattern.matcher(charBuffer);

      // Holder for longest word
      String longest = "";

      // For each line
      while (lineMatcher.find()) {

        // Get line
        String line = lineMatcher.group();

        // Get array of words on line
        String words[] = wordBreakPattern.split(line);

        // Look for longest word
        for (int i=0, n=words.length; i<n; i++) {
          if (words[i].length() > longest.length()) {
            longest = words[i];
          }
        }
      }
      // Report
      System.out.println("Longest word: " + longest);
 
      // Close
      input.close();
    } catch (IOException e) {
      System.err.println("Error processing");
    }
  }
}

Resources

Participate in the discussion forum on this article. (You can also click Discuss at the top or bottom of the article to access the forum.)
Read the API documentation for the java.util.regex package.
Try out alphaWorks Regex for Java for Java versions prior to 1.4.
The developerWorks Linux zone publishes a monthly column, Cultured Perl, which may provide you with insight into regular expressions using the Java language.
Read the complete collection of Merlin tips by John Zukowski.
Find more Java technology resources on the developerWorks Java technology zone.

About the author

John Zukowski conducts strategic Java consulting with JZ Ventures, Inc. and serves as the resident guru for a number of jGuru's community-driven Java FAQs. His latest books are Learn Java with JBuilder 6 from Apress and Mastering Java 2: J2SE 1.4 from Sybex. Reach him at mailto:jaz@zukowski.net?Subject=Magic with Merlin.

developerWorks > Java technology

About IBM | Privacy | Terms of use | Contact