Turning streams inside out, Part 1: Reading from an output stream


Search for:	within
		Search help

IBM home | Products & services | Support & downloads | My account

developerWorks > Java technology


	Turning streams inside out, Part 1: Reading from an output stream

Contents:

I/O stream basics

The brute-force solution

The piped-streams solution

The engineering solution

Applying the engineering solution: Compressing data during a read

Testing performance

Applying the engineering solution: Reading encoded character data

Applying the engineering solution: Reading serialized DOM documents

Related content:

IBM SDK for Java technology, version 1.3.x

Tutorial: Introduction to Java I/O (pre-Java 1.4)

Subscriptions:

dW newsletters

dW Subscription
(CDs and downloads)

Engineering a framework specifically to solve this problem

Level: Intermediate

Merlin Hughes (mailto:merlin@merlin.org?cc=&subject=Reading from an output stream)
Cryptographer, Baltimore Technologies
9 July 2002

The Java I/O framework is, in general, extremely versatile. The same framework supports file access, network access, character conversion, compression, encryption and so forth. Sometimes, however, it is not quite flexible enough. For example, the compression streams allow you to write data into a compressed form but they don't let you read it in a compressed form. Similarly, some third-party modules are built to write out data, without consideration for scenarios where applications need to read in the data. In this article, the first in a two-part series, Java cryptographer and author Merlin Hughes presents a framework that lets an application efficiently read data from a source that only supports writing data to an output stream.

The Java platform has expanded vastly since the early days of browser-based applets and simple applications. We now have multiple platforms and profiles and dozens of new APIs, with literally hundreds more in the making. In spite of its increasing complexity, the Java language is still a great tool for day-to-day programming tasks. While sometimes you can get mired in those day-to-day programming problems, occasionally you're able to step back and see an elegant solution to a problem you've encountered many times before.

Just the other day, I wanted to compress some data as I read them from a network connection (I was relaying TCP data, in a compressed form, down a UDP socket). Remembering that compression has been supported by the Java platform since version 1.1, I went straight to the package java.util.zip, expecting to find a solution waiting for me. Instead, I found a problem: the classes are architected around the normal case of decompressing data when reading and compressing them when writing, and not the other way around. Although it is possible to bypass the I/O classes, I wanted a streams-based solution and did not want to sully my hands with using the compressor directly.

It occurred to me that I had encountered the exact same problem in another situation only a short time ago. I have a base-64 transcoding library, and as with the compression package, it supports decoding data that are read from a stream and encoding data that are written to a stream. However, I was in need of a library that would encode data as I read them from a stream.

As I set out to solve this problem, I realized that I had encountered this problem on yet another occasion: when you serialize an XML document, you typically iterate through the document, writing the nodes to a stream. However, I had been in the position where I needed to read the serialized form of the document in order to reparse the subset into a new document.

Taking a step back, I realized that these isolated incidents represented a general problem: given a data source that incrementally writes data to an output stream, I need an input stream that will allow me to read these data, transparently calling on the data source whenever more data are needed.

Don't miss the rest of this series
Part 2, "Optimizing internal Java I/O" (September 2002)

In this article, we'll examine three possible solutions to the problem, settling on a new framework that implements the best of the other solutions. Then, we'll test out the framework on each of the problems listed above. We'll briefly touch on performance concerns, but will save the bulk of that discussion for the next article.

I/O stream basics
First, let's briefly review the Java platform's basic stream classes, which are illustrated in Figure 1. An OutputStream represents a stream to which data can be written. Typically, this stream will either be directly connected to a device, such as a file or a network connection, or to another output stream, in which case it is termed a filter. Output stream filters typically transform the data that are written to them before writing the resulting transformed data to the attached stream. An InputStream represents a stream of data from which data can be read. Again, this stream will be either directly connected to a device or else to another stream. Input stream filters read data from the attached stream, transform this data, and then allow this transformed data to be read from them.

Figure 1. I/O stream basics

In terms of my initial problem, the GZIPOutputStream class is an output stream filter that compresses data that are written to it, and then writes this compressed data to the attached stream. I was in need of an input stream filter that would read data from a stream, compress these, and let me read the result.

The Java platform, version 1.4, has introduced a new I/O framework, java.nio. However, much of this framework is concerned with providing efficient access to operating system I/O resources; and, although it does provide analogs for some of the traditional java.io classes and can represent dual-purpose resources that support both input and output, it does not entirely replace the standard stream classes and does not directly address the problem that I needed to solve.

The brute-force solution
Before setting out to find an engineering solution to my problem, I examined solutions based on the standard Java API classes in terms of their elegance and efficiency.

The brute-force solution to the problem is to simply read all the data from the input source, then push them through the transformer (that is, the compression stream, the encoding stream, or the XML serializer) into a memory buffer. I can then open a stream to read from this memory buffer, and I will have solved my problem.

First, I need a general-purpose I/O method. The method in Listing 1 copies all data from an InputStream to an OutputStream using a small buffer. When the end of the input is reached (the read() function returns less than zero), the method returns without closing either stream.


public static void io (InputStream in, OutputStream out) 
    throws IOException {
  byte[] buffer = new byte[8192];
  int amount;
  while ((amount = in.read (buffer)) >= 0)
    out.write (buffer, 0, amount);
}

Listing 2 shows the brute-force solution to let me read the compressed form of an input stream. I open a GZIPOutputStream that writes into a memory buffer (I use a ByteArrayOutputStream). Next, I copy the input stream into the compression stream, which fills the memory buffer with compressed data. I then return a ByteArrayInputStream that lets me read from the input stream, as shown in Figure 2:

Figure 2. The brute-force solution


public static InputStream bruteForceCompress (InputStream in) 
    throws IOException {
  ByteArrayOutputStream sink = new ByteArrayOutputStream ():
  OutputStream out = new GZIPOutputStream (sink);
  io (in, out);
  out.close ();
  byte[] buffer = sink.toByteArray ();
  return new ByteArrayInputStream (buffer);
}

An obvious flaw with this solution is that it stores the entire compressed document in memory. If the document is large, then this approach will needlessly waste system resources. One of the great features of using streams is that they allow you to operate on data larger than the memory of the system you are using: you can process data as you read them, or generate data as you write them, without ever holding all the data in memory.

In terms of efficiency, let's look more closely at copying data between buffers.

The data are read, by the io() method, from the input source into one buffer. Then they are written from that buffer into a buffer within the ByteArrayOutputStream (through the compression that I am ignoring). However, the ByteArrayOutputStream class operates with an expanding internal buffer; whenever the buffer becomes full, a new buffer, twice the size, is allocated and the existing data are copied into it. On average, every byte is copied twice by this process. (The math is simple: the average datum is copied twice when entering a ByteArrayOutputStream; all the data are copied at least once; half are copied at least twice; a quarter at least three times, and so on). The data are then copied from that buffer into a new one for the ByteArrayInputStream. The data are now available to be read by the application. In total, the data will be written through four buffers by this solution. This is a useful baseline for estimating the efficiency of other techniques.

The piped-streams solution
The piped streams, PipedOutputStream and PipedInputStream, provide a streams-based connection between the threads of a Java virtual machine. Data written by one thread into a PipedOutputStream can concurrently be read by another thread from the associated PipedInputStream.

As such, these classes present a solution to my problem. Listing 3 shows the code that employs one thread to copy data from the input stream through a GZIPOutputStream and into a PipedOutputStream. The associated PipedInputStream will then provide read access to the compressed data from another thread, as illustrated in Figure 3:

Figure 3. The piped-streams solution


private static InputStream pipedCompress (final InputStream in) 
    throws IOException {
  PipedInputStream source = new PipedInputStream ();
  final OutputStream out = 
    new GZIPOutputStream (new PipedOutputStream (source));
  new Thread () {
    public void run () {
      try {
        Streams.io (in, out);
        out.close ();
      } catch (IOException ex) {
        ex.printStackTrace ();
      }
    }
  }.start ();
  return source;
}

In theory, this could be a good technique: by employing threads (one will perform the compression, the other will process the resulting data), an application can benefit from hardware SMP (symmetric multiprocessing) or SMT (symmetric multithreading). Additionally, this solution involves only two buffer writes: the I/O loop reads data from the input stream into a buffer before writing through the compressed stream into the PipedOutputStream. The output stream then stores data in an internal buffer, which is directly shared with the PipedInputStream for reading by the application. Furthermore, because data are streamed through a fixed buffer, they never need to be read entirely into memory. Instead, only a small working set will be buffered at any given time.

In practise, however, performance is terrible. The piped streams need to make use of synchronization, which will be heavily contested between the two threads. Their internal buffer is too small to effectively process large amounts of data or to hide lock contention. Additionally, the constant sharing of the buffer will defeat many simple caching strategies to share the workload on an SMP machine. Finally, using threads makes exception handling very difficult: there is no way to push any IOException that may occur down the pipe for processing by the reader. Overall, this solution is much too heavyweight to be effective.

Synchronization issues
As has become accepted practise in such libraries as the NIO framework and the Collections API, synchronization is left as a burden on the application. If an application expects to make concurrent access to an object, the application must take the necessary steps to synchronize the access.
None of the code presented in this article is synchronized; that is, it is not safe for two threads to concurrently access a shared instance of one of these classes.
Although recent JVMs have greatly improved the performance of their thread safety mechanisms, synchronization remains an expensive operation. In the case of I/O, concurrent access to a single stream is, almost invariably, an error; the order of the resulting data streams will be nondeterministic, which is very rarely the desired scenario. As such, to synchronize these classes would be to impose an unnecessary expense with no tangible benefit.

We'll cover multithreaded considerations in more detail in part 2 of this series; for now, simply note that concurrent access to the streams I present will result in nondeterministic errors.

The engineering solution
Now we'll look at an alternative engineering solution to the problem. This solution provides a framework that is specifically engineered to solve this class of problem, a framework that provides InputStream access to data that are produced from a source that incrementally writes data to an OutputStream. The fact that data are written incrementally is important. If the source writes all data to the OutputStream in a single atomic operation and if threads are not used, we're basically back at the brute-force technique. If, however, the source can be called on to incrementally write its data, we've achieved a good balance between the brute-force and the piped-streams solutions. This solution provides the brute-force benefit of avoiding threads, while providing the piped benefit of only holding a small amount of data in memory at any time.

Figure 4 illustrates the complete solution. We'll be examining the source code for this solution for the remainder of the article.

Figure 4. The engineering solution

An output engine
Listing 4 provides an interface, OutputEngine, that describes my data sources. As I said, these sources incrementally write data to an output stream:


package org.merlin.io;

import java.io.*;

/**
 * An incremental data source that writes data to an OutputStream.
 *
 * @author Copyright (c) 2002 Merlin Hughes <merlin@merlin.org>
 *
 * This program is free software; you can redistribute 
 * it and/or modify it under the terms of the GNU 
 * General Public License as published by the Free 
 * Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 */
public interface OutputEngine {
  public void initialize (OutputStream out) throws IOException;
  public void execute () throws IOException;
  public void finish () throws IOException;
}

The initialize() method presents the engine with the stream to which it should write data. The execute() method will then be repeatedly called to write data to this stream. When there are no more data, the engine should close the stream. Finally, finish() will be called when the engine should shut down. This may occur before or after the engine has closed its output stream.

An I/O stream engine
An output engine that addresses the problem that started me off on this effort is one that copies data from an input stream through an output stream filter into the target output stream. This satisfies the property of incrementality because it can read and write a single buffer at a time.

The code in Listings 5 through 10 implements such an engine. It is constructed from an input stream and an output stream factory. Listing 11 is a factory that generates filtered output streams; for instance, it could return a GZIPOutputStream wrapped around the target output stream.

Listing 5. The I/O stream engine


package org.merlin.io;

import java.io.*;

/**
 * An output engine that copies data from an InputStream through
 * a FilterOutputStream to the target OutputStream.
 *
 * @author Copyright (c) 2002 Merlin Hughes <merlin@merlin.org>
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 */
public class IOStreamEngine implements OutputEngine {
  private static final int DEFAULT_BUFFER_SIZE = 8192;
  
  private InputStream in;
  private OutputStreamFactory factory;
  private byte[] buffer;
  private OutputStream out;

The constructors for this class just initialise various variables and the buffer that will be used for transferring data.


  public IOStreamEngine (InputStream in, OutputStreamFactory factory) {
    this (in, factory, DEFAULT_BUFFER_SIZE);
  }

  public IOStreamEngine 
      (InputStream in, OutputStreamFactory factory, int bufferSize) {
    this.in = in;
    this.factory = factory;
    buffer = new byte[bufferSize];
  }

In the initialize() method, this engine calls on its factory to wrap the OutputStream that it has been supplied with. The factory will typically attach a filter to the OutputStream.

Listing 7. The initialize() method


  public void initialize (OutputStream out) throws IOException {
    if (this.out != null) {
      throw new IOException ("Already initialised");
    } else {
      this.out = factory.getOutputStream (out);
    }
  }

In the execute() method, the engine reads a buffer of data from the InputStream and writes them to the wrapped OutputStream; or, if the input is exhausted, it closes the OutputStream.


  public void execute () throws IOException {
    if (out == null) {
      throw new IOException ("Not yet initialised");
    } else {
      int amount = in.read (buffer);
      if (amount < 0) {
        out.close ();
      } else {
        out.write (buffer, 0, amount);
      }
    }
  }

Finally, when it is shut down, the engine closes its InputStream.


  public void finish () throws IOException {
    in.close ();
  }

The inner OutputStreamFactory interface, shown below in Listing 10, describes a class that can return a filtered OutputStream.


  public static interface OutputStreamFactory {
    public OutputStream getOutputStream (OutputStream out) 
      throws IOException;
  }
}

Listing 11 shows an example factory that wraps the supplied stream in a GZIPOutputStream:


public class GZIPOutputStreamFactory 
    implements IOStreamEngine.OutputStreamFactory {
  public OutputStream getOutputStream (OutputStream out) 
      throws IOException {
    return new GZIPOutputStream (out);
  }
}

This I/O stream engine with its output stream factory framework is general enough to support most output stream filtering needs.

An output engine input stream
Finally, we need one more piece of code to complete this solution. The code in Listings 12 through 16 presents an input stream that reads the data that are written by an output engine. There are, in fact, two parts to this piece of code: the main class is an input stream that reads data from an internal buffer. Tightly coupled with this is an output stream, shown in Listing 17, that fills the internal reading buffer with data written by the output engine.

The main input stream class will initialise the output engine with its internal output stream. It can then automatically execute the engine to receive more data whenever its buffer is empty. The output engine will write data to its output stream and this will refill the input stream's internal buffer, allowing the data to be efficiently read by the consuming application.


package org.merlin.io;

import java.io.*;

/**
 * An input stream that reads data from an OutputEngine.
 *
 * @author Copyright (c) 2002 Merlin Hughes <merlin@merlin.org>
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 */
public class OutputEngineInputStream extends InputStream {
  private static final int DEFAULT_INITIAL_BUFFER_SIZE = 8192;
  
  private OutputEngine engine;
  private byte[] buffer;
  private int index, limit, capacity;
  private boolean closed, eof;

The constructors for this input stream take an output engine from which to read data and an optional buffer size. The stream first initialises itself, and then it initialises the output engine.


public OutputEngineInputStream (OutputEngine engine) throws IOException {
  this (engine, DEFAULT_INITIAL_BUFFER_SIZE);
}

public OutputEngineInputStream (OutputEngine engine, int initialBufferSize) 
    throws IOException {
  this.engine = engine;
  capacity = initialBufferSize;
  buffer = new byte[capacity];
  engine.initialize (new OutputStreamImpl ());
}

The main reading part of the code is a relatively straightforward byte array-based input stream, much the same as the ByteArrayInputStream class. However, whenever data are requested and this stream is empty, it invokes the output engine's execute() method to refill the read buffer. These new data can then be returned to the caller. Thus, this class will iteratively read through data written by the output engine until it completes, whereupon the eof flag will get set and this stream will return that the end of file has been reached.


  private byte[] one = new byte[1];
  
  public int read () throws IOException {
    int amount = read (one, 0, 1);
    return (amount < 0) ? -1 : one[0] & 0xff;
  }
  
  public int read (byte data[], int offset, int length) 
      throws IOException {
    if (data == null) {
      throw new NullPointerException ();
    } else if 
      ((offset < 0) || (length < 0) || (offset + length > data.length)) {
      throw new IndexOutOfBoundsException ();
    } else if (closed) {
      throw new IOException ("Stream closed");
    } else {
      while (index >= limit) {
        if (eof)
          return -1;
        engine.execute ();
      }
      if (limit - index < length)
        length = limit - index;
      System.arraycopy (buffer, index, data, offset, length);
      index += length;
      return length;
    }
  }

  public long skip (long amount) throws IOException {
    if (closed) {
      throw new IOException ("Stream closed");
    } else if (amount <= 0) {
      return 0;
    } else {
      while (index >= limit) {
        if (eof)
          return 0;
        engine.execute ();
      }
      if (limit - index < amount)
        amount = limit - index;
      index += (int) amount;
      return amount;
    }
  }
  
  public int available () throws IOException {
    if (closed) {
      throw new IOException ("Stream closed");
    } else {
      return limit - index;
    }
  }

When the consuming application closes this stream, it invokes the output engine's finish() method so that it can release any resources that it is using.


  public void close () throws IOException {
    if (!closed) {
      closed = true;
      engine.finish ();
    }
  }

The writeImpl() method is invoked when the output engine writes data to its output stream. It copies these data into the read buffer and updates the read limit index; this will make the new data automatically available to the reading methods.

If, in a single iteration, the output engine writes more data than can be held in the buffer, then the buffer capacity is doubled. This should not, however, happen very frequently; the buffer should rapidly expand to a sufficient size for steady-state operation.


  private void writeImpl (byte[] data, int offset, int length) {
    if (index >= limit)
      index = limit = 0;
    if (limit + length > capacity) {
      capacity = capacity * 2 + length;
      byte[] tmp = new byte[capacity];
      System.arraycopy (buffer, index, tmp, 0, limit - index);
      buffer = tmp;
      limit -= index;
      index = 0;
    }
    System.arraycopy (data, offset, buffer, limit, length);
    limit += length;
  }

The inner output stream implementation shown below in Listing 17 represents a stream that writes data into the internal input stream buffer. The code verifies that the parameters are acceptable, and, if so, it invokes the writeImpl() method.


  private class OutputStreamImpl extends OutputStream {
    public void write (int datum) throws IOException {
      one[0] = (byte) datum;
      write (one, 0, 1);
    }

    public void write (byte[] data, int offset, int length) 
        throws IOException {
      if (data == null) {
        throw new NullPointerException ();
      } else if 
        ((offset < 0) || (length < 0) || (offset + length > data.length)) {
        throw new IndexOutOfBoundsException ();
      } else if (eof) {
        throw new IOException ("Stream closed");
      } else {
        writeImpl (data, offset, length);
      }
    }

Finally, when the output engine closes its output stream, indicating that it has no more data to write, this output stream sets the input stream's eof flag, indicating that there are no more data to read.


    public void close () {
      eof = true;
    }
  }
}

The keen reader may note that I could have placed the body of the writeImpl() method directly in the output stream implementation: inner classes have access to all the private members of the enclosing class. However, inner-class access to such fields is a fraction less efficient than access by a direct method of the enclosing class. So, for efficiency, and to minimize inter-class dependencies, I use an additional helper method.

Applying the engineering solution: Compressing data during a read
Listing 19 demonstrates the use of this framework of classes to solve my initial problem: compressing data as I read them. The solution boils down to creating an IOStreamEngine associated with the input stream and a GZIPOutputStreamFactory, and then attaching an OutputEngineInputStream to this. Initialisation and connection of the streams is performed automatically, and compressed data can then be directly read from the resulting stream. When processing is complete and the stream is closed, the output engine is automatically shut down and it closes the original input stream.


  private static InputStream engineCompress (InputStream in) 
      throws IOException {
    return new OutputEngineInputStream
      (new IOStreamEngine (in, new GZIPOutputStreamFactory ()));
  }

Although it is not unsurprising that a solution engineered to tackle this class of problem should result in much cleaner code, the lesson is well heeded in general: applying good design techniques, no matter how small or large the problem, will almost invariably result in cleaner, more maintainable code.

Testing performance
In terms of efficiency, the IOStreamEngine will read data into its internal buffer and then write them, through the compression filter, to the OutputStreamImpl. This writes the data directly into the OutputEngineInputStream where they are made available for reading. All told, only two buffer copies are performed, which means that I should benefit from a combination of the buffer-copying efficiency of the piped-streams solution and the threadless efficiency of the brute-force solution.

To test the performance in practise, I wrote a simple test harness (see test.PerformanceTest in the accompanying source) that uses the three proposed solutions to read through a block of dummy data using a null filter. On a 800 MHz Linux box running the Java 2 SDK, version 1.4.0, the following performance was achieved:

Piped streams solution: 15KB: 23ms; 15MB: 22100ms
Brute force solution: 15KB: 0.35ms; 15MB: 745ms
Engineered solution: 15KB: 0.16ms; 15MB: 73ms

The engineered solution to this problem is clearly more efficient than either of the alternatives based on the standard Java API.

As an aside, consider that if an output engine could obey a contract such that after writing data to its output stream it would return without modifying the array from which it wrote the data, I could provide a solution that used just a single buffer-copy operation. However, this contract can rarely be honoured. If needed, an output engine could advertise its support for this mode of operation by simply implementing an appropriate marker interface.

Applying the engineering solution: Reading encoded character data
Any problem that can be expressed in terms of providing read access to an entity that iteratively writes data to an OutputStream can be solved with this framework. In this section and the next we'll take a look at examples of such problems, along with their efficient solutions.

First, consider the case of wanting to read the UTF-8 encoded form of a character stream: the InputStreamReader class lets you read binary-encoded character data as a sequence of Unicode characters; it represents a gateway from a byte input stream to a character input stream. The OutputStreamWriter class lets you write a sequence of Unicode characters in binary-encoded form to an output stream; it represents a gateway from a character output stream to a byte output stream. The getBytes() method of the String class converts a string to an encoded byte array. However, none of these classes directly let you read the UTF-8 encoded form of a character stream.

The code in Listings 20 through 24 demonstrates a solution that uses the OutputEngine framework in a very similar manner to the IOStreamEngine class. Instead of reading from an input stream and writing through an output stream filter, we read from a character stream and write through an OutputStreamWriter using a chosen character encoding.


package org.merlin.io;

import java.io.*;

/**
 * An output engine that copies data from a Reader through
 * a OutputStreamWriter to the target OutputStream.
 *
 * @author Copyright (c) 2002 Merlin Hughes <merlin@merlin.org>
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 */
public class ReaderWriterEngine implements OutputEngine {
  private static final int DEFAULT_BUFFER_SIZE = 8192;
  
  private Reader reader;
  private String encoding;
  private char[] buffer;
  private Writer writer;

The constructors for this class accept the character stream to read from, the encoding to use, and an optional buffer size.


  public ReaderWriterEngine (Reader in, String encoding) {
    this (in, encoding, DEFAULT_BUFFER_SIZE);
  }

  public ReaderWriterEngine 
      (Reader reader, String encoding, int bufferSize) {
    this.reader = reader;
    this.encoding = encoding;
    buffer = new char[bufferSize];
  }

When this engine is initialised, it attaches an OutputStreamWriter that writes characters, in the chosen encoding, to the supplied output stream.


  public void initialize (OutputStream out) throws IOException {
    if (writer != null) {
      throw new IOException ("Already initialised");
    } else {
      writer = new OutputStreamWriter (out, encoding);
    }
  }

When this engine is executed, it reads data from the input character stream, and writes them to the OutputStreamWriter, which passes them on to the attached output stream in the chosen encoding. From there, the framework makes them available for reading.


  public void execute () throws IOException {
    if (writer == null) {
      throw new IOException ("Not yet initialised");
    } else {
      int amount = reader.read (buffer);
      if (amount < 0) {
        writer.close ();
      } else {
        writer.write (buffer, 0, amount);
      }
    }
  }

When the engine is finished, it closes down its input.


  public void finish () throws IOException {
    reader.close ();
  }
}

In this case, unlike the compression case, the Java I/O packages provide no low-level access to the character encoding classes that lie beneath OutputStreamWriter. As a result, this is the only effective solution to reading the encoded form of a character stream on a pre-1.4 release of the Java platform. As of version 1.4, the java.nio.charset package does provide stream-independent character encoding and decoding capabilities. However, this package does not meet our requirement for an input stream-based solution.

Applying the engineering solution: Reading serialized DOM documents
Finally, let's look at one last use of this framework. The code in Listings 25 through 29 presents a solution for reading the serialized form of a DOM document or document subset. A potential use of this code might be to perform a validating reparse on part of a DOM document.


package org.merlin.io;

import java.io.*;
import java.util.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;

/**
 * An output engine that serializes a DOM tree using a specified
 * character encoding to the target OutputStream.
 *
 * @author Copyright (c) 2002 Merlin Hughes <merlin@merlin.org>
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 */
public class DOMSerializerEngine implements OutputEngine {
  private NodeIterator iterator;
  private String encoding;
  private OutputStreamWriter writer;

The constructors take a DOM node over which to iterate, or a preconstructed node iterator (this is part of DOM 2), and an encoding to use for the serialized form.


  public DOMSerializerEngine (Node root) {
    this (root, "UTF-8");
  }

  public DOMSerializerEngine (Node root, String encoding) {
    this (getIterator (root), encoding);
  }

  private static NodeIterator getIterator (Node node) {
    DocumentTraversal dt= (DocumentTraversal)
      (node.getNodeType () == 
        Node.DOCUMENT_NODE) ? node : node.getOwnerDocument ();
    return dt.createNodeIterator (node, NodeFilter.SHOW_ALL, null, false);
  }  

  public DOMSerializerEngine (NodeIterator iterator, String encoding) {
    this.iterator = iterator;
    this.encoding = encoding;
  }

During initialisation, the engine attaches an appropriate OutputStreamWriter to the target output stream.


  public void initialize (OutputStream out) throws IOException {
    if (writer != null) {
      throw new IOException ("Already initialised");
    } else {
      writer = new OutputStreamWriter (out, encoding);
    }
  }

During the execution phase, the engine gets the next node from the node iterator and serializes it to the OutputStreamWriter. When there are no more nodes, the engine closes its stream.


  public void execute () throws IOException {
    if (writer == null) {
      throw new IOException ("Not yet initialised");
    } else {
      Node node = iterator.nextNode ();
      closeElements (node);
      if (node == null) {
        writer.close ();
      } else {
        writeNode (node);
        writer.flush ();
      }
    }
  }

There are no resources to free when this engine shuts down.


  public void finish () throws IOException {
  }
  
  // private void closeElements (Node node) throws IOException ...
  // private void writeNode (Node node) throws IOException ...
}

The remaining internals of serializing each node are fairly uninteresting; the process basically involves writing out the node according to its type and the XML 1.0 specification, so I will omit that part of the code from this article. See the accompanying source for full details.

Conclusion
What I've presented is a useful framework that lets you efficiently read, using the standard input stream API, data produced by a system that can only write to an output stream. This lets us read compressed, or encoded data, serialized documents, and so on. Although this function was not impossible with the standard Java API, it was not at all efficient using those classes. That this solution is more efficient than the simplest brute-force solution, even for small data sizes, should be well noted. Any application that writes data into a ByteArrayOutputStream for subsequent processing may benefit from this framework.

The poor performance of the byte-array streams and the incredibly poor performance of the piped streams are, in fact, the topic of my next article. In it, I will look at reimplementing those classes with a greater focus on performance than the original authors of the classes had. Performance improvements of one hundred times are possible with only a slightly relaxed API contract.

I hate washing the dishes. However, the ideas behind these classes, as with most of what I consider my better (although still often trivial) ideas, came to me while I was washing the dishes. More often than not, I've found that taking a step back and considering a broader view of a problem, away from the actual code, will reveal a better solution that may, in the end, serve you much better than if you took the easy way out. These solutions often result in cleaner, more efficient, and more maintainable code.

I honestly fear the day that we get a dishwasher.

Resources

Download the source code discussed in this article. This code is freely licensed under the terms of the GNU General Public License.
Read this hands-on introduction to Java I/O from developerWorks.
Learn about the new I/O APIs in the J2SE 1.4 guide.
The IBM SDK for Java technology, version 1.3.x runs the piped-streams example about 65 percent faster than the J2SE 1.4.0 version from Sun on my system.
You'll find hundreds of articles about every aspect of Java programming in the developerWorks Java technology zone.

About the author
Merlin is a cryptographer and chief technical evangelist with the Irish e-security company Baltimore Technologies, occasional author, and part time janitor and dishwasher; not to be confused with JDK 1.4. Based in New York, New York (a city so nice, they named it twice), he can be reached at merlin@merlin.org.

developerWorks > Java technology

About IBM | Privacy | Terms of use | Contact