|
|
|
Contents: |
|
|
|
Related content: |
|
|
|
Subscriptions: |
|
|
| Engineering a framework specifically to solve this
problem
Merlin
Hughes (mailto:merlin@merlin.org?cc=&subject=Reading
from an output stream) Cryptographer, Baltimore Technologies 9
July 2002
The Java I/O framework is, in general,
extremely versatile. The same framework supports file access, network
access, character conversion, compression, encryption and so forth.
Sometimes, however, it is not quite flexible enough. For example, the
compression streams allow you to write data into a compressed form but
they don't let you read it in a compressed form. Similarly, some
third-party modules are built to write out data, without consideration
for scenarios where applications need to read in the data. In this
article, the first in a two-part series, Java cryptographer and author
Merlin Hughes presents a framework that lets an application efficiently
read data from a source that only supports writing data to an output
stream.
The Java platform has expanded vastly since the early days of
browser-based applets and simple applications. We now have multiple
platforms and profiles and dozens of new APIs, with literally hundreds
more in the making. In spite of its increasing complexity, the Java
language is still a great tool for day-to-day programming tasks. While
sometimes you can get mired in those day-to-day programming problems,
occasionally you're able to step back and see an elegant solution to a
problem you've encountered many times before.
Just the other day, I wanted to compress some data as I read them from
a network connection (I was relaying TCP data, in a compressed form, down
a UDP socket). Remembering that compression has been supported by the Java
platform since version 1.1, I went straight to the package
java.util.zip , expecting to find a solution waiting for me.
Instead, I found a problem: the classes are architected around the normal
case of decompressing data when reading and compressing them when writing,
and not the other way around. Although it is possible to bypass the I/O
classes, I wanted a streams-based solution and did not want to sully my
hands with using the compressor directly.
It occurred to me that I had encountered the exact same problem in
another situation only a short time ago. I have a base-64 transcoding
library, and as with the compression package, it supports decoding data
that are read from a stream and encoding data that are written to a
stream. However, I was in need of a library that would encode data as I
read them from a stream.
As I set out to solve this problem, I realized that I had encountered
this problem on yet another occasion: when you serialize an XML document,
you typically iterate through the document, writing the nodes to a stream.
However, I had been in the position where I needed to read the serialized
form of the document in order to reparse the subset into a new
document.
Taking a step back, I realized that these isolated incidents
represented a general problem: given a data source that incrementally
writes data to an output stream, I need an input stream that will allow me
to read these data, transparently calling on the data source whenever more
data are needed.
In this article, we'll examine three possible solutions to the problem,
settling on a new framework that implements the best of the other
solutions. Then, we'll test out the framework on each of the problems
listed above. We'll briefly touch on performance concerns, but will save
the bulk of that discussion for the next article.
I/O stream basics First,
let's briefly review the Java platform's basic stream classes, which are
illustrated in Figure 1. An OutputStream represents a stream
to which data can be written. Typically, this stream will either be
directly connected to a device, such as a file or a network connection, or
to another output stream, in which case it is termed a filter.
Output stream filters typically transform the data that are written to
them before writing the resulting transformed data to the attached stream.
An InputStream represents a stream of data from which data
can be read. Again, this stream will be either directly connected to a
device or else to another stream. Input stream filters read data from the
attached stream, transform this data, and then allow this transformed data
to be read from them.
Figure 1. I/O stream basics
In terms of my initial problem, the GZIPOutputStream class
is an output stream filter that compresses data that are written to it,
and then writes this compressed data to the attached stream. I was in need
of an input stream filter that would read data from a stream, compress
these, and let me read the result.
The Java platform, version 1.4, has introduced a new I/O framework,
java.nio . However, much of this framework is concerned with
providing efficient access to operating system I/O resources; and,
although it does provide analogs for some of the traditional
java.io classes and can represent dual-purpose resources that
support both input and output, it does not entirely replace the standard
stream classes and does not directly address the problem that I needed to
solve.
The brute-force
solution Before setting out to find an engineering solution
to my problem, I examined solutions based on the standard Java API classes
in terms of their elegance and efficiency.
The brute-force solution to the problem is to simply read all the data
from the input source, then push them through the transformer (that is,
the compression stream, the encoding stream, or the XML serializer) into a
memory buffer. I can then open a stream to read from this memory buffer,
and I will have solved my problem.
First, I need a general-purpose I/O method. The method in Listing 1
copies all data from an InputStream to an
OutputStream using a small buffer. When the end of the input
is reached (the read() function returns less than zero), the
method returns without closing either stream.
public static void io (InputStream in, OutputStream out)
throws IOException {
byte[] buffer = new byte[8192];
int amount;
while ((amount = in.read (buffer)) >= 0)
out.write (buffer, 0, amount);
}
|
Listing 2 shows the brute-force solution to let me read the compressed
form of an input stream. I open a GZIPOutputStream that
writes into a memory buffer (I use a ByteArrayOutputStream ).
Next, I copy the input stream into the compression stream, which fills the
memory buffer with compressed data. I then return a
ByteArrayInputStream that lets me read from the input stream,
as shown in Figure 2:
Figure 2. The brute-force solution
public static InputStream bruteForceCompress (InputStream in)
throws IOException {
ByteArrayOutputStream sink = new ByteArrayOutputStream ():
OutputStream out = new GZIPOutputStream (sink);
io (in, out);
out.close ();
byte[] buffer = sink.toByteArray ();
return new ByteArrayInputStream (buffer);
}
|
An obvious flaw with this solution is that it stores the entire
compressed document in memory. If the document is large, then this
approach will needlessly waste system resources. One of the great features
of using streams is that they allow you to operate on data larger than the
memory of the system you are using: you can process data as you read them,
or generate data as you write them, without ever holding all the data in
memory.
In terms of efficiency, let's look more closely at copying data between
buffers.
The data are read, by the io() method, from the input
source into one buffer. Then they are written from that buffer into a
buffer within the ByteArrayOutputStream (through the
compression that I am ignoring). However, the
ByteArrayOutputStream class operates with an expanding
internal buffer; whenever the buffer becomes full, a new buffer, twice the
size, is allocated and the existing data are copied into it. On average,
every byte is copied twice by this process. (The math is simple: the
average datum is copied twice when entering a
ByteArrayOutputStream ; all the data are copied at least once;
half are copied at least twice; a quarter at least three times, and so
on). The data are then copied from that buffer into a new one for the
ByteArrayInputStream . The data are now available to be read
by the application. In total, the data will be written through four
buffers by this solution. This is a useful baseline for estimating the
efficiency of other techniques.
The piped-streams
solution The piped streams, PipedOutputStream
and PipedInputStream , provide a streams-based connection
between the threads of a Java virtual machine. Data written by one thread
into a PipedOutputStream can concurrently be read by another
thread from the associated PipedInputStream .
As such, these classes present a solution to my problem. Listing 3
shows the code that employs one thread to copy data from the input stream
through a GZIPOutputStream and into a
PipedOutputStream . The associated
PipedInputStream will then provide read access to the
compressed data from another thread, as illustrated in Figure 3:
Figure 3. The piped-streams solution
private static InputStream pipedCompress (final InputStream in)
throws IOException {
PipedInputStream source = new PipedInputStream ();
final OutputStream out =
new GZIPOutputStream (new PipedOutputStream (source));
new Thread () {
public void run () {
try {
Streams.io (in, out);
out.close ();
} catch (IOException ex) {
ex.printStackTrace ();
}
}
}.start ();
return source;
}
|
In theory, this could be a good technique: by employing threads (one
will perform the compression, the other will process the resulting data),
an application can benefit from hardware SMP (symmetric multiprocessing)
or SMT (symmetric multithreading). Additionally, this solution involves
only two buffer writes: the I/O loop reads data from the input stream into
a buffer before writing through the compressed stream into the
PipedOutputStream . The output stream then stores data in an
internal buffer, which is directly shared with the
PipedInputStream for reading by the application. Furthermore,
because data are streamed through a fixed buffer, they never need to be
read entirely into memory. Instead, only a small working set will be
buffered at any given time.
In practise, however, performance is terrible. The piped streams need
to make use of synchronization, which will be heavily contested between
the two threads. Their internal buffer is too small to effectively process
large amounts of data or to hide lock contention. Additionally, the
constant sharing of the buffer will defeat many simple caching strategies
to share the workload on an SMP machine. Finally, using threads makes
exception handling very difficult: there is no way to push any
IOException that may occur down the pipe for processing by
the reader. Overall, this solution is much too heavyweight to be
effective.
Synchronization issues As has
become accepted practise in such libraries as the NIO framework and
the Collections API, synchronization is left as a burden on the
application. If an application expects to make concurrent access to
an object, the application must take the necessary steps to
synchronize the access. None of the code presented in this
article is synchronized; that is, it is not safe for two threads to
concurrently access a shared instance of one of these classes.
Although recent JVMs have greatly improved the performance of
their thread safety mechanisms, synchronization remains an expensive
operation. In the case of I/O, concurrent access to a single stream
is, almost invariably, an error; the order of the resulting data
streams will be nondeterministic, which is very rarely the desired
scenario. As such, to synchronize these classes would be to impose
an unnecessary expense with no tangible benefit.
We'll cover multithreaded considerations in more detail in part 2
of this series; for now, simply note that concurrent access to the
streams I present will result in nondeterministic
errors. |
The engineering
solution Now we'll look at an alternative engineering
solution to the problem. This solution provides a framework that is
specifically engineered to solve this class of problem, a framework that
provides InputStream access to data that are produced from a
source that incrementally writes data to an OutputStream . The
fact that data are written incrementally is important. If the source
writes all data to the OutputStream in a single atomic
operation and if threads are not used, we're basically back at the
brute-force technique. If, however, the source can be called on to
incrementally write its data, we've achieved a good balance between the
brute-force and the piped-streams solutions. This solution provides the
brute-force benefit of avoiding threads, while providing the piped benefit
of only holding a small amount of data in memory at any time.
Figure 4 illustrates the complete solution. We'll be examining the source
code for this solution for the remainder of the article.
Figure 4. The engineering solution
An output engine Listing 4
provides an interface, OutputEngine , that describes my data
sources. As I said, these sources incrementally write data to an output
stream:
package org.merlin.io;
import java.io.*;
/**
* An incremental data source that writes data to an OutputStream.
*
* @author Copyright (c) 2002 Merlin Hughes <merlin@merlin.org>
*
* This program is free software; you can redistribute
* it and/or modify it under the terms of the GNU
* General Public License as published by the Free
* Software Foundation; either version 2
* of the License, or (at your option) any later version.
*/
public interface OutputEngine {
public void initialize (OutputStream out) throws IOException;
public void execute () throws IOException;
public void finish () throws IOException;
}
|
The initialize() method presents the engine with the
stream to which it should write data. The execute() method
will then be repeatedly called to write data to this stream. When there
are no more data, the engine should close the stream. Finally,
finish() will be called when the engine should shut down.
This may occur before or after the engine has closed its output
stream.
An I/O stream engine An
output engine that addresses the problem that started me off on this
effort is one that copies data from an input stream through an output
stream filter into the target output stream. This satisfies the property
of incrementality because it can read and write a single buffer at a
time.
The code in Listings 5 through 10 implements such an engine. It is
constructed from an input stream and an output stream factory. Listing 11
is a factory that generates filtered output streams; for instance, it
could return a GZIPOutputStream wrapped around the target
output stream. Listing 5. The I/O stream
engine
package org.merlin.io;
import java.io.*;
/**
* An output engine that copies data from an InputStream through
* a FilterOutputStream to the target OutputStream.
*
* @author Copyright (c) 2002 Merlin Hughes <merlin@merlin.org>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*/
public class IOStreamEngine implements OutputEngine {
private static final int DEFAULT_BUFFER_SIZE = 8192;
private InputStream in;
private OutputStreamFactory factory;
private byte[] buffer;
private OutputStream out;
|
The constructors for this class just initialise various variables and
the buffer that will be used for transferring data.
public IOStreamEngine (InputStream in, OutputStreamFactory factory) {
this (in, factory, DEFAULT_BUFFER_SIZE);
}
public IOStreamEngine
(InputStream in, OutputStreamFactory factory, int bufferSize) {
this.in = in;
this.factory = factory;
buffer = new byte[bufferSize];
}
|
In the initialize() method, this engine calls on its
factory to wrap the OutputStream that it has been supplied
with. The factory will typically attach a filter to the
OutputStream . Listing 7. The initialize()
method
public void initialize (OutputStream out) throws IOException {
if (this.out != null) {
throw new IOException ("Already initialised");
} else {
this.out = factory.getOutputStream (out);
}
}
|
In the execute() method, the engine reads a buffer of data
from the InputStream and writes them to the wrapped
OutputStream ; or, if the input is exhausted, it closes the
OutputStream .
public void execute () throws IOException {
if (out == null) {
throw new IOException ("Not yet initialised");
} else {
int amount = in.read (buffer);
if (amount < 0) {
out.close ();
} else {
out.write (buffer, 0, amount);
}
}
}
|
Finally, when it is shut down, the engine closes its
InputStream .
public void finish () throws IOException {
in.close ();
}
|
The inner OutputStreamFactory interface, shown below in
Listing 10, describes a class that can return a filtered
OutputStream .
public static interface OutputStreamFactory {
public OutputStream getOutputStream (OutputStream out)
throws IOException;
}
}
|
Listing 11 shows an example factory that wraps the supplied stream in a
GZIPOutputStream :
public class GZIPOutputStreamFactory
implements IOStreamEngine.OutputStreamFactory {
public OutputStream getOutputStream (OutputStream out)
throws IOException {
return new GZIPOutputStream (out);
}
}
|
This I/O stream engine with its output stream factory framework is
general enough to support most output stream filtering needs.
An output engine input
stream Finally, we need one more piece of code to complete
this solution. The code in Listings 12 through 16 presents an input stream
that reads the data that are written by an output engine. There are, in
fact, two parts to this piece of code: the main class is an input stream
that reads data from an internal buffer. Tightly coupled with this is an
output stream, shown in Listing 17, that fills the internal reading buffer
with data written by the output engine.
The main input stream class will initialise the output engine with its
internal output stream. It can then automatically execute the engine to
receive more data whenever its buffer is empty. The output engine will
write data to its output stream and this will refill the input stream's
internal buffer, allowing the data to be efficiently read by the consuming
application.
package org.merlin.io;
import java.io.*;
/**
* An input stream that reads data from an OutputEngine.
*
* @author Copyright (c) 2002 Merlin Hughes <merlin@merlin.org>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*/
public class OutputEngineInputStream extends InputStream {
private static final int DEFAULT_INITIAL_BUFFER_SIZE = 8192;
private OutputEngine engine;
private byte[] buffer;
private int index, limit, capacity;
private boolean closed, eof;
|
The constructors for this input stream take an output engine from which
to read data and an optional buffer size. The stream first initialises
itself, and then it initialises the output engine.
public OutputEngineInputStream (OutputEngine engine) throws IOException {
this (engine, DEFAULT_INITIAL_BUFFER_SIZE);
}
public OutputEngineInputStream (OutputEngine engine, int initialBufferSize)
throws IOException {
this.engine = engine;
capacity = initialBufferSize;
buffer = new byte[capacity];
engine.initialize (new OutputStreamImpl ());
}
|
The main reading part of the code is a relatively straightforward byte
array-based input stream, much the same as the
ByteArrayInputStream class. However, whenever data are
requested and this stream is empty, it invokes the output engine's
execute() method to refill the read buffer. These new data
can then be returned to the caller. Thus, this class will iteratively read
through data written by the output engine until it completes, whereupon
the eof flag will get set and this stream will return that
the end of file has been reached.
private byte[] one = new byte[1];
public int read () throws IOException {
int amount = read (one, 0, 1);
return (amount < 0) ? -1 : one[0] & 0xff;
}
public int read (byte data[], int offset, int length)
throws IOException {
if (data == null) {
throw new NullPointerException ();
} else if
((offset < 0) || (length < 0) || (offset + length > data.length)) {
throw new IndexOutOfBoundsException ();
} else if (closed) {
throw new IOException ("Stream closed");
} else {
while (index >= limit) {
if (eof)
return -1;
engine.execute ();
}
if (limit - index < length)
length = limit - index;
System.arraycopy (buffer, index, data, offset, length);
index += length;
return length;
}
}
public long skip (long amount) throws IOException {
if (closed) {
throw new IOException ("Stream closed");
} else if (amount <= 0) {
return 0;
} else {
while (index >= limit) {
if (eof)
return 0;
engine.execute ();
}
if (limit - index < amount)
amount = limit - index;
index += (int) amount;
return amount;
}
}
public int available () throws IOException {
if (closed) {
throw new IOException ("Stream closed");
} else {
return limit - index;
}
}
|
When the consuming application closes this stream, it invokes the
output engine's finish() method so that it can release any
resources that it is using.
public void close () throws IOException {
if (!closed) {
closed = true;
engine.finish ();
}
}
|
The writeImpl() method is invoked when the output engine
writes data to its output stream. It copies these data into the read
buffer and updates the read limit index; this will make the new data
automatically available to the reading methods.
If, in a single iteration, the output engine writes more data than can
be held in the buffer, then the buffer capacity is doubled. This should
not, however, happen very frequently; the buffer should rapidly expand to
a sufficient size for steady-state operation.
private void writeImpl (byte[] data, int offset, int length) {
if (index >= limit)
index = limit = 0;
if (limit + length > capacity) {
capacity = capacity * 2 + length;
byte[] tmp = new byte[capacity];
System.arraycopy (buffer, index, tmp, 0, limit - index);
buffer = tmp;
limit -= index;
index = 0;
}
System.arraycopy (data, offset, buffer, limit, length);
limit += length;
}
|
The inner output stream implementation shown below in Listing 17
represents a stream that writes data into the internal input stream
buffer. The code verifies that the parameters are acceptable, and, if so,
it invokes the writeImpl() method.
private class OutputStreamImpl extends OutputStream {
public void write (int datum) throws IOException {
one[0] = (byte) datum;
write (one, 0, 1);
}
public void write (byte[] data, int offset, int length)
throws IOException {
if (data == null) {
throw new NullPointerException ();
} else if
((offset < 0) || (length < 0) || (offset + length > data.length)) {
throw new IndexOutOfBoundsException ();
} else if (eof) {
throw new IOException ("Stream closed");
} else {
writeImpl (data, offset, length);
}
}
|
Finally, when the output engine closes its output stream, indicating
that it has no more data to write, this output stream sets the input
stream's eof flag, indicating that there are no more data to
read.
public void close () {
eof = true;
}
}
}
|
The keen reader may note that I could have placed the body of the
writeImpl() method directly in the output stream
implementation: inner classes have access to all the private members of
the enclosing class. However, inner-class access to such fields is a
fraction less efficient than access by a direct method of the enclosing
class. So, for efficiency, and to minimize inter-class dependencies, I use
an additional helper method.
Applying the engineering solution:
Compressing data during a read Listing 19 demonstrates the
use of this framework of classes to solve my initial problem: compressing
data as I read them. The solution boils down to creating an
IOStreamEngine associated with the input stream and a
GZIPOutputStreamFactory , and then attaching an
OutputEngineInputStream to this. Initialisation and
connection of the streams is performed automatically, and compressed data
can then be directly read from the resulting stream. When processing is
complete and the stream is closed, the output engine is automatically shut
down and it closes the original input stream.
private static InputStream engineCompress (InputStream in)
throws IOException {
return new OutputEngineInputStream
(new IOStreamEngine (in, new GZIPOutputStreamFactory ()));
}
|
Although it is not unsurprising that a solution engineered to tackle
this class of problem should result in much cleaner code, the lesson is
well heeded in general: applying good design techniques, no matter how
small or large the problem, will almost invariably result in cleaner, more
maintainable code.
Testing performance In
terms of efficiency, the IOStreamEngine will read data into
its internal buffer and then write them, through the compression filter,
to the OutputStreamImpl . This writes the data directly into
the OutputEngineInputStream where they are made available for
reading. All told, only two buffer copies are performed, which means that
I should benefit from a combination of the buffer-copying efficiency of
the piped-streams solution and the threadless efficiency of the
brute-force solution.
To test the performance in practise, I wrote a simple test harness (see
test.PerformanceTest in the accompanying source)
that uses the three proposed solutions to read through a block of dummy
data using a null filter. On a 800 MHz Linux box running the Java 2 SDK,
version 1.4.0, the following performance was achieved:
- Piped streams solution
- 15KB: 23ms; 15MB: 22100ms
- Brute force solution
- 15KB: 0.35ms; 15MB: 745ms
- Engineered solution
- 15KB: 0.16ms; 15MB: 73ms
The engineered solution to this problem is clearly more efficient than
either of the alternatives based on the standard Java API.
As an aside, consider that if an output engine could obey a contract
such that after writing data to its output stream it would return without
modifying the array from which it wrote the data, I could provide a
solution that used just a single buffer-copy operation. However, this
contract can rarely be honoured. If needed, an output engine could
advertise its support for this mode of operation by simply implementing an
appropriate marker interface.
Applying the engineering solution:
Reading encoded character data Any problem that can be
expressed in terms of providing read access to an entity that iteratively
writes data to an OutputStream can be solved with this
framework. In this section and the next we'll take a look at examples of
such problems, along with their efficient solutions.
First, consider the case of wanting to read the UTF-8 encoded form of a
character stream: the InputStreamReader class lets you read
binary-encoded character data as a sequence of Unicode characters; it
represents a gateway from a byte input stream to a character input stream.
The OutputStreamWriter class lets you write a sequence of
Unicode characters in binary-encoded form to an output stream; it
represents a gateway from a character output stream to a byte output
stream. The getBytes() method of the String
class converts a string to an encoded byte array. However, none of these
classes directly let you read the UTF-8 encoded form of a character
stream.
The code in Listings 20 through 24 demonstrates a solution that uses
the OutputEngine framework in a very similar manner to the
IOStreamEngine class. Instead of reading from an input stream
and writing through an output stream filter, we read from a character
stream and write through an OutputStreamWriter using a chosen
character encoding.
package org.merlin.io;
import java.io.*;
/**
* An output engine that copies data from a Reader through
* a OutputStreamWriter to the target OutputStream.
*
* @author Copyright (c) 2002 Merlin Hughes <merlin@merlin.org>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*/
public class ReaderWriterEngine implements OutputEngine {
private static final int DEFAULT_BUFFER_SIZE = 8192;
private Reader reader;
private String encoding;
private char[] buffer;
private Writer writer;
|
The constructors for this class accept the character stream to read
from, the encoding to use, and an optional buffer size.
public ReaderWriterEngine (Reader in, String encoding) {
this (in, encoding, DEFAULT_BUFFER_SIZE);
}
public ReaderWriterEngine
(Reader reader, String encoding, int bufferSize) {
this.reader = reader;
this.encoding = encoding;
buffer = new char[bufferSize];
}
|
When this engine is initialised, it attaches an
OutputStreamWriter that writes characters, in the chosen
encoding, to the supplied output stream.
public void initialize (OutputStream out) throws IOException {
if (writer != null) {
throw new IOException ("Already initialised");
} else {
writer = new OutputStreamWriter (out, encoding);
}
}
|
When this engine is executed, it reads data from the input character
stream, and writes them to the OutputStreamWriter , which
passes them on to the attached output stream in the chosen encoding. From
there, the framework makes them available for reading.
public void execute () throws IOException {
if (writer == null) {
throw new IOException ("Not yet initialised");
} else {
int amount = reader.read (buffer);
if (amount < 0) {
writer.close ();
} else {
writer.write (buffer, 0, amount);
}
}
}
|
When the engine is finished, it closes down its input.
public void finish () throws IOException {
reader.close ();
}
}
|
In this case, unlike the compression case, the Java I/O packages
provide no low-level access to the character encoding classes that lie
beneath OutputStreamWriter . As a result, this is the only
effective solution to reading the encoded form of a character stream on a
pre-1.4 release of the Java platform. As of version 1.4, the
java.nio.charset package does provide stream-independent
character encoding and decoding capabilities. However, this package does
not meet our requirement for an input stream-based solution.
Applying the engineering solution:
Reading serialized DOM documents Finally, let's look at one
last use of this framework. The code in Listings 25 through 29 presents a
solution for reading the serialized form of a DOM document or document
subset. A potential use of this code might be to perform a validating
reparse on part of a DOM document.
package org.merlin.io;
import java.io.*;
import java.util.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;
/**
* An output engine that serializes a DOM tree using a specified
* character encoding to the target OutputStream.
*
* @author Copyright (c) 2002 Merlin Hughes <merlin@merlin.org>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*/
public class DOMSerializerEngine implements OutputEngine {
private NodeIterator iterator;
private String encoding;
private OutputStreamWriter writer;
|
The constructors take a DOM node over which to iterate, or a
preconstructed node iterator (this is part of DOM 2), and an encoding to
use for the serialized form.
public DOMSerializerEngine (Node root) {
this (root, "UTF-8");
}
public DOMSerializerEngine (Node root, String encoding) {
this (getIterator (root), encoding);
}
private static NodeIterator getIterator (Node node) {
DocumentTraversal dt= (DocumentTraversal)
(node.getNodeType () ==
Node.DOCUMENT_NODE) ? node : node.getOwnerDocument ();
return dt.createNodeIterator (node, NodeFilter.SHOW_ALL, null, false);
}
public DOMSerializerEngine (NodeIterator iterator, String encoding) {
this.iterator = iterator;
this.encoding = encoding;
}
|
During initialisation, the engine attaches an appropriate
OutputStreamWriter to the target output stream.
public void initialize (OutputStream out) throws IOException {
if (writer != null) {
throw new IOException ("Already initialised");
} else {
writer = new OutputStreamWriter (out, encoding);
}
}
|
During the execution phase, the engine gets the next node from the node
iterator and serializes it to the OutputStreamWriter . When
there are no more nodes, the engine closes its stream.
public void execute () throws IOException {
if (writer == null) {
throw new IOException ("Not yet initialised");
} else {
Node node = iterator.nextNode ();
closeElements (node);
if (node == null) {
writer.close ();
} else {
writeNode (node);
writer.flush ();
}
}
}
|
There are no resources to free when this engine shuts down.
public void finish () throws IOException {
}
// private void closeElements (Node node) throws IOException ...
// private void writeNode (Node node) throws IOException ...
}
|
The remaining internals of serializing each node are fairly
uninteresting; the process basically involves writing out the node
according to its type and the XML 1.0 specification, so I will omit that
part of the code from this article. See the accompanying source
for full details.
Conclusion What I've
presented is a useful framework that lets you efficiently read, using the
standard input stream API, data produced by a system that can only write
to an output stream. This lets us read compressed, or encoded data,
serialized documents, and so on. Although this function was not impossible
with the standard Java API, it was not at all efficient using those
classes. That this solution is more efficient than the simplest
brute-force solution, even for small data sizes, should be well noted. Any
application that writes data into a ByteArrayOutputStream for
subsequent processing may benefit from this framework.
The poor performance of the byte-array streams and the incredibly poor
performance of the piped streams are, in fact, the topic of my next
article. In it, I will look at reimplementing those classes with a greater
focus on performance than the original authors of the classes had.
Performance improvements of one hundred times are possible with only a
slightly relaxed API contract.
I hate washing the dishes. However, the ideas behind these classes, as
with most of what I consider my better (although still often trivial)
ideas, came to me while I was washing the dishes. More often than not,
I've found that taking a step back and considering a broader view of a
problem, away from the actual code, will reveal a better solution that
may, in the end, serve you much better than if you took the easy way out.
These solutions often result in cleaner, more efficient, and more
maintainable code.
I honestly fear the day that we get a dishwasher.
Resources
About the
author Merlin is a cryptographer and chief technical
evangelist with the Irish e-security company Baltimore Technologies,
occasional author, and part time janitor and dishwasher; not to be
confused with JDK 1.4. Based in New York, New York (a city so nice,
they named it twice), he can be reached at merlin@merlin.org.
|
|
|