|
|
|
Contents: |
|
|
|
Related content: |
|
|
|
Subscriptions: |
|
|
| Conversions and encoding schemes
John
Zukowski (mailto:jaz@zukowski.net?cc=&subject=Character
sets) President, JZ Ventures, Inc. 1 October 2002
Three classes in the
java.nio.charset package help convert between character sets when moving
legacy applications to the Java platform. John Zukowski walks you
through these classes and provides an example that demonstrates the
features. Share your thoughts on this article with the author and other
readers in the accompanying discussion forum. (You can
also click Discuss at the top or bottom of the article to access the
forum.)
By the numbers At the risk
of stating the obvious, computers only understand numbers. What's perhaps
less obvious, however, is that because they only understand numbers,
computers need to use some form of mapping of numerical values to
corresponding characters so that they can display text. It is these
mappings (or character sets) that permit the computer to understand
text. For instance, early desktop computers used ASCII for just such a
mapping. When a computer that uses ASCII stores the numbers 72, 101, 108,
and 112, it knows to display the word "Help," because in ASCII the number
72 is the value of H, 101 is e, 108 is l, and 112 is p. If, however, that
computer was an early IBM mainframe using EBCDIC (rather than ASCII), the
word "Help" would have been represented by the numbers 200, 133, 147, and
151.
Character set basics Moving
to the Java language, there are three classes in the
java.nio.charset package to help with the mappings:
Charset , CharsetEncoder , and
CharsetDecoder . These classes work together so that you can
take one mapping and convert it to another. In the case of going from
another mapping to the Java mapping (Unicode), you use a decoder. Then, if
you need to go from the Java mapping (Unicode) to a different mapping (or
back to the original), you use an encoder. You can't convert directly
between two non-Unicode formats with the java.nio.charset
package, but by going through an intermediate Unicode format you can
convert between two non-Unicode formats.
Before you can get a decoder or encoder, you need to get the
Charset for the specific mapping. For instance, US-ASCII is
the name of the mapping for the 7-bit ASCII character set. All you need to
do is pass that name into the forName() method of
Charset as shown here:
Charset charset =
Charset.forName("US-ASCII");
|
Once you have the Charset , just ask for
CharsetDecoder and CharsetEncoder :
CharsetDecoder decoder =
charset.newDecoder();
CharsetEncoder encoder =
charset.newEncoder();
|
After you have the decoder and encoder, you can then convert between
the different character sets, as shown below:
ByteBuffer bytes = ...;
CharBuffer chars = decoder.decode(bytes);
bytes = encoder.encode(chars);
|
Of course, if you aren't sure which character sets are available, you
need to ask:
SortedMap map =
Charset.availableCharsets();
|
You would then use a specific decoder to go from external bytes to
internal characters. Then, if you needed to send data out of the Java
code, you'd use the encoder to go from internal characters to external
bytes. As far as which specific character sets are available, your runtime
determines the complete set. Every Java programming implementation,
however, must support the following encodings:
- US-ASCII: 7-bit ASCII
- ISO-8859-1: ISO Latin alphabet
- UTF-8: 8-bit UCS Transformation Format
- UTF-16BE: 16-bit UCS Transformation Format, big-endian byte
order
- UTF-16LE: 16-bit UCS Transformation Format, little-endian
byte order
- UTF-16: 16-bit UCS Transformation Format, byte order
identified by marker
Different platforms may then support additional character sets specific
to that platform (for instance, on Windows platforms, you'll find a
Windows-1252 character set supported). If you need support for other sets,
you can create your own. See the CharsetProvider API in the
java.nio.charset.spi package.
Complete example Listing 1
demonstrates the Java character set conversion features by converting the
ASCII byte array values for the word "Help" (72, 101, 108, and 112).
Unfortunately, there is no EBCDIC encoder by default, so we'll convert the
value to a UTF-16LE byte array (which just adds in a "0" byte for the
second byte of each character).
import java.nio.*;
import java.nio.charset.*;
import java.util.Arrays;
public class Convert {
public static void main(String args[]) {
System.out.println(Charset.availableCharsets());
Charset asciiCharset = Charset.forName("US-ASCII");
CharsetDecoder decoder = asciiCharset.newDecoder();
byte help[] = {72, 101, 108, 112};
ByteBuffer asciiBytes = ByteBuffer.wrap(help);
CharBuffer helpChars = null;
try {
helpChars = decoder.decode(asciiBytes);
} catch (CharacterCodingException e) {
System.err.println("Error decoding");
System.exit(-1);
}
System.out.println(helpChars);
Charset utfCharset = Charset.forName("UTF-16LE");
CharsetEncoder encoder = utfCharset.newEncoder();
ByteBuffer utfBytes = null;
try {
utfBytes = encoder.encode(helpChars);
} catch (CharacterCodingException e) {
System.err.println("Error encoding");
System.exit(-1);
}
byte newHelp[] = utfBytes.array();
for (int i=0, n=newHelp.length; i<n; i++) {
System.out.println(i + " :" + newHelp[i]);
}
}
}
|
Note: Besides manually performing the encoding and decoding, you
can also provide a Charset to the java.ioInputStreamReader
and java.ioOutputStreamWriter constructors and let them do
the conversion for you.
Resources
|
|