JVM Low-level I/O - Part 2
Character Encoding & Charset on the JVM
Before we can send data between processes or over a network, we need to answer a fundamental question: how do we turn human-readable text into bytes, and bytes back into text? The answer is character encoding, and getting it wrong is the #1 cause of garbled data in IPC systems.
1. Why Encoding Matters
Consider this scenario: Process A writes the string "Héllo" to shared memory. Process B reads 5 bytes and gets "Héllo". What happened?
The problem: é is 1 byte in ISO-8859-1 (0xE9) but 2 bytes in UTF-8 (0xC3 0xA9). When Process B interprets UTF-8 bytes as ISO-8859-1, each byte maps to a different character.
2. A Brief History: ASCII → Unicode → UTF-8
ASCII: The Starting Point
ASCII maps 128 characters to numbers 0–127. It uses 7 bits per character.
Character: A B C ... Z 0 1 ... 9
Decimal: 65 66 67 ... 90 48 49 ... 57
Hex: 41 42 43 ... 5A 30 31 ... 39
Binary: 1000001 1000010 ...
Problem: Only covers English letters, digits, and basic punctuation. No accented characters, no CJK, no emoji.
Unicode: The Universal Map
Unicode doesn't define how to store characters — it defines a number (called a code point) for every character in every language:
Character Code Point Description
────────────────────────────────────────
A U+0041 Latin Capital A
é U+00E9 Latin Small E with Acute
中 U+4E2D CJK Character "middle"
🚀 U+1F680 Rocket Emoji
As of Unicode 15.0, there are over 149,000 characters defined.
UTF-8: The Encoding
UTF-8 is a variable-width encoding that maps Unicode code points to bytes:
Example: Encoding é (U+00E9) in UTF-8:
Code point: U+00E9 = 0000 0000 1110 1001 (binary)
Range: U+0080 to U+07FF → 2-byte pattern
Pattern: 110xxxxx 10xxxxxx
Fill in: 110 00011 10 101001
Result: 0xC3 0xA9
So é = [C3, A9] in UTF-8
3. How Characters Become Bytes
The encoding/decoding process is a two-way translation:
In Java, String objects are internally stored as sequences of Unicode code points. When you need to write them to a file, network, or shared memory, you must choose an encoding.
4. Common Encodings Compared
| Encoding | Bytes per Char | ASCII Compatible | Use Case |
|---|---|---|---|
| US-ASCII | 1 (fixed) | ✅ | Legacy systems, protocols |
| ISO-8859-1 | 1 (fixed) | ✅ | Legacy Western European |
| UTF-8 | 1–4 (variable) | ✅ | ⭐ Web, files, IPC (recommended) |
| UTF-16 | 2 or 4 (variable) | ❌ | Java internal, Windows APIs |
| UTF-16BE | 2 or 4 (variable) | ❌ | Big-endian UTF-16 |
| UTF-16LE | 2 or 4 (variable) | ❌ | Little-endian UTF-16, .NET |
| UTF-32 | 4 (fixed) | ❌ | Internal processing (rarely used) |
Storage Comparison for "Hello 世界 🌍"
For IPC: always use UTF-8 unless you have a specific reason not to. It's compact for ASCII-dominated data and universally supported.
5. Java's Charset API
Java provides the java.nio.charset.Charset class as the foundation for encoding and decoding:
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
// Pre-defined charsets (always available, no exceptions)
Charset utf8 = StandardCharsets.UTF_8;
Charset utf16 = StandardCharsets.UTF_16;
Charset ascii = StandardCharsets.US_ASCII;
Charset latin1 = StandardCharsets.ISO_8859_1;
// List all available charsets
System.out.println("Available charsets: " + Charset.availableCharsets().size());
// Typically 100+ depending on JVM
// Get charset by name
Charset windows1252 = Charset.forName("Windows-1252");
// Default charset (platform-dependent — DANGEROUS!)
Charset defaultCs = Charset.defaultCharset();
System.out.println("Default: " + defaultCs); // e.g., UTF-8 on most modern systems
Simple Encoding/Decoding
String text = "Héllo, 世界!";
// String → bytes
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
System.out.println("UTF-8 length: " + utf8Bytes.length); // 15
byte[] utf16Bytes = text.getBytes(StandardCharsets.UTF_16);
System.out.println("UTF-16 length: " + utf16Bytes.length); // 22 (includes BOM)
// bytes → String
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(decoded); // "Héllo, 世界!"
⚠️ Never use
new String(bytes)or"text".getBytes()without specifying a charset! These use the platform default encoding, which varies between systems.
6. CharsetEncoder and CharsetDecoder
For fine-grained control — especially when working with ByteBuffers — use CharsetEncoder and CharsetDecoder directly.
Encoding with CharsetEncoder
import java.nio.*;
import java.nio.charset.*;
CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
// Configure error handling
encoder.onMalformedInput(CodingErrorAction.REPLACE);
encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
encoder.replaceWith(new byte[]{'?'});
// Encode
CharBuffer input = CharBuffer.wrap("Héllo, 世界!");
ByteBuffer output = ByteBuffer.allocate(64);
CoderResult result = encoder.encode(input, output, true);
encoder.flush(output);
if (result.isUnderflow()) {
System.out.println("Encoding succeeded!");
}
output.flip();
System.out.println("Encoded " + output.remaining() + " bytes");
Decoding with CharsetDecoder
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
decoder.replaceWith("?");
// Simulate receiving bytes from a network/file
byte[] rawBytes = "Héllo, 世界!".getBytes(StandardCharsets.UTF_8);
ByteBuffer input = ByteBuffer.wrap(rawBytes);
CharBuffer output = CharBuffer.allocate(64);
CoderResult result = decoder.decode(input, output, true);
decoder.flush(output);
output.flip();
System.out.println("Decoded: " + output.toString());
Error Handling Options
Recommendation: Use REPORT during development to catch issues early. Use REPLACE in production when you must be resilient.
7. Encoding with ByteBuffer
Here's the full pattern for encoding text into a ByteBuffer for IPC:
Pattern: Length-Prefixed Strings
This is the most common pattern for writing strings into binary protocols:
/**
* Writes a string into the ByteBuffer as:
* [4 bytes: length of UTF-8 bytes] [N bytes: UTF-8 encoded string]
*/
public static void writeString(ByteBuffer buffer, String value) {
byte[] utf8 = value.getBytes(StandardCharsets.UTF_8);
buffer.putInt(utf8.length); // 4-byte length prefix
buffer.put(utf8); // UTF-8 bytes
}
/**
* Reads a length-prefixed string from the ByteBuffer.
*/
public static String readString(ByteBuffer buffer) {
int length = buffer.getInt(); // read 4-byte length
byte[] utf8 = new byte[length];
buffer.get(utf8); // read that many bytes
return new String(utf8, StandardCharsets.UTF_8);
}
Usage:
ByteBuffer buf = ByteBuffer.allocate(256);
// Write multiple strings
writeString(buf, "Hello");
writeString(buf, "世界");
writeString(buf, "🚀 Launch!");
buf.flip();
// Read them back
System.out.println(readString(buf)); // "Hello"
System.out.println(readString(buf)); // "世界"
System.out.println(readString(buf)); // "🚀 Launch!"
Memory layout:
Pattern: Null-Terminated Strings (C-style)
Used when interoperating with C/C++ code:
public static void writeCString(ByteBuffer buffer, String value) {
buffer.put(value.getBytes(StandardCharsets.UTF_8));
buffer.put((byte) 0); // null terminator
}
public static String readCString(ByteBuffer buffer) {
int start = buffer.position();
while (buffer.get() != 0) { } // scan for null
int length = buffer.position() - start - 1;
buffer.position(start);
byte[] bytes = new byte[length];
buffer.get(bytes);
buffer.get(); // skip null terminator
return new String(bytes, StandardCharsets.UTF_8);
}
8. The BOM (Byte Order Mark) Problem
The BOM is a special Unicode character (U+FEFF) placed at the start of a file to indicate byte order:
// UTF-16 defaults include BOM
byte[] utf16 = "Hello".getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length); // 12 (2 BOM + 10 data)
// UTF-16BE/LE do NOT include BOM
byte[] utf16be = "Hello".getBytes(StandardCharsets.UTF_16BE);
System.out.println(utf16be.length); // 10 (no BOM)
// For IPC: always use explicit byte order, never rely on BOM
IPC Rule: Use
UTF_8(no BOM issue) orUTF_16BE/UTF_16LE(explicit byte order, no BOM). Never useUTF_16— it prepends a BOM.
9. Encoding Pitfalls and How to Avoid Them
Pitfall 1: Platform Default Encoding
// ❌ Uses platform default — different on Windows vs Linux!
byte[] bytes = "Hello".getBytes();
String text = new String(bytes);
// ✅ Always specify charset explicitly
byte[] bytes = "Hello".getBytes(StandardCharsets.UTF_8);
String text = new String(bytes, StandardCharsets.UTF_8);
Pitfall 2: Truncating Multi-byte Characters
// ❌ Cutting UTF-8 in the middle of a multi-byte character
String text = "Hello 世界";
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
// utf8 = [48,65,6C,6C,6F,20,E4,B8,96,E7,95,8C] — 12 bytes
// Truncate to 8 bytes — cuts 世 in the middle!
byte[] truncated = Arrays.copyOf(utf8, 8);
String broken = new String(truncated, StandardCharsets.UTF_8);
// broken = "Hello 世" + replacement char — the 界 char is lost/garbled
// ✅ Only truncate at character boundaries
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
Pitfall 3: char ≠ character
// Java's char is 16-bit (UTF-16 code unit), NOT a Unicode character
String rocket = "🚀";
System.out.println(rocket.length()); // 2 ← two chars (surrogate pair)
System.out.println(rocket.codePointCount(0, rocket.length())); // 1 ← one character
// When calculating ByteBuffer sizes, use byte length, not String.length()
byte[] rocketBytes = rocket.getBytes(StandardCharsets.UTF_8);
System.out.println(rocketBytes.length); // 4 ← four bytes in UTF-8
Pitfall 4: Mixed encodings in the same buffer
// ❌ Writing different strings with different encodings
buf.put("Hello".getBytes(StandardCharsets.UTF_8));
buf.put("World".getBytes(StandardCharsets.ISO_8859_1));
// How does the reader know which encoding was used for which part?
// ✅ Use one encoding consistently
buf.put("Hello".getBytes(StandardCharsets.UTF_8));
buf.put("World".getBytes(StandardCharsets.UTF_8));
10. Practical Patterns for IPC
Message Encoding Protocol
When designing an IPC protocol, define a clear binary format:
/**
* Message format:
* ┌──────────┬──────────┬──────────────┬──────────────┐
* │ msg_type │ payload │ key_length │ key_bytes │
* │ (1 byte) │ (4 bytes)│ (2 bytes) │ (N bytes) │
* └──────────┴──────────┴──────────────┴──────────────┘
*/
public class IPCMessage {
private byte type;
private int payload;
private String key; // UTF-8 encoded
public void writeTo(ByteBuffer buffer) {
byte[] keyBytes = key.getBytes(StandardCharsets.UTF_8);
buffer.put(type);
buffer.putInt(payload);
buffer.putShort((short) keyBytes.length);
buffer.put(keyBytes);
}
public static IPCMessage readFrom(ByteBuffer buffer) {
IPCMessage msg = new IPCMessage();
msg.type = buffer.get();
msg.payload = buffer.getInt();
short keyLen = buffer.getShort();
byte[] keyBytes = new byte[keyLen];
buffer.get(keyBytes);
msg.key = new String(keyBytes, StandardCharsets.UTF_8);
return msg;
}
}
Encoding Negotiation for IPC
11. Summary
Key takeaways:
Always specify encoding explicitly — never use platform defaults
UTF-8 is the best default for IPC and file I/O
Length-prefix your strings in binary protocols
String.length()≠ byte count — always compute encoded byte lengthHandle encoding errors with
CodingErrorAction.REPORTorREPLACEBoth sides of IPC must agree on the encoding format