JVM Low-level I/O

Before we can send data between processes or over a network, we need to answer a fundamental question: how do we turn human-readable text into bytes, and bytes back into text? The answer is character encoding, and getting it wrong is the #1 cause of garbled data in IPC systems.

1. Why Encoding Matters

Consider this scenario: Process A writes the string "Héllo" to shared memory. Process B reads 5 bytes and gets "HÃ©llo". What happened?

The problem: é is 1 byte in ISO-8859-1 (0xE9) but 2 bytes in UTF-8 (0xC3 0xA9). When Process B interprets UTF-8 bytes as ISO-8859-1, each byte maps to a different character.

2. A Brief History: ASCII → Unicode → UTF-8

ASCII: The Starting Point

ASCII maps 128 characters to numbers 0–127. It uses 7 bits per character.

Character:  A    B    C    ...  Z    0    1    ...  9
Decimal:    65   66   67   ...  90   48   49   ...  57
Hex:        41   42   43   ...  5A   30   31   ...  39
Binary:     1000001  1000010  ...

Problem: Only covers English letters, digits, and basic punctuation. No accented characters, no CJK, no emoji.

Unicode: The Universal Map

Unicode doesn't define how to store characters — it defines a number (called a code point) for every character in every language:

Character    Code Point    Description
────────────────────────────────────────
A            U+0041        Latin Capital A
é            U+00E9        Latin Small E with Acute
中           U+4E2D        CJK Character "middle"
🚀           U+1F680       Rocket Emoji

As of Unicode 15.0, there are over 149,000 characters defined.

UTF-8: The Encoding

UTF-8 is a variable-width encoding that maps Unicode code points to bytes:

Example: Encoding é (U+00E9) in UTF-8:

Code point: U+00E9 = 0000 0000 1110 1001 (binary)
Range: U+0080 to U+07FF → 2-byte pattern

Pattern:  110xxxxx  10xxxxxx
Fill in:  110 00011  10 101001
Result:   0xC3       0xA9

So é = [C3, A9] in UTF-8

3. How Characters Become Bytes

The encoding/decoding process is a two-way translation:

In Java, String objects are internally stored as sequences of Unicode code points. When you need to write them to a file, network, or shared memory, you must choose an encoding.

4. Common Encodings Compared

Encoding	Bytes per Char	ASCII Compatible	Use Case
US-ASCII	1 (fixed)	✅	Legacy systems, protocols
ISO-8859-1	1 (fixed)	✅	Legacy Western European
UTF-8	1–4 (variable)	✅	⭐ Web, files, IPC (recommended)
UTF-16	2 or 4 (variable)	❌	Java internal, Windows APIs
UTF-16BE	2 or 4 (variable)	❌	Big-endian UTF-16
UTF-16LE	2 or 4 (variable)	❌	Little-endian UTF-16, .NET
UTF-32	4 (fixed)	❌	Internal processing (rarely used)

Storage Comparison for "Hello 世界 🌍"

For IPC: always use UTF-8 unless you have a specific reason not to. It's compact for ASCII-dominated data and universally supported.

5. Java's Charset API

Java provides the java.nio.charset.Charset class as the foundation for encoding and decoding:

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

// Pre-defined charsets (always available, no exceptions)
Charset utf8    = StandardCharsets.UTF_8;
Charset utf16   = StandardCharsets.UTF_16;
Charset ascii   = StandardCharsets.US_ASCII;
Charset latin1  = StandardCharsets.ISO_8859_1;

// List all available charsets
System.out.println("Available charsets: " + Charset.availableCharsets().size());
// Typically 100+ depending on JVM

// Get charset by name
Charset windows1252 = Charset.forName("Windows-1252");

// Default charset (platform-dependent — DANGEROUS!)
Charset defaultCs = Charset.defaultCharset();
System.out.println("Default: " + defaultCs); // e.g., UTF-8 on most modern systems

Simple Encoding/Decoding

String text = "Héllo, 世界!";

// String → bytes
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
System.out.println("UTF-8 length: " + utf8Bytes.length);  // 15

byte[] utf16Bytes = text.getBytes(StandardCharsets.UTF_16);
System.out.println("UTF-16 length: " + utf16Bytes.length); // 22 (includes BOM)

// bytes → String  
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(decoded); // "Héllo, 世界!"

⚠️ Never use new String(bytes) or "text".getBytes() without specifying a charset! These use the platform default encoding, which varies between systems.

6. CharsetEncoder and CharsetDecoder

For fine-grained control — especially when working with ByteBuffers — use CharsetEncoder and CharsetDecoder directly.

Encoding with CharsetEncoder

import java.nio.*;
import java.nio.charset.*;

CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();

// Configure error handling
encoder.onMalformedInput(CodingErrorAction.REPLACE);
encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
encoder.replaceWith(new byte[]{'?'});

// Encode
CharBuffer input = CharBuffer.wrap("Héllo, 世界!");
ByteBuffer output = ByteBuffer.allocate(64);

CoderResult result = encoder.encode(input, output, true);
encoder.flush(output);

if (result.isUnderflow()) {
    System.out.println("Encoding succeeded!");
}

output.flip();
System.out.println("Encoded " + output.remaining() + " bytes");

Decoding with CharsetDecoder

CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
decoder.replaceWith("?");

// Simulate receiving bytes from a network/file
byte[] rawBytes = "Héllo, 世界!".getBytes(StandardCharsets.UTF_8);
ByteBuffer input = ByteBuffer.wrap(rawBytes);
CharBuffer output = CharBuffer.allocate(64);

CoderResult result = decoder.decode(input, output, true);
decoder.flush(output);

output.flip();
System.out.println("Decoded: " + output.toString());

Error Handling Options

Recommendation: Use REPORT during development to catch issues early. Use REPLACE in production when you must be resilient.

7. Encoding with ByteBuffer

Here's the full pattern for encoding text into a ByteBuffer for IPC:

Pattern: Length-Prefixed Strings

This is the most common pattern for writing strings into binary protocols:

/**
 * Writes a string into the ByteBuffer as:
 *   [4 bytes: length of UTF-8 bytes] [N bytes: UTF-8 encoded string]
 */
public static void writeString(ByteBuffer buffer, String value) {
    byte[] utf8 = value.getBytes(StandardCharsets.UTF_8);
    buffer.putInt(utf8.length);  // 4-byte length prefix
    buffer.put(utf8);            // UTF-8 bytes
}

/**
 * Reads a length-prefixed string from the ByteBuffer.
 */
public static String readString(ByteBuffer buffer) {
    int length = buffer.getInt();  // read 4-byte length
    byte[] utf8 = new byte[length];
    buffer.get(utf8);              // read that many bytes
    return new String(utf8, StandardCharsets.UTF_8);
}

Usage:

ByteBuffer buf = ByteBuffer.allocate(256);

// Write multiple strings
writeString(buf, "Hello");
writeString(buf, "世界");
writeString(buf, "🚀 Launch!");

buf.flip();

// Read them back
System.out.println(readString(buf)); // "Hello"
System.out.println(readString(buf)); // "世界"
System.out.println(readString(buf)); // "🚀 Launch!"

Memory layout:

Pattern: Null-Terminated Strings (C-style)

Used when interoperating with C/C++ code:

public static void writeCString(ByteBuffer buffer, String value) {
    buffer.put(value.getBytes(StandardCharsets.UTF_8));
    buffer.put((byte) 0); // null terminator
}

public static String readCString(ByteBuffer buffer) {
    int start = buffer.position();
    while (buffer.get() != 0) { } // scan for null
    int length = buffer.position() - start - 1;
    
    buffer.position(start);
    byte[] bytes = new byte[length];
    buffer.get(bytes);
    buffer.get(); // skip null terminator
    
    return new String(bytes, StandardCharsets.UTF_8);
}

8. The BOM (Byte Order Mark) Problem

The BOM is a special Unicode character (U+FEFF) placed at the start of a file to indicate byte order:

// UTF-16 defaults include BOM
byte[] utf16 = "Hello".getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length); // 12 (2 BOM + 10 data)

// UTF-16BE/LE do NOT include BOM
byte[] utf16be = "Hello".getBytes(StandardCharsets.UTF_16BE);
System.out.println(utf16be.length); // 10 (no BOM)

// For IPC: always use explicit byte order, never rely on BOM

IPC Rule: Use UTF_8 (no BOM issue) or UTF_16BE/UTF_16LE (explicit byte order, no BOM). Never use UTF_16 — it prepends a BOM.

9. Encoding Pitfalls and How to Avoid Them

Pitfall 1: Platform Default Encoding

// ❌ Uses platform default — different on Windows vs Linux!
byte[] bytes = "Hello".getBytes();
String text = new String(bytes);

// ✅ Always specify charset explicitly
byte[] bytes = "Hello".getBytes(StandardCharsets.UTF_8);
String text = new String(bytes, StandardCharsets.UTF_8);

Pitfall 2: Truncating Multi-byte Characters

// ❌ Cutting UTF-8 in the middle of a multi-byte character
String text = "Hello 世界";
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
// utf8 = [48,65,6C,6C,6F,20,E4,B8,96,E7,95,8C] — 12 bytes

// Truncate to 8 bytes — cuts 世 in the middle!
byte[] truncated = Arrays.copyOf(utf8, 8);
String broken = new String(truncated, StandardCharsets.UTF_8);
// broken = "Hello 世" + replacement char — the 界 char is lost/garbled

// ✅ Only truncate at character boundaries
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);

Pitfall 3: char ≠ character

// Java's char is 16-bit (UTF-16 code unit), NOT a Unicode character
String rocket = "🚀";
System.out.println(rocket.length());     // 2 ← two chars (surrogate pair)
System.out.println(rocket.codePointCount(0, rocket.length())); // 1 ← one character

// When calculating ByteBuffer sizes, use byte length, not String.length()
byte[] rocketBytes = rocket.getBytes(StandardCharsets.UTF_8);
System.out.println(rocketBytes.length);  // 4 ← four bytes in UTF-8

Pitfall 4: Mixed encodings in the same buffer

// ❌ Writing different strings with different encodings
buf.put("Hello".getBytes(StandardCharsets.UTF_8));
buf.put("World".getBytes(StandardCharsets.ISO_8859_1));
// How does the reader know which encoding was used for which part?

// ✅ Use one encoding consistently
buf.put("Hello".getBytes(StandardCharsets.UTF_8));
buf.put("World".getBytes(StandardCharsets.UTF_8));

10. Practical Patterns for IPC

Message Encoding Protocol

When designing an IPC protocol, define a clear binary format:

/**
 * Message format:
 * ┌──────────┬──────────┬──────────────┬──────────────┐
 * │ msg_type │ payload  │ key_length   │ key_bytes    │
 * │ (1 byte) │ (4 bytes)│ (2 bytes)    │ (N bytes)    │
 * └──────────┴──────────┴──────────────┴──────────────┘
 */
public class IPCMessage {
    private byte type;
    private int payload;
    private String key; // UTF-8 encoded
    
    public void writeTo(ByteBuffer buffer) {
        byte[] keyBytes = key.getBytes(StandardCharsets.UTF_8);
        
        buffer.put(type);
        buffer.putInt(payload);
        buffer.putShort((short) keyBytes.length);
        buffer.put(keyBytes);
    }
    
    public static IPCMessage readFrom(ByteBuffer buffer) {
        IPCMessage msg = new IPCMessage();
        msg.type = buffer.get();
        msg.payload = buffer.getInt();
        
        short keyLen = buffer.getShort();
        byte[] keyBytes = new byte[keyLen];
        buffer.get(keyBytes);
        msg.key = new String(keyBytes, StandardCharsets.UTF_8);
        
        return msg;
    }
}

Encoding Negotiation for IPC

11. Summary

Key takeaways:

Always specify encoding explicitly — never use platform defaults
UTF-8 is the best default for IPC and file I/O
Length-prefix your strings in binary protocols
String.length() ≠ byte count — always compute encoded byte length
Handle encoding errors with CodingErrorAction.REPORT or REPLACE
Both sides of IPC must agree on the encoding format

JVM Low-level I/O - Part 2

1. Why Encoding Matters

2. A Brief History: ASCII → Unicode → UTF-8

ASCII: The Starting Point

Unicode: The Universal Map

UTF-8: The Encoding

3. How Characters Become Bytes

4. Common Encodings Compared

Storage Comparison for "Hello 世界 🌍"

5. Java's Charset API

Simple Encoding/Decoding

6. CharsetEncoder and CharsetDecoder

Encoding with CharsetEncoder

Decoding with CharsetDecoder

Error Handling Options

7. Encoding with ByteBuffer

Pattern: Length-Prefixed Strings

Pattern: Null-Terminated Strings (C-style)

8. The BOM (Byte Order Mark) Problem

9. Encoding Pitfalls and How to Avoid Them

Pitfall 1: Platform Default Encoding

Pitfall 2: Truncating Multi-byte Characters

Pitfall 3: char ≠ character

Pitfall 4: Mixed encodings in the same buffer

10. Practical Patterns for IPC

Message Encoding Protocol

Encoding Negotiation for IPC

11. Summary

Comments

More from this blog

Using the Agrona library (Part 7)

Using the Agrona library (Part 6)

Using the Agrona library (Part 5)

Using the Agrona library (Part 4)

Using the Agrona library (Part 3)

Command Palette

1. Why Encoding Matters

2. A Brief History: ASCII → Unicode → UTF-8

ASCII: The Starting Point

Unicode: The Universal Map

UTF-8: The Encoding

3. How Characters Become Bytes

4. Common Encodings Compared

Storage Comparison for "Hello 世界 🌍"

5. Java's Charset API

Simple Encoding/Decoding

6. CharsetEncoder and CharsetDecoder

Encoding with CharsetEncoder

Decoding with CharsetDecoder

Error Handling Options

7. Encoding with ByteBuffer

Pattern: Length-Prefixed Strings

Pattern: Null-Terminated Strings (C-style)

8. The BOM (Byte Order Mark) Problem

9. Encoding Pitfalls and How to Avoid Them

Pitfall 1: Platform Default Encoding

Pitfall 2: Truncating Multi-byte Characters

Pitfall 3: char ≠ character

Pitfall 4: Mixed encodings in the same buffer

10. Practical Patterns for IPC

Message Encoding Protocol

Encoding Negotiation for IPC

11. Summary

Comments

More from this blog