Skip to main content

Command Palette

Search for a command to run...

JVM Low-level I/O - Part 2

Character Encoding & Charset on the JVM

Published
9 min read

Before we can send data between processes or over a network, we need to answer a fundamental question: how do we turn human-readable text into bytes, and bytes back into text? The answer is character encoding, and getting it wrong is the #1 cause of garbled data in IPC systems.


1. Why Encoding Matters

Consider this scenario: Process A writes the string "Héllo" to shared memory. Process B reads 5 bytes and gets "Héllo". What happened?

The problem: é is 1 byte in ISO-8859-1 (0xE9) but 2 bytes in UTF-8 (0xC3 0xA9). When Process B interprets UTF-8 bytes as ISO-8859-1, each byte maps to a different character.


2. A Brief History: ASCII → Unicode → UTF-8

ASCII: The Starting Point

ASCII maps 128 characters to numbers 0–127. It uses 7 bits per character.

Character:  A    B    C    ...  Z    0    1    ...  9
Decimal:    65   66   67   ...  90   48   49   ...  57
Hex:        41   42   43   ...  5A   30   31   ...  39
Binary:     1000001  1000010  ...

Problem: Only covers English letters, digits, and basic punctuation. No accented characters, no CJK, no emoji.

Unicode: The Universal Map

Unicode doesn't define how to store characters — it defines a number (called a code point) for every character in every language:

Character    Code Point    Description
────────────────────────────────────────
A            U+0041        Latin Capital A
é            U+00E9        Latin Small E with Acute
中           U+4E2D        CJK Character "middle"
🚀           U+1F680       Rocket Emoji

As of Unicode 15.0, there are over 149,000 characters defined.

UTF-8: The Encoding

UTF-8 is a variable-width encoding that maps Unicode code points to bytes:

Example: Encoding é (U+00E9) in UTF-8:

Code point: U+00E9 = 0000 0000 1110 1001 (binary)
Range: U+0080 to U+07FF → 2-byte pattern

Pattern:  110xxxxx  10xxxxxx
Fill in:  110 00011  10 101001
Result:   0xC3       0xA9

So é = [C3, A9] in UTF-8

3. How Characters Become Bytes

The encoding/decoding process is a two-way translation:

In Java, String objects are internally stored as sequences of Unicode code points. When you need to write them to a file, network, or shared memory, you must choose an encoding.


4. Common Encodings Compared

Encoding Bytes per Char ASCII Compatible Use Case
US-ASCII 1 (fixed) Legacy systems, protocols
ISO-8859-1 1 (fixed) Legacy Western European
UTF-8 1–4 (variable) ⭐ Web, files, IPC (recommended)
UTF-16 2 or 4 (variable) Java internal, Windows APIs
UTF-16BE 2 or 4 (variable) Big-endian UTF-16
UTF-16LE 2 or 4 (variable) Little-endian UTF-16, .NET
UTF-32 4 (fixed) Internal processing (rarely used)

Storage Comparison for "Hello 世界 🌍"

For IPC: always use UTF-8 unless you have a specific reason not to. It's compact for ASCII-dominated data and universally supported.


5. Java's Charset API

Java provides the java.nio.charset.Charset class as the foundation for encoding and decoding:

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

// Pre-defined charsets (always available, no exceptions)
Charset utf8    = StandardCharsets.UTF_8;
Charset utf16   = StandardCharsets.UTF_16;
Charset ascii   = StandardCharsets.US_ASCII;
Charset latin1  = StandardCharsets.ISO_8859_1;

// List all available charsets
System.out.println("Available charsets: " + Charset.availableCharsets().size());
// Typically 100+ depending on JVM

// Get charset by name
Charset windows1252 = Charset.forName("Windows-1252");

// Default charset (platform-dependent — DANGEROUS!)
Charset defaultCs = Charset.defaultCharset();
System.out.println("Default: " + defaultCs); // e.g., UTF-8 on most modern systems

Simple Encoding/Decoding

String text = "Héllo, 世界!";

// String → bytes
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
System.out.println("UTF-8 length: " + utf8Bytes.length);  // 15

byte[] utf16Bytes = text.getBytes(StandardCharsets.UTF_16);
System.out.println("UTF-16 length: " + utf16Bytes.length); // 22 (includes BOM)

// bytes → String  
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(decoded); // "Héllo, 世界!"

⚠️ Never use new String(bytes) or "text".getBytes() without specifying a charset! These use the platform default encoding, which varies between systems.


6. CharsetEncoder and CharsetDecoder

For fine-grained control — especially when working with ByteBuffers — use CharsetEncoder and CharsetDecoder directly.

Encoding with CharsetEncoder

import java.nio.*;
import java.nio.charset.*;

CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();

// Configure error handling
encoder.onMalformedInput(CodingErrorAction.REPLACE);
encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
encoder.replaceWith(new byte[]{'?'});

// Encode
CharBuffer input = CharBuffer.wrap("Héllo, 世界!");
ByteBuffer output = ByteBuffer.allocate(64);

CoderResult result = encoder.encode(input, output, true);
encoder.flush(output);

if (result.isUnderflow()) {
    System.out.println("Encoding succeeded!");
}

output.flip();
System.out.println("Encoded " + output.remaining() + " bytes");

Decoding with CharsetDecoder

CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
decoder.replaceWith("?");

// Simulate receiving bytes from a network/file
byte[] rawBytes = "Héllo, 世界!".getBytes(StandardCharsets.UTF_8);
ByteBuffer input = ByteBuffer.wrap(rawBytes);
CharBuffer output = CharBuffer.allocate(64);

CoderResult result = decoder.decode(input, output, true);
decoder.flush(output);

output.flip();
System.out.println("Decoded: " + output.toString());

Error Handling Options

Recommendation: Use REPORT during development to catch issues early. Use REPLACE in production when you must be resilient.


7. Encoding with ByteBuffer

Here's the full pattern for encoding text into a ByteBuffer for IPC:

Pattern: Length-Prefixed Strings

This is the most common pattern for writing strings into binary protocols:

/**
 * Writes a string into the ByteBuffer as:
 *   [4 bytes: length of UTF-8 bytes] [N bytes: UTF-8 encoded string]
 */
public static void writeString(ByteBuffer buffer, String value) {
    byte[] utf8 = value.getBytes(StandardCharsets.UTF_8);
    buffer.putInt(utf8.length);  // 4-byte length prefix
    buffer.put(utf8);            // UTF-8 bytes
}

/**
 * Reads a length-prefixed string from the ByteBuffer.
 */
public static String readString(ByteBuffer buffer) {
    int length = buffer.getInt();  // read 4-byte length
    byte[] utf8 = new byte[length];
    buffer.get(utf8);              // read that many bytes
    return new String(utf8, StandardCharsets.UTF_8);
}

Usage:

ByteBuffer buf = ByteBuffer.allocate(256);

// Write multiple strings
writeString(buf, "Hello");
writeString(buf, "世界");
writeString(buf, "🚀 Launch!");

buf.flip();

// Read them back
System.out.println(readString(buf)); // "Hello"
System.out.println(readString(buf)); // "世界"
System.out.println(readString(buf)); // "🚀 Launch!"

Memory layout:

Pattern: Null-Terminated Strings (C-style)

Used when interoperating with C/C++ code:

public static void writeCString(ByteBuffer buffer, String value) {
    buffer.put(value.getBytes(StandardCharsets.UTF_8));
    buffer.put((byte) 0); // null terminator
}

public static String readCString(ByteBuffer buffer) {
    int start = buffer.position();
    while (buffer.get() != 0) { } // scan for null
    int length = buffer.position() - start - 1;
    
    buffer.position(start);
    byte[] bytes = new byte[length];
    buffer.get(bytes);
    buffer.get(); // skip null terminator
    
    return new String(bytes, StandardCharsets.UTF_8);
}

8. The BOM (Byte Order Mark) Problem

The BOM is a special Unicode character (U+FEFF) placed at the start of a file to indicate byte order:

// UTF-16 defaults include BOM
byte[] utf16 = "Hello".getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length); // 12 (2 BOM + 10 data)

// UTF-16BE/LE do NOT include BOM
byte[] utf16be = "Hello".getBytes(StandardCharsets.UTF_16BE);
System.out.println(utf16be.length); // 10 (no BOM)

// For IPC: always use explicit byte order, never rely on BOM

IPC Rule: Use UTF_8 (no BOM issue) or UTF_16BE/UTF_16LE (explicit byte order, no BOM). Never use UTF_16 — it prepends a BOM.


9. Encoding Pitfalls and How to Avoid Them

Pitfall 1: Platform Default Encoding

// ❌ Uses platform default — different on Windows vs Linux!
byte[] bytes = "Hello".getBytes();
String text = new String(bytes);

// ✅ Always specify charset explicitly
byte[] bytes = "Hello".getBytes(StandardCharsets.UTF_8);
String text = new String(bytes, StandardCharsets.UTF_8);

Pitfall 2: Truncating Multi-byte Characters

// ❌ Cutting UTF-8 in the middle of a multi-byte character
String text = "Hello 世界";
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
// utf8 = [48,65,6C,6C,6F,20,E4,B8,96,E7,95,8C] — 12 bytes

// Truncate to 8 bytes — cuts 世 in the middle!
byte[] truncated = Arrays.copyOf(utf8, 8);
String broken = new String(truncated, StandardCharsets.UTF_8);
// broken = "Hello 世" + replacement char — the 界 char is lost/garbled

// ✅ Only truncate at character boundaries
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);

Pitfall 3: char ≠ character

// Java's char is 16-bit (UTF-16 code unit), NOT a Unicode character
String rocket = "🚀";
System.out.println(rocket.length());     // 2 ← two chars (surrogate pair)
System.out.println(rocket.codePointCount(0, rocket.length())); // 1 ← one character

// When calculating ByteBuffer sizes, use byte length, not String.length()
byte[] rocketBytes = rocket.getBytes(StandardCharsets.UTF_8);
System.out.println(rocketBytes.length);  // 4 ← four bytes in UTF-8

Pitfall 4: Mixed encodings in the same buffer

// ❌ Writing different strings with different encodings
buf.put("Hello".getBytes(StandardCharsets.UTF_8));
buf.put("World".getBytes(StandardCharsets.ISO_8859_1));
// How does the reader know which encoding was used for which part?

// ✅ Use one encoding consistently
buf.put("Hello".getBytes(StandardCharsets.UTF_8));
buf.put("World".getBytes(StandardCharsets.UTF_8));

10. Practical Patterns for IPC

Message Encoding Protocol

When designing an IPC protocol, define a clear binary format:

/**
 * Message format:
 * ┌──────────┬──────────┬──────────────┬──────────────┐
 * │ msg_type │ payload  │ key_length   │ key_bytes    │
 * │ (1 byte) │ (4 bytes)│ (2 bytes)    │ (N bytes)    │
 * └──────────┴──────────┴──────────────┴──────────────┘
 */
public class IPCMessage {
    private byte type;
    private int payload;
    private String key; // UTF-8 encoded
    
    public void writeTo(ByteBuffer buffer) {
        byte[] keyBytes = key.getBytes(StandardCharsets.UTF_8);
        
        buffer.put(type);
        buffer.putInt(payload);
        buffer.putShort((short) keyBytes.length);
        buffer.put(keyBytes);
    }
    
    public static IPCMessage readFrom(ByteBuffer buffer) {
        IPCMessage msg = new IPCMessage();
        msg.type = buffer.get();
        msg.payload = buffer.getInt();
        
        short keyLen = buffer.getShort();
        byte[] keyBytes = new byte[keyLen];
        buffer.get(keyBytes);
        msg.key = new String(keyBytes, StandardCharsets.UTF_8);
        
        return msg;
    }
}

Encoding Negotiation for IPC


11. Summary

Key takeaways:

  1. Always specify encoding explicitly — never use platform defaults

  2. UTF-8 is the best default for IPC and file I/O

  3. Length-prefix your strings in binary protocols

  4. String.length() ≠ byte count — always compute encoded byte length

  5. Handle encoding errors with CodingErrorAction.REPORT or REPLACE

  6. Both sides of IPC must agree on the encoding format