# JVM Low-level I/O - Part 2

Before we can send data between processes or over a network, we need to answer a fundamental question: **how do we turn human-readable text into bytes, and bytes back into text?** The answer is *character encoding*, and getting it wrong is the #1 cause of garbled data in IPC systems.

* * *

## 1\. Why Encoding Matters

Consider this scenario: Process A writes the string `"Héllo"` to shared memory. Process B reads 5 bytes and gets `"HÃ©llo"`. What happened?

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/b5de0e58-0785-4602-9ac9-baa6b6dd52b4.png align="center")

The problem: `é` is **1 byte** in ISO-8859-1 (0xE9) but **2 bytes** in UTF-8 (0xC3 0xA9). When Process B interprets UTF-8 bytes as ISO-8859-1, each byte maps to a different character.

* * *

## 2\. A Brief History: ASCII → Unicode → UTF-8

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/9987eb32-d49c-491b-b316-0a9cdb2e0a98.png align="center")

### ASCII: The Starting Point

ASCII maps 128 characters to numbers 0–127. It uses 7 bits per character.

```plaintext
Character:  A    B    C    ...  Z    0    1    ...  9
Decimal:    65   66   67   ...  90   48   49   ...  57
Hex:        41   42   43   ...  5A   30   31   ...  39
Binary:     1000001  1000010  ...
```

**Problem:** Only covers English letters, digits, and basic punctuation. No accented characters, no CJK, no emoji.

### Unicode: The Universal Map

Unicode doesn't define how to *store* characters — it defines a number (called a **code point**) for every character in every language:

```plaintext
Character    Code Point    Description
────────────────────────────────────────
A            U+0041        Latin Capital A
é            U+00E9        Latin Small E with Acute
中           U+4E2D        CJK Character "middle"
🚀           U+1F680       Rocket Emoji
```

As of Unicode 15.0, there are over **149,000** characters defined.

### UTF-8: The Encoding

UTF-8 is a **variable-width** encoding that maps Unicode code points to bytes:

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/5d32af5a-629b-46d5-a2b7-1167f2aab250.png align="center")

**Example: Encoding** `é` **(U+00E9) in UTF-8:**

```plaintext
Code point: U+00E9 = 0000 0000 1110 1001 (binary)
Range: U+0080 to U+07FF → 2-byte pattern

Pattern:  110xxxxx  10xxxxxx
Fill in:  110 00011  10 101001
Result:   0xC3       0xA9

So é = [C3, A9] in UTF-8
```

* * *

## 3\. How Characters Become Bytes

The encoding/decoding process is a two-way translation:

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/56229853-3035-41ff-ab6c-70bd5d760761.png align="center")

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/0cd69851-8216-47af-acef-69d291a9bc56.png align="center")

In Java, `String` objects are internally stored as sequences of Unicode code points. When you need to write them to a file, network, or shared memory, you must choose an encoding.

* * *

## 4\. Common Encodings Compared

| Encoding | Bytes per Char | ASCII Compatible | Use Case |
| --- | --- | --- | --- |
| **US-ASCII** | 1 (fixed) | ✅ | Legacy systems, protocols |
| **ISO-8859-1** | 1 (fixed) | ✅ | Legacy Western European |
| **UTF-8** | 1–4 (variable) | ✅ | ⭐ Web, files, IPC (recommended) |
| **UTF-16** | 2 or 4 (variable) | ❌ | Java internal, Windows APIs |
| **UTF-16BE** | 2 or 4 (variable) | ❌ | Big-endian UTF-16 |
| **UTF-16LE** | 2 or 4 (variable) | ❌ | Little-endian UTF-16, .NET |
| **UTF-32** | 4 (fixed) | ❌ | Internal processing (rarely used) |

### Storage Comparison for "Hello 世界 🌍"

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/4c581e14-d6e6-4c05-90f2-47d3d9862fb9.png align="center")

> **For IPC: always use UTF-8** unless you have a specific reason not to. It's compact for ASCII-dominated data and universally supported.

* * *

## 5\. Java's Charset API

Java provides the `java.nio.charset.Charset` class as the foundation for encoding and decoding:

```java
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

// Pre-defined charsets (always available, no exceptions)
Charset utf8    = StandardCharsets.UTF_8;
Charset utf16   = StandardCharsets.UTF_16;
Charset ascii   = StandardCharsets.US_ASCII;
Charset latin1  = StandardCharsets.ISO_8859_1;

// List all available charsets
System.out.println("Available charsets: " + Charset.availableCharsets().size());
// Typically 100+ depending on JVM

// Get charset by name
Charset windows1252 = Charset.forName("Windows-1252");

// Default charset (platform-dependent — DANGEROUS!)
Charset defaultCs = Charset.defaultCharset();
System.out.println("Default: " + defaultCs); // e.g., UTF-8 on most modern systems
```

### Simple Encoding/Decoding

```java
String text = "Héllo, 世界!";

// String → bytes
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
System.out.println("UTF-8 length: " + utf8Bytes.length);  // 15

byte[] utf16Bytes = text.getBytes(StandardCharsets.UTF_16);
System.out.println("UTF-16 length: " + utf16Bytes.length); // 22 (includes BOM)

// bytes → String  
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(decoded); // "Héllo, 世界!"
```

> ⚠️ **Never use** `new String(bytes)` **or** `"text".getBytes()` without specifying a charset! These use the platform default encoding, which varies between systems.

* * *

## 6\. CharsetEncoder and CharsetDecoder

For fine-grained control — especially when working with ByteBuffers — use `CharsetEncoder` and `CharsetDecoder` directly.

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/898aeff1-946b-43b7-a872-13591035b24b.png align="center")

### Encoding with CharsetEncoder

```java
import java.nio.*;
import java.nio.charset.*;

CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();

// Configure error handling
encoder.onMalformedInput(CodingErrorAction.REPLACE);
encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
encoder.replaceWith(new byte[]{'?'});

// Encode
CharBuffer input = CharBuffer.wrap("Héllo, 世界!");
ByteBuffer output = ByteBuffer.allocate(64);

CoderResult result = encoder.encode(input, output, true);
encoder.flush(output);

if (result.isUnderflow()) {
    System.out.println("Encoding succeeded!");
}

output.flip();
System.out.println("Encoded " + output.remaining() + " bytes");
```

### Decoding with CharsetDecoder

```java
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
decoder.replaceWith("?");

// Simulate receiving bytes from a network/file
byte[] rawBytes = "Héllo, 世界!".getBytes(StandardCharsets.UTF_8);
ByteBuffer input = ByteBuffer.wrap(rawBytes);
CharBuffer output = CharBuffer.allocate(64);

CoderResult result = decoder.decode(input, output, true);
decoder.flush(output);

output.flip();
System.out.println("Decoded: " + output.toString());
```

### Error Handling Options

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/95961e51-45f7-41aa-9087-4614008abd4c.png align="center")

**Recommendation:** Use `REPORT` during development to catch issues early. Use `REPLACE` in production when you must be resilient.

* * *

## 7\. Encoding with ByteBuffer

Here's the full pattern for encoding text into a ByteBuffer for IPC:

### Pattern: Length-Prefixed Strings

This is the most common pattern for writing strings into binary protocols:

```java
/**
 * Writes a string into the ByteBuffer as:
 *   [4 bytes: length of UTF-8 bytes] [N bytes: UTF-8 encoded string]
 */
public static void writeString(ByteBuffer buffer, String value) {
    byte[] utf8 = value.getBytes(StandardCharsets.UTF_8);
    buffer.putInt(utf8.length);  // 4-byte length prefix
    buffer.put(utf8);            // UTF-8 bytes
}

/**
 * Reads a length-prefixed string from the ByteBuffer.
 */
public static String readString(ByteBuffer buffer) {
    int length = buffer.getInt();  // read 4-byte length
    byte[] utf8 = new byte[length];
    buffer.get(utf8);              // read that many bytes
    return new String(utf8, StandardCharsets.UTF_8);
}
```

**Usage:**

```java
ByteBuffer buf = ByteBuffer.allocate(256);

// Write multiple strings
writeString(buf, "Hello");
writeString(buf, "世界");
writeString(buf, "🚀 Launch!");

buf.flip();

// Read them back
System.out.println(readString(buf)); // "Hello"
System.out.println(readString(buf)); // "世界"
System.out.println(readString(buf)); // "🚀 Launch!"
```

**Memory layout:**

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/5bbf86f5-ae6f-430b-91fc-e4f208f78cde.png align="center")

### Pattern: Null-Terminated Strings (C-style)

Used when interoperating with C/C++ code:

```java
public static void writeCString(ByteBuffer buffer, String value) {
    buffer.put(value.getBytes(StandardCharsets.UTF_8));
    buffer.put((byte) 0); // null terminator
}

public static String readCString(ByteBuffer buffer) {
    int start = buffer.position();
    while (buffer.get() != 0) { } // scan for null
    int length = buffer.position() - start - 1;
    
    buffer.position(start);
    byte[] bytes = new byte[length];
    buffer.get(bytes);
    buffer.get(); // skip null terminator
    
    return new String(bytes, StandardCharsets.UTF_8);
}
```

* * *

## 8\. The BOM (Byte Order Mark) Problem

The BOM is a special Unicode character (U+FEFF) placed at the start of a file to indicate byte order:

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/c8427717-2c1a-45f5-8717-d69e5981ea09.png align="center")

```java
// UTF-16 defaults include BOM
byte[] utf16 = "Hello".getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length); // 12 (2 BOM + 10 data)

// UTF-16BE/LE do NOT include BOM
byte[] utf16be = "Hello".getBytes(StandardCharsets.UTF_16BE);
System.out.println(utf16be.length); // 10 (no BOM)

// For IPC: always use explicit byte order, never rely on BOM
```

> **IPC Rule:** Use `UTF_8` (no BOM issue) or `UTF_16BE`/`UTF_16LE` (explicit byte order, no BOM). Never use `UTF_16` — it prepends a BOM.

* * *

## 9\. Encoding Pitfalls and How to Avoid Them

### Pitfall 1: Platform Default Encoding

```java
// ❌ Uses platform default — different on Windows vs Linux!
byte[] bytes = "Hello".getBytes();
String text = new String(bytes);

// ✅ Always specify charset explicitly
byte[] bytes = "Hello".getBytes(StandardCharsets.UTF_8);
String text = new String(bytes, StandardCharsets.UTF_8);
```

### Pitfall 2: Truncating Multi-byte Characters

```java
// ❌ Cutting UTF-8 in the middle of a multi-byte character
String text = "Hello 世界";
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
// utf8 = [48,65,6C,6C,6F,20,E4,B8,96,E7,95,8C] — 12 bytes

// Truncate to 8 bytes — cuts 世 in the middle!
byte[] truncated = Arrays.copyOf(utf8, 8);
String broken = new String(truncated, StandardCharsets.UTF_8);
// broken = "Hello 世" + replacement char — the 界 char is lost/garbled

// ✅ Only truncate at character boundaries
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
```

### Pitfall 3: char ≠ character

```java
// Java's char is 16-bit (UTF-16 code unit), NOT a Unicode character
String rocket = "🚀";
System.out.println(rocket.length());     // 2 ← two chars (surrogate pair)
System.out.println(rocket.codePointCount(0, rocket.length())); // 1 ← one character

// When calculating ByteBuffer sizes, use byte length, not String.length()
byte[] rocketBytes = rocket.getBytes(StandardCharsets.UTF_8);
System.out.println(rocketBytes.length);  // 4 ← four bytes in UTF-8
```

### Pitfall 4: Mixed encodings in the same buffer

```java
// ❌ Writing different strings with different encodings
buf.put("Hello".getBytes(StandardCharsets.UTF_8));
buf.put("World".getBytes(StandardCharsets.ISO_8859_1));
// How does the reader know which encoding was used for which part?

// ✅ Use one encoding consistently
buf.put("Hello".getBytes(StandardCharsets.UTF_8));
buf.put("World".getBytes(StandardCharsets.UTF_8));
```

* * *

## 10\. Practical Patterns for IPC

### Message Encoding Protocol

When designing an IPC protocol, define a clear binary format:

```java
/**
 * Message format:
 * ┌──────────┬──────────┬──────────────┬──────────────┐
 * │ msg_type │ payload  │ key_length   │ key_bytes    │
 * │ (1 byte) │ (4 bytes)│ (2 bytes)    │ (N bytes)    │
 * └──────────┴──────────┴──────────────┴──────────────┘
 */
public class IPCMessage {
    private byte type;
    private int payload;
    private String key; // UTF-8 encoded
    
    public void writeTo(ByteBuffer buffer) {
        byte[] keyBytes = key.getBytes(StandardCharsets.UTF_8);
        
        buffer.put(type);
        buffer.putInt(payload);
        buffer.putShort((short) keyBytes.length);
        buffer.put(keyBytes);
    }
    
    public static IPCMessage readFrom(ByteBuffer buffer) {
        IPCMessage msg = new IPCMessage();
        msg.type = buffer.get();
        msg.payload = buffer.getInt();
        
        short keyLen = buffer.getShort();
        byte[] keyBytes = new byte[keyLen];
        buffer.get(keyBytes);
        msg.key = new String(keyBytes, StandardCharsets.UTF_8);
        
        return msg;
    }
}
```

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/e0acf2ff-c505-4d4a-b533-c56637ad7518.png align="center")

### Encoding Negotiation for IPC

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/604a622d-d964-4ccf-9c35-5fc2e25b33c6.png align="center")

* * *

## 11\. Summary

![](https://cdn.hashnode.com/uploads/covers/637f189ed7d9bcd845996b4b/47b4a443-e484-49f9-8e25-c3fb9017ecef.png align="center")

**Key takeaways:**

1.  **Always specify encoding explicitly** — never use platform defaults
    
2.  **UTF-8 is the best default** for IPC and file I/O
    
3.  **Length-prefix your strings** in binary protocols
    
4.  `String.length()` **≠ byte count** — always compute encoded byte length
    
5.  **Handle encoding errors** with `CodingErrorAction.REPORT` or `REPLACE`
    
6.  **Both sides of IPC must agree** on the encoding format
