Dec 26, 2025

How OrioleDB Implements B-Tree Storage for Tables

Introduction

PostgreSQL's storage engine has remained largely unchanged for decades — a rock-solid but aging heap-based model with MVCC snapshots stored directly in table pages. It's reliable, battle-tested, and proven across millions of deployments worldwide.

But reliability comes at a cost.

OrioleDB, a next-generation PostgreSQL storage engine, is rethinking this foundation from the ground up. Instead of the traditional heap, it implements a B-Tree-based storage layer designed for modern hardware and workloads. This brings innovations like undo logs, compressed page formats, and asynchronous I/O to the core of PostgreSQL.

But what does "B-Tree storage for tables" actually mean — and why does it matter for performance and scalability?

The Problem: Traditional PostgreSQL Storage

How the Heap Works (And Why It's Slow)

In classic PostgreSQL, each table is stored as a heap — an unordered collection of tuples scattered across disk pages. Indexes, including B-Trees, are separate structures that simply point to rows in the heap via tuple IDs (TIDs).

When you query a row, PostgreSQL must:

Read the index page to find the matching key.
Jump to the corresponding heap page using the TID pointer.
Check tuple visibility using MVCC snapshots.

This multi-step process introduces real costs:

Random I/O: Each index lookup requires a jump to a different page, causing disk seeks.
Visibility overhead: Every access triggers MVCC snapshot checks.
Table bloat: Dead tuples linger until VACUUM removes them, wasting storage and cache.

For high-throughput transactional workloads or large analytical scans, these overheads compound. The design works, but it's not optimized for modern hardware or scale.

The Solution: OrioleDB's B-Tree as the Table

A Radical Rethink

OrioleDB eliminates the heap entirely. Every table itself is stored as a B-Tree.

Instead of maintaining a separate index that points back to a heap, OrioleDB merges them. The table is the index. Data is stored in a balanced tree structure with three levels:

Each B-Tree node has three levels:

Level	Name	Contains
Root	Level 2	Key range directory
Internal Nodes	Level 1	Pointers to leaf pages
Leaf Nodes	Level 0	Actual tuples (data rows)

Think of it like:

         Root
        / | \
    Internal ...
     /  |  \
 Leaf Leaf Leaf

Each leaf page stores multiple tuples — compacted, compressed, and version-tracked through external undo logs (more on this later). Because rows are already sorted by key, both index lookups and range scans become inherently fast.

Why This Matters

The key insight: You no longer need a separate index structure for ordered access. The table itself provides the ordering. This eliminates a layer of indirection and the random I/O that comes with it.

Understanding the Structure in Practice

A Concrete Example

Imagine a table with 1,000 rows:

Each page: ~8 KB
Each row: ~100 bytes
Result: ~13 leaf pages needed

OrioleDB builds a small B-Tree like this:

Level	Node Count	Example Function
2 (Root)	1	Directs to internal or leaf pages
1 (Internal)	1–2	Stores key ranges (e.g., id 1–500, 501–1000)
0 (Leaf)	13	Holds the actual tuples

This entire tree fits in shared buffers, enabling fast top-down navigation for every query.

For larger tables with millions of rows, the tree grows deeper, but B-Tree properties ensure the height stays logarithmic. A balanced tree with 1 million tuples might only be 3–4 levels deep.

How Writes Work in OrioleDB

The Insert/Update Path

When you INSERT or UPDATE a row:

Locate the target leaf by traversing the B-Tree path (root ? internal ? leaf).
Update the in-memory page buffer with the new or modified tuple.
Create an Undo Log entry — similar to InnoDB's rollback segments — that records the old version.

Why Undo Logs Change Everything

Unlike PostgreSQL's tuple version chains stored inside the heap (which bloat pages over time), OrioleDB stores old versions externally in undo segments.

Benefits:

Clean data pages: No multi-versioned tuples cluttering leaf nodes.
No table bloat: Each page stays optimally packed.
No VACUUM required: Expired versions are simply discarded from the undo log, not hunted down across the heap.

The result: Predictable write performance and lower storage overhead.

How Reads Get Dramatically Faster

The Read Advantage

Since each table is already a B-Tree, reads benefit immediately:

Point lookups: Only 2–3 page hops (root > internal > leaf). Compare this to PostgreSQL's index lookup + heap jump.

Range scans: Sequential access to adjacent leaf pages — they're naturally ordered. No random I/O across the heap.

Index-only scans: Built-in. Since data is already keyed and organized, the table itself can answer queries without touching an external index.

Scaling with Asynchronous I/O

Combined with asynchronous I/O (available in PostgreSQL 18) and OrioleDB's optimized page cache, multiple pages can be fetched in parallel. This maximizes disk bandwidth and drastically reduces query latency.

For analytical workloads scanning millions of rows, this is transformative.

Crash Recovery: MVCC With Surgical Precision

How Recovery Works

Each transaction writes undo entries as it modifies rows. If a rollback occurs, OrioleDB simply replays undo records to revert changes — no need to search for hidden tuple versions across the heap.

During crash recovery:

WAL ensures that committed changes are durable.
Undo log ensures that uncommitted changes are cleanly rolled back.
The recovery scope is limited to the delta between the last checkpoint and the crash point.

Recovery is faster and more predictable than PostgreSQL's full WAL replay, which must scan every page to restore consistency.

Peeking Inside: Tools for Developers

Exploring OrioleDB's Structure

Developers can inspect the internal B-Tree architecture directly:

-- View internal B-Tree stats
SELECT * FROM orioledb_tree_stat('orio_tbl'::regclass);
-- See detailed node layout
SELECT * FROM orioledb_tbl_structure('orio_tbl'::regclass);

Typical output:

level | count |    avgoccupied
-------+--------+----------------
   0   |  9260  | 7383 bytes (leaf)
   1   |    25  | 7247 bytes (internal)
   2   |     1  | 960 bytes (root)

This gives an exact view of the table's hierarchical organization: 1 root node, 25 internal nodes routing to 9,260 leaf pages. Each metric helps you understand storage efficiency and identify optimization opportunities.

Why This Design Matters: The Comparison

Feature	PostgreSQL Heap	OrioleDB B-Tree
Storage	Unordered heap	Ordered B-Tree
MVCC	Tuple version chains	Undo log
Vacuum	Required	Not required
Read performance	Heap + index lookup	Direct B-Tree navigation
Range scan	Slow (heap random I/O	Sequential (leaf-level)
Crash recovery	Heavy WAL replay	Undo + WAL minimal replay

The outcome:

Higher throughput
Lower storage bloat
Better OLTP + OLAP hybrid performance
Predictable latency

The Bigger Picture: PostgreSQL's Future

OrioleDB is not just another extension — it represents a fundamental evolution of PostgreSQL's storage architecture.

As PostgreSQL 18 introduces native support for asynchronous I/O, OrioleDB is positioned to lead adoption. Its B-Tree design pairs perfectly with multi-block parallel reads, enabling near-SSD saturation on analytical workloads.

What This Enables

Real MVCC without bloat: Transaction isolation without storage penalties.
Parallel async reads/writes: Maximize I/O bandwidth on modern hardware.
Native undo-based recovery: Checkpointing and recovery built from the ground up for performance.

For the next generation of PostgreSQL-powered systems, OrioleDB represents a quantum leap in design thinking.

Final Thoughts

By implementing B-Tree storage for tables, OrioleDB redefines PostgreSQL's storage philosophy. It merges the proven reliability of PostgreSQL's core with innovations inspired by InnoDB, RocksDB, and modern database architecture.

If you're building:

Data-intensive applications
Performance-critical systems
Scalable OLAP/OLTP hybrids
Custom Odoo deployments

OrioleDB deserves serious consideration.

The shift from heap to B-Tree is not a minor optimization — it's a fundamental rethinking of how data is organized, accessed, and recovered. For teams pushing PostgreSQL's limits, OrioleDB offers a path toward higher throughput, cleaner storage, and future-proof scalability.