Introduction
PostgreSQL's storage engine has remained largely unchanged for decades — a rock-solid but aging heap-based model with MVCC snapshots stored directly in table pages. It's reliable, battle-tested, and proven across millions of deployments worldwide.
But reliability comes at a cost.
OrioleDB, a next-generation PostgreSQL storage engine, is rethinking this foundation from the ground up. Instead of the traditional heap, it implements a B-Tree-based storage layer designed for modern hardware and workloads. This brings innovations like undo logs, compressed page formats, and asynchronous I/O to the core of PostgreSQL.
But what does "B-Tree storage for tables" actually mean — and why does it matter for performance and scalability?
The Problem: Traditional PostgreSQL Storage
How the Heap Works (And Why It's Slow)
In classic PostgreSQL, each table is stored as a heap — an unordered collection of tuples scattered across disk pages. Indexes, including B-Trees, are separate structures that simply point to rows in the heap via tuple IDs (TIDs).
When you query a row, PostgreSQL must:
- Read the index page to find the matching key.
- Jump to the corresponding heap page using the TID pointer.
- Check tuple visibility using MVCC snapshots.
This multi-step process introduces real costs:
- Random I/O: Each index lookup requires a jump to a different page, causing disk seeks.
- Visibility overhead: Every access triggers MVCC snapshot checks.
- Table bloat: Dead tuples linger until VACUUM removes them, wasting storage and cache.
For high-throughput transactional workloads or large analytical scans, these overheads compound. The design works, but it's not optimized for modern hardware or scale.
The Solution: OrioleDB's B-Tree as the Table
A Radical Rethink
OrioleDB eliminates the heap entirely. Every table itself is stored as a B-Tree.
Instead of maintaining a separate index that points back to a heap, OrioleDB merges them. The table is the index. Data is stored in a balanced tree structure with three levels:
Each B-Tree node has three levels:
| Level | Name | Contains |
| Root | Level 2 | Key range directory |
| Internal Nodes | Level 1 | Pointers to leaf pages |
| Leaf Nodes | Level 0 | Actual tuples (data rows) |
Think of it like:
Root
/ | \
Internal ...
/ | \
Leaf Leaf Leaf
Each leaf page stores multiple tuples — compacted, compressed, and version-tracked through external undo logs (more on this later). Because rows are already sorted by key, both index lookups and range scans become inherently fast.
Why This Matters
The key insight: You no longer need a separate index structure for ordered access. The table itself provides the ordering. This eliminates a layer of indirection and the random I/O that comes with it.
Understanding the Structure in Practice
A Concrete Example
Imagine a table with 1,000 rows:
- Each page: ~8 KB
- Each row: ~100 bytes
- Result: ~13 leaf pages needed
OrioleDB builds a small B-Tree like this:
| Level | Node Count | Example Function |
| 2 (Root) | 1 | Directs to internal or leaf pages |
| 1 (Internal) | 1–2 | Stores key ranges (e.g., id 1–500, 501–1000) |
| 0 (Leaf) | 13 | Holds the actual tuples |
This entire tree fits in shared buffers, enabling fast top-down navigation for every query.
For larger tables with millions of rows, the tree grows deeper, but B-Tree properties ensure the height stays logarithmic. A balanced tree with 1 million tuples might only be 3–4 levels deep.
How Writes Work in OrioleDB
The Insert/Update Path
When you INSERT or UPDATE a row:
- Locate the target leaf by traversing the B-Tree path (root ? internal ? leaf).
- Update the in-memory page buffer with the new or modified tuple.
- Create an Undo Log entry — similar to InnoDB's rollback segments — that records the old version.
Why Undo Logs Change Everything
Unlike PostgreSQL's tuple version chains stored inside the heap (which bloat pages over time), OrioleDB stores old versions externally in undo segments.
Benefits:
- Clean data pages: No multi-versioned tuples cluttering leaf nodes.
- No table bloat: Each page stays optimally packed.
- No VACUUM required: Expired versions are simply discarded from the undo log, not hunted down across the heap.
The result: Predictable write performance and lower storage overhead.
How Reads Get Dramatically Faster
The Read Advantage
Since each table is already a B-Tree, reads benefit immediately:
Point lookups: Only 2–3 page hops (root > internal > leaf). Compare this to PostgreSQL's index lookup + heap jump.
Range scans: Sequential access to adjacent leaf pages — they're naturally ordered. No random I/O across the heap.
Index-only scans: Built-in. Since data is already keyed and organized, the table itself can answer queries without touching an external index.
Scaling with Asynchronous I/O
Combined with asynchronous I/O (available in PostgreSQL 18) and OrioleDB's optimized page cache, multiple pages can be fetched in parallel. This maximizes disk bandwidth and drastically reduces query latency.
For analytical workloads scanning millions of rows, this is transformative.
Crash Recovery: MVCC With Surgical Precision
How Recovery Works
Each transaction writes undo entries as it modifies rows. If a rollback occurs, OrioleDB simply replays undo records to revert changes — no need to search for hidden tuple versions across the heap.
During crash recovery:
- WAL ensures that committed changes are durable.
- Undo log ensures that uncommitted changes are cleanly rolled back.
- The recovery scope is limited to the delta between the last checkpoint and the crash point.
Recovery is faster and more predictable than PostgreSQL's full WAL replay, which must scan every page to restore consistency.
Peeking Inside: Tools for Developers
Exploring OrioleDB's Structure
Developers can inspect the internal B-Tree architecture directly:
-- View internal B-Tree stats
SELECT * FROM orioledb_tree_stat('orio_tbl'::regclass);
-- See detailed node layout
SELECT * FROM orioledb_tbl_structure('orio_tbl'::regclass);
Typical output:
level | count | avgoccupied
-------+--------+----------------
0 | 9260 | 7383 bytes (leaf)
1 | 25 | 7247 bytes (internal)
2 | 1 | 960 bytes (root)
This gives an exact view of the table's hierarchical organization: 1 root node, 25 internal nodes routing to 9,260 leaf pages. Each metric helps you understand storage efficiency and identify optimization opportunities.
Why This Design Matters: The Comparison
| Feature | PostgreSQL Heap | OrioleDB B-Tree |
| Storage | Unordered heap | Ordered B-Tree |
| MVCC | Tuple version chains | Undo log |
| Vacuum | Required | Not required |
| Read performance | Heap + index lookup | Direct B-Tree navigation |
| Range scan | Slow (heap random I/O | Sequential (leaf-level) |
| Crash recovery | Heavy WAL replay | Undo + WAL minimal replay |
The outcome:
- Higher throughput
- Lower storage bloat
- Better OLTP + OLAP hybrid performance
- Predictable latency
The Bigger Picture: PostgreSQL's Future
OrioleDB is not just another extension — it represents a fundamental evolution of PostgreSQL's storage architecture.
As PostgreSQL 18 introduces native support for asynchronous I/O, OrioleDB is positioned to lead adoption. Its B-Tree design pairs perfectly with multi-block parallel reads, enabling near-SSD saturation on analytical workloads.
What This Enables
- Real MVCC without bloat: Transaction isolation without storage penalties.
- Parallel async reads/writes: Maximize I/O bandwidth on modern hardware.
- Native undo-based recovery: Checkpointing and recovery built from the ground up for performance.
For the next generation of PostgreSQL-powered systems, OrioleDB represents a quantum leap in design thinking.
Final Thoughts
By implementing B-Tree storage for tables, OrioleDB redefines PostgreSQL's storage philosophy. It merges the proven reliability of PostgreSQL's core with innovations inspired by InnoDB, RocksDB, and modern database architecture.
If you're building:
- Data-intensive applications
- Performance-critical systems
- Scalable OLAP/OLTP hybrids
- Custom Odoo deployments
OrioleDB deserves serious consideration.
The shift from heap to B-Tree is not a minor optimization — it's a fundamental rethinking of how data is organized, accessed, and recovered. For teams pushing PostgreSQL's limits, OrioleDB offers a path toward higher throughput, cleaner storage, and future-proof scalability.