What is TidesDB?
TidesDB is a fast, light-weight and efficient embeddable key-value storage engine library written in C. Built on a log-structured merge-tree (LSM-tree) architecture, it provides a foundational library for building database systems or can be used directly as a standalone key-value or column store.
Key Characteristics
TidesDB is designed for direct integration into applications. It handles keys and values as raw byte sequences without size restrictions (up to system memory limits), giving you full control over serialization. Optimized for write-heavy workloads, it maintains efficient reads through bloom filters, caching, block indices, and compaction.
Core Features
- Lock-free skip list memtable with atomic CAS operations for concurrent reads and writes. Reads scale linearly with CPU cores without blocking. Hot paths including WAL group commit and comparator lookups use lock-free atomic operations. Background operations use fine-grained locks for coordination. Immutable memtables remain searchable during flush, ensuring consistent reads.
- ACID transactions with full MVCC (Multi-Version Concurrency Control) supporting 5 isolation levels:
read-uncommitted,read-committed,repeatable-readsnapshot, andserializable. Each transaction operates on a consistent snapshot with sequence numbers for version ordering. Snapshot isolation provides first-committer-wins conflict detection preventing dirty reads, non-repeatable reads, and lost updates. Serializable isolation implements SSI (Serializable Snapshot Isolation) with read-write conflict detection to prevent all anomalies including write-skew. Transactions support read-your-own-writes semantics, savepoints for partial rollback, and optimistic concurrency control. Lock-free writes with atomic sequence numbers ensure ordering without blocking. - Multi-column family atomicity guarantees that transactions spanning multiple column families either commit to all or none. Single-CF transactions use per-CF sequence numbers for maximum performance. Multi-CF transactions use global sequence numbers with high-bit flagging (bit 63) to mark entries that require atomic validation. During recovery, multi-CF transactions are only applied if all participating column families have the complete sequence, ensuring true all-or-nothing semantics without coordinator locks or two-phase commit overhead. This enables referential integrity and cross-table consistency for SQL-like workloads.
- Column families provide isolated key-value stores, each with independent configuration, memtables, SSTables, write-ahead logs, etc. Transactions can span multiple column families with per-CF consistent snapshots. Runtime configuration updates allow dynamic tuning of write buffer size, compression, and compaction parameters without restart.
- Bidirectional iterators support seek, forward and backward traversal with heap-based merge-sort across active memtables, immutable memtables, and SSTables. Lock-free iteration with reference counting prevents premature deletion during concurrent operations. Snapshot isolation ensures consistent iteration even during concurrent writes and compactions.
- Efficient seek operations using O(log n) skip list positioning and optional block indexes for prefix boundary to block mapping in SSTables. Block indexes sample every Nth block (configurable
index_sample_ratio, default 1) by storing the first and last key prefix of sampled blocks along with their file position, creating a sparse index that enables binary search to locate the correct block range for a given key. Configurable prefix length (default 16 bytes) balances index size with seek precision. Lower sample ratios (e.g., 8) create denser indexes for faster seeks at the cost of more memory, while higher ratios (e.g., 32) reduce memory usage. - Hybrid compaction policy with three modes: full preemptive merge for initial levels (minimize space amplification), dividing merge at configurable dividing level (create partition boundaries), and partitioned merge for deep levels (minimize write amplification). Dynamic Capacity Adjustment (DCA) automatically scales level capacities based on largest level size using the formula C_i = N_L / T^(L-i), ensuring optimal size ratios as data grows. Dynamic level management adds and removes levels on demand. Configurable dividing level offset allows tuning the transition point between aggressive and incremental compaction strategies.
- Durability through write-ahead log (WAL) with sequence numbers for ordering and automatic recovery on startup that reconstructs memtables from persisted logs. Lock-free WAL group commit batches multiple concurrent writes into single disk operations using atomic reservation (threads atomically reserve buffer space via CAS) and CAS-based leader election (first thread to reach buffer capacity becomes leader and performs the flush), maximizing throughput without blocking. Multi-CF transaction metadata is embedded in WAL entries, recording all participating column families for atomic recovery validation. SSTable metadata persistence ensures accurate recovery of min/max keys and entry counts. Immutable memtables remain searchable during flush, preventing data loss windows.
- Concurrent background operations with shared thread pools for flush and compaction. Multiple flush workers process different memtables in parallel without blocking reads or writes. Compaction workers coordinate via per-column-family atomic
is_compactingflag to prevent conflicts while allowing parallel compaction across column families. Manifest uses fine-grained locks with atomic operation tracking for safe concurrent access. Adaptive multi-tier backpressure system prevents write stalls and ensures write amplification stays bounded. - Optional bloom filters provide probabilistic key existence checks to reduce disk reads. Configurable false positive rate per column family. Bloom filters are built during SSTable creation and persisted in metadata blocks.
- Key-value separation (WiscKey-style) with configurable value threshold (default 1024*4 bytes). Small values are stored inline with keys in the klog (key log) for fast access. Large values exceeding the threshold are stored in the vlog (value log) with only an offset stored in the klog entry. This reduces write amplification during compaction since large values are not rewritten with keys. Klog blocks use 64KB default size while vlog blocks are not fixed for efficient I/O. Both klog and vlog blocks support optional compression, though WAL entries remain uncompressed for fast writes and recovery. Configurable per column family with runtime updates.
- TTL (time-to-live) support for key-value pairs with automatic expiration. Expired entries are skipped during reads and removed during compaction. Sequence number-based MVCC ensures correct handling of expired versions.
- Custom comparators allow registration of user-defined key comparison functions with context pointers for stateful comparisons. Lock-free comparator registry uses atomic COW (copy-on-write) pattern with CAS for registration (allocate new array, copy existing entries, add new entry, atomically swap pointer) and atomic loads for lookups, eliminating contention on hot read paths. Built-in comparators include memcmp, lexicographic, and more. Comparators are used consistently across skip lists, SSTables, indexes, and merge operations and support reverse order.
- Block manager provides lock-free concurrent I/O using
pread/pwritefor true parallelism where readers never block readers or writers, and writers don’t block each other. Atomic file size tracking eliminates syscalls on hot paths. Reference-counted blocks with atomic acquire/release enable safe sharing across threads without locks. Stack-allocated buffers for small writes (< 64KB) avoid heap allocation overhead. xxHash32 checksums with header+footer validation ensure data integrity. Sequential read hints (posix_fadvise) optimize OS page cache behavior. Supports up to 4GB blocks with efficient partial reads for key-only scans without loading full values. - Two-tier caching system for optimal performance:
- SSTable File Handle Management · Lock-free atomic design manages file descriptor limits without blocking reads. SSTables are opened on-demand during first access and kept open for subsequent reads. Each SSTable tracks
last_access_timeatomically for LRU eviction. Globalnum_open_sstablescounter enforces configurable limit (default 512). Background reaper thread wakes every set time interval, updates cached current time (avoiding syscalls on hot paths), checks if limit is reached, and closes 25% of oldest unused SSTables (refcount=1) based on LRU timestamps. Reaper also updatescached_available_disk_spaceto avoid repeated syscalls during write operations. Newly flushed/compacted SSTables stay open immediately for hot reads. Read path is completely non-blocking with zero retry loops or cache contention. Level-by-level search with early termination optimizes reads by checking bloom filters and range bounds before taking SSTable references, minimizing unnecessary file opens. - Block Cache · Concurrent CLOCK cache for hot SSTable blocks with automatic configuration based on CPU count and memory budget. Caches entire deserialized klog blocks with reference counting (not individual key-value pairs) for maximum efficiency—one cached block serves hundreds of keys, dramatically reducing memory overhead while eliminating disk I/O, decompression, and deserialization for hot data. Uses atomic operations with CLOCK eviction algorithm (second-chance page replacement) for concurrent reads and writes. Partitioned design (2 partitions per CPU core, up to 128 partitions) minimizes contention with thread-local clock hands. Configurable memory budget (default 256MB) with power-of-2 slot sizing for fast modulo operations. Cache keys use format
klog_path:block_positionto uniquely identify each block (e.g.,/path/to/db/users/L2P3_1336.klog:65536). Zero-copy API returns direct pointers to cached arena-allocated blocks with reference bit protection preventing eviction during access—eliminates copy overhead and race conditions. Blocks are cached after first read during point lookups and seek operations, remaining valid across transactions. Reference bit tracking gives recently accessed blocks a “second chance” before eviction. Hybrid hash table + circular array design enables O(1) lookups with linear probing (max 64 probes) and efficient CLOCK scanning. Memory efficient and significantly reduces disk I/O, decompression, and deserialization CPU overhead for random read workloads while maintaining thread-safety across concurrent transactions and compaction.
- SSTable File Handle Management · Lock-free atomic design manages file descriptor limits without blocking reads. SSTables are opened on-demand during first access and kept open for subsequent reads. Each SSTable tracks
- Shared thread pools for background flush and compaction operations with configurable thread counts at the database level (default 2 flush threads, 2 compaction threads). Work queues distribute tasks across workers for parallel processing. Compaction is automatically triggered after each flush when Level 0 reaches or exceeds its capacity, ensuring write amplification stays bounded.
- Three sync modes ·
TDB_SYNC_NONEfor maximum performance (OS-managed flushing),TDB_SYNC_FULLfor maximum durability (fsync on every write), andTDB_SYNC_INTERVALfor balanced performance with periodic syncing at configurable microsecond intervals. Structural operations (flush, compaction, WAL rotation) always enforce durability regardless of sync mode to prevent data loss. Configurable per column family. - Cross-platform support for Linux, macOS, and Windows on both 32-bit and 64-bit architectures with comprehensive platform abstraction layer. Atomic operations, threading primitives, and file I/O are abstracted for portability.
- Full file portability with explicit little-endian serialization throughout; database files can be copied between any platform (x86, ARM, RISC-V, PowerPC) and architecture (32-bit, 64-bit) without conversion. Fixed-width integer encoding ensures consistent layout.
- Clean C API that returns 0 on success and negative error codes on failure for straightforward error handling. Debug logging with configurable verbosity aids development and troubleshooting.
Community
Join the TidesDB Discord Community to ask questions, work on development, and discuss the future of TidesDB.