Skip to content

TidesDB Tuning Reference

If you want to download the source of this document, you can find it here.


This reference gives small, mid, and large scale configuration presets for the engine and column families configuration, tuned for modern systems. These configurations can be applied to integrations such as FFI libraries and TideSQL. Every number here is derived from the engine knobs, defaults, and clamps in tidesdb.h and tidesdb.c, not from guesswork. Read the mental model first, then pick the tier that matches your hardware and adjust it with the workload overlays.

The library ships with a 64 MB write buffer, a level ratio of 10, a 1 percent bloom false-positive rate, a 512 byte value threshold, an L1 trigger of 4, an L0 stall of 10, two flush and two compaction threads, a 64 MB block cache, 256 open SSTables, a memory limit of 50 percent of RAM, READ_COMMITTED isolation, and LZ4 compression. The presets below move away from these defaults only where a tier benefits.


What “modern systems” means here

TierProfileRAM for the DBCoresStorageScale
Smalledge, container, sidecar, embedded1 to 4 GB2 to 4eMMC / SATA SSD / constrained NVMeup to tens of GB, a handful of CFs
Midsingle production server, typical cloud VM8 to 32 GB8 to 16NVMe SSDup to ~1 TB, dozens of CFs
Largestorage node, high-end server64 to 512 GB32 to 128fast NVMe (or striped)multi-TB, many CFs, high concurrency

The mental model

A few formulas in the engine determine almost everything, so it is better to tune with these in mind than knob by knob.

Per-CF write memory is bounded by two mechanisms acting together. The L0 stall throttles writes once the immutable queue reaches l0_queue_stall_threshold, and each immutable holds one write_buffer_size. The active memtable is hard-capped at 2 x write_buffer_size. The worst-case steady-state memory for one busy column family is therefore about

per_cf_mem ~= l0_queue_stall_threshold * write_buffer_size (queued immutables)
+ 2 * write_buffer_size (active memtable ceiling)
+ bloom filters + block indexes (small, resident)

Above the stall sits a last-resort hard cap of l0_queue_stall_threshold + 6 immutables. It scales with the threshold, so raising the threshold raises the cap in lockstep and there is no hidden ceiling.

When more than one column family shares the database in per-CF mode, the effective L0 stall is reduced to resolved_memory_limit / (num_cfs * write_buffer_size), floored at 2. The more column families you run this way, the sooner each one stalls, so the total never exceeds the memory budget. If you have many column families, either give the database a larger max_memory_usage or switch to unified memtable mode, which shares one queue and ignores this split.

A max_memory_usage of 0 resolves to 50 percent of system RAM, with a floor of 5 percent. A non-zero value is honored but clamped up to that 5 percent floor. The block cache is clamped separately so that both internal caches together use at most 30 percent of the resolved limit. Set the limit explicitly in containers, where the host RAM the engine measures is not your cgroup limit.

Each open SSTable holds two file descriptors. At open time the engine clamps max_open_sstables to (fd_limit - 64) / 2, so a target of N open SSTables needs roughly 2N + 64 descriptors. About one eighth of that budget is reserved for the write path, and reads back off with the retryable TDB_ERR_BUSY at the cap. To run a large max_open_sstables, raise the ceiling before tidesdb_open().

tidesdb_raise_open_file_limit(32768); /* opt-in, POSIX RLIMIT_NOFILE / Windows CRT cap */

The max_concurrent_flushes count is pinned one to one to num_flush_threads at open, and a mismatch is normalized with a warning. A single hot column family drains its immutable queue at the speed of the flush pool. Compaction runs one round per column family, but each round fans out across the compaction pool through sub-compaction.


Database-level presets

FieldSmallMidLargeNotes
num_flush_threads248 to 16match to concurrently hot CFs; pins max_concurrent_flushes
num_compaction_threads248 to 16up to core count on NVMe, 1 to 2 on HDD
block_cache_size32 MB1 to 4 GB16 to 64 GBshared across all CFs; clamped to 30 percent of mem limit
max_open_sstables64 to 128512 to 10244096 to 16384needs 2N + 64 fds; raise ulimit for mid and large
max_memory_usage512 MB to 1 GB0 (50 percent RAM) or explicitexplicit (~50 percent RAM)set explicitly inside containers
log_to_file111keep LOG in the data dir
log_levelTDB_LOG_WARNTDB_LOG_WARNTDB_LOG_INFOdrop to WARN in steady state
unified_memtable0 (1 if many CFs)per workloadper workloadsee the many-column-family overlay

The max_concurrent_flushes field is left at its default so it tracks num_flush_threads.


Column-family presets

FieldSmallMidLargeNotes
write_buffer_size8 to 16 MB64 MB128 to 256 MBbigger batches, more memory, bigger L1 SSTables
level_size_ratio101010 (12 to 15 for multi-TB)higher ratio means fewer levels and more write amp
min_levels111 to 2a deeper floor avoids early level churn on big sets
dividing_level_offset111lower is more aggressive compaction
l0_queue_stall_threshold6 to 81016 to 20drives per-CF write memory (see model)
l1_file_count_trigger444 to 8higher tolerates burstier flushing
klog_value_threshold512512512 to 1024values at or above this go to the value log
compression_algorithmTDB_COMPRESS_LZ4TDB_COMPRESS_LZ4TDB_COMPRESS_LZ4 (ZSTD for cold)LZ4 is the throughput default
enable_bloom_filter111almost always worth it
bloom_fpr0.010.010.01 (0.005 read-heavy)lower FPR means more bits per key
enable_block_indexes111required for fast seeks
index_sample_ratio1111 indexes every block, making lookups definitive
block_index_prefix_len161616 to 32raise if keys share long prefixes
sync_modeTDB_SYNC_INTERVALTDB_SYNC_INTERVAL or FULLTDB_SYNC_INTERVAL or FULLsee durability
sync_interval_us128000128000128000only used with INTERVAL
default_isolation_levelREAD_COMMITTEDREAD_COMMITTEDper workloadSERIALIZABLE only when you need SSI
use_btree00 (1 for point and seek-heavy)0 (1 for point and seek-heavy)B+tree klog favors point lookups and seeks

What each tier costs in memory

Using the per-CF model above, one busy column family bounds to roughly the following.

TierbufferL0 stallworst-case per-CF write memory
Small16 MB88x16 + 2x16 = ~160 MB
Mid64 MB1010x64 + 2x64 = ~768 MB
Large128 MB1616x128 + 2x128 = ~2.3 GB

Multiply by the number of concurrently hot column families and keep the total under max_memory_usage. In per-CF mode the engine auto-scales the stall down to protect the budget, but sizing it yourself avoids surprise throttling.

What each tier costs in file descriptors

Tiermax_open_sstablesfds needed (2N + 64)recommended ulimit -n
Small128~3201024 (default is fine)
Mid1024~21124096 or more
Large8192~1644832768 or more (call tidesdb_raise_open_file_limit)

Full examples

Small

tidesdb_config_t db = tidesdb_default_config();
db.db_path = "./data";
db.num_flush_threads = 2;
db.num_compaction_threads = 2;
db.block_cache_size = 32 * 1024 * 1024; /* 32 MB */
db.max_open_sstables = 128; /* ~320 fds */
db.max_memory_usage = 768 * 1024 * 1024; /* explicit cap for a small box */
db.log_to_file = 1;
db.log_level = TDB_LOG_WARN;
tidesdb_column_family_config_t cf = tidesdb_default_column_family_config();
cf.write_buffer_size = 16 * 1024 * 1024; /* 16 MB */
cf.l0_queue_stall_threshold = 8;
cf.compression_algorithm = TDB_COMPRESS_LZ4;
cf.sync_mode = TDB_SYNC_INTERVAL;

Mid

tidesdb_raise_open_file_limit(8192); /* before open */
tidesdb_config_t db = tidesdb_default_config();
db.db_path = "./data";
db.num_flush_threads = 4;
db.num_compaction_threads = 4;
db.block_cache_size = 2ull * 1024 * 1024 * 1024; /* 2 GB */
db.max_open_sstables = 1024; /* ~2112 fds */
db.max_memory_usage = 0; /* 50 percent of RAM */
db.log_to_file = 1;
db.log_level = TDB_LOG_WARN;
tidesdb_column_family_config_t cf = tidesdb_default_column_family_config();
cf.write_buffer_size = 64 * 1024 * 1024; /* 64 MB (default) */
cf.l0_queue_stall_threshold = 10;
cf.compression_algorithm = TDB_COMPRESS_LZ4;
cf.sync_mode = TDB_SYNC_INTERVAL; /* TDB_SYNC_FULL for strict durability */

Large

tidesdb_raise_open_file_limit(65536); /* before open */
tidesdb_config_t db = tidesdb_default_config();
db.db_path = "./data";
db.num_flush_threads = 12;
db.num_compaction_threads = 12;
db.block_cache_size = 32ull * 1024 * 1024 * 1024; /* 32 GB */
db.max_open_sstables = 8192; /* ~16448 fds */
db.max_memory_usage = 128ull * 1024 * 1024 * 1024; /* explicit on a 256 GB box */
db.log_to_file = 1;
db.log_level = TDB_LOG_INFO;
tidesdb_column_family_config_t cf = tidesdb_default_column_family_config();
cf.write_buffer_size = 128 * 1024 * 1024;
cf.l0_queue_stall_threshold = 16;
cf.level_size_ratio = 10; /* 12 to 15 for very large datasets */
cf.l1_file_count_trigger = 6;
cf.compression_algorithm = TDB_COMPRESS_LZ4;
cf.sync_mode = TDB_SYNC_INTERVAL;

Workload overlays

Apply these on top of a tier.

For write-heavy ingest, use a larger write_buffer_size for better batching and a higher l0_queue_stall_threshold to absorb bursts at the cost of memory, add flush threads, and choose TDB_SYNC_INTERVAL, or TDB_SYNC_NONE for a rebuildable cache. Keep level_size_ratio at 10 to bound write amplification, and raise it only if reads can absorb the extra level depth.

For read-heavy point lookups, lower bloom_fpr to 0.005 or 0.002 to cut false positives, give the block cache more room, and consider use_btree = 1 for O(log n) klog lookups instead of block scans. Keep index_sample_ratio at 1 so block index lookups are definitive and can short-circuit negative reads.

For range scans and iteration, use_btree = 1 helps seek-then-scan, block indexes must stay enabled, and a larger block cache keeps hot blocks resident. Raise block_index_prefix_len if your keys share long common prefixes.

For delete-heavy workloads, arm the tombstone density trigger so delete-dominated column families are compacted before range scans degrade.

cf.tombstone_density_trigger = 0.30; /* compact when an sstable is >30% tombstones */
cf.tombstone_density_min_entries = 1024; /* ignore tiny sstables */

Use tidesdb_txn_single_delete for keys put at most once between deletes, which lets a put and its delete cancel at the first merge regardless of level, and tidesdb_compact_range to reclaim a known range immediately.

For large values, raise klog_value_threshold so more of them move to the value log and keep the klog scannable, and lower it if values are small and hot so they stay inline. Raise multipart_part_size in object store mode for very large SSTables.

For many column families that map to one logical entity, such as a table plus N secondary index column families, enable unified memtable mode so a transaction touching K column families does one WAL write instead of K.

db.unified_memtable = 1;
db.unified_memtable_write_buffer_size = 0; /* 0 => 64 MB (TDB_DEFAULT_WRITE_BUFFER_SIZE) */
db.unified_memtable_sync_mode = TDB_SYNC_INTERVAL;

Unified mode requires the default memcmp comparator for every column family, since the shared skip list has a single sort order, and it shares one immutable queue, so the per-CF stall auto-scaling does not apply. This is the right default once you exceed a handful of column families per transaction.

Unified memtable sizing and the L0 and L1 split

Unified mode changes what write_buffer_size and the L0 stall mean, and the two buffer knobs drive different things. This is the subtle part.

The unified_memtable_write_buffer_size sizes the one shared memtable. All column families write into it, and it rotates as a whole the moment its total size reaches that value. The threshold is fixed, with no adaptive idle headroom, unlike per-CF mode, which rotates at up to 1.5x its buffer. A value of 0 means 64 MB, the TDB_DEFAULT_WRITE_BUFFER_SIZE constant, not any column family’s write_buffer_size.

Each column family’s own write_buffer_size no longer sizes a memtable in unified mode, and the per-CF active and immutable memtables stay empty. That value still matters, but only as level geometry. It is the capacity floor for that column family’s levels, since DCA never sizes a level below it, so it governs the on-disk level shape rather than the in-memory footprint.

L0 is shared while L1 and everything below it stays per-CF, and this asymmetry is the heart of the model. There is exactly one L0 flush queue, unified_mt.immutables. The L0 stall, the hard cap of stall + 6, and the graduated L0 delays all measure that single shared queue. Backpressure is still evaluated once per column family in a commit, comparing the shared queue depth against each participating column family’s l0_queue_stall_threshold, so the smallest configured threshold among the column families in a transaction is the one that actually stalls it. Set l0_queue_stall_threshold consistently across column families in unified mode, or expect the minimum to win.

L1 stays per column family. Flush demux writes each column family’s slice of the rotated memtable to that column family’s own level 1, so every column family keeps its own L1 file count, its own l1_file_count_trigger, and its own compaction. L1 never hard-stalls writes in either mode and only applies graduated delays, and compaction stays one round per column family.

Memory in unified mode is database-wide rather than per column family. Because the queue and the active memtable are shared, the worst-case write memory is bounded once for the whole database, regardless of column family count.

unified_write_mem ~= l0_queue_stall_threshold * unified_memtable_write_buffer_size
+ 2 * unified_memtable_write_buffer_size (shared active ceiling, 2x)

With the defaults of a 64 MB unified buffer and a stall of 10, that is about (10 + 2) * 64 MB = ~768 MB total, no matter how many column families you run. That fixed, count-independent ceiling is the main reason to choose unified mode for many-CF workloads. In per-CF mode the bound is instead per column family, and the engine auto-scales each one’s stall down to protect the budget. So in unified mode you tune memory with the unified buffer and a single l0_queue_stall_threshold, not with per-CF write_buffer_size.

For object store and cloud-native deployments, set object_store and an tidesdb_objstore_config_t, and unified memtable mode is enabled automatically. Give local_cache_max_bytes enough room for the hot working set, keep replicate_wal = 1, and set wal_sync_on_commit = 1 for an RPO of zero, or rely on the 1 MB wal_sync_threshold_bytes for a bounded loss window. The max_concurrent_uploads count defaults to 4 and max_concurrent_downloads to 8, and both are worth raising on large nodes. Two per-CF knobs trade remote I/O against read amplification. The object_lazy_compaction flag, off by default, doubles the L1 file-count compaction trigger when an object store is attached, which cuts compaction frequency and upload churn at the cost of more files to read. The object_prefetch_compaction flag, on by default, downloads all evicted merge inputs in parallel before a compaction instead of one at a time.


Durability

ModeBehaviorUse when
TDB_SYNC_NONEWAL written, not fsynced per commit, synced at flushrebuildable cache, max throughput
TDB_SYNC_INTERVALbackground fsync every sync_interval_us (default 128 ms)general purpose, good throughput with a bounded loss window
TDB_SYNC_FULLfsync coalesced across concurrent committersstrict durability; unified-mode group commit keeps it fast

In full-sync unified mode the WAL fsync is coalesced into one fsync per batch of concurrent committers, so TDB_SYNC_FULL stays performant at high concurrency. In per-CF mode every commit fragments across separate WALs, so TDB_SYNC_FULL costs one fsync per commit.


Lower-impact knobs

These rarely need changing from their defaults, but they are real and worth knowing.

The skip_list_max_level (default 12) and skip_list_probability (default 0.25) are the standard probabilistic skip-list parameters. Keep skip_list_max_level below 64, because the memtable write path uses a stack-allocated update array only when the level is under 64, and a higher value forces a per-operation heap allocation. A level of 12 already indexes far more entries than a memtable holds. The unified memtable has its own unified_memtable_skip_list_max_level and unified_memtable_skip_list_probability, and a 0 for either resolves to the same 12 and 0.25 defaults.

The min_disk_space field (default 100 MB) is a safety floor. Flush and compaction are skipped while free space on the column family directory is below it, so the data stays in memory and writes keep flowing until memory pressure intervenes. Raise it if you want a larger headroom before the engine stops writing new files.

The log_truncation_at field (default 24 MB) matters only with log_to_file = 1. The LOG file is truncated once it grows past this size, which bounds log disk use.

The num_compaction_threads count sets both the number of compaction worker threads and the shared sub-compaction helper budget, so one round of a single column family can fan out across the whole pool. Size it to your storage parallelism, one to two on HDD and up to the core count on NVMe, not to your column family count.


Things to watch

Set max_memory_usage explicitly in containers, because the auto-resolve uses host RAM rather than your cgroup limit.

A high max_open_sstables is silently clamped if the file-descriptor limit cannot honor it, so call tidesdb_raise_open_file_limit() before tidesdb_open().

The max_concurrent_flushes count follows num_flush_threads, so set the thread count and do not fight the one-to-one pin.

The comparator is permanent. It cannot change after a column family is created without corrupting key order, and unified mode forces memcmp.

Many column families in per-CF mode self-throttle, since the effective L0 stall shrinks with column family count. Grow max_memory_usage or switch to unified mode.

The block cache shares the 30 percent memory clamp with the B+tree node cache, so a large block_cache_size may be clamped down relative to max_memory_usage.

The min_disk_space floor gates flush and compaction, not just writes. If free space drops below it, new SSTables stop being written and memory climbs until pressure relief intervenes, so monitor disk free space against this floor.

A skip_list_max_level of 64 or more costs a heap allocation on every memtable write, so stay below 64. The default 12 is fine.