TidesDB Tuning Reference
If you want to download the source of this document, you can find it here.
This reference gives small, mid, and large scale configuration presets for the
engine and column families
configuration, tuned for modern systems. These configurations can be applied to integrations such as FFI libraries and TideSQL. Every number here
is derived from the engine knobs, defaults, and clamps in tidesdb.h and
tidesdb.c, not from guesswork. Read the mental model first, then pick the tier
that matches your hardware and adjust it with the workload overlays.
The library ships with a 64 MB write buffer, a level ratio of 10, a 1 percent bloom false-positive rate, a 512 byte value threshold, an L1 trigger of 4, an L0 stall of 10, two flush and two compaction threads, a 64 MB block cache, 256 open SSTables, a memory limit of 50 percent of RAM, READ_COMMITTED isolation, and LZ4 compression. The presets below move away from these defaults only where a tier benefits.
What “modern systems” means here
| Tier | Profile | RAM for the DB | Cores | Storage | Scale |
|---|---|---|---|---|---|
| Small | edge, container, sidecar, embedded | 1 to 4 GB | 2 to 4 | eMMC / SATA SSD / constrained NVMe | up to tens of GB, a handful of CFs |
| Mid | single production server, typical cloud VM | 8 to 32 GB | 8 to 16 | NVMe SSD | up to ~1 TB, dozens of CFs |
| Large | storage node, high-end server | 64 to 512 GB | 32 to 128 | fast NVMe (or striped) | multi-TB, many CFs, high concurrency |
The mental model
A few formulas in the engine determine almost everything, so it is better to tune with these in mind than knob by knob.
Per-CF write memory is bounded by two mechanisms acting together. The L0 stall
throttles writes once the immutable queue reaches l0_queue_stall_threshold, and
each immutable holds one write_buffer_size. The active memtable is hard-capped
at 2 x write_buffer_size. The worst-case steady-state memory for one busy
column family is therefore about
per_cf_mem ~= l0_queue_stall_threshold * write_buffer_size (queued immutables) + 2 * write_buffer_size (active memtable ceiling) + bloom filters + block indexes (small, resident)Above the stall sits a last-resort hard cap of l0_queue_stall_threshold + 6
immutables. It scales with the threshold, so raising the threshold raises the cap
in lockstep and there is no hidden ceiling.
When more than one column family shares the database in per-CF mode, the
effective L0 stall is reduced to
resolved_memory_limit / (num_cfs * write_buffer_size), floored at 2. The more
column families you run this way, the sooner each one stalls, so the total never
exceeds the memory budget. If you have many column families, either give the
database a larger max_memory_usage or switch to unified memtable mode, which
shares one queue and ignores this split.
A max_memory_usage of 0 resolves to 50 percent of system RAM, with a floor of 5
percent. A non-zero value is honored but clamped up to that 5 percent floor. The
block cache is clamped separately so that both internal caches together use at
most 30 percent of the resolved limit. Set the limit explicitly in containers,
where the host RAM the engine measures is not your cgroup limit.
Each open SSTable holds two file descriptors. At open time the engine clamps
max_open_sstables to (fd_limit - 64) / 2, so a target of N open SSTables needs
roughly 2N + 64 descriptors. About one eighth of that budget is reserved for the
write path, and reads back off with the retryable TDB_ERR_BUSY at the cap. To
run a large max_open_sstables, raise the ceiling before tidesdb_open().
tidesdb_raise_open_file_limit(32768); /* opt-in, POSIX RLIMIT_NOFILE / Windows CRT cap */The max_concurrent_flushes count is pinned one to one to num_flush_threads at
open, and a mismatch is normalized with a warning. A single hot column family
drains its immutable queue at the speed of the flush pool. Compaction runs one
round per column family, but each round fans out across the compaction pool
through sub-compaction.
Database-level presets
| Field | Small | Mid | Large | Notes |
|---|---|---|---|---|
num_flush_threads | 2 | 4 | 8 to 16 | match to concurrently hot CFs; pins max_concurrent_flushes |
num_compaction_threads | 2 | 4 | 8 to 16 | up to core count on NVMe, 1 to 2 on HDD |
block_cache_size | 32 MB | 1 to 4 GB | 16 to 64 GB | shared across all CFs; clamped to 30 percent of mem limit |
max_open_sstables | 64 to 128 | 512 to 1024 | 4096 to 16384 | needs 2N + 64 fds; raise ulimit for mid and large |
max_memory_usage | 512 MB to 1 GB | 0 (50 percent RAM) or explicit | explicit (~50 percent RAM) | set explicitly inside containers |
log_to_file | 1 | 1 | 1 | keep LOG in the data dir |
log_level | TDB_LOG_WARN | TDB_LOG_WARN | TDB_LOG_INFO | drop to WARN in steady state |
unified_memtable | 0 (1 if many CFs) | per workload | per workload | see the many-column-family overlay |
The max_concurrent_flushes field is left at its default so it tracks
num_flush_threads.
Column-family presets
| Field | Small | Mid | Large | Notes |
|---|---|---|---|---|
write_buffer_size | 8 to 16 MB | 64 MB | 128 to 256 MB | bigger batches, more memory, bigger L1 SSTables |
level_size_ratio | 10 | 10 | 10 (12 to 15 for multi-TB) | higher ratio means fewer levels and more write amp |
min_levels | 1 | 1 | 1 to 2 | a deeper floor avoids early level churn on big sets |
dividing_level_offset | 1 | 1 | 1 | lower is more aggressive compaction |
l0_queue_stall_threshold | 6 to 8 | 10 | 16 to 20 | drives per-CF write memory (see model) |
l1_file_count_trigger | 4 | 4 | 4 to 8 | higher tolerates burstier flushing |
klog_value_threshold | 512 | 512 | 512 to 1024 | values at or above this go to the value log |
compression_algorithm | TDB_COMPRESS_LZ4 | TDB_COMPRESS_LZ4 | TDB_COMPRESS_LZ4 (ZSTD for cold) | LZ4 is the throughput default |
enable_bloom_filter | 1 | 1 | 1 | almost always worth it |
bloom_fpr | 0.01 | 0.01 | 0.01 (0.005 read-heavy) | lower FPR means more bits per key |
enable_block_indexes | 1 | 1 | 1 | required for fast seeks |
index_sample_ratio | 1 | 1 | 1 | 1 indexes every block, making lookups definitive |
block_index_prefix_len | 16 | 16 | 16 to 32 | raise if keys share long prefixes |
sync_mode | TDB_SYNC_INTERVAL | TDB_SYNC_INTERVAL or FULL | TDB_SYNC_INTERVAL or FULL | see durability |
sync_interval_us | 128000 | 128000 | 128000 | only used with INTERVAL |
default_isolation_level | READ_COMMITTED | READ_COMMITTED | per workload | SERIALIZABLE only when you need SSI |
use_btree | 0 | 0 (1 for point and seek-heavy) | 0 (1 for point and seek-heavy) | B+tree klog favors point lookups and seeks |
What each tier costs in memory
Using the per-CF model above, one busy column family bounds to roughly the following.
| Tier | buffer | L0 stall | worst-case per-CF write memory |
|---|---|---|---|
| Small | 16 MB | 8 | 8x16 + 2x16 = ~160 MB |
| Mid | 64 MB | 10 | 10x64 + 2x64 = ~768 MB |
| Large | 128 MB | 16 | 16x128 + 2x128 = ~2.3 GB |
Multiply by the number of concurrently hot column families and keep the total
under max_memory_usage. In per-CF mode the engine auto-scales the stall down to
protect the budget, but sizing it yourself avoids surprise throttling.
What each tier costs in file descriptors
| Tier | max_open_sstables | fds needed (2N + 64) | recommended ulimit -n |
|---|---|---|---|
| Small | 128 | ~320 | 1024 (default is fine) |
| Mid | 1024 | ~2112 | 4096 or more |
| Large | 8192 | ~16448 | 32768 or more (call tidesdb_raise_open_file_limit) |
Full examples
Small
tidesdb_config_t db = tidesdb_default_config();db.db_path = "./data";db.num_flush_threads = 2;db.num_compaction_threads = 2;db.block_cache_size = 32 * 1024 * 1024; /* 32 MB */db.max_open_sstables = 128; /* ~320 fds */db.max_memory_usage = 768 * 1024 * 1024; /* explicit cap for a small box */db.log_to_file = 1;db.log_level = TDB_LOG_WARN;
tidesdb_column_family_config_t cf = tidesdb_default_column_family_config();cf.write_buffer_size = 16 * 1024 * 1024; /* 16 MB */cf.l0_queue_stall_threshold = 8;cf.compression_algorithm = TDB_COMPRESS_LZ4;cf.sync_mode = TDB_SYNC_INTERVAL;Mid
tidesdb_raise_open_file_limit(8192); /* before open */
tidesdb_config_t db = tidesdb_default_config();db.db_path = "./data";db.num_flush_threads = 4;db.num_compaction_threads = 4;db.block_cache_size = 2ull * 1024 * 1024 * 1024; /* 2 GB */db.max_open_sstables = 1024; /* ~2112 fds */db.max_memory_usage = 0; /* 50 percent of RAM */db.log_to_file = 1;db.log_level = TDB_LOG_WARN;
tidesdb_column_family_config_t cf = tidesdb_default_column_family_config();cf.write_buffer_size = 64 * 1024 * 1024; /* 64 MB (default) */cf.l0_queue_stall_threshold = 10;cf.compression_algorithm = TDB_COMPRESS_LZ4;cf.sync_mode = TDB_SYNC_INTERVAL; /* TDB_SYNC_FULL for strict durability */Large
tidesdb_raise_open_file_limit(65536); /* before open */
tidesdb_config_t db = tidesdb_default_config();db.db_path = "./data";db.num_flush_threads = 12;db.num_compaction_threads = 12;db.block_cache_size = 32ull * 1024 * 1024 * 1024; /* 32 GB */db.max_open_sstables = 8192; /* ~16448 fds */db.max_memory_usage = 128ull * 1024 * 1024 * 1024; /* explicit on a 256 GB box */db.log_to_file = 1;db.log_level = TDB_LOG_INFO;
tidesdb_column_family_config_t cf = tidesdb_default_column_family_config();cf.write_buffer_size = 128 * 1024 * 1024;cf.l0_queue_stall_threshold = 16;cf.level_size_ratio = 10; /* 12 to 15 for very large datasets */cf.l1_file_count_trigger = 6;cf.compression_algorithm = TDB_COMPRESS_LZ4;cf.sync_mode = TDB_SYNC_INTERVAL;Workload overlays
Apply these on top of a tier.
For write-heavy ingest, use a larger write_buffer_size for better batching and a
higher l0_queue_stall_threshold to absorb bursts at the cost of memory, add
flush threads, and choose TDB_SYNC_INTERVAL, or TDB_SYNC_NONE for a
rebuildable cache. Keep level_size_ratio at 10 to bound write amplification, and
raise it only if reads can absorb the extra level depth.
For read-heavy point lookups, lower bloom_fpr to 0.005 or 0.002 to cut false
positives, give the block cache more room, and consider use_btree = 1 for
O(log n) klog lookups instead of block scans. Keep index_sample_ratio at 1 so
block index lookups are definitive and can short-circuit negative reads.
For range scans and iteration, use_btree = 1 helps seek-then-scan, block indexes
must stay enabled, and a larger block cache keeps hot blocks resident. Raise
block_index_prefix_len if your keys share long common prefixes.
For delete-heavy workloads, arm the tombstone density trigger so delete-dominated column families are compacted before range scans degrade.
cf.tombstone_density_trigger = 0.30; /* compact when an sstable is >30% tombstones */cf.tombstone_density_min_entries = 1024; /* ignore tiny sstables */Use tidesdb_txn_single_delete for keys put at most once between deletes, which
lets a put and its delete cancel at the first merge regardless of level, and
tidesdb_compact_range to reclaim a known range immediately.
For large values, raise klog_value_threshold so more of them move to the value
log and keep the klog scannable, and lower it if values are small and hot so they
stay inline. Raise multipart_part_size in object store mode for very large
SSTables.
For many column families that map to one logical entity, such as a table plus N secondary index column families, enable unified memtable mode so a transaction touching K column families does one WAL write instead of K.
db.unified_memtable = 1;db.unified_memtable_write_buffer_size = 0; /* 0 => 64 MB (TDB_DEFAULT_WRITE_BUFFER_SIZE) */db.unified_memtable_sync_mode = TDB_SYNC_INTERVAL;Unified mode requires the default memcmp comparator for every column family,
since the shared skip list has a single sort order, and it shares one immutable
queue, so the per-CF stall auto-scaling does not apply. This is the right default
once you exceed a handful of column families per transaction.
Unified memtable sizing and the L0 and L1 split
Unified mode changes what write_buffer_size and the L0 stall mean, and the two
buffer knobs drive different things. This is the subtle part.
The unified_memtable_write_buffer_size sizes the one shared memtable. All column
families write into it, and it rotates as a whole the moment its total size
reaches that value. The threshold is fixed, with no adaptive idle headroom, unlike
per-CF mode, which rotates at up to 1.5x its buffer. A value of 0 means 64 MB, the
TDB_DEFAULT_WRITE_BUFFER_SIZE constant, not any column family’s
write_buffer_size.
Each column family’s own write_buffer_size no longer sizes a memtable in unified
mode, and the per-CF active and immutable memtables stay empty. That value still
matters, but only as level geometry. It is the capacity floor for that column
family’s levels, since DCA never sizes a level below it, so it governs the on-disk
level shape rather than the in-memory footprint.
L0 is shared while L1 and everything below it stays per-CF, and this asymmetry is
the heart of the model. There is exactly one L0 flush queue,
unified_mt.immutables. The L0 stall, the hard cap of stall + 6, and the
graduated L0 delays all measure that single shared queue. Backpressure is still
evaluated once per column family in a commit, comparing the shared queue depth
against each participating column family’s l0_queue_stall_threshold, so the
smallest configured threshold among the column families in a transaction is the
one that actually stalls it. Set l0_queue_stall_threshold consistently across
column families in unified mode, or expect the minimum to win.
L1 stays per column family. Flush demux writes each column family’s slice of the
rotated memtable to that column family’s own level 1, so every column family keeps
its own L1 file count, its own l1_file_count_trigger, and its own compaction. L1
never hard-stalls writes in either mode and only applies graduated delays, and
compaction stays one round per column family.
Memory in unified mode is database-wide rather than per column family. Because the queue and the active memtable are shared, the worst-case write memory is bounded once for the whole database, regardless of column family count.
unified_write_mem ~= l0_queue_stall_threshold * unified_memtable_write_buffer_size + 2 * unified_memtable_write_buffer_size (shared active ceiling, 2x)With the defaults of a 64 MB unified buffer and a stall of 10, that is about
(10 + 2) * 64 MB = ~768 MB total, no matter how many column families you run.
That fixed, count-independent ceiling is the main reason to choose unified mode
for many-CF workloads. In per-CF mode the bound is instead per column family, and
the engine auto-scales each one’s stall down to protect the budget. So in unified
mode you tune memory with the unified buffer and a single l0_queue_stall_threshold,
not with per-CF write_buffer_size.
For object store and cloud-native deployments, set object_store and an
tidesdb_objstore_config_t, and unified memtable mode is enabled automatically.
Give local_cache_max_bytes enough room for the hot working set, keep
replicate_wal = 1, and set wal_sync_on_commit = 1 for an RPO of zero, or rely
on the 1 MB wal_sync_threshold_bytes for a bounded loss window. The
max_concurrent_uploads count defaults to 4 and max_concurrent_downloads to 8,
and both are worth raising on large nodes. Two per-CF knobs trade remote I/O
against read amplification. The object_lazy_compaction flag, off by default,
doubles the L1 file-count compaction trigger when an object store is attached,
which cuts compaction frequency and upload churn at the cost of more files to
read. The object_prefetch_compaction flag, on by default, downloads all evicted
merge inputs in parallel before a compaction instead of one at a time.
Durability
| Mode | Behavior | Use when |
|---|---|---|
TDB_SYNC_NONE | WAL written, not fsynced per commit, synced at flush | rebuildable cache, max throughput |
TDB_SYNC_INTERVAL | background fsync every sync_interval_us (default 128 ms) | general purpose, good throughput with a bounded loss window |
TDB_SYNC_FULL | fsync coalesced across concurrent committers | strict durability; unified-mode group commit keeps it fast |
In full-sync unified mode the WAL fsync is coalesced into one fsync per batch of
concurrent committers, so TDB_SYNC_FULL stays performant at high concurrency. In
per-CF mode every commit fragments across separate WALs, so TDB_SYNC_FULL costs
one fsync per commit.
Lower-impact knobs
These rarely need changing from their defaults, but they are real and worth knowing.
The skip_list_max_level (default 12) and skip_list_probability (default 0.25)
are the standard probabilistic skip-list parameters. Keep skip_list_max_level
below 64, because the memtable write path uses a stack-allocated update array only
when the level is under 64, and a higher value forces a per-operation heap
allocation. A level of 12 already indexes far more entries than a memtable holds.
The unified memtable has its own unified_memtable_skip_list_max_level and
unified_memtable_skip_list_probability, and a 0 for either resolves to the same
12 and 0.25 defaults.
The min_disk_space field (default 100 MB) is a safety floor. Flush and
compaction are skipped while free space on the column family directory is below
it, so the data stays in memory and writes keep flowing until memory pressure
intervenes. Raise it if you want a larger headroom before the engine stops writing
new files.
The log_truncation_at field (default 24 MB) matters only with log_to_file = 1.
The LOG file is truncated once it grows past this size, which bounds log disk
use.
The num_compaction_threads count sets both the number of compaction worker
threads and the shared sub-compaction helper budget, so one round of a single
column family can fan out across the whole pool. Size it to your storage
parallelism, one to two on HDD and up to the core count on NVMe, not to your
column family count.
Things to watch
Set max_memory_usage explicitly in containers, because the auto-resolve uses
host RAM rather than your cgroup limit.
A high max_open_sstables is silently clamped if the file-descriptor limit cannot
honor it, so call tidesdb_raise_open_file_limit() before tidesdb_open().
The max_concurrent_flushes count follows num_flush_threads, so set the thread
count and do not fight the one-to-one pin.
The comparator is permanent. It cannot change after a column family is created
without corrupting key order, and unified mode forces memcmp.
Many column families in per-CF mode self-throttle, since the effective L0 stall
shrinks with column family count. Grow max_memory_usage or switch to unified
mode.
The block cache shares the 30 percent memory clamp with the B+tree node cache, so
a large block_cache_size may be clamped down relative to max_memory_usage.
The min_disk_space floor gates flush and compaction, not just writes. If free
space drops below it, new SSTables stop being written and memory climbs until
pressure relief intervenes, so monitor disk free space against this floor.
A skip_list_max_level of 64 or more costs a heap allocation on every memtable
write, so stay below 64. The default 12 is fine.