Use cases for bluestore_dio = false in Ceph

Question 1

What are sensible use cases for the bluestore_dio = false in Ceph 19?

I already know that it activates RAM cache instead of writing new data directly into mass storage.

I would like to accelerate bulk reads and writes of ~250 MB multimedia (OGG, MPEG-4, FITS, RAW,…) files (via CephFS) and would like to know if I can expect performance gains when deactivating DIO.

My Ceph cluster contains four ARM (RK3588) nodes with an 8 TB M.2 SSD & 16 GB of RAM each each running Ubuntu 24, connected with a 10 GB network.

Question 2

Ceph bluestore OSDs do not use the Linux VFS and its page cache, they are directly on block devices versus on files. There is no OSD option corresponding to journal_dio, it does not apply.

Memory management is instead via the ceph application. Per hardware recommendations, recommend:

tune OSDs to at least 4 GB each
total OSD memory not exceeding 50% system memory, less if other applications share the resources
20% margin system memory to avoid nasty reclaim performance and OOM (which is consistent with capacity planning for other latency sensitive applications)

While there are a few config directives that can tune this, a simple option is a static configuration via osd_memory_target. Returning to that hardware recs page:

BlueStore uses its own memory to cache data rather than relying on the operating system’s page cache. When using the BlueStore OSD back end you can adjust the amount of memory that the OSD attempts to consume by changing the osd_memory_target configuration option.

Setting the osd_memory_target below 2GB is not recommended. Ceph may fail to keep the memory consumption under 2GB and extremely slow performance is likely.

Setting the memory target between 2GB and 4GB typically works but may result in degraded performance: metadata may need to be read from disk during IO unless the active data set is relatively small.

4GB is the current default value for osd_memory_target This default was chosen for typical use cases, and is intended to balance RAM cost and OSD performance.

Setting the osd_memory_target higher than 4GB can improve performance when there many (small) objects or when large (256GB/OSD or more) data sets are processed. This is especially true with fast NVMe OSDs.

For monitoring, the usual metrics for watching Linux memory include:

Qualitative measure of little to no memory delay via PSI data in /proc/pressure/memory (and the similar numbers per cgroup).
Quantitative measurements from /proc/meminfo (and the similar numbers per cgroup). From these you can find out if MemAvailable / MemTotal has a reasonable capacity margin, if you misconfigured huge pages, indicate the larger anon page use of bluestore versus a file oriented DBMS, and more.

In terms of which storage API to use, double check your ceph package is linked against libaio, for Linux native async I/O. As the default it probably is.

If you have a hypothesis that storage API, manual cache sizing, or other tuning affects your performance, test that. Only you can test your workload. Performance has an incredible number of variables, from your data is structured, to how you map it to OSDs, performance characteristics of your model of SSD, host resources like CPU and memory, ceph filestore versus bluestore, and so on.

However, I doubt you need to do anything as exotic as enabling io_uring or the next generation ceph crimson to make effective use of that small cluster.

As an exercise in finding bottlenecks, check if you can saturate that 10g link. Synthetic load, real load, take your pick. I expect that network to be relatively easy to max out across 8 disks.