Playing with bcache: NVMe Write-Back Caching over HDD
Over the weekend I decided to revisit bcache, Linux’s block-layer cache system, to try and squeeze more performance out of a large spinning disk by caching it with a small SSD. The concept is simple: attach a fast SSD (in this case, a 256?GB NVMe drive) in front of a slower HDD, and let bcache absorb and smooth out writes using the SSD as a write-back cache.
Unfortunately, the results were not what I expected. Instead of a big performance bump, I discovered that my NVMe drive—despite being a Samsung MZVLQ256HBJD—is shockingly slow for sustained writes, barely better than the HDD I was trying to accelerate.
Setup Steps
Here’s the setup I used:
- Wipe the drives
Make sure there’s no lingering filesystem or bcache metadata:sudo wipefs -a /dev/sda sudo wipefs -a /dev/nvme0n1p4
- Create backing device on the HDD
sudo make-bcache --wipe-bcache --block /dev/sda
- Create cache device on the NVMe SSD
sudo make-bcache --cache --wipe-bcache --block /dev/nvme0n1p4
- Attach the cache to the backing device
Get the cache set UUID and attach it to the backing device:# Find UUID bcache-super-show /dev/nvme0n1p4 | grep UUID # Attach (replace UUID) echo | sudo tee /sys/block/bcache0/bcache/attach
- Enable write-back mode
echo writeback | sudo tee /sys/block/bcache0/bcache/cache_mode
- Format and mount
sudo mkfs.ext4 /dev/bcache0 sudo mount /dev/bcache0 /mnt
Benchmarking Reality
To get a baseline, I ran a basic fio test directly on a file on my root filesystem, which resides on the NVMe drive:
fio --name=test --filename=/home/osan/testfile --size=10G --bs=1M --rw=write \ --ioengine=libaio --iodepth=32 --direct=0
The result: ~194?MiB/s write bandwidth. And that’s not a typo.
With clat latencies in the hundreds of milliseconds and 99th percentile latencies >2?seconds, the drive showed behavior much closer to a spinning disk than what you’d expect from an NVMe device. Even enabling write-back caching in bcache didn’t improve performance much—it simply matched the NVMe’s raw write speed, which isn’t saying much.
The disk shows 8GT/s x4 on PCIe Gen3, so the bus isn’t the bottleneck. The drive is just bad at sustained writes—probably DRAM-less and overaggressively optimized for client workloads or bursty activity.
<H2Takeaways
If you’re planning to use an NVMe SSD as a bcache write-back device, don’t assume all NVMe drives are fast. Many low-end OEM drives—especially DRAM-less models like the MZVLQ256HBJD—fall off a performance cliff under sustained write load.
In my case, bcache didn’t offer a performance boost because the cache layer was just as slow as the backing device. In retrospect, I would’ve been better off checking the SSD’s sustained write behavior first using something like:
sudo nvme smart-log /dev/nvme0n1
Or running long-duration writes and monitoring latency and thermal throttling.
There’s still value in bcache—especially if you have a real SSD or enterprise-grade NVMe to use as a cache. But with a sluggish consumer NVMe like mine, bcache turns into a no-op at best and a source of extra latency at worst.
Sparse files
A really long time ago I was writing up something about sparse files to show how they can be used for automatically growing file systems. Unfortunately I never got very far but figured it’s one of those things I’m still really happy to play around with, so I might as well try to finish this entry.
A sparse file is basically an empty file with a size, but you never assign the blocks for the file… so you can wind up with a huge file, but it only uses a small part of the harddrive space which in turn is only filled/take up real disk space as you use the space. The real disk space is not taken up until it’s actually used. This can be really useful for creating harddrive loopback disks for example. Below is an example of how I create an empty image file which takes up 0 bytes of diskspace, but is 512 megabytes in size according to the filesystem.
oan@work7:~$ dd if=/dev/zero of=file.img bs=1 count=0 seek=512M
0+0 records in
0+0 records out
0 bytes (0 B) copied, 9.0712e-05 s, 0.0 kB/s
Here we actually create the file, notice the seek 512M. This makes dd jump 512 megabytes forward to write a single zero.
oan@work7:~$ du -shx file.img
0 file.img
And here you see the amount of disk space actually used by the file at this point.
oan@work7:~$ mkfs.btrfs file.img
SMALL VOLUME: forcing mixed metadata/data groups
btrfs-progs v4.0
See http://btrfs.wiki.kernel.org for more information.
Turning ON incompat feature 'mixed-bg': mixed data and metadata block groups
Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
Turning ON incompat feature 'skinny-metadata': reduced-size metadata extent refs
Created a data/metadata chunk of size 8388608
ERROR: device scan failed 'file.img' - Operation not permitted
fs created label (null) on file.img
nodesize 4096 leafsize 4096 sectorsize 4096 size 512.00MiB
At this point we create a btrfs filesystem inside the sparse file we previously created.
oan@work7:~$ du -shx file.img
4.1M file.img
And now you can see that the filesystem metadata that we created uses 4.1 megabytes of actual diskspace. The actual filesystem available should be 512M – 4.1M in size approximately.
oan@work7:~$ sudo mount -t btrfs -o loop file.img file
oan@work7:~$ sudo dd if=/dev/urandom of=./lalala.rand bs=1K count=4K
4096+0 records in
4096+0 records out
4194304 bytes (4.2 MB) copied, 0.344203 s, 12.2 MB/s
We mount the file system using a loopback device and create a small file of 4M in size.
oan@work7:~$ du -shx file.img
8.1M file.img
And here you can see how the actual diskspace used has grown to 8.1M in size.
This is a really interesting way of utilizing your diskspace optimally while creating for example yocto builds run locally etc. I hope this can be useful for others as well.