All About the Linux Kernel: Bcache
The 3.10 Linux kernel release late last month brought a raft of new features worth celebrating for Linux developers and sysadmins alike. This release was especially satisfying, though, to kernel developer Kent Overstreet who saw years of hard work pay off with the inclusion of the Bcache patch set in 3.10.
Bcache allows Linux machines to use flash-based SSDs (solid-state drives) as cache for other, slower and less expensive, hard disk drives. It can be used in servers, workstations, high-end storage arrays, or “anywhere you want IO to be faster, really,” Overstreet said.
“If you don't want to shell out for all SSD storage, using bcache makes the machine you're using feel just about as fast as if it was using just SSDs,” he said. “I've been using it on my various machines at home and at work for quite a while now.”
It lives in the kernel’s block layer, below the filesystem, alongside other block device utilities such as md RAID, device mapper, and DRBD fit. (See the bcache documentation and a full list of bcache features and performance notes.)
Years in Development
Overstreet originally took on bcache “for fun” as a side project and worked on it alone for more than a year before it started getting attention. Then Google took notice and hired him to work specifically on bcache.
He worked on coding it for another year, with help from Adam Berkan and Ricky Benitez at Google who contributed some code, along with design and code reviews. But bcache was never rolled out on a large scale at Google, “for reasons of a political/"vision" nature,” Overstreet said.
He’s since worked mostly alone to maintain it, with help from the open source and Linux kernel communities on patches and testing, “which is hugely important and impossible for me to fully cover myself,” Overstreet said.
He’s now moved on to a new company, Datera, which will be relying heavily on bcache, “so development should pick back up again,” he said.
Future Bcache Features
Overstreet can celebrate the 3.10 release, at least a little, now that Bcache is mostly complete with no big changes remaining. But in the near future – though probably not as soon as 3.11 – he has a number of bcache enhancements planned. These include:
- RAID stripe awareness.
“Partial stripe writes on raid 5/6 are quite expensive, they require a read/modify/write of the parity blocks. This will add knowledge of the stripe layout to the writeback code, so that when deciding which writes to do writeback for it biases in favor of stripes that are already dirty, and background writeback preferentially flushes full stripes first.”
- The ability to add miss data to the cache when the btree node is full.
“If we get a cache miss and the btree node it'd go in is full, we can't add that data to the cache. On normal workloads this is mostly a non-issue, because there'll be some write activity and the btree node splits will just happen on writes.
“But if you're benchmarking reads or random reads, and trying to warm the cache by just doing reads - it's a really annoying issue then because the cache never fully warms up and if you don't know about this issue, it's quite baffling and frustrating.”
In the longer term, Overstreet would like to add multiple SSDs in a cache set and full data checksumming. He’s already made some progress on these changes; the potential for supporting multiple SSDs was “baked into the design ages ago.”
Multiple SSDs “will allow us to mirror dirty data and metadata, but not clean data - you get redundancy without wasting SSD space duplicating clean cached data,” he said.
Even farther off he sees the potential for using bcache as the basis for a new, faster local filesystem with smaller and cleaner code. “But who knows when I'll have time to work on it?” he said.