What Does a Database for SSDs Look Like?

<- Back

What Does a Database for SSDs Look Like?

charleshn

Comments (87)

cornholio
You know you need to be careful when an Amazon engineer will argue for a database architecture that fully leverages (and makes you dependent of) the strengths of their employer's product. In particular:> Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).This is surely true for certain use cases, say financial applications which must guarantee 100% uptime, but I'd argue the vast, vast majority of applications are perfectly ok with local commit and rapid recovery from remote logs and replicas. The point is, the cloud won't give you that distributed consistency for free, you will pay for it both in money and complexity that in practice will lock you in to a specific cloud vendor.I.e, make cloud and hosting services impossible to commoditize by the database vendors, which is exactly the point.
wpietri
I should add that the bond between relational databases and spinning rust goes back further. My dad, who started working as a programmer in the 60s with just magtape as storage, talked about the early era of disks as a big step forward but requiring a lot of detailed work to decide where to put the data and how to find it again. For him, databases were a solution to the problems that that disks created for programmers. And I can certainly imagine that. Suddenly you have to deal with way more data stored in multiple dimensions (platter, cylinder, sector) with wildly nonlinear access times (platter rotation, head movement). I can see how commercial solutions to that problem would have been wildly popular, but also build around solving a number of problems that don't matter.
mrkeen
> Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random.Overall speed is irrelevant, what mattered was the relative speed difference between sequential and random access.And since there's still a massive difference between sequential and random access with SSDs, I doubt the overall approach of using buffers needs to be reconsidered.
zokier
Author could have started by surveying current state of art instead of just falsely assuming that DB devs have just been resting on the laurels for past decades. If you want to see (relational) DB for SSD just check out stuff like myrocks on zenfs+; it's pretty impressive stuff.
pmontra
A tangent:> Companies are global, businesses are 24/7Only a few companies are global, so only a few of them should optimize for those kind of workload. However maybe every startup in SV must aim to becoming global, so probably that's what most of them must optimize for, even the ones that eventually fail to get traction.24/7 is different because even the customers of local companies, even B2B ones, mighty feel like doing some work at midnight once in a while. They'll be disappointed to find the server down.
adsharma
Re: keeping the relational modelThis made sense for product catalogs, employee dept and e-commerce type of use cases.But it's an extremely poor fit for storing a world model that LLMs are building in an opaque and probabilistic way.Prediction: a new data model will take over in the next 5 years. It might use some principles from many decades of relational DBs, but will also be different in fundamental ways.
ksec
It may be worth pointing out, current highest capacity EDSFF drive offers ~8PB in 1U. That is 320PB per rack, and current roadmaps in 10 years time up to 1000+ PB or 1EB per rack.Design Database for SSD would still go a very very long way before what I think the author is suggesting which is designing for cloud or datacenter.
ljosifov
Not for SSD specifically, but I assume the compact design doesn't hurt: duckdb saved my sanity recently. Single file, columnar, with builtin compression I presume (given in columnar even simplest compression maybe very effective), and with $ duckdb -ui /path/to/data/base.duckdb opening a notebook in browser. Didn't find a single thing to dislike about duckdb - as a single user. To top it off - afaik can be zero-copy 'overlayed' on the top of a bunch of parquet binary files to provide sql over them?? (didn't try it; wd be amazing if it works well)
firesteelrain
At first glance this reads like a storage interface argument, but it’s really about media characteristics. SSDs collapse the random vs sequential gap, yet most DB engines still optimize for throughput instead of latency variance and write amplification. That mismatch is the interesting part
exabrial
> Commit-to-disk on a single system is both unnecessaryIf you believe this, then what you want already exists. For example: MySQL has in memory tables, but also this design pretty much sounds like NDB.I don’t think I’d build a database the way they are describing for anything serious. Maybe a social network or other unimportant app where the consequences of losing data aren’t really a big deal.
londons_explore
Median database workloads are probably doing writes of just a few bytes per transaction. Ie 'set last_login_time = now() where userid=12345'.Due to the interface between SSD and host OS being block based, you are forced to write a full 4k page. Which means you really still benefit from a write ahead log to batch together all those changes, at least up to page size, if not larger.
Havoc
I'm a little bit surprised enterprise isn't sticking to optane for this. It's EoL tech at this point, but it'll still smoke top of the line nvmes for small Q1 which I'd think you'd want for some databases.
hyperman1
Postgres allows you to choose a different page size (at initdb time? At compile time?). The default is 8K. I've always wondered if 32K wouldn't be a better value, and this article points in the same direction.
PunchyHamster
> WALs, and related low-level logging details, are critical for database systems that care deeply about durability on a single system. But the modern database isn’t like that: it doesn’t depend on commit-to-disk on a single system for its durability story. Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).And then a bug crashes your database cluster all at once and now instead of missing seconds, you miss minutes, because some smartass thought "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?".I love how this industry invents best practices that are actually good then people just invent badly researched reasons to just... not do them.
ritcgab
SSDs are more of a black box per se. FTL adds another layer of indirection and they are mostly proprietary and vendor-specific. So the performance of SSDs are not generalizable.
gethly
> I’d move durability, read and write scale, and high availability into being distributedSo, essentially just CQRS, which is usually handled in the application level with event sourcing and similar techniques.
dbzero
Please give a try to dbzero. It eliminates the database from the developer's stack completely - by replacing a database with the DISTIC memory model (durable, infinite, shared, transactional, isolated, composable). It's build for the SSD/NVME drive era.
danielfalbo
Reminds me of: Databases on SSDs, Initial Ideas on Tuning (2010) [1][1] https://www.dr-josiah.com/2010/08/databases-on-ssds-initial-...
dist1ll
Is there more detail on the design of the distributed multi-AZ journal? That feels like the meat of the architecture.
raggi
It may not matter for clouds with massive margins but there are substantial opportunities for optimizing wear.
ghqqwwee
I’m a bit disappointed the article doesn’t mention Aerospike. It’s not a rdbms but a kvdb commonly used in adtech, and extremely performant on that use case. Anyway, it’s actually designed for ssds, which makes it possible to persist all writes even when the nic is saturated with write operations. Of course the aggregated bandwidth of the attached ssd hardware needs to be faster than the throughput of the nic, but not much, there’s very little overhead in the software.
sscdotopen
Umbra: A Disk-Based System with In-Memory Performance, CIDR'20https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
toolslive
but... but... SSD/MVMes are not really block devices. Not wrangling them into a block device interface but using the full set of features can already yield major improvements. Two examples: metadata and indexes need smaller granularities compared to data and an NVMe can do this quite naturally. Another example is that the data can be sent directly from the device to the network, without the CPU being involved.
sreekanth850
Unpopular Opinion: Database were designed for 1980-90 mechanics, the only thing that never innovates is DB. It still use BTree/LSM tree that were optimized for spinning disc. Inefficiency is masked by hardware innovation and speed (Moores Law).
Rakshath_1
[dead]
Rakshath_1
[dead]