I built a storage engine in Go (from scratch without any AI), here's the entire process documented.
I spent the last 2 months building a storage engine from scratch understand how storage engines actually work. To respect the rules of this subreddit, we will not discuss the features ΓΈr benchmarks, for that you should read the repo yourself. To understand how I came up with the bits and pieces to put together this project, follow along. So here's my entire journey. Two things: - I built this project because of a recent interest in low-level and systems programming, that combined with my general affinity towards stateful systems made this project an obvious choice. - I see a lot of cool projects (on various social platforms) and when I attempt to read their code, it's obvious that it's written by AI. I intend to share my thought process here because I want to spread awareness that it's very much possible to build something like a storage engine using your first principles intuition. ---------------- Let's begin: 1 . Given all the knowledge I had and using first principles thinking, I setup an in-memory KV store but then it's obvious I had to make it persistent. I added persistence by writing a single line to the file every time someone made a PUT request. The file now had a bunch of {key: "hello", value: "world"}\n. So during startup, I would read all these lines and recover everything into in-memory. 2 . For me, at this point that's all the upfront knowledge I had. So I asked some very basic questions: > How would I recover the entire file into memory on startup? At some point it just wouldn't be possible because the file is growing unbounded. This means that I must not load everything in-memory and instead access the file directly > But then if I read every line top to bottom on every GET then my latency would be literally obliterated? This means I must somehow efficiently query the file. I came up with a solution, I created files based on alphabets, all keys with prefix A will end up in file A, all keys with prefix B will end up in file B and so on. By first principles, it broke down the problem space but still suffered from the problem: scanning all keys and also added another one: multiple files. The first win of breaking down the problem space felt good so I kept treading this path. And then out of the two mini-problems, I decided to fix the "multiple files" one. So after some thinking, I decided everything will be in a single file. At the very top I will have a line which tracks the offsets for every prefix: {A: 65, B: 300...} and so I can make my queries efficient. Now the main issue still remains, I am still querying everything all the lines top-to-bottom to find any key, even if the search space had reduced. Plus who's to say keys can only ever begin with an alphabetical prefix π. So I had to thought of another way. Now at the same time I was reading a lot about databases on the Internet. A lotttttttt. And anyone who has even tried to read up something about databases, knows the first thing that jumps at you: "BTree v/s LSM Tree". And given this I realised, "oh wait I studied BST (binary search tree) before, it allows me to search for any key and it will only take O(logn) because half of the search space is already reduced". That genuinely felt like a eureka moment. So I was all excited and fired up. I opened my laptop to code it all up. But I sat there and I couldn't write anything. A weird question came up: how do you even persist a BST on disk? Like I only ever knew you could write strings to disk, but how do I convert a BST to a string. So I went down a rabbithole, but the wrong one. I kept searching for how do you convert a BST to string and convert it. You know you could use first principles here as well: convert data nodes of BST to a tree-like format in string and then attempt to parse it back to a BST (I know I might be sounding dumb saying this). But it's a very very very obvious design choice and a very fatal choice as well. But imagine, this is similar to what few of the resources suggested. But then you should have questions, you should always have questions: > While parsing, would you read every single data item and insert it back to the BST? No. That would be so horrible for performance. > And with this approach, you would be dealing with the BST in its entirety, that would literally defeat the entire purpose of using the data structure. Why should I not use my original approach then? > So I needed to backtrack a little. So we're back at the same idea again, how do you persist a BST to a disk? And after some thought and exploration work (aka Googling), I realised the right question is: how do you persist a data structure to disk? And yet again, people started giving examples of how "BTree" is persisted to disk and I am like wth man, again. At this point, I absolutely had to study what's so impressive about a BTree. And let me tell you, it's not just impressive, it's beautiful. It's a piece of art. For anyone interested ( https://www.youtube.com/watch?v=FgWbADOG44s ) > BTree is meant to be persisted to disk and it's designed in a way such that traversing it on disk would not thrash your performance. > I must have spent almost 2 weeks understanding its semantics, cuz I genuinely believe building a BTree was far more mind-bending for me to build than Raft (it's a consensus algo, out of scope, just know it's pretty notorious to build as well) > This is the first time I setup the repo and started doing an in-mem implementation of BTree because I know how it worked before even porting it over to my storage engine. Okay, I understand BTree, but how do you persist one on disk π. This time I was asking the right questions. Because I found my answer. You never persist the data structure to the disk. You persist raw bytes to the disk and you assume it's a data structure and using a pre-decided format (aka the on-disk format / disk layout), you extract those raw and naked bytes one by one and perform operations on them. Pretty interesting, right? Absolutely. It took me another 2 weeks to build this one out. In hindsight, once I had this persistent version of a BTree done, I think a lot of my knowledge compounded for some reason. Not even knowledge, I guess the fear of hitting a wall just went out the door. I was comfortable being uncomfortable. Imagine putting in 10-12 hours every day just to get a persistent BTree up and running. Which itself took a whole month. It's brutal but I am so happy I took that decision of not using AI. But I had to change my game now. The few things: > I decided to use AI, not for the code but to help combat "I don't know what I don't know" issue. Like for example, knowing the internals of mmap would've taken me god knows how many weeks but AI was able to speedrun me through OS internals and help me understand what exact purpose is mmap solving. > I also decided to read database resources upfront. I mean was reading about databases, but not internals of it. And I did so because at the end of the day, I would be a fool trying to build everything from scratch. Like what extra knowledge would that provide? I already suffered and built a tough mindset through that 1 month ordeal. I didn't need to waste extra time on deriving things from scratch. But then I was also mindful not to ask AI to write the code for me. Only to understand the "whys", to help shorten down the time to understand a concept X from weeks to hours. Further features were pretty easy to build, because I got way too efficient at this and that I had the help of AI to understand the internals of it. > mmap: I was hitting a performance wall and I found an absolute goldmine of an article by ScyllaDB talking about different ways of I/O. I chose mmap because it seemed to bring me into deep internals of OS and I guess I developed some masochism by this point. > free-page list: My database size was upwards of 500MB due to copy-on-write semantics. I had to bring it down, the first principles thought was to re-use the abandoned pages. > page allocator: I had zero clue about this honestly. I was using the traditional read/write methods on the db file. Claude helped me point out this. It suggested using raw syscalls and since I had some experience from mmap, I realised fallocate. While looking for how to implement fallocate, I found an article that described "pre-allocating chunks upfront" to prevent allocating pages on every write. DISCLAIMER 1: I am an engineer with 3-4 YoE (so a pre-LLM engineer). I had always been this way in my entire programming journey, to want to know the WHYs of stuff I am learning. So I guess a lot of my mental models and emotional capabilities transferred over here. It's not my first project. It's not my first time writing Go. A lot of stuff also might have originated due to the fact that I read distributed systems papers a lot (as I mentioned I have a strong affinity for storage systems). DISCLAIMER 2: I had absolutely used tons of AI but the post would be too long to describe everything here. I had zero clue about mmap, but I had read it in the GFS paper but I didn't know how it worked internally. It was a 7 hour-ish study session with Claude to understand OS, kernel, mmap and what not. My OS fundamentals aren't on a crazy level. That's all from my side. I hope you liked this breakdown. BUT BUT BUT I wouldn't recommended anyone doing this. And I wouldn't do this myself ever again. I don't know why I went down this path. But it would be far more beneficial if I took a top-down approach meaning looking at production grade systems and why they have done it the way they have done it -- and then moving from there. You could learn way faster and way more that way. submitted by /u/mxdvf [link] [comments]
No comments yet.