Research blog: Holes in my Raft

2 min readDec 2, 2023

I decided to stop doing the homework for CS 6.284: distributed systems. My implementation of the consensus algorithm Raft had too many holes in it, and I got tired of shoveling out water.

I maybe come back to it, but for now, there are other things I would like to spend development time on. I still plan to read the research paper and watch the lectures for the rest of the course. I am also starting to read papers on LLM infrastructure. My friend suggested I start with NVIDIA’s Megatron paper.

The question now becomes what do I spend development time on? I think there are two areas of interest: 1) LLM training and serving and 2) real-time systems. The LLM training/serving dovetails nicely with the presentation I plan to give in February, and in reading the first few bits of Megatron, I am reminded how much I like and miss NLP. ❤ My current team manages the serving infrastructure on Capital One’s ML platform, so I’ve been more removed from ML the last few months, and it’s definitely fun to re-engage! I think I will also get a paper dump from my friend who works at Deepmind about what papers are relevant to LLM training.

Looking at development avenues in LLM training, here is my attempt to brainstorm some ideas:

implement the distributed training approaches outlined in Megatron and other papers
Try again to implement TorchElastic, which is a fault-tolerant training paradigm that leverages Kubernetes
Deploy/serve an LLM in connection with a real-time data pipeline/system
This could be related to my previous idea of an audio blog where I call a language model to process and transcribe an audio file when it is uploaded to an s3 bucket. This is nice because it combines LLM serving with event-driven. I could also try to use Terraform or another infrastructure as code tools to configure this setup. One thing that is hard about projects is that you would want to tear them down once they are finished. I don’t want to have a million little services on AWS racking up charges, but it would be nice to come back to things in the future, so having infrastructure as code files would make things easier to re-provision.

I think this last idea seems fun and does involve some real-time infrastructure.

To type a bit about my real-time interest, I watched a video on Shein, the fast-fashion retailer, and I was impressed by how they leverage real-time data from inside their app. Part of my team works on the realtime serving platform, so I have some exposure to realtime data. And as I think about myself as a more general engineer, I think getting experience building realtime systems.

As an aside, I’ve had a lot of success reading research papers while walking around. I find walking and reading really easy, which is great because I find reading while sitting down kinda difficult.

Research blog: Holes in my Raft

Written by Dustin Zubke