
Kaizen! Let it crash (Friends)
The Changelog: Software Development, Open Source
1hr 41min Jan 17, 2026
Gerhard is back for Kaizen 22! We’re diving deep into those pesky out-of-memory errors, analyzing our new Pipedream instance status checker, and trying to figure out why someone in Asia downloads a single episode so much.
Changelog++ members save 6 minutes on this episode because they made the ads disappear. Join today!
Sponsors:
- Namespace – Speed up your development and testing workflows using your existing tools. (Much) faster GitHub actions, Docker builds, and more. At an unbeatable price.
- Depot – 10x faster builds? Yes please. Build faster. Waste less time. Accelerate Docker image builds, and GitHub Actions workflows. Easily integrate with your existing CI provider and dev workflows to save hours of build time.
- Squarespace – A website makes it real! Use code CHANGELOG to save 10% on your first website purchase.
Featuring:
- Gerhard Lazu – Website, GitHub, LinkedIn, X
- Jerod Santo – Website, GitHub, LinkedIn, Mastodon, X
- Adam Stacoviak – Website, GitHub, LinkedIn, Mastodon, X
Show Notes:
- Kaizen 22 discussion #554
- Tw93/mole: 🐹 deep clean and optimize your mac.
- Stuff Goes Bad: Erlang in Anger
- Abacus.ai - the world’s first super assistant for professionals and enterprises
Something missing or broken? PRs welcome!
Adam Stacoviak
How else would you learn? Let it crash.
Gerhard Lazu
Exactly. The best things happen when things fail... \[laughter\] Seriously. If it's in a controlled way, right? I think that's something which isn't said. It's implied. It has to be a controlled failure, where you have the boundary, and things will not blow up. I mean, they'll blow up, but the fireworks sort of blowing up, where it's a controlled explosion.
Adam Stacoviak
Yeah.
Jerod Santo
Right. Tiny little crashes to learn from. Welcome, everyone, to Kaizen \#22, with the incomparable Gerhard Lazu... He's here to let us know how he lets it crash. It's that song, "Let it snow, let it snow, let it snow", only - you know how to replace... Hey Gerhard, how are you?
Gerhard Lazu
Hey, Jerod. I'm good, thank you. Thank you. I had a great holiday. It was a great couple of weeks where I've managed to finally disconnect. It's been, I don't know, 20 years since I had two weeks completely off...
Jerod Santo
Nice.
Gerhard Lazu
Even my holidays are only a week. So this was very different, very enjoyable, and I feel so refreshed... So I'm firing on all cylinders.
Jerod Santo
You unplugged, and now you're plugged back in.
Gerhard Lazu
Pretty much.
Adam Stacoviak
Plug it in.
Gerhard Lazu
I stopped it, and I started it, and it's brand new.
Adam Stacoviak
It's Glade, man. I'm Glade over here, man. Plug it in, plug it in. You know what I'm saying? Smell the scent, the fresh New Year's scent called 2026...
Jerod Santo
Some people are going to say this is going to be the best year ever. I've heard it said. What do you think, Gerhard?
Adam Stacoviak
They keep saying that, and I'm excited about them.
Gerhard Lazu
They said that about 2020. \[laughter\]
Jerod Santo
2020... We have to admit, it was off to a killer start. I mean, it was really going well.
Gerhard Lazu
Right. Pun intended, killer start...
Adam Stacoviak
What happened in 2020?
Gerhard Lazu
It was COVID. Pun intended? Killer start? That was 2020. 2020 was the year of COVID, and everyone's "Oh, this is going to be the best year ever", and then we had three years of misery. So I think --
Jerod Santo
Put it behind us.
Gerhard Lazu
I just want an easygoing year. You know what I mean? Last year, 2025, 1st of January, we were building shelves. We were redoing studies and whatnot... And the whole year was full on. Like, it was nonstop. Every week there was something significant happening. And this year we would for it to be a bit more chill, maybe a bit more meaningful... So that's what we're thinking. But how about you, Adam? How are your holidays?
Adam Stacoviak
My holidays were filled with barbecue, and good times.
Gerhard Lazu
Wow. Even in winter... So barbecue never stops. It doesn't know seasons.
Adam Stacoviak
It never stops in Texas. Actually, just to shower you all with a few of my picks from my most recent barbecue adventures... If you're in Zulip, go to the general channel, look for barbecue with three bangs after it, because - why do one bang when you can do three?
Jerod Santo
Bang, bang, bang.
Adam Stacoviak
Some recent ribs... My gosh, my ribs method is on point, my spatchcock chicken method is on point... No one is disappointed at my barbecue joint.
Gerhard Lazu
Very nice. Look at that, we're going to add some meat on this slide... That's what happened in real time.
Jerod Santo
Wow. Real-time meat added. This is -- this is intense.
Gerhard Lazu
Yeah. And again, just to be clear, it's Adam's barbecue. Okay? So no joking aside, we're talking about barbecue.
Adam Stacoviak
Well...
Gerhard Lazu
I think we have to leave it there... \[laughs\]
Jerod Santo
Let's move on.
Gerhard Lazu
I think we have to leave it there.
Adam Stacoviak
I didn't show a burger, but I do make a mean burger, too. Thank you, Gerhard, for assuming that is something I do rock really good. My smash burgers are on point.
Gerhard Lazu
Very nice, very nice. I'm looking forward to that. And so...
Adam Stacoviak
One day...
Gerhard Lazu
My favorite Christmas tree - this is what it looked like.
Adam Stacoviak
Oh, yes.
Jerod Santo
Hm... What is that?
Gerhard Lazu
And for those that are listening, it's a networking cabinet. There's lots of blue lights flashing. This is happening in the loft... You have many terabits of network throughput. There's some switches, there's UniFi, there's Mikrotik... This is maybe five years in the works, and every Christmas I take time to improve it little by little. So this year I went really crazy. I redid the whole thing. I redid the whole, for example, DHCP network, VLAN... Man, it's beautiful.
Adam Stacoviak
Your VLANs are beautiful...
Gerhard Lazu
They are. They are.
Adam Stacoviak
I want to be a guest on your network, man. I'm going to get blocked from everything, okay?
Gerhard Lazu
Well, well, there's a big story happening in the background, and it is going to be -- I think this will be amazing. This is will be the best network that I have run in my life... But the blue, and the darkness, and it's like -- that was one more Christmas tree in our house, and this was it, where I would just go and tinker for a few hours in between the Christmas dinner and all the Christmas festivities... So it was nice just to spend a bit of time tinkering with hardware. And I'm sure that many of you listening, when it comes Christmas time, when things start quieting down, you get the little projects that you didn't have time for throughout the year, and then you have some fun. So I'm wondering, did any of you did anything fun this Christmas? ...but nerdy fun, that's what I mean by that.
Adam Stacoviak
Nerdy fun... Well, I got upset with something...
Jerod Santo
That's not fun.
Adam Stacoviak
\[00:07:47.23\] ...and so I decided to just let it roll. You know what I'm trying to say? I got upset with the amount of RAM usage on my machine... And while I liked the application, I was like "You know what? I'm just kind of tired of having four gigs--" I think it was -- no, it was like 1.2 gigs of RAM being used by Clean my Mac... Fancy little utility application, helps you tune, and pay attention, and stuff like that... And I decided to remake it, and that was it. So I remade it. It's called Mac Tuner. I know there used to be a MacTuner.com, which was, I think, a Mac Magazine, I believe... But Mac Tuner fit. I might change it, who knows... But for now it's called Mac Tuner. It does all the things, all the things. Analyze, clean up, uninstall... And not just that fake uninstall; the real one, where you get the dirty dirties out. You know what I'm saying, the dirties? All the dirties are out, okay?
Gerhard Lazu
My mind is on the dirty burger that you mentioned earlier... \[laughter\]
Adam Stacoviak
Yeah. I mean, that's about as nerdy as I can get. I mean, I made a little utility that's for me for now. Soon to be open source, though; soon to be.
Gerhard Lazu
It will be soon.
Adam Stacoviak
Yeah.
Jerod Santo
Very nice.
Adam Stacoviak
I mean, why not, right? Share with the world.
Jerod Santo
Well, I didn't create a Mac Tuner, but I've found one. I also was thinking, Clean my Mac - how long am I going to run this thing? And the answer is "As long as I ran it, because I'm done now." I found a tool called Mole, M-O-L-E, which is a command line macOS cleaner that does everything. So maybe you've got some competition here, Adam. Maybe you can come out and throw some blows down, like "Here's why I'm better than Mole." It's got a TUI, it's all command line-based, it does cleaning, optimizing, uninstalling, DaisyDisk, Explorer...
Adam Stacoviak
Oh, gosh.
Jerod Santo
...all from -- yeah.
Adam Stacoviak
I'm feeling it. I'm feeling intimidated over here, okay?
Jerod Santo
You're starting to sweat?
Gerhard Lazu
I think he just changed his mind about open-sourcing it.
Jerod Santo
Here's your domain name idea, Adam. \[unintelligible 00:09:45.15\]
Adam Stacoviak
That's good. I could do that.
Jerod Santo
So I've been using that, and I'm very excited, because who doesn't want to just have all the things right there in their command line? And I didn't spend any tokens on it. Adam's got some tokens involved, but his also works the exact way he wants it to.
Adam Stacoviak
Yeah. Yeah, absolutely. Mine leverages some Recast stuff as well. It's kind of cool.
Jerod Santo
Sweet. Open source that sucker.
Adam Stacoviak
One day.
Jerod Santo
Which day is that?
Gerhard Lazu
Not today. \[laughter\] Definitely not right now.
Jerod Santo
But it's going to be one day...
Adam Stacoviak
One day. There's a bigger launch awaiting, is all I'll say. There's a bigger launch awaiting till I'm going to open-source some things.
Gerhard Lazu
I've been using AppCleaner for many, many years... Now, there's no TUI, there's no CLI. It's just a regular app. It's a really old one.
Adam Stacoviak
Butt you just drag and drop onto it, right?
Gerhard Lazu
Pretty much, yeah. And you also have a list of applications... But it's so old that it's difficult to find it these days, and it hasn't updated in a very long time... So I will check Mole out.
Jerod Santo
Mole's really cool. Brew-install Mole and you're done. So you can check it out right here while we're talking. And I liked AppZapper... And I think AppZapper doesn't exist anymore, but the cool thing about that was that it would literally make the zap sound, as it -- yeah. You drop your app on it and it zapped it. And I just liked that sound.
Gerhard Lazu
That's the only feature that your application needs to have, Adam. If it zaps...
Jerod Santo
Mole does not zap, so there you have it.
Adam Stacoviak
"Make it zap" is our tagline, actually. Make it zap.
Jerod Santo
Make it zap.
Gerhard Lazu
There you go. I think that's a very good debate, actually.
Adam Stacoviak
What about you, Gerhard, besides your Christmas tree? Did you...?
Gerhard Lazu
I will come back to that. I will come back to the Christmas tree, yeah.
Jerod Santo
This guy's got stories, man.
Gerhard Lazu
Oh, man. Oh, yes. I have to tease them and be very disciplined, because there's too much stuff. So I have to be very careful, because it will be an hour and I will not shut up talking about this thing. I mean, it's just like -- anyway. So we will come back to that, I promise.
Adam Stacoviak
Okay.
Gerhard Lazu
\[00:11:45.18\] Last time, when we finished Kaizen \#21, this was one of the last thoughts that we shared, which is what's next. So BAM... Remember BAM? That happened live. OOM crashes, out of memory crashes, and a bunch of other things. The good news is that only one thing happened. OOM crashes...
Jerod Santo
You've only got one thing to talk about... \[laughs\]
Gerhard Lazu
...but this rabbit hole is really, really deep.
Jerod Santo
Okay. Alright. Take us down the rabbit hole. The OOM, out of memory.
Gerhard Lazu
Who remembers this book? Erlang in Anger.
Jerod Santo
Erlang in Anger.
Gerhard Lazu
Stuff Goes Bad, by Fred Hebert. Ferd.ca.
Jerod Santo
Now, I remember "Learn You Some Erlang for Great Good", but I do not remember this one in particular. So I'm not sure why the other one hit my radar, because he wrote both of them, it seems... But when did this one come out?
Gerhard Lazu
So this one, if I look -- I just switched to the browser... 2016, 2017, while he was still at Heroku. Remember Heroku? Those were the days.
Jerod Santo
I do.
Gerhard Lazu
So about 10 years ago. And Fred -- I mean, if you don't know his blog... It's just amazing. I'll just click it very quickly, just to have a look... I think it's one of the best blogs out there. There's so much goodness here. So much. But one of my favorites is queues, and queuing, and how queues don't protect from overload. So queues don't fix overload. And this is so relevant to today's conversation as well. But there's a lot of stuff in the Erlang ecosystem, and there's many, many things that Fred wrote over the years, that are so relevant to today.
So if I click on Download PDF - by the way, this is a... It's amazing this book is open source. You can download it, open source, freely available, Creative Commons license... And I'm going to make this a little bit bigger, so we can see what's happening. And if I search for "Let it crash", it's page number one. It's in the introduction.
Jerod Santo
There you go. Page one.
Gerhard Lazu
Page one. And this idea of "Let it crash" really comes from the Erlang ecosystem. It's very well renowned there because of how the Erlang VM works, and how all the processes, and the supervision trees just -- it was built this way. And we know a thing or two about Erlang, Jerod, right? ...because the application, Elixir, the Phoenix framework runs on the same principle.
Jerod Santo
I know a thing, and you know two, so that's how we get to a thing or two.
Gerhard Lazu
And Adam - I'm sure he knows the big one. But we don't know whether he's going to share it. The point is, when you think about "Let it crash", Jerod, from your development experience with Erlang, with Elixir, Phoenix - is there any situation, any moment where you could experience it and you realized "Huh, that's nice"?
Jerod Santo
When I let it crash?
Gerhard Lazu
When you let it crash.
Jerod Santo
Well, it's nice that the \[unintelligible 00:14:49.07\] seems to handle a lot of the problems with letting it crash. It just goes again, or there's a supervision tree, and things watching each other, and I don't have to think about it very much. I can't think of an instance in development where I was like "This is really useful", but I'm sure you could come up with one.
Gerhard Lazu
Yeah. So you know when you write code, we tend to write code very defensively... Typically try/catch. So you feel like you need to account for every single scenario. And the "Let it crash" philosophy is about not preventing failure; learning from it. What that means is you need to have a context where it's safe for things to crash, and the overall system will still remain stable.
So how can you build a resilient system - and really, this is about resiliency - where the core of the system will remain running, and the system as a whole will remain running even though parts of it may experience failures? ...but those failures will not bring everything down. And that's really important. So fewer try/catch blocks, don't code defensively, let it crash, and separate the code that solves the problem from the code that fixes the failures. And the more you can lean into the framework, or the VM, or whatever you have - the system - to deal with failures, the better off you are to focus on the things that are unique to your application.
Jerod Santo
Yeah.
Gerhard Lazu
\[00:16:12.15\] And Erlang is well renowned for that.
Jerod Santo
Kind of the opposite philosophy that Go took, as I write some Go code and I write some Elixir code... Where with Go it's handle every error condition right after you potentially raise one, and make sure that there's no error. And if you're not dealing with it, then you're not writing robust software. And the other philosophy is "Let it crash and deal with it elsewhere." I think they're both legitimate, depending on what you're building.
Gerhard Lazu
Agreed. Well, in our case, we had a lot of crashes to deal with... \[laughter\]
Jerod Santo
Yeah, we're taking the Erlang style...
Adam Stacoviak
Oh, gosh...
Gerhard Lazu
So what we are going to have a look at is all the times that the Pipe Dream has been crashing since our last Kaizen. So since Kaizen 2021, which is October 17th, we had a lot of crashes. And there's a certain property about the system - and this is Varnish specifically that made these crashes pretty okay. And the property which I'm referring to is when you start the varnishd, the daemon, Varnish itself runs as a thread, and you have many, many threads that do different things. So when we had these out of memory crashes, all that happened - the thread was killed. Which means that the system as a whole didn't crash, the VM didn't, the Firecracker VM didn't crash... The application needed to restart. It was just a thread that was using too much memory, and it restarted within seconds, as in maybe two seconds, and everything was back to normal. Obviously, the cache was cold, but it was good. And that's why the memory looked a bit interesting, in that it doesn't release all the memory, the VM doesn't restart... There's not many hangs; it restarts and it crashes really, really quickly. So that's a nice property.
Jerod Santo
Well, that confuses me. So how does Fly know about it then, if it's just happening inside of Varnish?
Gerhard Lazu
So it's looking at the process ID, "Which process uses the most memory?" And it's the same process that's asking for more memory. So basically, it will just send a signal to that process, and kill that process. But that is just a thread; that maps to a thread. So Varnish itself didn't crash; it's just a thread that maps to a process ID that crashed, and then it was restarted by the Varnish daemon.
Jerod Santo
Okay, so where is Fly involved in that? Because Fly is ,aware because I see all these Fly notices, and I get the Fly emails.
Gerhard Lazu
Right. So Fly is aware that there is a process on the machine that is using too much memory, and more memory is being requested. And then it looks like "Okay, which process do I kill?" And in this case, a process with the most memory will get shot, and will get killed.
Jerod Santo
So Fly as a platform can actually reach in and kill that process without killing the machine, rebooting the VM, or Firecracker, or whatever?
Gerhard Lazu
So the Fly platform - it integrates with that functionality, which is a Kernel, it's a Linux functionality. That's why an out of memory crash would happen even if you have a single machine; you have too much memory, you don't have any swap... How do you basically give more memory when there's no memory left, and when the system is becoming unstable? So then you get just a single process which gets killed. In Fly's case, they surface that. They surface the fact that there was an out of memory crash, there was an out of memory event, and they send you an email when that happens. It doesn't mean that the machine had to restart, it doesn't mean that it stopped serving traffic... It just means there was something that just had to go away, because it was using too much memory.
When I say "too much memory", obviously it's a bit more complicated than that, because something was asking for memory, the kernel didn't have any more memory to allocate, so it just had to look at what needs to be killed, so that I can allocate more memory... Because something is using too much memory. And it just so happens it would be this process, and this thread. So how many crashes do you think that the Pipe Dream had since Kaizen 2021, since October? So we're talking about three months, maybe a bit more than that...
Jerod Santo
\[00:20:16.01\] So Gerhard has presented us a multiple choice quiz. A is 20, B is 40, C is 80, D is 160. Now, I know that I personally receive an email every time this happens, and so I have a little bit of a feeler into this. I delete them, so I can't go do a quick search. Adam, do you get emails when these Fly things crash?
Adam Stacoviak
I don't.
Jerod Santo
Okay, good for you.
Adam Stacoviak
Not to my knowledge. And if I do, they're in a box that doesn't get looked at.
Jerod Santo
You've been saving on some email bandwidth...
Gerhard Lazu
I do know, because we send the email... So let's go back to this one. If I click on this one... Let's take this one, and you can see everyone that gets an email. I'm just going to make this a little bit bigger, so you can see it services Jerod, Adam and Gerhard.
Jerod Santo
Oh, you do get it.
Adam Stacoviak
I do get it.
Gerhard Lazu
Yeah. So there must be a filter...
Jerod Santo
He just doesn't look at it.
Adam Stacoviak
A superhuman saving me. Nice...
Gerhard Lazu
That's okay. So what do we think?
Jerod Santo
Good thing other people are looking at it...
Gerhard Lazu
It's not an Adam problem. That's the thing. So that's a good thing. He's doing the right thing. He's just saving his inbox for more important messages.
Jerod Santo
They ran an LLM on that to the side. So I feel like 160 is too many. I don't think I've gotten 160 emails since October on this particular thread. 20 feels not enough. I've certainly got more than 20 emails. So I'm between 40 and 80, and I'm going to think that -- gosh, that's a tough one. I'm going to go with 40. Adam, what do you think?
Adam Stacoviak
I'd go with 40 as well.
Jerod Santo
Oh, I got it. Yes...!
Gerhard Lazu
43 exactly.
Adam Stacoviak
The price is right.
Jerod Santo
The price is right. Alright, cool.
Gerhard Lazu
Yeah.
Jerod Santo
43 crashes from October to December; through the end of the year.
Gerhard Lazu
Yeah. And then obviously, there were periods when we had quite a few. So if we were to think about what could be happening in Varnish that it's running out of memory and crashing... So this is us trying to think about the sort of traffic that we serve, trying to think about everything -- I mean, now we see every single request that hits Changelog, the CDN as well... And it's a lot of requests.
Jerod Santo
Yeah.
Gerhard Lazu
So there was something in the system that was using way too much memory, and as a result, the process - or the thread in this case - was crashing.
Jerod Santo
I mean, I could guess it, but I might even have some insight. So... Should I just say it, or do you want Adam to guess? I mean, my guess based on - also I saw some emails flying through, but... Already I would have suspected that we just have too many large files. These 60 to 80 to 100 megabyte MP3 files loaded into memory, flying every which direction... And you just can't load up that much memory without some sort of fancy freeing mechanism. And it's just trying to hold all these MP3s in RAM, I think, and it just can't do it. So that's my guess.
Gerhard Lazu
Yeah. That was a good guess. And I think -- the next question is going to be to the audience, because we know too much.
Jerod Santo
How are they going to answer it? It's not real time.
Gerhard Lazu
Well, just think about it... We will give some time for people to think.
Jerod Santo
Okay. We'll do like a delay here. So if they have a -- what's it called? The feature where you skip silences on... They're not gonna have any time to think about this.
Gerhard Lazu
Right. Okay.
Jerod Santo
So quickly, turn that feature off, give yourself some time to think... Go ahead.
Gerhard Lazu
Yeah. Or pause. We can also say pause. Now is a good time to pause. And then - what could be the problem? So you're right - all those large files. We had all the MP3 files; many, many MP3 files. They're large. All trying to be cached in memory. And that was a problem. So what is many? Well, we have thousands at this point, of MP3 files, across all the podcasts, since the beginning of time. Large - large means anywhere from 30-40 megabytes, to 100+ megabytes. So that's -- I mean, just think, if you had to load a thousand files, that take 100 megabytes... That's a lot of memory that you need to have available.
\[00:24:30.17\] And the problem is that once you store these large files, as we discovered, you get memory fragmentation, in that - imagine that you have all the memory available, you keep storing all these files, and at some point there's no more memory left. So what do you do? Well, you need to see what can you evict from memory, so that you can store the new file. So imagine that you evict a few of those objects, but maybe they aren't big enough, and you haven't evicted them fast enough. So then you have this big file that can't fit anywhere, because the sizes, the holes that you have in memory aren't big enough for this file to fit. And there's no defragmentation, or nothing like that that runs in the background... Which means that even though technically you kind of would have space in the memory, for the specific files you may not. And then it can't be stored in memory. Now, the thing in Varnish is actually called - I kid you not - n\_lru\_nuked.
Jerod Santo
Nice.
Gerhard Lazu
So I think the connection to the nuke and to the book, and to "Let it crash" is right there. So lru\_nuked, basically - it's like a forced eviction. So it's an event where an object has to be evicted from the cache just to make room for a new one, because the storage is full. So you can see how many times this has happened. And that's like an important metric that if we look at, we can see, "We had too many of these events." Many objects were being nuked from memory to make room for new objects, but sometimes they wouldn't fit.
So how badly did it nuke? Because we can measure this, we can look at this. And this is what that looks from a memory perspective. So you can see that the instance was running about maybe four gigs of memory, and then we had a massive spike within minutes, like one or two minutes, to 16 gigabytes. So that's a lot of data that had to be fit in memory. And you can already see where this is going... Scrapers, and bots, and LLMs... We have so many things happening. And then you can see the memory, it went up. The thread was killed, the child was killed... The Varnish \[unintelligible 00:26:40.14\] memory came down again, and then it went up again. So the graph that we see here, we can see the first spike, just like maybe a minute apart... The second spike, another crash... It took a little while for it to restore. We're talking maybe 10 seconds. And then we stabilized around 10 gigabytes. From a CPU perspective, we got like a hundred percent CPU utilization when this happens. Everything is full on, everything -- the instance is really struggling to allocate and deallocate and free up memory... And more importantly, we have a lot of traffic flowing through. So how much? 2.29 gigabits, specifically. 2.29 gigabits...
Jerod Santo
Per second.
Gerhard Lazu
Per second, exactly. And these happen so quickly; you have a huge rush of traffic coming in... And then nothing.
**Break**: \[00:27:40.00\]
So why is more traffic coming into the instance than going out? So this is the traffic that the instance is receiving. So we're receiving 2.29 gigabits, but we're only sending 145 megabits. Now is a good time to pause and think about why this is happening.
Jerod Santo
Yeah, don't skip silence. So when we say the instance, we mean the Varnish instance.
Gerhard Lazu
The Varnish instance, yeah.
Jerod Santo
Which sits between our end user, whatever that is - or users - and our application. Well, actually, and our Cloudflare, not our application.
Gerhard Lazu
All our backends. And we have a couple of backends.
Jerod Santo
Yes, but in the case of MP3 files it's our R2 origin.
Gerhard Lazu
That's correct.
Jerod Santo
So Varnish is receiving a bunch of data, and sending back an order of magnitude less data. And what's it receiving - I don't know, man. I mean, my guess would be we're uploading MP3s... Now, that's gonna go straight through the app to R2... Just a DDoS? I mean, what is it? I don't know.
Gerhard Lazu
Yeah... So it is a DDoS, but it's specifically downloading MP3 files, or starting to download MP3 files, but never finishing.
Jerod Santo
Hanging...
Gerhard Lazu
So you get all these requests for MP3 files, for large files... Varnish is going and fetching them as quickly as it can... So pulling all this data in, so it has it in memory, but the client is never around long enough.
Adam Stacoviak
It terminates.
Gerhard Lazu
Yeah, exactly. So they basically abort, but Varnish is still pulling in all the data. Now, there is a property... It's called bresp.doStreamTrue. So what this does - a very weird thing - is it tells Varnish not to buffer the entire backend response if the client is slow. So I'm not going to fetch the entire MP3 file if you only want the first, I don't know, minute, or two, or a range, or something like that. Now, this is on by default. So by default, that's how Varnish behaves. So we wouldn't need to enable this. But if the object is uncacheable, it cannot be stored in cache - do you see where I'm going with this? Memory, you can't store it in memory... So you keep pulling these files over and over again, and maybe even just fragments of them... So even though the client never receives them, you may be pulling hundreds of files, and the client just goes away. So you're not pulling the entire file, but you're still pulling enough, and not able to fit it anywhere, and it just becomes a mess.
Adam Stacoviak
This reminds me of the '90s, when you used to go jean shopping...
Gerhard Lazu
\[laughs\] Tell us... Do tell, Adam.
Adam Stacoviak
And you'd go into Abercrombie & Fitch - which I never shopped at, but let's just imagine I did... I'd go in there and be like "I like all these jeans. Get them all." I'm trying them all on, and then I just bounce.
Jerod Santo
Yeah, the person goes to collect them all, they come back and you're not there.
Adam Stacoviak
Here's a dressing room full of jeans, and Adam's gone. Bye-bye. See ya.
Jerod Santo
This really sounds like you're speaking from experience. Was this a prank \[unintelligible 00:32:54.20\]
Adam Stacoviak
I just made it up just now. I'm just creative like that, you know? On the fly. Creativity.
Gerhard Lazu
\[laughs\] That's a good one. That's a good one. On the fly, yes.
Jerod Santo
It is on the fly...
Gerhard Lazu
It is. On the fly.io. Boom!
Jerod Santo
Well, what could we do then? What's going on here?
Gerhard Lazu
Exactly. So this was one of the things which I had to deep-dive and understand what on Earth is going on. Where do we store, what's happening... So there's a lot, lot more that went into this pull request. It's pull request 44. I'm calling it "The elephant in the room." I'm going to switch to the browser, just to have a look at that.
So the title of the pull request is "Storing MP3 files in the file cache." But that's the tip, right? The most obvious thing is, "Well, you either have lots and lots of memory to give varnish", which honestly would be impractical, in the sense that would be way too expensive to store all these files in memory. The next best thing is to have something like a file cache. And by the way, we're talking about open source Varnish. That's really important. Anyone can use this, anyone can configure this... You can configure a file cache, which will basically pre-allocate a file on disk, and that's where these large files will be stored. Pull request 44, the one that we're looking at, is in the Pipely repository. That's what this adds.
\[00:34:19.08\] But there's significantly more stuff... And if I'm going to -- so there's quite a few files. I highlighted a few, so I'm going to look at this one... So it's not just that. You also need to tune, for example, thread pools, you need to tune the minimum, the maximum... You need to tune the workspace backend, like "How many memory structures get allocated?" You need to configure the nuke limit... And there's a couple more things that we had to go through, just to make things stable.
Now, I'm just going to very quickly mention these things. You can go and have a look at the pull request to see what else went into it.. So this was the one file. The other one was the regions. That's another thing. Not all regions would suffer from this. So you don't want to allocate too much memory or too much CPU to regions where maybe they don't get a lot of traffic. And you would think that this thing is easy, but oh man, I have a surprise for you... You can't mix and match sizes easily in Fly. So you can't say "Create application groups, and this group will be the small group, and that group will be the big group, and this is just one application..."
Jerod Santo
Really...?
Gerhard Lazu
It's not straightforward. So you have to -- again, this is how I solved it. Maybe someone listening to this will tell me, "Hey, Gerhard, you're wrong." I would love to know that, seriously. So the way I solved it is we deploy in all the regions, because you specify the size once. So you say "My starting size is the large instance type." It has a certain number of cores, a certain number of memory... And by the way, the disk is the same in all of them, because that's another problem, so we will sidebar that, or put a pin in that.
So when it comes to the initial deployment, you deploy the one size across all the application instances, and then you go and need to check to see which instances should be scaled down, so that you have the capacity, but the regions that don't need the capacity can just bring them down. And you do a rolling deploy, in that you replace one for one, you have plenty of capacity to handle the traffic while instances are being rolled... All that good stuff. But we have hot regions, and then we have cold regions. And there's quite a few things here. Again, if someone knows how to do this better, I would love to hear about that.
And we have the TOML, we have the primary region... There's a couple of things here... We'll come back to services and -- HTTP services. That's a fun one. We'll leave that for a little bit later. Fly Just... We can see how we do the flyctl deploy, we disable HA, because we want only one instance per region... We have 15 regions in total. We specify the CPUs, the memory, all that good stuff, including environment variables... Oh, that's another thing - we need to adjust the Varnish size based on the memory the instance has. We need to say "Hey, Varnish, you get 70%." And that's the other thing that this does. Same thing for the file size. You can't take up the entire disk. We tell you, based on the disk that we provision, how much space you should use from the disk that gets created. There's a scaling there, so that's another good one...
I'm going through the pull request to see if there's anything else. Oh man, this was a pain... So recreating -- like, writing tests for this. Everything is tested, in the sense that which requests would go, or basically which files would get cached in the file store, and which files would be cached in the memory store. So how do you write the tests? Some Varnish logging is included, you have to have anchors... There's quite a few things. So that's assetsbackend.vtc. And part of this - it was a huge refactoring.So if you look at the lines of code, I wouldn't say it's that many. 1,500 were added, and 1,470 were deleted. So not much changed. I mean, the net is 30 new lines were added. But there was a huge, massive refactoring part of this.
\[00:38:21.16\] So there's -- again, this was, I think, two-three days of figuring it out, trying things, refactoring things... And if you think that an LLM can help you - well, you try this. \[laughs\] And it takes longer to go through those iterations than - if you know what you're looking for, it tends to be easier.
Anyway, it's very dense, very specific, very difficult to make sure that it's doing the right thing. But it's all there. We have the mock backends, we're reusing things... We split the VCLs -- by the way, we finished the splits, so it's easier to reuse them. So there's quite a few things there.
Now, this is Kaizen, so we are wondering what improved. After all this work, we rolled it out... What improved? And to answer this question, we need to figure out which region is the busiest one. So out of all the regions that we serve - we have 15 in total - which ones get the most traffic? It's those hot regions. We're looking at Fly, the Grafana dashboard for our fly application, the instance of the Pipedream, the current one... And we can see that SJC - San Jose, California - is a nice, big, red circle, which means it has the most traffic... And also NRT, which is Tokyo.
Jerod Santo
Huh.
Gerhard Lazu
Apparently.
Jerod Santo
We're big in Japan.
Gerhard Lazu
Yeah. And Europe, there's quite a few... So if I'm going to pull this down a little bit; let's see... No, I wanted to go here.
Adam Stacoviak
What about that new continent? Are we big there?
Gerhard Lazu
The new continent? Australia?
Adam Stacoviak
No, there's a new one. There's a new-new one.
Jerod Santo
Well, what's it called?
Gerhard Lazu
Which is the new one?
Adam Stacoviak
I don't know. There's a headline... I thought y'all would get the joke.
Gerhard Lazu
No.
Adam Stacoviak
Over the holiday there was speculation there was a new continent being announced.
Gerhard Lazu
Narnia?
Adam Stacoviak
Maybe... It could have been Narnia.
Gerhard Lazu
\[laughs\] No, no, no.
Adam Stacoviak
With the closet.
Gerhard Lazu
So right now, even this list is basically -- if you think about it, it kind of makes sense. It's US East, US West, Europe... But we have quite a few instances in Europe. We have four. It's more geographically spread in Europe. And we have Asia. So these are like the big ones. Australia, Africa, and South America - they're not as busy. They are the least busy regions. Cool... So which instance would you like us to have a look at? So I have a queue right here...
Jerod Santo
SJC, baby. Let's go. Let's go big.
SJC, baby. Alright, let's see that. So I'm running flyctl, SSH console. I'm using two flags. -s, which is a short one for --select... It'll prompt me which instance I want to select. And then I have -C. Capital C. It's different than lowercase c. They do different things. I give it the command to run. And it's \[unintelligible 00:41:11.25\] which will give me all the statistics from Varnish at a point in time. So since this instance was running, I will select SJC... There you go. And it will give me all this data, which is all the counters that Varnish is incrementing, is keeping track of different things; of the origins, backends, the memory pool, the disk pool, the lock counters... There's so much stuff. I'm really, really impressed how many things Varnish has.
So this is what we're going to do. "We", because AI.. We're going to copy all of this, we're going to ask AI what it thinks of this... How about that? \[laughs\] There's just too much data here, so let's be serious about it. So question to you - which is your favorite AI, Jerod? Which one do you use?
Oh, I don't like any of them. I would probably start with Claude, and then I would go to Grok, and then I would go to ChatGPT, third.
Gerhard Lazu
\[00:42:11.24\] Okay. So Claude, which one? Which version? Which model?
Jerod Santo
Opus, man. Give us the Opus.
Gerhard Lazu
Opus. Okay. So we're looking at abacus.ai, something I've been using for a long, long time... It allows you -- I'm only paying $10 per month for it. Not sponsored, not affiliated in any way... It's just something that I've picked for myself, and I can basically pick any model, and I can just run this. So I have something prepared, so I'm going to drop this - it's all the data - and we're going to read through something that I prepared ahead of time.
Jerod Santo
You pre-prompted this.
Gerhard Lazu
I pre-prompted this, exactly.
Jerod Santo
Okay. You've been engineering this prompt for weeks.
Gerhard Lazu
Not really, but...
Jerod Santo
Oh, that's a long prompt.
Gerhard Lazu
So we're going to read it, and in the meantime, Adam will think about his favorite LLM to try. And I have mine. So we'll try three LLMs to see what they say.
Jerod Santo
Oh, my goodness.
Gerhard Lazu
So I'll need to read the prompt now, while everybody thinks, "No...! You should be using-" whatever LLM you should be using. "You are a Varnish 7 expert. You need to prepare four distinct responses, and be explicit about the person that you're addressing. One, a seasoned sysadmin that has been living and breathing infrastructure for the last 20 years. Be precise, think deeply, and approach the setup from a hardware perspective. Two, an Elixir application developer that embraces Erlang's "Let it crash" concept. You need to give it straight, give it fast, and keep it relevant to their application. Use the app and the nightly backends. Assets and fees are important, but less relevant. Cloudflare R2. Three, the business person that is selling this thing. They care about costs, efficiency and simplicity. Keep it high level and relevant for someone that doesn't care about the tech, but cares about the outcomes. And four, the audience of a podcast where this is being discussed. Make it general, relatable and fun. Make analogies, keep it light and engaging." I have fun too many times. We don't wanna make it too fun... \[laughs\]
Jerod Santo
That's a lot of fun.
Gerhard Lazu
Yeah, that's one too many funs.
Jerod Santo
That's right. Well... \[unintelligible 00:44:16.27\]
Gerhard Lazu
"Now that you understand your audience, please analyze the following VarnishStat output for the... SJC." Look, I already knew that you would pick --
Jerod Santo
How did you know I'd go for the big one...?
Gerhard Lazu
I have no idea. "Focus on things that work well, things that could be improved, and anything else that you find interesting. And by the way, ignore the synthetic requests." It will keep mentioning these... I get so fed up with this. We have health checks that run every five seconds, so they are normal.
Jerod Santo
Okay.
Gerhard Lazu
So I'm going to copy this, I'm going to run this, and I'm also going to open a new window for Adam. So which LLM should we pick, Adam? Which is your favorite?
Adam Stacoviak
You mean model?
Gerhard Lazu
Model, yeah. Which model?
Adam Stacoviak
We just used it... But I'd probably back up to like Codex... Which is like GPT-5, latest... 5.1, 5.2...
Gerhard Lazu
There you go. So GPT Codex. My favorite one is Gemini, so I'm going to drop it... And let's see how do they compare.
Jerod Santo
Ah, Gemini. You're in a different tab now. So Abacus can't do Gemini?
Gerhard Lazu
It might, but I have my own Pro account.
Jerod Santo
Gotcha.
Gerhard Lazu
So that's something else. I use Veo, I use Nano Banana... Quite a few things. Transcripts... It's all part of the package. So it can, but that's what I prefer. Cool. So Claude Opus 4.5, for the seasoned sysadmin.
Jerod Santo
This is you, Gerhard.
Gerhard Lazu
This is me, exactly. Thank you for noticing. For knowing who's who. \[laughs\]
Jerod Santo
You're welcome. I'm following...
Gerhard Lazu
\[00:45:45.09\] So what's working well? Rock-solid stability. So by the way, the instance has been running for 5.4 days. We had all these improvements shipped, and we are able to observe how our busiest instance works... And that's what this is, basically. That was... The window moved. Cool. So after 5.4 days, zero child panics crashes, zero thread failures. This is important. It means no threads died, no threads had to be restarted; everything is healthy on this instance. It didn't crash. So this instance didn't crash.
Zero lock contention across all subsystems. Your CPU cache lines are happy. Excellent hit ratio, 93%. We like that. We really like that. We have backend connection pooling, with a two to one reuse ratio, and memory pressure is minimal. 132 lru's in the last five days. Lru nukes. So very few objects had to be removed from memory. Thread pool property - 300 threads, zero queuing, zero drops. That's perfect.
Areas to investigate. Disk storage allocator failures... We have disk C fails... We are hitting storage fragmentation. The disk is 97% full. We have 48 gigabytes used. That's how many MP3 files are stored. By the way, how many MP3 files total do you think we have?
Jerod Santo
Size or files?
Gerhard Lazu
Size.
Jerod Santo
Size. Well, if we had a thousand episodes at a hundred megs each, which - neither of those things are true... That'd be a hundred gigs. So a hundred is too big, but a thousand is too small. I'm going to say 80 gigs.
Gerhard Lazu
Adam, do you want to guess?
Adam Stacoviak
That math checks out... I was gonna say like a terabyte, but that's probably raw wav files, versus not...
Gerhard Lazu
All the files that we store in R2 - and this includes all the assets; but we know that the MP3 files are the biggest - it's close to 250 gigabytes. We may have some duplicates; I don't know, I haven't checked. But that's how much files we have in R2.
Jerod Santo
Yeah. Well, we also have Plus Plus for the last couple of years, which means every episode has two files, not just one. So... That makes sense.
Gerhard Lazu
So we should go higher... Now, we use this in every single region, so maybe we want to reduce number of regions... But I think --
Jerod Santo
We need a third category called "Super-hot."
Gerhard Lazu
Super-hot, yes. Maybe.
Jerod Santo
Which is like SJC in Tokyo, right?
Gerhard Lazu
That's possible, yeah. There's four, which - we know they're really, really hot. Yeah, yeah. But honestly, this is happening across multiple regions, and...
Jerod Santo
It is.
Gerhard Lazu
...we'll get to some interesting things. So okay. Synthetic responses, grace hits... All good. For the Elixir developer - and I think this is you, Jerod. Do you want to read it out?
Jerod Santo
Oh. "Well, the TL;DR is Varnish is doing its job. Your app backend is well protected." Do you want me to read the whole thing?
Gerhard Lazu
If you want... I mean, it's shielded...
Jerod Santo
It's 95% shielded. No failures, zero backend failures... That's because of - you know, my code doesn't really let it crash very often.
Gerhard Lazu
Exactly. Your code is -- yeah, it crashes internally, not externally. \[laughs\]
Jerod Santo
That's right. My thing is doing its thing. It is generating some uncacheable responses, but you know, we do have some that we just don't want to be cached... Ooh, one fetch failure. Negligible. Yeah, I agree. We don't need to worry about that. And in the end, it says "Whoever wrote this is really good at what they do."
Gerhard Lazu
I agree. That's exactly what it says. \[laughs\]
Jerod Santo
And they should be proud of themselves... And congratulations on such a great hire.
Gerhard Lazu
Yeah. I agree. I agree.
Jerod Santo
\[laughs\]
Gerhard Lazu
I think the hire needs a promotion, and a bonus, I think...
Jerod Santo
There you go.
Gerhard Lazu
Alright. For the business person, the caching layer is performing excellently. Adam, do you recognize yourself? Or shall I continue with this?
Adam Stacoviak
You can read it.
Gerhard Lazu
93% of requests never touch your servers. Massive cost savings on compute. Do you know how many requests per second the application is serving? Maximum, by the way. What's the maximum RPS for this amazing Elixir Phoenix application, for the homepage?
Adam Stacoviak
Probably a lot. Gosh... Thousands? Tens of thousands?
Gerhard Lazu
Maximum. Okay. Jerod?
Adam Stacoviak
100,000?
Gerhard Lazu
The database connection is involved.
Jerod Santo
Concurrently?
Gerhard Lazu
Concurrently, yes. I don't know, I'd say not very many. To our homepage?
\[00:50:15.13\] The homepage.
Jerod Santo
That'd be like 12. 12 requests a second.
Gerhard Lazu
Yeah. 17. \[laughs\]
Jerod Santo
17! I'm right in there, baby.
Gerhard Lazu
Someone that knows their code. So 17 requests per second. \[laughter\] So if all these requests were hitting the application, we'd need so much compute to serve that. So much caching... Obviously, we've removed all the caching. Now we're joking about this, because we purposefully removed all the caching from the application.
Jerod Santo
Right.
Gerhard Lazu
I remember that a couple of years back, because we said "This has no place in the application. The application gets restarted, we need to store this somewhere, we need a cluster..." It was just really messy to handle it at that layer, which is why we introduced this. Five plus days running without any issues... By the way, this is the last deploy. So maybe by the next Kaizen, if we do no more deploys, we'll be able to see how well it handles. Zero failures on the infrastructure side, and three terabytes of data served to users. Three terabytes. So in five days, this one instance served three terabytes.
Without your application servers breaking a sweat, storage is getting full... So we need basically more storage. For the podcast audience --
Jerod Santo
Oh yeah, it's gonna be fun.
Gerhard Lazu
Imagine a really good receptionist at a busy office. This Varnish server is like having someone at the front desk who remembers everything. Out of 100 people who walk in asking questions, 93 of them get their answers immediately from the receptionist, without ever bothering the experts in the back office. What's cool? It's been running for over five days straight without a coffee break, or a single mistake. That sounds cruel to me, but let's go with it...
Jerod Santo
Good job. \[laughter\]
Gerhard Lazu
It served three terabytes of data... That's like streaming about a thousand HD movies. This one instance streamed a thousand HD movies in five days. And the experts only had to answer 7% of the questions. The one quirk - the filing cabinet is getting full. It's like when your receptionist's desk drawers are stuffed, and they occasionally have to throw away old notes to make room for new ones. Not a crisis, just time to get a bigger cabinet. Okay, I think the \[unintelligible 00:52:29.09\] 300 workers... I think that's too detailed right there.
Adam Stacoviak
That's good fun there. Good job.
Gerhard Lazu
Do we care about GPT, or Gemini? We can only use one. We can only pick one.
Adam Stacoviak
Gemini is getting some good hotness...
Jerod Santo
Let's check Gemini. We'll see how it adds up. Oh, it's still thinking...
Gerhard Lazu
Let's see. I think it's finished. Maybe that's -- let me just close that. Did it finish? I think it did. Alright, so let's go up... Slow thinking... I did the thinking -- I could have gone Pro as well.
Adam Stacoviak
Show thinking.
Jerod Santo
Oh, show thinking.
Gerhard Lazu
Show thinking. Slow think... \[laughs\]
Jerod Santo
I thought I said "Slow thinking." I was like "Come on, speed up, Gemini."
Gerhard Lazu
Show thinking. There's quite a lot there... Anyway, we're not going to look into that.
Jerod Santo
No.
Gerhard Lazu
So the instance has been up for 5.3 days. The MGT uptime - I like it. It's telling me which of that long list of counters is important. From a system perspective, the threading model is perfectly dialed in. 300 threads across two pools, with a zero thread limited, and zero thread queue length. The kernel isn't context-switching unnecessarily, and we aren't starring for execution units. However, we have a critical IO and storage allocation bottleneck in the SMF. The file - I'm not sure what Steve Dore means. That's a weird one... \[laughs\]
Jerod Santo
I'm not sure Gemini knows what Steve Dore means either.
Gerhard Lazu
\[00:53:51.06\] The problem is SMF disk C fail is at 19,000 plus, against only requests at a 73% failure rate for disk allocations. In the G space we have only 145 megabytes, while we have 48 gigabytes used. "You're effectively out of contiguous blocks or hitting the ceiling of the pre-allocated file." That sounds right. Varnish is forced to fall back to SMA RAM or simply not cache large objects. While SMA has 2.1 gigabytes free, the disk is thrashing. The hardware, "If this is spinning glass, your IO8 is likely spinning." It's not; it's NVME. So yeah, there's quite a few things there. Interesting is low, relative -- suggesting the lock contention on the shared memory log is negligible. Nice. We have low lock contention. The bottleneck is strictly at the storage layer. We just need basically a bigger disk.
The Elixir developer - let it crash proponent. I like it. Your app backend is rock solid. Well done, Jerod. Varnish hasn't seen a single backend fail or an unhealthy event in five days. The let it crash philosophy is working. If the app is crashing, it's recovering. So far, Varnish doesn't even notice. I don't think it is... I think the application is good. The good news is we're seeing an effective 75% hit rate. Now, this is different, right? We had 93% there, so I would like to understand the discrepancy. That's another reason why you should use more than one, by the way. And then obviously, double-check the data, because sometimes they give you things which aren't true... The crash concern - the disk cache is crashing. Again, this is not for the application developer. I think this is for the sysadmin in me, I think. The straight talk... Your main synth is high. So we have health checks. There's delta responses... So we have a lot of synthetic requests. Sorry, synthetic responses. This is, again, a Varnish thing.
The business person - efficiency we're currently serving of our traffic from Varnish... I think I know what's happened. I don't think it's taking into account the synthetic requests. Those should be removed from the total number of requests.
Jerod Santo
So you think Claude has the right number?
Gerhard Lazu
I think so, yeah. Yeah, I think so. This means for customers - we have cost efficiency, that's good. The risk... There's the bottom line. I think this was the fun, but I think this is a library. I think we can stop it here.
Jerod Santo
The library analogy versus the secretary analogy.
Gerhard Lazu
I think that was a better one. I got a barista one, I thought it was a very good one...
Jerod Santo
Oh, yeah... For queuing, or for what?
Gerhard Lazu
For queuing, yeah. The barista analogy I thought was very good. This is using books, and whatnot. The library hasn't burned down... That is fun. \[laughs\]
Jerod Santo
Oh, that's a good thing.
Gerhard Lazu
That is fun. So I think Gemini is getting a bit funnier. The nightly feeds in the app are still humming along. Nice. So that's what we have... And that was only half the problem.
**Break**: \[00:56:53.09\]
That was only half the problem. So we're at the midpoint... \[laughs\]
Jerod Santo
Oh... I was feeling good. I felt like we had it all fixed. What else is the problem?
Gerhard Lazu
Oh, man... This is when all the fun begins. So. Do you remember this, Jerod?
Jerod Santo
Yes. MP3 requests intermittently hang in Newark, New Jersey. This was our good friend, John Spurlock, who's been on the show before, and is a podcast nerd... In fact, he runs op3.dev and other podcast nerdery things. And so he really knows his stuff. And so when he reports issues, I don't say "Did you try rebooting?" I take it seriously. So I shared it with you... And he actually did some additional digging for us. Go ahead.
Gerhard Lazu
Mm-hm. So in terms of - you tested this, and I think you had issues as well. So we've confirmed this, for sure.
Jerod Santo
I did. Certain times, certain files... Actually, it would be all requests at certain times. I assume that it was that particular pop - as we could call them; or pipe, in the Pipe Dream - was hanging. And then it would go away... And he actually had the same problem. He had a Friday night deploy of Friends, and he was trying to listen to it on Friday. Couldn't get to it. By Saturday morning he can get to it. So it's intermittent hanging. Very difficult to diagnose, very difficult - I assume - to debug. And then it just comes back to normal... I thought it was maybe the out of memory thing; like, it's just in some sort of fugue state until it reboots, and then it works again... But you go ahead.
Gerhard Lazu
That's what I thought. That's why I did a deep dive on this. This was November, end of November, beginning of November... So November, I was just trying to figure out what on Earth was going on. Just from the sides... I didn't have too much time. But if you look at this response, there's quite a few things there. This is my initial one, an investigation, trying to understand what's happening, giving a couple of debug headers, a couple of extra headers that the request can be made of -- sorry, can be run with, so we'd just get a bit more details. Forcing regions as well... So there's quite a few things there; I was checking into that.
This is Don McKinnon. He also had issues today, so he pasted some results... So thank you. Thank you, Don, for adding this. This was helpful. So I'm still scrolling, I'm still scrolling... There we go. Super-helpful. I have confirmed that the requests have been hanging... You are getting the hangs this afternoon as well. This was only three weeks ago... So this has been going on for a while. I dug deeper and I've found the problem. The problem was that in the Fly config we had the concurrency set to connections, not requests. So it's possible to configure an application. Again, you're configuring the Fly proxy that sits in front of the application to limit how much traffic hits your application. So requests - how many requests per second should the Fly proxy forward to your application before it stops? Because you don't want to get overloaded. So before it starts throttling, it starts slowing clients down... And that's when you start seeing Fly edge errors.
\[01:02:09.02\] Connections, you would use for something that has long-running connections, like a database, for example. In our case, it's not a database, it's an HTTP application... So requests would have been the right concurrency. I have no idea why I picked connections. It was the wrong one. But the effect was, as you can see here, we had 2,700 long-running connections on that edge, so on that region. So in this case it was, I think, the orange one... I think EWR, right? So EWR had all these connections opened. The clients were enclosing the connection, the proxy was full, no more connections could be forwarded to the application.
Long-running connections - there are usually clients which are not doing the right thing. You shouldn't have that many long-running connections. So the problem was a misconfiguration on our side, which meant that connections, like slow connections, long-running connections were basically blocking other connections from coming through. So that was the problem there.
And I thought that was it, but... But... There was more. So there was a last comment, last week... We now have a check that runs every hour, and what was interesting - and I'll talk about the check as well - we had response bodies timing out in two regions. So 13 regions were fine, but even after this configuration, there were two regions - IADT and EWR - where when we were using HTTP2, and for some reason this is important... When we were using HTTP2 and the Fly proxy would see this, it would not forward the connection correctly. As in it would start, it would serve the response, we could see the headers coming back from our instances; what we wouldn't get is the body. So the body would always be like "Zero bytes served." And we could see this happening, we could see the connections that, by the way, they were opened... They shouldn't have been opened, because the application changed, so these connections should have been dropped... There was something not quite right. My suspicion is with the Fly proxy layer. Because when we were forcing HTTP1, everything was working fine. And by the way, the Fly proxy, when it talks to our Varnish instance, it's using HTTP1. And you can see that in the headers.
So the proxy to the Varnish was fine, but the client to the proxy was not fine. And HTTP 2.0 is a very complex protocol. There's so many things which just don't work the way people would expect. So anyway, the issue fixed itself... That's the important thing. \[laughs\] So opening this --
Jerod Santo
Not super-satisfying...
Gerhard Lazu
Yeah, that was very nice to see. And there was something -- Maya Ilaros? How would you read this...?
Jerod Santo
Maya Ilaros.
Gerhard Lazu
Ilaros, there you go. Someone on the Fly community forum that was very helpful - they noticed that we had a misconfiguration in our Fly TOML. And we were using services, as well as HTTP service, and this is bad, by the way. This is very, very bad. So everything was happy; we could push this config, the applications were running, everything was fine... But because we had these two things together, it was apparently creating some issues. And all we did - we were explicitly setting the idle timeout. And the idle timeout - that's the one where if after 60 seconds the connection isn't doing anything, it will be forcefully terminated by the proxy. So that part was important.
So anyway, we made the change, we pushed the change... But even before we pushed the change, the proxy started behaving, and now there's pull request 49 - we right-sized it, we made a few changes, I captured all the details, the configuration, the commands... It's all there if you want to read it. But most importantly, now we have a check that runs against all regions, every hour, on the hour. CI/CD, it's using Hurl... And what I'm thinking is, shall we try running that locally, to see how it behaves? Because that's how we started it. I was starting to do it locally.
Jerod Santo
\[01:06:27.16\] Yeah.
Gerhard Lazu
So on the left-hand side -- I'm back in the terminal. On the left-hand side I am monitoring my internet connection. Remember that Christmas tree? This is related to that Christmas tree. So I'm at the top of the Christmas tree, I'm at the gateway, the core router. It's a MikroTik CR2004. Pretty good... 10 gigabits per second, maximum. Now, my internet connection isn't 10 gigabits, it's 2.5, which is plenty for this test. So every second it's showing me how many packets and how many bits we're receiving and transmitting. And again, we are recording, everything's happening live, so you can see it jumping as we're pushing more data to Riverside.
Cool. So I'm going to run now Justcheck. And justcheck, by default - it's one of the commands, the just command that we have in the Pipely repository. And check - all it does, it runs Hurl with a couple of flags. It downloads an MP3 file, it downloads feeds... It basically connects to all the different backends, and it sees how quickly it can get data back. We're transferring about -- that was quick, that was eight seconds. I'm going to run it again, and as I run this, pay attention to the left-hand side... It will go to 120 megabits per second. So that's that MP3 file being downloaded. So every single time this runs, a full MP3 file gets downloaded, alongside a few other things.
Okay, I can open the reports... We're not going to look into that, because we're going to run something more interesting now. We'll do Check All. And what Check All does - it runs the same command against all the regions. I'm at 2.3 gigabits per second. We're downloading all the files... We can see the response is coming back. EWR just sped by, \[unintelligible 01:08:18.24\] sped by... So all the different endpoints are returning. Now, I'm based in London... Obviously, the further away you are -- so for example, this was South America, that was LAX... So a couple of instances are slower to respond. And all this happens via headers. When you connect to Fly, you can tell it "Hey, I want to connect to a specific region", and then that's what routes the request to that region.
Jerod Santo
That's cool.
Gerhard Lazu
And again, it's all captured in that pull request, and you can see what it looks like. The Check All one - Johannesburg, that's usually slow. And the slowest one is Tokyo for me. Sydney as well can be slow. So we still haven't received the responses from there. We should get that shortly. I'm pulling now 50 megabits, 20 megabits... It's just slowing down. And it's just -- the connections between now and there... The last one there was Tokyo.
In 60 seconds I pulled about two gigs, roughly. It's a lot of data that gets pulled down. The feeds, between all of that. And anyone can run it. I would recommend you not to run this, because we have to pay for this bandwidth... But our CI runs it, just to make sure that everything works. And if we look at every hour - I think I'm going to tune this down. You can see there were no more connections hanging. So we got to the bottom of that as well.
Jerod Santo
If it ever comes back -- because it went away on its own. If it comes back on its own, we're worried about it.
Gerhard Lazu
Exactly. Now we have a system that is able to inform us when there's a problem. So let's go to three... We're on page number three. This one, for example, took more than five minutes. So sometimes, when the connectivity is a bit slow, some regions can be slow - that's when you get these timeouts. So this is capped at five minutes... The last one that failed was a while ago. So you can see we're January 5th... There we go. There's one that failed January 4th. Check all instances... So let's see Run, and we'll see exactly which region failed. Execution... NRT, that's Tokyo, and as you can see, we have 100 seconds. So if after 100 seconds it doesn't download, it just times out. And we were pulling data, but it didn't finish downloading the entire MP3. And we're downloading 100 and something megabytes.
Jerod Santo
\[01:10:34.02\] Very cool. I mean, not cool that it didn't finish, but cool that that was a while ago, and we can actually test this. Now, do we need to be doing such a large file? Is that part of the test? Or can we test a smaller file, and still get the same results?
Gerhard Lazu
We could, yes. This was a file that was reported. So we need to find an MP3 file, absolutely. I think we can also reduce the frequency. We don't have to run it every hour. This was obviously in preparation for this conversation...
Adam Stacoviak
What about episode 456?
Gerhard Lazu
That's coming up. That's coming up. That's the deepest rabbit hole, so I'm leaving that for last. That's coming, Adam... \[laughs\]
Adam Stacoviak
One thing I suggested though in our Zulip - and I didn't check to see if this is even a thing, but... To validate - you know, if the Fly CLI could validate the TOML file for you. Because you could have checked the TOML file for syntax errors, or just do's and don'ts, essentially... And it didn't.
Gerhard Lazu
It does have a validation subcommand. Syntactically, it's correct. The config is valid. I mean, it was applied... But because it combines two things, it shouldn't. So at least I would expect a warning, like "Hey, you're using both HTTP services."
Adam Stacoviak
That's right. Yeah, validate syntax, and then validate expected, true TOML file config. Don't combine or conflate two values, or overwrite one, or... You know, just that kind of thing. That's how I would defensively do something like that in a CLI to protect my user from a poor config. They could have just not been holding it wrong for so long.
Gerhard Lazu
Yup. I agree. So it's the impact of that configuration, indeed. Yup. So this is something -- we can see, again, the same logs. We can see -- this one here goes to 50 megabytes per second. That's 500 megabits. When we have these peaks, when we see this in the Fly config, we can see this usually when the benchmarks run, or when the checks run, because they put significant pressure on the instances, and we can see them, and we can pick them up straight away. So that's what this is.
Alright... So remember this guy - this guy was saying March 29th. So it's almost two years ago when this guy was saying "We will run into all sorts of issues that we end up sinking all kinds of time into." So this guy had a good hunch. This is Jerod, March 29th... \[laughter\] And we just went through a couple of examples of issues that we had to deal with, part of this. But because of this, we understand the traffic, and we understand how the application behaves, and the backends behave, at a very deep level. So you were right, Jerod. We did sunk all sorts of -- how many lines? Let's see, how many lines do we have now?
Jerod Santo
I wanted 20 lines...
Gerhard Lazu
590 lines. \[laughs\] 590 lines we have in total, of Varnish config. It's more than 20 lines. By the way, we have the roadmap to 2.0... This is 1.0 that we tagged and shipped. It solved a lot of issues. But that was the easy stuff, okay? So for everyone that stuck with us, something really good is coming up.
Jerod Santo
It's starting to get harder from here.
Gerhard Lazu
\[01:14:02.07\] And Adam was already mentioning it... Episode 456. There's something special about episode 456.
Jerod Santo
Oh, yeah.
Gerhard Lazu
So what is special about it? What stands out to you, Jerod?
Jerod Santo
Oh, it's just getting rocked with downloads.
Gerhard Lazu
So episode 456, "OAuth, it's complicated"... And by the way, this was recorded in 2021, it was published, again, August 2021... For some reason, it's been downloaded a lot in recent months. It has over one million downloads. This is the most popular episode on the Changelog, ever.
Jerod Santo
The most downloaded episode...
Gerhard Lazu
It's crazy. It's crazy.
Jerod Santo
Oh, so you guys looked into this?
Gerhard Lazu
We did, yes. We dug into this.
Jerod Santo
Okay. I didn't know you guys were doing this.
Gerhard Lazu
So we just had a quick look to understand what is happening here... So we have Honeycomb open up, remember, every single request which comes through the Pipe Dream, through Pipely. Every single request we sent to Honeycomb, we were able to look at it... This is the last 60 days, and I have filtering done in such a way so that I'm only looking at this one file. How many times has this file been downloaded in the last two months? You can see the peaks, right? You can see -- and by the way, this is gigabytes, and the period is four hours. So we are peaking at about a hundred -- actually, this peak was here. We had almost 300... 300? 400? Anyway, close to 400 gigabytes in a four-hour period.
Adam Stacoviak
That's just too much.
Gerhard Lazu
I think so... I know this is a great episode, great conversation, but --
Jerod Santo
I remember that conversation. It was good.
Gerhard Lazu
Like, who is downloading this file 400 times - or actually more than 400 times - every four hours, consistently, for months on end?
Jerod Santo
Super-fan.
Gerhard Lazu
A super-fan. So we can see the different regions... Now, this is spread across the entire world. It's not just one region. This is really, really big. I think if there was a DDoS attack, I think this would class as one. And in the last six months -- sorry, in the last two months, 60 days, we served 30 terabytes in San Jose, California alone. In Tokyo, we served 515 terabytes. This is a big number. And if you look in this column, the distinct IPs, the client IPs, we had over 10,000 IPs downloading this file. So this is not one or two IPs, this is thousands and thousands of IPs which keep downloading this file over and over and over again. So I don't know how we would block 10,000 IPs...
Jerod Santo
Right.
Gerhard Lazu
The VCL would be crazy.
Jerod Santo
Well, that episode was starring Aaron Parecki, who is a very talented person. And he is the co-founder of IndieWebCamp, and a big fan of the IndieWeb, as well as OAuth, obviously... So my hunch is Aaron's very interested in being the most downloaded episode ever, and he controls a fleet of machines from all around the world... And he points them wherever he wishes. And he thinks "You know what I'm going to do? I'm going to get the number one spot on these guys' download charts." And so I'm thinking Aaron Parecki is the man with the mask on, and we pulled the mask off. It was him this whole time. What do you think, Gerhard?
Gerhard Lazu
I think that we need to speak -- see, I don't want to say the specific language... I think we need to go to Asia. I think we need to visit a couple of cities in Asia... \[laughter\] And find the IPs which are responsible for this, because this is a crazy amount of traffic. Asia, it just so happens, if we look at -- so Asia is basically the continent where we are getting the most downloads from, because of this one episode. And this is actually traffic being served; this is not like head requests, or get requests. These are bytes being sent to thousands and thousands of machines in Asia, every single hour. So whoever is doing this, please stop. Please. \[laughter\]
Adam Stacoviak
It's on a cycle.
Jerod Santo
\[01:18:13.11\] So we need to knock on doors. We need to go over there and knock on some doors, and say "Excuse me, is this IP address at this home?" And then they might say yes, and say "Would you please stop? What's going on over here?" What could they possibly benefit from this? What could they be getting?
Gerhard Lazu
Maybe - maybe - we're the speed test. \[laughter\] Someone is using us to speed-test their connection. Who knows?
Jerod Santo
Yeah, maybe.
Gerhard Lazu
That's the only thing I can imagine.
Adam Stacoviak
Well, that's a lot of IP addresses.
Gerhard Lazu
It is.
Adam Stacoviak
And it's across multiple regions, which...
Gerhard Lazu
Multiple data centers, yes. So multiple regions, Fly regions are serving these IPs, yes. They're all coming from Asia, by the way... Again, I don't want to mention any names, because there's no bad guys here, right? We just want to assume that someone left the oven on...
Jerod Santo
I don't know, man...
Adam Stacoviak
It's like the blinker on when you're driving. I would say "Hey, you're not turning. It's time to turn that blinker off."
Gerhard Lazu
So the way I can see us mitigating this - and this is a hard problem because of the number of IPs which are hitting us. We can basically start blocking entire net blocks, entire network blocks... Unfortunately, some genuine listeners might be caught in this, and basically, Changelog will not be available, or at least the MP3s will not be available to a portion of users.
The other one is - obviously, we can, and we should... This is like the next problem. We should enable some throttling, because there's more stuff happening here. So we don't have any sort of throttling. We assume fairness, we're assuming goodwill, we're assuming decency, and we're not seeing that here.
Jerod Santo
Well, that's the internet.
Gerhard Lazu
So to be honest, whoever is doing this - and it's not LLMs, I had a look; we have that problem as well, but in this case it's not LLMs. This is something completely different. So my hope is by someone that listens to this episode - maybe we put this in the intro - whoever's downloading episode 456, please stop. Because we'll need to take the next step...
Jerod Santo
I'm not sure that's --
Gerhard Lazu
I know it's a bit of a cat and a mouse game, but that's what will need to happen. Because we need to pay for this bandwidth.
Adam Stacoviak
This is only Varnish, right? This is only the cache layer this is happening.
Gerhard Lazu
This is only the cache layer, yes. Yeah.
Adam Stacoviak
And so what mechanisms are in Varnish to do throttling, or rate limiting, or just anything like that whatsoever?
Gerhard Lazu
There's VMods, which are basically modules that Varnish loads, that just give it extra functionality. One such VMod - and I've looked at this, it is free and open source - is the VMod throttle. Now, that means that we need to start keeping track of IPs, and it will use a bit more memory... That's okay, we have more memory. And then we need to start basically applying limits to how many download specific IPs can do. And we can limit it to MP3 files only. So if we have a bot, or if we have, for example, an RSS aggregator or something like that, we're okay serving those requests... Because again, that's what Varnish is meant to do. The problem here is that we're serving a lot of bytes for MP3s, the same MP3 that cannot be real traffic.
Adam Stacoviak
Yeah. I mean, even in this case, you can tie it potentially just to this MP3, like you just said, which is not at all an MP3 scenario. Like, if you request this MP3 with this kind of request signature of X per whatever... I mean, I didn't examine the actual signature of the request, but that's how I'd probably investigate it... It's to begin to isolate. Does that require us to write a lot of defensive code against that kind of scenario?
Gerhard Lazu
I don't think so. It's just system integration. We just need to add more configuration. And back to Jerod's point, we're just chasing now new problems that we didn't even think we would have... But we have what looks to me like an actor that's not very - I wanna say this in a nice way... An unfriendly actor, that is not very happy, and they are very angrily downloading our mp3 over and over and over again, thousands of times, across thousands of IPs... And this is not cool. Because ultimately, we end up paying for this bandwidth. That is not helping anyone.
\[01:22:23.19\] But that's one. It's not the only one. We have one more. So you can see here, for example - this is the last seven days; we have seven terabytes that were transferred in the last seven days. Seven terabytes? Maybe that's more than that. It needs to be more... The geocode does not exist. Okay, I was expecting to see more than that. Anyway, Asia is the one that we can see that pattern, but we also have in Europe, sometimes we have these spikes. And this spike, which I wanted to focus on - we know that someone in Frankfurt, that connects to Frankfurt, downloaded the static favicon 170,000 times in the span of, I don't know, an hour or two. So they downloaded this like two, three hours. So we get requests like this, that are putting stress on these instances. I mean, what potentially -- and that was a past request as well, which means it went through the cache, which means that they must have had a cookie set up, or something like that, that basically was preventing the cache from working, in this case... Which - again, that's how it's supposed to work.
So anyway, I think that was, unfortunately, not the best thing that we could have ended on, but it's a thing, and it's something - food for thought. More work to be done. There's many things that we didn't get to talk to, we didn't have time for... For example, we didn't talk about the Nightly... By the way, Nightly now is being served by the Pipe Dream as well. And the reason why we had to do this is because that sometimes will get scraped, it would get hit really heavily... It's a very small app, it's NGINX, but if I open it -- so let's just click on that one. That's pull request 46. Before, it was basically topping up at 141 requests per second. Now it's at 1,300, so it's almost a 10X, an order of magnitude faster. The latency went way, way down... And the only thing we had to do is basically put Varnish in front of it.
Jerod Santo
Nice. Well, that's nice.
Gerhard Lazu
Yeah, that's one more thing there, and you can go and have a look how it works... It'd be like a benchmark here, a small benchmark here... And that's it. We have the last one for the road - but before we do that, anything else you want to talk about before I share one last thought?
Adam Stacoviak
I suppose what do we do, if we know these downloads are happening? ...we're here on the podcast, just politely asking to stop... We just let it keep happening?
Gerhard Lazu
Well, we could set up some sort of throttling... I think it would be the easiest thing. Now, it will impact everyone... I don't want to start blocking, again, IP ranges, net blocks, because we don't know who's going to be caught there. They may change to other IP blocks. That's entirely possible. We don't know how this will work. We can't block an entire country, an entire continent, especially if it's a big one... I don't think that's reasonable. So really, throttling is I think the fairest thing. And then we can throttle MP3 specifically. Because we do have, for example -- I see them; for example, we have a Python client and a Go client, that every week they come and they download all our MP3s. I don't know why they do that, but every seven days, they basically request every single MP3 that we have. So they're scraping the website, and then pulling everything down. I don't know why.
Adam Stacoviak
Yeah.
Gerhard Lazu
\[01:25:49.11\] Again, the more I was looking at -- and again, because I was working so deep in this, I started noticing these behaviors that you would normally not see... So it's one of the advantages, I suppose, to working so close with the traffic, with all the requests, and having this level of understanding and visibility into every single request. So it really helps, down to the IP level.
Adam Stacoviak
Something like that though, like the Go client and the Python client - would that be a Honeycomb thing? Where would that be --
Gerhard Lazu
Yeah. It's Honeycomb, yeah. You can filter by user agent, for example, and you can see that there'll be... For example, say -- no, I don't want to show any IPs or anything like that, so that's why I'm not going to screen-share that... But once we start digging into that, you can say "Group by client, by agent", and you can say "Filter by MP3s." So like "URL contains MP3." And that will be able to group... And you can say "Oh, and by the way, only show me where there's more than, for example, 100 downloads." And then you'll start seeing the outliers, which are the clients that are downloading certain MP3s, or MP3s in general, excessively.
Now, that can be spoofed. That's the other thing. We have, for example -- the request agent, like the user agent, it's empty; it's an empty string. That also happens. Because you don't have to send the header if you don't want to.
Adam Stacoviak
Yeah, you can also send whatever you want to, if you want to.
Gerhard Lazu
So that can be spoofed.
Adam Stacoviak
Yeah. It's like, whenever you build systems like this, and then even when you observe them - I guess you don't expect, that's why I originally thought, but you kind of hope that clients, a.k.a. people, behave. They're going to use the system for the system's purpose, not to once a week download and scrape the entire thing, and... I mean, in that case, somebody could have their own web archiver, and they could have altruistic reasons for it. I think that's kind of silly, but you know, once per week download the entire contents on somebody's disk seems like "I want your thing. And I want to keep getting your thing. And if it ever changes, I want to make sure I have that snapshot." I don't understand it. It doesn't make any sense. What would make anybody do that? What is the purpose and motivation to keep doing that? ...to even commit the compute, or the script, or the time to do that? What are they getting from it? I don't know.
Jerod Santo
We need to go over there and knock on some doors, man, and ask them. "Why are you doing his...?!"
Gerhard Lazu
"What's up with that?"
Adam Stacoviak
Every door...
Jerod Santo
"What's in it for you?"
Adam Stacoviak
...in Asia. "Do you listen to the Changelog?" "Yes, I do." "How many times?!"
Gerhard Lazu
Yeah.
Jerod Santo
"Tell us about 456. You know what 456 means, don't you?!" \[laughter\]
Gerhard Lazu
Yeah. So this is, I think, a really delicate and a really important point to discuss... Because this is how good systems become bad systems.
Adam Stacoviak
It's true. Yeah. You have to treat everybody bad...
Gerhard Lazu
Exactly. We don't want to be doing this, but we are forced to do something against something which isn't good... So it's not benefiting anyone, and we have to step in and do something about it. Now, we have to do it... I was expecting this to stop, but it's still even to this day. We made Varnish -- I mean, now that it's stable, it's able to serve more traffic... We just had the biggest spike, because now the system is more stable, but it means that bad actors -- again, I shouldn't be using that. Unhappy people, unhappy clients...
Jerod Santo
Use it...! Who are you gonna offend? The only person you can offend is the one who's doing this, and I'm fine with it.
Gerhard Lazu
Yeah.
Jerod Santo
\[01:29:34.18\] They need to knock it off. Here's a -- this might be a cudgel... But if we're trying to solve the problem of "They're taking our bandwidth for something that's no longer relevant, or interesting, and it's been out there for years", what if we could just toggle certain episodes? ...and this might be a cat and mouse game as well. But at a certain point it's like "Well, just give them the R2 URL, and not the CDN URL, and just let Cloudflare deal with it." Just let them download directly from Cloudflare, and we're just out of the equation then. We don't care about the stats, we don't care about anything. We're just like -- you know, we've served this file plenty of times through our CDN. Now we're going to just let R2 serve it. What do you think about that idea?
Gerhard Lazu
I can see for this specific episode being a very simple fix, because we can just serve basically a location header, and we just redirect, and that's it. We're done with it. So it'd be another synthetic response.
Jerod Santo
The question is if they're actually malicious, then they switch to a new episode and start doing that one, right?
Gerhard Lazu
Exactly. Exactly. And we have other clients, which are -- for example, we've seen \[unintelligible 01:30:38.28\] they're basically busting the cache and then purposefully going to R2 directly, and Varnish almost acts like a proxy in this case.
Jerod Santo
Right.
Gerhard Lazu
So we have that as well. We have -- every now and then we have this random client that comes and downloads all the episodes... And that's not a problem. So I think that some sort of a throttle would make sense, which would keep the system fair to everybody. But the throttle needs to be high enough so that it doesn't impact anyone else.
Now, if for example our requests, or if our audience grows, or we become more popular, and we get more requests - obviously, we'd need to be aware of where the limit is, and start increasing the limits, once we are throttling too much, maybe. But that seems to be more like long-term, and it seems like a more well-engineered approach, in a way... But certainly, the simplest thing would be just "Take this one URL." I mean, that could be done in minutes, roll it out, and then we would stop this abuse for this specific MP3. That would be the easiest thing, for sure. So yeah, I can see how pragmatic that approach is, and I like the pragmatism.
Jerod Santo
Well, it's at least worth checking to see if, you know, the mouse is still alive over there.
Gerhard Lazu
Right.
Jerod Santo
You know?
Gerhard Lazu
Yup.
Jerod Santo
And if they are - well, then we'll know that this is a cat and mouse game. But if it's just somebody left the blinker on, we're just gonna turn their blinker off for them, and see if the problem goes away. And if it changes to a new MP3, then yeah, we need more generic solutions. But we may not need that at all.
Gerhard Lazu
I do have to say that the internet these days is very different from the internet even like a year ago. With the rise of LLMs and AIs, I'm starting to see patterns in our traffic, which are unlike any other time. We have these very big spikes, when a lot of data is being requested in a very short periods of time, from -- I mean, the user agents, they don't make much sense... I mean, I know they're spoofed. There are many IPs which are being used... So it's almost like there's some system which wants a lot of our content is doing silly things, because some requests, they just don't make sense. For example, what benefit does static favicon have? What's up with that? That just makes no sense.
Adam Stacoviak
It's a small file. Maybe it's a heartbeat, or a version of a heartbeat.
Gerhard Lazu
Maybe... But this is the first time I've seen this specific file being downloaded this many times. I haven't seen this before... Which makes me think, is this a trend that we'll start seeing, more and more requests that don't make sense? And then you start having to set up some form of protection for all sorts of clients that are just doing the wrong thing.
Adam Stacoviak
Yeah, you need like a defensive layer by default.
Gerhard Lazu
Exactly. Exactly, yeah. And something that would be fair to regular clients... Like, for example when I want to do a benchmark... I mean, sure, it's me. I wouldn't want other people to do that, because I'm testing the system, making sure the real world production system everywhere in the world is working correctly... And I'm aware of what that means, and how it costs... And by the way, my IPs are removed from all the stats, because otherwise you'd see those massive benchmarks... So we account for that. But we can't account for all these weird clients.
\[01:34:05.02\] It's a challenge... I think it's a good one, but it just sets us up to -- you know, when you become older, it feels like this is more of an adult problem. So we got the thing barely working, we got it out there, we made it stable, reliable, all that... And now we're hitting -- it almost feels like a new layer of problems. And this to me is a hint as to the next phase.
Jerod Santo
Oh, to be a kid again...
Adam Stacoviak
Yeah. Well, one positive thing I think is the robustness of our observability. Being able to have this visibility is great, because otherwise we're like "Wow..." We pat ourselves on the back. "Aaron Parecki, let's get you back on the pod, because man, you are big all over the world. That's amazing. A million downloads."
Jerod Santo
Big in Asia. So what's your one last thing for the road ahead? I agree \[unintelligible 01:34:54.22\]
Gerhard Lazu
What's my one last thing? So I'll keep it short, we'll keep it fun. I mentioned about the Christmas tree, I mentioned about the various things which I had going over the holidays... So - Make It Work Club. That's the place. You're there. Both of you are there. So you can join whenever you want. Next Thursday... Yeah, next Thursday. I'm going to talk about the 100-gigabit WAN. The 100-gigabit WAN. So why would I need such a thing?
Adam Stacoviak
Smokin'.
Gerhard Lazu
It's smokin', for sure. So I thought the CCR 2004 has like four CPUs, has multiple 10-gigabit SFP+ ports... It even has 24 SFP28 ports, but it doesn't have a switch chip. And people that know a little bit about hardware, you want to switch chip to do hardware offloading L3, and even L4. So after I bought the CCR 2004 - it was almost like a Christmas present - I thought "Surely, this will be enough for the rest of my life." And no, I had to get the flagship. So I'll be talking about that - the LAN, the setup, quite a few things - coming up. And it just goes to show how much I enjoy the hardware side of things as well, the networking side of things... Like, I shaved two milliseconds off my WAN. It's amzing. Little things like that. It was already good. It was already sub five milliseconds, but I wanted sub three milliseconds. It is now 2.4 milliseconds. So the next -- and what it means... Like, why would I do this? So first of all, I'm all about improving. Every winter I improve the network. In this specific instance, I wanted the pages just to be snappier, things to load a lot quicker, to handle a bit more traffic, but also to not have any impact... I was running that benchmark, 2.5, like -- look, I'm going to do another one right now. Let's see. I have a speed test... I have speed test right here, speed test London... Let's go for this one. So we're recording, we're streaming, and I'm just pulling 2.5, 2.6 gigabits down. And there's no interruption on my network. So it's just my bread and butter. That's how I work.
\[01:37:20.27\] And by the way, if you see any buffering or any slowing down, let me know. I see Adam a bit more pixelated. Maybe you can see pixelated too, I don't know... But yeah, I just pulled six gigabytes, three down, three up... And it's just what I do every day.
Jerod Santo
It's just what he does.
Gerhard Lazu
I work with this stuff, and -- yeah, I enjoy it. And by the way, this is the slower gateway router. So I'm getting the proper one set up, and I'll talk about that. And there's so many things there. VLANing is quite a thing... I have a new IPv4 block, by the way... So some would say that I'm preparing for hosting something. And maybe I am, I don't know. We'll see how that works. But I just realized that my home connection - obviously, I couldn't serve all the MP3s that were being downloaded. That would really cripple my connection if that was happening... But I'm at 2.5 gigabits. The next one will be five gigabits, and the hardware can do it. And the five gigabits - I mean, that's like a decent server.
Jerod Santo
Sure.
Gerhard Lazu
And if you can do five gigabits all day, every day... Sorry. Yeah, gigabits. Gigabits per second. That's pretty decent. So I'm just waiting for more internet.
Jerod Santo
I was gonna say, you're gonna have a hundred gigabit WAN, but you're not gonna have a connection for it, right?
Gerhard Lazu
Right. So... Very few places in the world have that. So if I was in Switzerland, I would get 25 gigabits.
Jerod Santo
Now, would you move? Would you move for this?
Gerhard Lazu
Of course.
Jerod Santo
\[laughs\] "Of course..."
Gerhard Lazu
The only reason to move...
Jerod Santo
I know that sensation.
Gerhard Lazu
Yeah. It's the 25-gigabit connection. But I know that a hundred gig is coming... So we'll see. They either ship it by the time I move, or I move, and then they ship it. So it's one or the other.
Jerod Santo
Okay.
Gerhard Lazu
The important thing is, I have the router to handle that.
Jerod Santo
You'll be ready. You'll be ready.
Gerhard Lazu
I'll be ready. Exactly. So I'm a prepper... I'm prepping for that.
Jerod Santo
Prepping for good internet.
Gerhard Lazu
And interestingly -- five years ago, when I got the previous router, I did the same thing. It's a forum post, it's like a follow-up... So I just did a follow-up recently, at this milestone... So I've been at this for some number of years, and optimizing my network and making sure that it's in tip top condition.
Adam Stacoviak
Relentless... I love it. So relentless.
Jerod Santo
Good stuff, Gerhard. Well, that's a happy note to end on, right? That's a happy note to end on.
Adam Stacoviak
Observability in a hundred gigabit? That's the way to do it.
Jerod Santo
Alright. Well, the good news for Kaizen is we have a lot to work on.
Adam Stacoviak
Always. Always.
Gerhard Lazu
That's what it seems. We know how to pick them, don't we?
Adam Stacoviak
Oh, my gosh... The rabbit hole goes deep, and we keep going in.
Jerod Santo
Kaizen!
Adam Stacoviak
Bye, friends.
Gerhard Lazu
Kaizen.
