Kaizen! Let it crash (Friends)

The Changelog: Software Development, Open Source

1hr 41min Jan 17, 2026

Gerhard is back for Kaizen 22! We’re diving deep into those pesky out-of-memory errors, analyzing our new Pipedream instance status checker, and trying to figure out why someone in Asia downloads a single episode so much.

Join the discussion

Changelog++ members save 6 minutes on this episode because they made the ads disappear. Join today!

Sponsors:

Namespace – Speed up your development and testing workflows using your existing tools. (Much) faster GitHub actions, Docker builds, and more. At an unbeatable price.
Depot – 10x faster builds? Yes please. Build faster. Waste less time. Accelerate Docker image builds, and GitHub Actions workflows. Easily integrate with your existing CI provider and dev workflows to save hours of build time.
Squarespace – A website makes it real! Use code CHANGELOG to save 10% on your first website purchase.

Featuring:

Gerhard Lazu – Website, GitHub, LinkedIn, X
Jerod Santo – Website, GitHub, LinkedIn, Mastodon, X
Adam Stacoviak – Website, GitHub, LinkedIn, Mastodon, X

Show Notes:

Kaizen 22 discussion #554
Tw93/mole: 🐹 deep clean and optimize your mac.
Stuff Goes Bad: Erlang in Anger
Abacus.ai - the world’s first super assistant for professionals and enterprises

Something missing or broken? PRs welcome!

Adam Stacoviak

How else would you learn? Let it crash.

Gerhard Lazu

Exactly. The best things happen when things fail... \[laughter\] Seriously. If it's in a controlled way, right? I think that's something which isn't said. It's implied. It has to be a controlled failure, where you have the boundary, and things will not blow up. I mean, they'll blow up, but the fireworks sort of blowing up, where it's a controlled explosion.

Adam Stacoviak

Yeah.

Jerod Santo

Right. Tiny little crashes to learn from. Welcome, everyone, to Kaizen \#22, with the incomparable Gerhard Lazu... He's here to let us know how he lets it crash. It's that song, "Let it snow, let it snow, let it snow", only - you know how to replace... Hey Gerhard, how are you?

Gerhard Lazu

Hey, Jerod. I'm good, thank you. Thank you. I had a great holiday. It was a great couple of weeks where I've managed to finally disconnect. It's been, I don't know, 20 years since I had two weeks completely off...

Jerod Santo

Nice.

Gerhard Lazu

Even my holidays are only a week. So this was very different, very enjoyable, and I feel so refreshed... So I'm firing on all cylinders.

Jerod Santo

You unplugged, and now you're plugged back in.

Gerhard Lazu

Pretty much.

Adam Stacoviak

Plug it in.

Gerhard Lazu

I stopped it, and I started it, and it's brand new.

Adam Stacoviak

It's Glade, man. I'm Glade over here, man. Plug it in, plug it in. You know what I'm saying? Smell the scent, the fresh New Year's scent called 2026...

Jerod Santo

Some people are going to say this is going to be the best year ever. I've heard it said. What do you think, Gerhard?

Adam Stacoviak

They keep saying that, and I'm excited about them.

Gerhard Lazu

They said that about 2020. \[laughter\]

Jerod Santo

2020... We have to admit, it was off to a killer start. I mean, it was really going well.

Gerhard Lazu

Right. Pun intended, killer start...

Adam Stacoviak

What happened in 2020?

Gerhard Lazu

It was COVID. Pun intended? Killer start? That was 2020. 2020 was the year of COVID, and everyone's "Oh, this is going to be the best year ever", and then we had three years of misery. So I think --

Jerod Santo

Put it behind us.

Gerhard Lazu

I just want an easygoing year. You know what I mean? Last year, 2025, 1st of January, we were building shelves. We were redoing studies and whatnot... And the whole year was full on. Like, it was nonstop. Every week there was something significant happening. And this year we would for it to be a bit more chill, maybe a bit more meaningful... So that's what we're thinking. But how about you, Adam? How are your holidays?

Adam Stacoviak

My holidays were filled with barbecue, and good times.

Gerhard Lazu

Wow. Even in winter... So barbecue never stops. It doesn't know seasons.

Adam Stacoviak

It never stops in Texas. Actually, just to shower you all with a few of my picks from my most recent barbecue adventures... If you're in Zulip, go to the general channel, look for barbecue with three bangs after it, because - why do one bang when you can do three?

Jerod Santo

Bang, bang, bang.

Adam Stacoviak

Some recent ribs... My gosh, my ribs method is on point, my spatchcock chicken method is on point... No one is disappointed at my barbecue joint.

Gerhard Lazu

Very nice. Look at that, we're going to add some meat on this slide... That's what happened in real time.

Jerod Santo

Wow. Real-time meat added. This is -- this is intense.

Gerhard Lazu

Yeah. And again, just to be clear, it's Adam's barbecue. Okay? So no joking aside, we're talking about barbecue.

Adam Stacoviak

Well...

Gerhard Lazu

I think we have to leave it there... \[laughs\]

Jerod Santo

Let's move on.

Gerhard Lazu

I think we have to leave it there.

Adam Stacoviak

I didn't show a burger, but I do make a mean burger, too. Thank you, Gerhard, for assuming that is something I do rock really good. My smash burgers are on point.

Gerhard Lazu

Very nice, very nice. I'm looking forward to that. And so...

Adam Stacoviak

One day...

Gerhard Lazu

My favorite Christmas tree - this is what it looked like.

Adam Stacoviak

Oh, yes.

Jerod Santo

Hm... What is that?

Gerhard Lazu

And for those that are listening, it's a networking cabinet. There's lots of blue lights flashing. This is happening in the loft... You have many terabits of network throughput. There's some switches, there's UniFi, there's Mikrotik... This is maybe five years in the works, and every Christmas I take time to improve it little by little. So this year I went really crazy. I redid the whole thing. I redid the whole, for example, DHCP network, VLAN... Man, it's beautiful.

Adam Stacoviak

Your VLANs are beautiful...

Gerhard Lazu

They are. They are.

Adam Stacoviak

I want to be a guest on your network, man. I'm going to get blocked from everything, okay?

Gerhard Lazu

Well, well, there's a big story happening in the background, and it is going to be -- I think this will be amazing. This is will be the best network that I have run in my life... But the blue, and the darkness, and it's like -- that was one more Christmas tree in our house, and this was it, where I would just go and tinker for a few hours in between the Christmas dinner and all the Christmas festivities... So it was nice just to spend a bit of time tinkering with hardware. And I'm sure that many of you listening, when it comes Christmas time, when things start quieting down, you get the little projects that you didn't have time for throughout the year, and then you have some fun. So I'm wondering, did any of you did anything fun this Christmas? ...but nerdy fun, that's what I mean by that.

Adam Stacoviak

Nerdy fun... Well, I got upset with something...

Jerod Santo

That's not fun.

Adam Stacoviak

\[00:07:47.23\] ...and so I decided to just let it roll. You know what I'm trying to say? I got upset with the amount of RAM usage on my machine... And while I liked the application, I was like "You know what? I'm just kind of tired of having four gigs--" I think it was -- no, it was like 1.2 gigs of RAM being used by Clean my Mac... Fancy little utility application, helps you tune, and pay attention, and stuff like that... And I decided to remake it, and that was it. So I remade it. It's called Mac Tuner. I know there used to be a MacTuner.com, which was, I think, a Mac Magazine, I believe... But Mac Tuner fit. I might change it, who knows... But for now it's called Mac Tuner. It does all the things, all the things. Analyze, clean up, uninstall... And not just that fake uninstall; the real one, where you get the dirty dirties out. You know what I'm saying, the dirties? All the dirties are out, okay?

Gerhard Lazu

My mind is on the dirty burger that you mentioned earlier... \[laughter\]

Adam Stacoviak

Yeah. I mean, that's about as nerdy as I can get. I mean, I made a little utility that's for me for now. Soon to be open source, though; soon to be.

Gerhard Lazu

It will be soon.

Adam Stacoviak

Yeah.

Jerod Santo

Very nice.

Adam Stacoviak

I mean, why not, right? Share with the world.

Jerod Santo

Well, I didn't create a Mac Tuner, but I've found one. I also was thinking, Clean my Mac - how long am I going to run this thing? And the answer is "As long as I ran it, because I'm done now." I found a tool called Mole, M-O-L-E, which is a command line macOS cleaner that does everything. So maybe you've got some competition here, Adam. Maybe you can come out and throw some blows down, like "Here's why I'm better than Mole." It's got a TUI, it's all command line-based, it does cleaning, optimizing, uninstalling, DaisyDisk, Explorer...

Adam Stacoviak

Oh, gosh.

Jerod Santo

...all from -- yeah.

Adam Stacoviak

I'm feeling it. I'm feeling intimidated over here, okay?

Jerod Santo

You're starting to sweat?

Gerhard Lazu

I think he just changed his mind about open-sourcing it.

Jerod Santo

Here's your domain name idea, Adam. \[unintelligible 00:09:45.15\]

Adam Stacoviak

That's good. I could do that.

Jerod Santo

So I've been using that, and I'm very excited, because who doesn't want to just have all the things right there in their command line? And I didn't spend any tokens on it. Adam's got some tokens involved, but his also works the exact way he wants it to.

Adam Stacoviak

Yeah. Yeah, absolutely. Mine leverages some Recast stuff as well. It's kind of cool.

Jerod Santo

Sweet. Open source that sucker.

Adam Stacoviak

One day.

Jerod Santo

Which day is that?

Gerhard Lazu

Not today. \[laughter\] Definitely not right now.

Jerod Santo

But it's going to be one day...

Adam Stacoviak

One day. There's a bigger launch awaiting, is all I'll say. There's a bigger launch awaiting till I'm going to open-source some things.

Gerhard Lazu

I've been using AppCleaner for many, many years... Now, there's no TUI, there's no CLI. It's just a regular app. It's a really old one.

Adam Stacoviak

Butt you just drag and drop onto it, right?

Gerhard Lazu

Pretty much, yeah. And you also have a list of applications... But it's so old that it's difficult to find it these days, and it hasn't updated in a very long time... So I will check Mole out.

Jerod Santo

Mole's really cool. Brew-install Mole and you're done. So you can check it out right here while we're talking. And I liked AppZapper... And I think AppZapper doesn't exist anymore, but the cool thing about that was that it would literally make the zap sound, as it -- yeah. You drop your app on it and it zapped it. And I just liked that sound.

Gerhard Lazu

That's the only feature that your application needs to have, Adam. If it zaps...

Jerod Santo

Mole does not zap, so there you have it.

Adam Stacoviak

"Make it zap" is our tagline, actually. Make it zap.

Jerod Santo

Make it zap.

Gerhard Lazu

There you go. I think that's a very good debate, actually.

Adam Stacoviak

What about you, Gerhard, besides your Christmas tree? Did you...?

Gerhard Lazu

I will come back to that. I will come back to the Christmas tree, yeah.

Jerod Santo

This guy's got stories, man.

Gerhard Lazu

Oh, man. Oh, yes. I have to tease them and be very disciplined, because there's too much stuff. So I have to be very careful, because it will be an hour and I will not shut up talking about this thing. I mean, it's just like -- anyway. So we will come back to that, I promise.

Adam Stacoviak

Okay.

Gerhard Lazu

\[00:11:45.18\] Last time, when we finished Kaizen \#21, this was one of the last thoughts that we shared, which is what's next. So BAM... Remember BAM? That happened live. OOM crashes, out of memory crashes, and a bunch of other things. The good news is that only one thing happened. OOM crashes...

Jerod Santo

You've only got one thing to talk about... \[laughs\]

Gerhard Lazu

...but this rabbit hole is really, really deep.

Jerod Santo

Okay. Alright. Take us down the rabbit hole. The OOM, out of memory.

Gerhard Lazu

Who remembers this book? Erlang in Anger.

Jerod Santo

Erlang in Anger.

Gerhard Lazu

Stuff Goes Bad, by Fred Hebert. Ferd.ca.

Jerod Santo

Now, I remember "Learn You Some Erlang for Great Good", but I do not remember this one in particular. So I'm not sure why the other one hit my radar, because he wrote both of them, it seems... But when did this one come out?

Gerhard Lazu

So this one, if I look -- I just switched to the browser... 2016, 2017, while he was still at Heroku. Remember Heroku? Those were the days.

Jerod Santo

I do.

Gerhard Lazu

So about 10 years ago. And Fred -- I mean, if you don't know his blog... It's just amazing. I'll just click it very quickly, just to have a look... I think it's one of the best blogs out there. There's so much goodness here. So much. But one of my favorites is queues, and queuing, and how queues don't protect from overload. So queues don't fix overload. And this is so relevant to today's conversation as well. But there's a lot of stuff in the Erlang ecosystem, and there's many, many things that Fred wrote over the years, that are so relevant to today. So if I click on Download PDF - by the way, this is a... It's amazing this book is open source. You can download it, open source, freely available, Creative Commons license... And I'm going to make this a little bit bigger, so we can see what's happening. And if I search for "Let it crash", it's page number one. It's in the introduction.

Jerod Santo

There you go. Page one.

Gerhard Lazu

Page one. And this idea of "Let it crash" really comes from the Erlang ecosystem. It's very well renowned there because of how the Erlang VM works, and how all the processes, and the supervision trees just -- it was built this way. And we know a thing or two about Erlang, Jerod, right? ...because the application, Elixir, the Phoenix framework runs on the same principle.

Jerod Santo

I know a thing, and you know two, so that's how we get to a thing or two.

Gerhard Lazu

And Adam - I'm sure he knows the big one. But we don't know whether he's going to share it. The point is, when you think about "Let it crash", Jerod, from your development experience with Erlang, with Elixir, Phoenix - is there any situation, any moment where you could experience it and you realized "Huh, that's nice"?

Jerod Santo

When I let it crash?

Gerhard Lazu

When you let it crash.

Jerod Santo

Well, it's nice that the \[unintelligible 00:14:49.07\] seems to handle a lot of the problems with letting it crash. It just goes again, or there's a supervision tree, and things watching each other, and I don't have to think about it very much. I can't think of an instance in development where I was like "This is really useful", but I'm sure you could come up with one.

Gerhard Lazu

Yeah. So you know when you write code, we tend to write code very defensively... Typically try/catch. So you feel like you need to account for every single scenario. And the "Let it crash" philosophy is about not preventing failure; learning from it. What that means is you need to have a context where it's safe for things to crash, and the overall system will still remain stable. So how can you build a resilient system - and really, this is about resiliency - where the core of the system will remain running, and the system as a whole will remain running even though parts of it may experience failures? ...but those failures will not bring everything down. And that's really important. So fewer try/catch blocks, don't code defensively, let it crash, and separate the code that solves the problem from the code that fixes the failures. And the more you can lean into the framework, or the VM, or whatever you have - the system - to deal with failures, the better off you are to focus on the things that are unique to your application.

Jerod Santo

Yeah.

Gerhard Lazu

\[00:16:12.15\] And Erlang is well renowned for that.

Jerod Santo

Kind of the opposite philosophy that Go took, as I write some Go code and I write some Elixir code... Where with Go it's handle every error condition right after you potentially raise one, and make sure that there's no error. And if you're not dealing with it, then you're not writing robust software. And the other philosophy is "Let it crash and deal with it elsewhere." I think they're both legitimate, depending on what you're building.

Gerhard Lazu

Agreed. Well, in our case, we had a lot of crashes to deal with... \[laughter\]

Jerod Santo

Yeah, we're taking the Erlang style...

Adam Stacoviak

Oh, gosh...

Gerhard Lazu

So what we are going to have a look at is all the times that the Pipe Dream has been crashing since our last Kaizen. So since Kaizen 2021, which is October 17th, we had a lot of crashes. And there's a certain property about the system - and this is Varnish specifically that made these crashes pretty okay. And the property which I'm referring to is when you start the varnishd, the daemon, Varnish itself runs as a thread, and you have many, many threads that do different things. So when we had these out of memory crashes, all that happened - the thread was killed. Which means that the system as a whole didn't crash, the VM didn't, the Firecracker VM didn't crash... The application needed to restart. It was just a thread that was using too much memory, and it restarted within seconds, as in maybe two seconds, and everything was back to normal. Obviously, the cache was cold, but it was good. And that's why the memory looked a bit interesting, in that it doesn't release all the memory, the VM doesn't restart... There's not many hangs; it restarts and it crashes really, really quickly. So that's a nice property.

Jerod Santo

Well, that confuses me. So how does Fly know about it then, if it's just happening inside of Varnish?

Gerhard Lazu

So it's looking at the process ID, "Which process uses the most memory?" And it's the same process that's asking for more memory. So basically, it will just send a signal to that process, and kill that process. But that is just a thread; that maps to a thread. So Varnish itself didn't crash; it's just a thread that maps to a process ID that crashed, and then it was restarted by the Varnish daemon.

Jerod Santo

Okay, so where is Fly involved in that? Because Fly is ,aware because I see all these Fly notices, and I get the Fly emails.

Gerhard Lazu

Right. So Fly is aware that there is a process on the machine that is using too much memory, and more memory is being requested. And then it looks like "Okay, which process do I kill?" And in this case, a process with the most memory will get shot, and will get killed.

Jerod Santo

So Fly as a platform can actually reach in and kill that process without killing the machine, rebooting the VM, or Firecracker, or whatever?

Gerhard Lazu

So the Fly platform - it integrates with that functionality, which is a Kernel, it's a Linux functionality. That's why an out of memory crash would happen even if you have a single machine; you have too much memory, you don't have any swap... How do you basically give more memory when there's no memory left, and when the system is becoming unstable? So then you get just a single process which gets killed. In Fly's case, they surface that. They surface the fact that there was an out of memory crash, there was an out of memory event, and they send you an email when that happens. It doesn't mean that the machine had to restart, it doesn't mean that it stopped serving traffic... It just means there was something that just had to go away, because it was using too much memory. When I say "too much memory", obviously it's a bit more complicated than that, because something was asking for memory, the kernel didn't have any more memory to allocate, so it just had to look at what needs to be killed, so that I can allocate more memory... Because something is using too much memory. And it just so happens it would be this process, and this thread. So how many crashes do you think that the Pipe Dream had since Kaizen 2021, since October? So we're talking about three months, maybe a bit more than that...

Jerod Santo

\[00:20:16.01\] So Gerhard has presented us a multiple choice quiz. A is 20, B is 40, C is 80, D is 160. Now, I know that I personally receive an email every time this happens, and so I have a little bit of a feeler into this. I delete them, so I can't go do a quick search. Adam, do you get emails when these Fly things crash?

Adam Stacoviak

I don't.

Jerod Santo

Okay, good for you.

Adam Stacoviak

Not to my knowledge. And if I do, they're in a box that doesn't get looked at.

Jerod Santo

You've been saving on some email bandwidth...

Gerhard Lazu

I do know, because we send the email... So let's go back to this one. If I click on this one... Let's take this one, and you can see everyone that gets an email. I'm just going to make this a little bit bigger, so you can see it services Jerod, Adam and Gerhard.

Jerod Santo

Oh, you do get it.

Adam Stacoviak

I do get it.

Gerhard Lazu

Yeah. So there must be a filter...

Jerod Santo

He just doesn't look at it.

Adam Stacoviak

A superhuman saving me. Nice...

Gerhard Lazu

That's okay. So what do we think?

Jerod Santo

Good thing other people are looking at it...

Gerhard Lazu

It's not an Adam problem. That's the thing. So that's a good thing. He's doing the right thing. He's just saving his inbox for more important messages.

Jerod Santo

They ran an LLM on that to the side. So I feel like 160 is too many. I don't think I've gotten 160 emails since October on this particular thread. 20 feels not enough. I've certainly got more than 20 emails. So I'm between 40 and 80, and I'm going to think that -- gosh, that's a tough one. I'm going to go with 40. Adam, what do you think?

Adam Stacoviak

I'd go with 40 as well.

Jerod Santo

Oh, I got it. Yes...!

Gerhard Lazu

43 exactly.

Adam Stacoviak

The price is right.

Jerod Santo

The price is right. Alright, cool.

Gerhard Lazu

Yeah.

Jerod Santo

43 crashes from October to December; through the end of the year.

Gerhard Lazu

Yeah. And then obviously, there were periods when we had quite a few. So if we were to think about what could be happening in Varnish that it's running out of memory and crashing... So this is us trying to think about the sort of traffic that we serve, trying to think about everything -- I mean, now we see every single request that hits Changelog, the CDN as well... And it's a lot of requests.

Jerod Santo

Yeah.

Gerhard Lazu

So there was something in the system that was using way too much memory, and as a result, the process - or the thread in this case - was crashing.

Jerod Santo

I mean, I could guess it, but I might even have some insight. So... Should I just say it, or do you want Adam to guess? I mean, my guess based on - also I saw some emails flying through, but... Already I would have suspected that we just have too many large files. These 60 to 80 to 100 megabyte MP3 files loaded into memory, flying every which direction... And you just can't load up that much memory without some sort of fancy freeing mechanism. And it's just trying to hold all these MP3s in RAM, I think, and it just can't do it. So that's my guess.

Gerhard Lazu

Yeah. That was a good guess. And I think -- the next question is going to be to the audience, because we know too much.

Jerod Santo

How are they going to answer it? It's not real time.

Gerhard Lazu

Well, just think about it... We will give some time for people to think.

Jerod Santo

Okay. We'll do like a delay here. So if they have a -- what's it called? The feature where you skip silences on... They're not gonna have any time to think about this.

Gerhard Lazu

Right. Okay.

Jerod Santo

So quickly, turn that feature off, give yourself some time to think... Go ahead.

Gerhard Lazu

Yeah. Or pause. We can also say pause. Now is a good time to pause. And then - what could be the problem? So you're right - all those large files. We had all the MP3 files; many, many MP3 files. They're large. All trying to be cached in memory. And that was a problem. So what is many? Well, we have thousands at this point, of MP3 files, across all the podcasts, since the beginning of time. Large - large means anywhere from 30-40 megabytes, to 100+ megabytes. So that's -- I mean, just think, if you had to load a thousand files, that take 100 megabytes... That's a lot of memory that you need to have available. \[00:24:30.17\] And the problem is that once you store these large files, as we discovered, you get memory fragmentation, in that - imagine that you have all the memory available, you keep storing all these files, and at some point there's no more memory left. So what do you do? Well, you need to see what can you evict from memory, so that you can store the new file. So imagine that you evict a few of those objects, but maybe they aren't big enough, and you haven't evicted them fast enough. So then you have this big file that can't fit anywhere, because the sizes, the holes that you have in memory aren't big enough for this file to fit. And there's no defragmentation, or nothing like that that runs in the background... Which means that even though technically you kind of would have space in the memory, for the specific files you may not. And then it can't be stored in memory. Now, the thing in Varnish is actually called - I kid you not - n\_lru\_nuked.

Jerod Santo

Nice.

Gerhard Lazu

So I think the connection to the nuke and to the book, and to "Let it crash" is right there. So lru\_nuked, basically - it's like a forced eviction. So it's an event where an object has to be evicted from the cache just to make room for a new one, because the storage is full. So you can see how many times this has happened. And that's like an important metric that if we look at, we can see, "We had too many of these events." Many objects were being nuked from memory to make room for new objects, but sometimes they wouldn't fit. So how badly did it nuke? Because we can measure this, we can look at this. And this is what that looks from a memory perspective. So you can see that the instance was running about maybe four gigs of memory, and then we had a massive spike within minutes, like one or two minutes, to 16 gigabytes. So that's a lot of data that had to be fit in memory. And you can already see where this is going... Scrapers, and bots, and LLMs... We have so many things happening. And then you can see the memory, it went up. The thread was killed, the child was killed... The Varnish \[unintelligible 00:26:40.14\] memory came down again, and then it went up again. So the graph that we see here, we can see the first spike, just like maybe a minute apart... The second spike, another crash... It took a little while for it to restore. We're talking maybe 10 seconds. And then we stabilized around 10 gigabytes. From a CPU perspective, we got like a hundred percent CPU utilization when this happens. Everything is full on, everything -- the instance is really struggling to allocate and deallocate and free up memory... And more importantly, we have a lot of traffic flowing through. So how much? 2.29 gigabits, specifically. 2.29 gigabits...

Jerod Santo

Per second.

Gerhard Lazu

Per second, exactly. And these happen so quickly; you have a huge rush of traffic coming in... And then nothing. **Break**: \[00:27:40.00\]

So why is more traffic coming into the instance than going out? So this is the traffic that the instance is receiving. So we're receiving 2.29 gigabits, but we're only sending 145 megabits. Now is a good time to pause and think about why this is happening.

Jerod Santo

Yeah, don't skip silence. So when we say the instance, we mean the Varnish instance.

Gerhard Lazu

The Varnish instance, yeah.

Jerod Santo

Which sits between our end user, whatever that is - or users - and our application. Well, actually, and our Cloudflare, not our application.

Gerhard Lazu

All our backends. And we have a couple of backends.

Jerod Santo

Yes, but in the case of MP3 files it's our R2 origin.

Gerhard Lazu

That's correct.

Jerod Santo

So Varnish is receiving a bunch of data, and sending back an order of magnitude less data. And what's it receiving - I don't know, man. I mean, my guess would be we're uploading MP3s... Now, that's gonna go straight through the app to R2... Just a DDoS? I mean, what is it? I don't know.

Gerhard Lazu

Yeah... So it is a DDoS, but it's specifically downloading MP3 files, or starting to download MP3 files, but never finishing.

Jerod Santo

Hanging...

Gerhard Lazu

So you get all these requests for MP3 files, for large files... Varnish is going and fetching them as quickly as it can... So pulling all this data in, so it has it in memory, but the client is never around long enough.

Adam Stacoviak

It terminates.

Gerhard Lazu

Yeah, exactly. So they basically abort, but Varnish is still pulling in all the data. Now, there is a property... It's called bresp.doStreamTrue. So what this does - a very weird thing - is it tells Varnish not to buffer the entire backend response if the client is slow. So I'm not going to fetch the entire MP3 file if you only want the first, I don't know, minute, or two, or a range, or something like that. Now, this is on by default. So by default, that's how Varnish behaves. So we wouldn't need to enable this. But if the object is uncacheable, it cannot be stored in cache - do you see where I'm going with this? Memory, you can't store it in memory... So you keep pulling these files over and over again, and maybe even just fragments of them... So even though the client never receives them, you may be pulling hundreds of files, and the client just goes away. So you're not pulling the entire file, but you're still pulling enough, and not able to fit it anywhere, and it just becomes a mess.

Adam Stacoviak

This reminds me of the '90s, when you used to go jean shopping...

Gerhard Lazu

\[laughs\] Tell us... Do tell, Adam.

Adam Stacoviak

And you'd go into Abercrombie & Fitch - which I never shopped at, but let's just imagine I did... I'd go in there and be like "I like all these jeans. Get them all." I'm trying them all on, and then I just bounce.

Jerod Santo

Yeah, the person goes to collect them all, they come back and you're not there.

Adam Stacoviak

Here's a dressing room full of jeans, and Adam's gone. Bye-bye. See ya.

Jerod Santo

This really sounds like you're speaking from experience. Was this a prank \[unintelligible 00:32:54.20\]

Adam Stacoviak

I just made it up just now. I'm just creative like that, you know? On the fly. Creativity.

Gerhard Lazu

\[laughs\] That's a good one. That's a good one. On the fly, yes.

Jerod Santo

It is on the fly...

Gerhard Lazu

It is. On the fly.io. Boom!

Jerod Santo

Well, what could we do then? What's going on here?

Gerhard Lazu

Exactly. So this was one of the things which I had to deep-dive and understand what on Earth is going on. Where do we store, what's happening... So there's a lot, lot more that went into this pull request. It's pull request 44. I'm calling it "The elephant in the room." I'm going to switch to the browser, just to have a look at that. So the title of the pull request is "Storing MP3 files in the file cache." But that's the tip, right? The most obvious thing is, "Well, you either have lots and lots of memory to give varnish", which honestly would be impractical, in the sense that would be way too expensive to store all these files in memory. The next best thing is to have something like a file cache. And by the way, we're talking about open source Varnish. That's really important. Anyone can use this, anyone can configure this... You can configure a file cache, which will basically pre-allocate a file on disk, and that's where these large files will be stored. Pull request 44, the one that we're looking at, is in the Pipely repository. That's what this adds. \[00:34:19.08\] But there's significantly more stuff... And if I'm going to -- so there's quite a few files. I highlighted a few, so I'm going to look at this one... So it's not just that. You also need to tune, for example, thread pools, you need to tune the minimum, the maximum... You need to tune the workspace backend, like "How many memory structures get allocated?" You need to configure the nuke limit... And there's a couple more things that we had to go through, just to make things stable. Now, I'm just going to very quickly mention these things. You can go and have a look at the pull request to see what else went into it.. So this was the one file. The other one was the regions. That's another thing. Not all regions would suffer from this. So you don't want to allocate too much memory or too much CPU to regions where maybe they don't get a lot of traffic. And you would think that this thing is easy, but oh man, I have a surprise for you... You can't mix and match sizes easily in Fly. So you can't say "Create application groups, and this group will be the small group, and that group will be the big group, and this is just one application..."

Jerod Santo

Really...?

Gerhard Lazu

It's not straightforward. So you have to -- again, this is how I solved it. Maybe someone listening to this will tell me, "Hey, Gerhard, you're wrong." I would love to know that, seriously. So the way I solved it is we deploy in all the regions, because you specify the size once. So you say "My starting size is the large instance type." It has a certain number of cores, a certain number of memory... And by the way, the disk is the same in all of them, because that's another problem, so we will sidebar that, or put a pin in that. So when it comes to the initial deployment, you deploy the one size across all the application instances, and then you go and need to check to see which instances should be scaled down, so that you have the capacity, but the regions that don't need the capacity can just bring them down. And you do a rolling deploy, in that you replace one for one, you have plenty of capacity to handle the traffic while instances are being rolled... All that good stuff. But we have hot regions, and then we have cold regions. And there's quite a few things here. Again, if someone knows how to do this better, I would love to hear about that. And we have the TOML, we have the primary region... There's a couple of things here... We'll come back to services and -- HTTP services. That's a fun one. We'll leave that for a little bit later. Fly Just... We can see how we do the flyctl deploy, we disable HA, because we want only one instance per region... We have 15 regions in total. We specify the CPUs, the memory, all that good stuff, including environment variables... Oh, that's another thing - we need to adjust the Varnish size based on the memory the instance has. We need to say "Hey, Varnish, you get 70%." And that's the other thing that this does. Same thing for the file size. You can't take up the entire disk. We tell you, based on the disk that we provision, how much space you should use from the disk that gets created. There's a scaling there, so that's another good one... I'm going through the pull request to see if there's anything else. Oh man, this was a pain... So recreating -- like, writing tests for this. Everything is tested, in the sense that which requests would go, or basically which files would get cached in the file store, and which files would be cached in the memory store. So how do you write the tests? Some Varnish logging is included, you have to have anchors... There's quite a few things. So that's assetsbackend.vtc. And part of this - it was a huge refactoring.So if you look at the lines of code, I wouldn't say it's that many. 1,500 were added, and 1,470 were deleted. So not much changed. I mean, the net is 30 new lines were added. But there was a huge, massive refactoring part of this. \[00:38:21.16\] So there's -- again, this was, I think, two-three days of figuring it out, trying things, refactoring things... And if you think that an LLM can help you - well, you try this. \[laughs\] And it takes longer to go through those iterations than - if you know what you're looking for, it tends to be easier. Anyway, it's very dense, very specific, very difficult to make sure that it's doing the right thing. But it's all there. We have the mock backends, we're reusing things... We split the VCLs -- by the way, we finished the splits, so it's easier to reuse them. So there's quite a few things there. Now, this is Kaizen, so we are wondering what improved. After all this work, we rolled it out... What improved? And to answer this question, we need to figure out which region is the busiest one. So out of all the regions that we serve - we have 15 in total - which ones get the most traffic? It's those hot regions. We're looking at Fly, the Grafana dashboard for our fly application, the instance of the Pipedream, the current one... And we can see that SJC - San Jose, California - is a nice, big, red circle, which means it has the most traffic... And also NRT, which is Tokyo.

Jerod Santo

Huh.

Gerhard Lazu

Apparently.

Jerod Santo

We're big in Japan.

Gerhard Lazu

Yeah. And Europe, there's quite a few... So if I'm going to pull this down a little bit; let's see... No, I wanted to go here.

Adam Stacoviak

What about that new continent? Are we big there?

Gerhard Lazu

The new continent? Australia?

Adam Stacoviak

No, there's a new one. There's a new-new one.

Jerod Santo

Well, what's it called?

Gerhard Lazu

Which is the new one?

Adam Stacoviak

I don't know. There's a headline... I thought y'all would get the joke.

Gerhard Lazu

No.

Adam Stacoviak

Over the holiday there was speculation there was a new continent being announced.

Gerhard Lazu

Narnia?

Adam Stacoviak

Maybe... It could have been Narnia.

Gerhard Lazu

\[laughs\] No, no, no.

Adam Stacoviak

With the closet.

Gerhard Lazu

So right now, even this list is basically -- if you think about it, it kind of makes sense. It's US East, US West, Europe... But we have quite a few instances in Europe. We have four. It's more geographically spread in Europe. And we have Asia. So these are like the big ones. Australia, Africa, and South America - they're not as busy. They are the least busy regions. Cool... So which instance would you like us to have a look at? So I have a queue right here...

Jerod Santo

SJC, baby. Let's go. Let's go big.

SJC, baby. Alright, let's see that. So I'm running flyctl, SSH console. I'm using two flags. -s, which is a short one for --select... It'll prompt me which instance I want to select. And then I have -C. Capital C. It's different than lowercase c. They do different things. I give it the command to run. And it's \[unintelligible 00:41:11.25\] which will give me all the statistics from Varnish at a point in time. So since this instance was running, I will select SJC... There you go. And it will give me all this data, which is all the counters that Varnish is incrementing, is keeping track of different things; of the origins, backends, the memory pool, the disk pool, the lock counters... There's so much stuff. I'm really, really impressed how many things Varnish has. So this is what we're going to do. "We", because AI.. We're going to copy all of this, we're going to ask AI what it thinks of this... How about that? \[laughs\] There's just too much data here, so let's be serious about it. So question to you - which is your favorite AI, Jerod? Which one do you use?

Oh, I don't like any of them. I would probably start with Claude, and then I would go to Grok, and then I would go to ChatGPT, third.

Gerhard Lazu

\[00:42:11.24\] Okay. So Claude, which one? Which version? Which model?

Jerod Santo

Opus, man. Give us the Opus.

Gerhard Lazu

Opus. Okay. So we're looking at abacus.ai, something I've been using for a long, long time... It allows you -- I'm only paying $10 per month for it. Not sponsored, not affiliated in any way... It's just something that I've picked for myself, and I can basically pick any model, and I can just run this. So I have something prepared, so I'm going to drop this - it's all the data - and we're going to read through something that I prepared ahead of time.

Jerod Santo

You pre-prompted this.

Gerhard Lazu

I pre-prompted this, exactly.

Jerod Santo

Okay. You've been engineering this prompt for weeks.

Gerhard Lazu

Not really, but...

Jerod Santo

Oh, that's a long prompt.

Gerhard Lazu

So we're going to read it, and in the meantime, Adam will think about his favorite LLM to try. And I have mine. So we'll try three LLMs to see what they say.

Jerod Santo

Oh, my goodness.

Gerhard Lazu

So I'll need to read the prompt now, while everybody thinks, "No...! You should be using-" whatever LLM you should be using. "You are a Varnish 7 expert. You need to prepare four distinct responses, and be explicit about the person that you're addressing. One, a seasoned sysadmin that has been living and breathing infrastructure for the last 20 years. Be precise, think deeply, and approach the setup from a hardware perspective. Two, an Elixir application developer that embraces Erlang's "Let it crash" concept. You need to give it straight, give it fast, and keep it relevant to their application. Use the app and the nightly backends. Assets and fees are important, but less relevant. Cloudflare R2. Three, the business person that is selling this thing. They care about costs, efficiency and simplicity. Keep it high level and relevant for someone that doesn't care about the tech, but cares about the outcomes. And four, the audience of a podcast where this is being discussed. Make it general, relatable and fun. Make analogies, keep it light and engaging." I have fun too many times. We don't wanna make it too fun... \[laughs\]

Jerod Santo

That's a lot of fun.

Gerhard Lazu

Yeah, that's one too many funs.

Jerod Santo

That's right. Well... \[unintelligible 00:44:16.27\]

Gerhard Lazu

"Now that you understand your audience, please analyze the following VarnishStat output for the... SJC." Look, I already knew that you would pick --

Jerod Santo

How did you know I'd go for the big one...?

Gerhard Lazu

I have no idea. "Focus on things that work well, things that could be improved, and anything else that you find interesting. And by the way, ignore the synthetic requests." It will keep mentioning these... I get so fed up with this. We have health checks that run every five seconds, so they are normal.

Jerod Santo

Okay.

Gerhard Lazu

So I'm going to copy this, I'm going to run this, and I'm also going to open a new window for Adam. So which LLM should we pick, Adam? Which is your favorite?

Adam Stacoviak

You mean model?

Gerhard Lazu

Model, yeah. Which model?

Adam Stacoviak

We just used it... But I'd probably back up to like Codex... Which is like GPT-5, latest... 5.1, 5.2...

Gerhard Lazu

There you go. So GPT Codex. My favorite one is Gemini, so I'm going to drop it... And let's see how do they compare.

Jerod Santo

Ah, Gemini. You're in a different tab now. So Abacus can't do Gemini?

Gerhard Lazu

It might, but I have my own Pro account.

Jerod Santo

Gotcha.

Gerhard Lazu

So that's something else. I use Veo, I use Nano Banana... Quite a few things. Transcripts... It's all part of the package. So it can, but that's what I prefer. Cool. So Claude Opus 4.5, for the seasoned sysadmin.

Jerod Santo

This is you, Gerhard.

Gerhard Lazu

This is me, exactly. Thank you for noticing. For knowing who's who. \[laughs\]

Jerod Santo

You're welcome. I'm following...

Gerhard Lazu

\[00:45:45.09\] So what's working well? Rock-solid stability. So by the way, the instance has been running for 5.4 days. We had all these improvements shipped, and we are able to observe how our busiest instance works... And that's what this is, basically. That was... The window moved. Cool. So after 5.4 days, zero child panics crashes, zero thread failures. This is important. It means no threads died, no threads had to be restarted; everything is healthy on this instance. It didn't crash. So this instance didn't crash. Zero lock contention across all subsystems. Your CPU cache lines are happy. Excellent hit ratio, 93%. We like that. We really like that. We have backend connection pooling, with a two to one reuse ratio, and memory pressure is minimal. 132 lru's in the last five days. Lru nukes. So very few objects had to be removed from memory. Thread pool property - 300 threads, zero queuing, zero drops. That's perfect. Areas to investigate. Disk storage allocator failures... We have disk C fails... We are hitting storage fragmentation. The disk is 97% full. We have 48 gigabytes used. That's how many MP3 files are stored. By the way, how many MP3 files total do you think we have?

Jerod Santo

Size or files?

Gerhard Lazu

Size.

Jerod Santo

Size. Well, if we had a thousand episodes at a hundred megs each, which - neither of those things are true... That'd be a hundred gigs. So a hundred is too big, but a thousand is too small. I'm going to say 80 gigs.

Gerhard Lazu

Adam, do you want to guess?

Adam Stacoviak

That math checks out... I was gonna say like a terabyte, but that's probably raw wav files, versus not...

Gerhard Lazu

All the files that we store in R2 - and this includes all the assets; but we know that the MP3 files are the biggest - it's close to 250 gigabytes. We may have some duplicates; I don't know, I haven't checked. But that's how much files we have in R2.

Jerod Santo

Yeah. Well, we also have Plus Plus for the last couple of years, which means every episode has two files, not just one. So... That makes sense.

Gerhard Lazu

So we should go higher... Now, we use this in every single region, so maybe we want to reduce number of regions... But I think --

Jerod Santo

We need a third category called "Super-hot."

Gerhard Lazu

Super-hot, yes. Maybe.

Jerod Santo

Which is like SJC in Tokyo, right?

Gerhard Lazu

That's possible, yeah. There's four, which - we know they're really, really hot. Yeah, yeah. But honestly, this is happening across multiple regions, and...

Jerod Santo

It is.

Gerhard Lazu

...we'll get to some interesting things. So okay. Synthetic responses, grace hits... All good. For the Elixir developer - and I think this is you, Jerod. Do you want to read it out?

Jerod Santo

Oh. "Well, the TL;DR is Varnish is doing its job. Your app backend is well protected." Do you want me to read the whole thing?

Gerhard Lazu

If you want... I mean, it's shielded...

Jerod Santo

It's 95% shielded. No failures, zero backend failures... That's because of - you know, my code doesn't really let it crash very often.

Gerhard Lazu

Exactly. Your code is -- yeah, it crashes internally, not externally. \[laughs\]

Jerod Santo

That's right. My thing is doing its thing. It is generating some uncacheable responses, but you know, we do have some that we just don't want to be cached... Ooh, one fetch failure. Negligible. Yeah, I agree. We don't need to worry about that. And in the end, it says "Whoever wrote this is really good at what they do."

Gerhard Lazu

I agree. That's exactly what it says. \[laughs\]

Jerod Santo

And they should be proud of themselves... And congratulations on such a great hire.

Gerhard Lazu

Yeah. I agree. I agree.

Jerod Santo

\[laughs\]

Gerhard Lazu

I think the hire needs a promotion, and a bonus, I think...

Jerod Santo

There you go.

Gerhard Lazu

Alright. For the business person, the caching layer is performing excellently. Adam, do you recognize yourself? Or shall I continue with this?

Adam Stacoviak

You can read it.

Gerhard Lazu

93% of requests never touch your servers. Massive cost savings on compute. Do you know how many requests per second the application is serving? Maximum, by the way. What's the maximum RPS for this amazing Elixir Phoenix application, for the homepage?

Adam Stacoviak

Probably a lot. Gosh... Thousands? Tens of thousands?

Gerhard Lazu

Maximum. Okay. Jerod?

Adam Stacoviak

100,000?

Gerhard Lazu

The database connection is involved.

Jerod Santo

Concurrently?

Gerhard Lazu

Concurrently, yes. I don't know, I'd say not very many. To our homepage?

\[00:50:15.13\] The homepage.

Jerod Santo

That'd be like 12. 12 requests a second.

Gerhard Lazu

Yeah. 17. \[laughs\]

Jerod Santo

17! I'm right in there, baby.

Gerhard Lazu

Someone that knows their code. So 17 requests per second. \[laughter\] So if all these requests were hitting the application, we'd need so much compute to serve that. So much caching... Obviously, we've removed all the caching. Now we're joking about this, because we purposefully removed all the caching from the application.

Jerod Santo

Right.

Gerhard Lazu

I remember that a couple of years back, because we said "This has no place in the application. The application gets restarted, we need to store this somewhere, we need a cluster..." It was just really messy to handle it at that layer, which is why we introduced this. Five plus days running without any issues... By the way, this is the last deploy. So maybe by the next Kaizen, if we do no more deploys, we'll be able to see how well it handles. Zero failures on the infrastructure side, and three terabytes of data served to users. Three terabytes. So in five days, this one instance served three terabytes. Without your application servers breaking a sweat, storage is getting full... So we need basically more storage. For the podcast audience --

Jerod Santo

Oh yeah, it's gonna be fun.

Gerhard Lazu

Imagine a really good receptionist at a busy office. This Varnish server is like having someone at the front desk who remembers everything. Out of 100 people who walk in asking questions, 93 of them get their answers immediately from the receptionist, without ever bothering the experts in the back office. What's cool? It's been running for over five days straight without a coffee break, or a single mistake. That sounds cruel to me, but let's go with it...

Jerod Santo

Good job. \[laughter\]

Gerhard Lazu

It served three terabytes of data... That's like streaming about a thousand HD movies. This one instance streamed a thousand HD movies in five days. And the experts only had to answer 7% of the questions. The one quirk - the filing cabinet is getting full. It's like when your receptionist's desk drawers are stuffed, and they occasionally have to throw away old notes to make room for new ones. Not a crisis, just time to get a bigger cabinet. Okay, I think the \[unintelligible 00:52:29.09\] 300 workers... I think that's too detailed right there.

Adam Stacoviak

That's good fun there. Good job.

Gerhard Lazu

Do we care about GPT, or Gemini? We can only use one. We can only pick one.

Adam Stacoviak

Gemini is getting some good hotness...

Jerod Santo

Let's check Gemini. We'll see how it adds up. Oh, it's still thinking...

Gerhard Lazu

Let's see. I think it's finished. Maybe that's -- let me just close that. Did it finish? I think it did. Alright, so let's go up... Slow thinking... I did the thinking -- I could have gone Pro as well.

Adam Stacoviak

Show thinking.

Jerod Santo

Oh, show thinking.

Gerhard Lazu

Show thinking. Slow think... \[laughs\]

Jerod Santo

I thought I said "Slow thinking." I was like "Come on, speed up, Gemini."

Gerhard Lazu

Show thinking. There's quite a lot there... Anyway, we're not going to look into that.

Jerod Santo

No.

Gerhard Lazu

So the instance has been up for 5.3 days. The MGT uptime - I like it. It's telling me which of that long list of counters is important. From a system perspective, the threading model is perfectly dialed in. 300 threads across two pools, with a zero thread limited, and zero thread queue length. The kernel isn't context-switching unnecessarily, and we aren't starring for execution units. However, we have a critical IO and storage allocation bottleneck in the SMF. The file - I'm not sure what Steve Dore means. That's a weird one... \[laughs\]

Jerod Santo

I'm not sure Gemini knows what Steve Dore means either.

Gerhard Lazu

\[00:53:51.06\] The problem is SMF disk C fail is at 19,000 plus, against only requests at a 73% failure rate for disk allocations. In the G space we have only 145 megabytes, while we have 48 gigabytes used. "You're effectively out of contiguous blocks or hitting the ceiling of the pre-allocated file." That sounds right. Varnish is forced to fall back to SMA RAM or simply not cache large objects. While SMA has 2.1 gigabytes free, the disk is thrashing. The hardware, "If this is spinning glass, your IO8 is likely spinning." It's not; it's NVME. So yeah, there's quite a few things there. Interesting is low, relative -- suggesting the lock contention on the shared memory log is negligible. Nice. We have low lock contention. The bottleneck is strictly at the storage layer. We just need basically a bigger disk. The Elixir developer - let it crash proponent. I like it. Your app backend is rock solid. Well done, Jerod. Varnish hasn't seen a single backend fail or an unhealthy event in five days. The let it crash philosophy is working. If the app is crashing, it's recovering. So far, Varnish doesn't even notice. I don't think it is... I think the application is good. The good news is we're seeing an effective 75% hit rate. Now, this is different, right? We had 93% there, so I would like to understand the discrepancy. That's another reason why you should use more than one, by the way. And then obviously, double-check the data, because sometimes they give you things which aren't true... The crash concern - the disk cache is crashing. Again, this is not for the application developer. I think this is for the sysadmin in me, I think. The straight talk... Your main synth is high. So we have health checks. There's delta responses... So we have a lot of synthetic requests. Sorry, synthetic responses. This is, again, a Varnish thing. The business person - efficiency we're currently serving of our traffic from Varnish... I think I know what's happened. I don't think it's taking into account the synthetic requests. Those should be removed from the total number of requests.

Jerod Santo

So you think Claude has the right number?

Gerhard Lazu

I think so, yeah. Yeah, I think so. This means for customers - we have cost efficiency, that's good. The risk... There's the bottom line. I think this was the fun, but I think this is a library. I think we can stop it here.

Jerod Santo

The library analogy versus the secretary analogy.

Gerhard Lazu

I think that was a better one. I got a barista one, I thought it was a very good one...

Jerod Santo

Oh, yeah... For queuing, or for what?

Gerhard Lazu

For queuing, yeah. The barista analogy I thought was very good. This is using books, and whatnot. The library hasn't burned down... That is fun. \[laughs\]

Jerod Santo

Oh, that's a good thing.

Gerhard Lazu

That is fun. So I think Gemini is getting a bit funnier. The nightly feeds in the app are still humming along. Nice. So that's what we have... And that was only half the problem. **Break**: \[00:56:53.09\]

That was only half the problem. So we're at the midpoint... \[laughs\]

Jerod Santo

Oh... I was feeling good. I felt like we had it all fixed. What else is the problem?

Gerhard Lazu

Oh, man... This is when all the fun begins. So. Do you remember this, Jerod?

Jerod Santo

Yes. MP3 requests intermittently hang in Newark, New Jersey. This was our good friend, John Spurlock, who's been on the show before, and is a podcast nerd... In fact, he runs op3.dev and other podcast nerdery things. And so he really knows his stuff. And so when he reports issues, I don't say "Did you try rebooting?" I take it seriously. So I shared it with you... And he actually did some additional digging for us. Go ahead.

Gerhard Lazu

Mm-hm. So in terms of - you tested this, and I think you had issues as well. So we've confirmed this, for sure.

Jerod Santo

I did. Certain times, certain files... Actually, it would be all requests at certain times. I assume that it was that particular pop - as we could call them; or pipe, in the Pipe Dream - was hanging. And then it would go away... And he actually had the same problem. He had a Friday night deploy of Friends, and he was trying to listen to it on Friday. Couldn't get to it. By Saturday morning he can get to it. So it's intermittent hanging. Very difficult to diagnose, very difficult - I assume - to debug. And then it just comes back to normal... I thought it was maybe the out of memory thing; like, it's just in some sort of fugue state until it reboots, and then it works again... But you go ahead.

Gerhard Lazu

That's what I thought. That's why I did a deep dive on this. This was November, end of November, beginning of November... So November, I was just trying to figure out what on Earth was going on. Just from the sides... I didn't have too much time. But if you look at this response, there's quite a few things there. This is my initial one, an investigation, trying to understand what's happening, giving a couple of debug headers, a couple of extra headers that the request can be made of -- sorry, can be run with, so we'd just get a bit more details. Forcing regions as well... So there's quite a few things there; I was checking into that. This is Don McKinnon. He also had issues today, so he pasted some results... So thank you. Thank you, Don, for adding this. This was helpful. So I'm still scrolling, I'm still scrolling... There we go. Super-helpful. I have confirmed that the requests have been hanging... You are getting the hangs this afternoon as well. This was only three weeks ago... So this has been going on for a while. I dug deeper and I've found the problem. The problem was that in the Fly config we had the concurrency set to connections, not requests. So it's possible to configure an application. Again, you're configuring the Fly proxy that sits in front of the application to limit how much traffic hits your application. So requests - how many requests per second should the Fly proxy forward to your application before it stops? Because you don't want to get overloaded. So before it starts throttling, it starts slowing clients down... And that's when you start seeing Fly edge errors. \[01:02:09.02\] Connections, you would use for something that has long-running connections, like a database, for example. In our case, it's not a database, it's an HTTP application... So requests would have been the right concurrency. I have no idea why I picked connections. It was the wrong one. But the effect was, as you can see here, we had 2,700 long-running connections on that edge, so on that region. So in this case it was, I think, the orange one... I think EWR, right? So EWR had all these connections opened. The clients were enclosing the connection, the proxy was full, no more connections could be forwarded to the application. Long-running connections - there are usually clients which are not doing the right thing. You shouldn't have that many long-running connections. So the problem was a misconfiguration on our side, which meant that connections, like slow connections, long-running connections were basically blocking other connections from coming through. So that was the problem there. And I thought that was it, but... But... There was more. So there was a last comment, last week... We now have a check that runs every hour, and what was interesting - and I'll talk about the check as well - we had response bodies timing out in two regions. So 13 regions were fine, but even after this configuration, there were two regions - IADT and EWR - where when we were using HTTP2, and for some reason this is important... When we were using HTTP2 and the Fly proxy would see this, it would not forward the connection correctly. As in it would start, it would serve the response, we could see the headers coming back from our instances; what we wouldn't get is the body. So the body would always be like "Zero bytes served." And we could see this happening, we could see the connections that, by the way, they were opened... They shouldn't have been opened, because the application changed, so these connections should have been dropped... There was something not quite right. My suspicion is with the Fly proxy layer. Because when we were forcing HTTP1, everything was working fine. And by the way, the Fly proxy, when it talks to our Varnish instance, it's using HTTP1. And you can see that in the headers. So the proxy to the Varnish was fine, but the client to the proxy was not fine. And HTTP 2.0 is a very complex protocol. There's so many things which just don't work the way people would expect. So anyway, the issue fixed itself... That's the important thing. \[laughs\] So opening this --

Jerod Santo

Not super-satisfying...

Gerhard Lazu

Yeah, that was very nice to see. And there was something -- Maya Ilaros? How would you read this...?

Jerod Santo

Maya Ilaros.

Gerhard Lazu

Ilaros, there you go. Someone on the Fly community forum that was very helpful - they noticed that we had a misconfiguration in our Fly TOML. And we were using services, as well as HTTP service, and this is bad, by the way. This is very, very bad. So everything was happy; we could push this config, the applications were running, everything was fine... But because we had these two things together, it was apparently creating some issues. And all we did - we were explicitly setting the idle timeout. And the idle timeout - that's the one where if after 60 seconds the connection isn't doing anything, it will be forcefully terminated by the proxy. So that part was important. So anyway, we made the change, we pushed the change... But even before we pushed the change, the proxy started behaving, and now there's pull request 49 - we right-sized it, we made a few changes, I captured all the details, the configuration, the commands... It's all there if you want to read it. But most importantly, now we have a check that runs against all regions, every hour, on the hour. CI/CD, it's using Hurl... And what I'm thinking is, shall we try running that locally, to see how it behaves? Because that's how we started it. I was starting to do it locally.

Jerod Santo

\[01:06:27.16\] Yeah.

Gerhard Lazu

So on the left-hand side -- I'm back in the terminal. On the left-hand side I am monitoring my internet connection. Remember that Christmas tree? This is related to that Christmas tree. So I'm at the top of the Christmas tree, I'm at the gateway, the core router. It's a MikroTik CR2004. Pretty good... 10 gigabits per second, maximum. Now, my internet connection isn't 10 gigabits, it's 2.5, which is plenty for this test. So every second it's showing me how many packets and how many bits we're receiving and transmitting. And again, we are recording, everything's happening live, so you can see it jumping as we're pushing more data to Riverside. Cool. So I'm going to run now Justcheck. And justcheck, by default - it's one of the commands, the just command that we have in the Pipely repository. And check - all it does, it runs Hurl with a couple of flags. It downloads an MP3 file, it downloads feeds... It basically connects to all the different backends, and it sees how quickly it can get data back. We're transferring about -- that was quick, that was eight seconds. I'm going to run it again, and as I run this, pay attention to the left-hand side... It will go to 120 megabits per second. So that's that MP3 file being downloaded. So every single time this runs, a full MP3 file gets downloaded, alongside a few other things. Okay, I can open the reports... We're not going to look into that, because we're going to run something more interesting now. We'll do Check All. And what Check All does - it runs the same command against all the regions. I'm at 2.3 gigabits per second. We're downloading all the files... We can see the response is coming back. EWR just sped by, \[unintelligible 01:08:18.24\] sped by... So all the different endpoints are returning. Now, I'm based in London... Obviously, the further away you are -- so for example, this was South America, that was LAX... So a couple of instances are slower to respond. And all this happens via headers. When you connect to Fly, you can tell it "Hey, I want to connect to a specific region", and then that's what routes the request to that region.

Jerod Santo

That's cool.

Gerhard Lazu

And again, it's all captured in that pull request, and you can see what it looks like. The Check All one - Johannesburg, that's usually slow. And the slowest one is Tokyo for me. Sydney as well can be slow. So we still haven't received the responses from there. We should get that shortly. I'm pulling now 50 megabits, 20 megabits... It's just slowing down. And it's just -- the connections between now and there... The last one there was Tokyo. In 60 seconds I pulled about two gigs, roughly. It's a lot of data that gets pulled down. The feeds, between all of that. And anyone can run it. I would recommend you not to run this, because we have to pay for this bandwidth... But our CI runs it, just to make sure that everything works. And if we look at every hour - I think I'm going to tune this down. You can see there were no more connections hanging. So we got to the bottom of that as well.

Jerod Santo

If it ever comes back -- because it went away on its own. If it comes back on its own, we're worried about it.

Gerhard Lazu

Exactly. Now we have a system that is able to inform us when there's a problem. So let's go to three... We're on page number three. This one, for example, took more than five minutes. So sometimes, when the connectivity is a bit slow, some regions can be slow - that's when you get these timeouts. So this is capped at five minutes... The last one that failed was a while ago. So you can see we're January 5th... There we go. There's one that failed January 4th. Check all instances... So let's see Run, and we'll see exactly which region failed. Execution... NRT, that's Tokyo, and as you can see, we have 100 seconds. So if after 100 seconds it doesn't download, it just times out. And we were pulling data, but it didn't finish downloading the entire MP3. And we're downloading 100 and something megabytes.

Jerod Santo

\[01:10:34.02\] Very cool. I mean, not cool that it didn't finish, but cool that that was a while ago, and we can actually test this. Now, do we need to be doing such a large file? Is that part of the test? Or can we test a smaller file, and still get the same results?

Gerhard Lazu

We could, yes. This was a file that was reported. So we need to find an MP3 file, absolutely. I think we can also reduce the frequency. We don't have to run it every hour. This was obviously in preparation for this conversation...

Adam Stacoviak

What about episode 456?

Gerhard Lazu

That's coming up. That's coming up. That's the deepest rabbit hole, so I'm leaving that for last. That's coming, Adam... \[laughs\]

Adam Stacoviak

One thing I suggested though in our Zulip - and I didn't check to see if this is even a thing, but... To validate - you know, if the Fly CLI could validate the TOML file for you. Because you could have checked the TOML file for syntax errors, or just do's and don'ts, essentially... And it didn't.

Gerhard Lazu

It does have a validation subcommand. Syntactically, it's correct. The config is valid. I mean, it was applied... But because it combines two things, it shouldn't. So at least I would expect a warning, like "Hey, you're using both HTTP services."

Adam Stacoviak

That's right. Yeah, validate syntax, and then validate expected, true TOML file config. Don't combine or conflate two values, or overwrite one, or... You know, just that kind of thing. That's how I would defensively do something like that in a CLI to protect my user from a poor config. They could have just not been holding it wrong for so long.

Gerhard Lazu

Yup. I agree. So it's the impact of that configuration, indeed. Yup. So this is something -- we can see, again, the same logs. We can see -- this one here goes to 50 megabytes per second. That's 500 megabits. When we have these peaks, when we see this in the Fly config, we can see this usually when the benchmarks run, or when the checks run, because they put significant pressure on the instances, and we can see them, and we can pick them up straight away. So that's what this is. Alright... So remember this guy - this guy was saying March 29th. So it's almost two years ago when this guy was saying "We will run into all sorts of issues that we end up sinking all kinds of time into." So this guy had a good hunch. This is Jerod, March 29th... \[laughter\] And we just went through a couple of examples of issues that we had to deal with, part of this. But because of this, we understand the traffic, and we understand how the application behaves, and the backends behave, at a very deep level. So you were right, Jerod. We did sunk all sorts of -- how many lines? Let's see, how many lines do we have now?

Jerod Santo

I wanted 20 lines...

Gerhard Lazu

590 lines. \[laughs\] 590 lines we have in total, of Varnish config. It's more than 20 lines. By the way, we have the roadmap to 2.0... This is 1.0 that we tagged and shipped. It solved a lot of issues. But that was the easy stuff, okay? So for everyone that stuck with us, something really good is coming up.

Jerod Santo

It's starting to get harder from here.

Gerhard Lazu

\[01:14:02.07\] And Adam was already mentioning it... Episode 456. There's something special about episode 456.

Jerod Santo

Oh, yeah.

Gerhard Lazu

So what is special about it? What stands out to you, Jerod?

Jerod Santo

Oh, it's just getting rocked with downloads.

Gerhard Lazu

So episode 456, "OAuth, it's complicated"... And by the way, this was recorded in 2021, it was published, again, August 2021... For some reason, it's been downloaded a lot in recent months. It has over one million downloads. This is the most popular episode on the Changelog, ever.

Jerod Santo

The most downloaded episode...

Gerhard Lazu

It's crazy. It's crazy.

Jerod Santo

Oh, so you guys looked into this?

Gerhard Lazu

We did, yes. We dug into this.

Jerod Santo

Okay. I didn't know you guys were doing this.

Gerhard Lazu

So we just had a quick look to understand what is happening here... So we have Honeycomb open up, remember, every single request which comes through the Pipe Dream, through Pipely. Every single request we sent to Honeycomb, we were able to look at it... This is the last 60 days, and I have filtering done in such a way so that I'm only looking at this one file. How many times has this file been downloaded in the last two months? You can see the peaks, right? You can see -- and by the way, this is gigabytes, and the period is four hours. So we are peaking at about a hundred -- actually, this peak was here. We had almost 300... 300? 400? Anyway, close to 400 gigabytes in a four-hour period.

Adam Stacoviak

That's just too much.

Gerhard Lazu

I think so... I know this is a great episode, great conversation, but --

Jerod Santo

I remember that conversation. It was good.

Gerhard Lazu

Like, who is downloading this file 400 times - or actually more than 400 times - every four hours, consistently, for months on end?

Jerod Santo

Super-fan.

Gerhard Lazu

A super-fan. So we can see the different regions... Now, this is spread across the entire world. It's not just one region. This is really, really big. I think if there was a DDoS attack, I think this would class as one. And in the last six months -- sorry, in the last two months, 60 days, we served 30 terabytes in San Jose, California alone. In Tokyo, we served 515 terabytes. This is a big number. And if you look in this column, the distinct IPs, the client IPs, we had over 10,000 IPs downloading this file. So this is not one or two IPs, this is thousands and thousands of IPs which keep downloading this file over and over and over again. So I don't know how we would block 10,000 IPs...

Jerod Santo

Right.

Gerhard Lazu

The VCL would be crazy.

Jerod Santo

Well, that episode was starring Aaron Parecki, who is a very talented person. And he is the co-founder of IndieWebCamp, and a big fan of the IndieWeb, as well as OAuth, obviously... So my hunch is Aaron's very interested in being the most downloaded episode ever, and he controls a fleet of machines from all around the world... And he points them wherever he wishes. And he thinks "You know what I'm going to do? I'm going to get the number one spot on these guys' download charts." And so I'm thinking Aaron Parecki is the man with the mask on, and we pulled the mask off. It was him this whole time. What do you think, Gerhard?

Gerhard Lazu

I think that we need to speak -- see, I don't want to say the specific language... I think we need to go to Asia. I think we need to visit a couple of cities in Asia... \[laughter\] And find the IPs which are responsible for this, because this is a crazy amount of traffic. Asia, it just so happens, if we look at -- so Asia is basically the continent where we are getting the most downloads from, because of this one episode. And this is actually traffic being served; this is not like head requests, or get requests. These are bytes being sent to thousands and thousands of machines in Asia, every single hour. So whoever is doing this, please stop. Please. \[laughter\]

Adam Stacoviak

It's on a cycle.

Jerod Santo

\[01:18:13.11\] So we need to knock on doors. We need to go over there and knock on some doors, and say "Excuse me, is this IP address at this home?" And then they might say yes, and say "Would you please stop? What's going on over here?" What could they possibly benefit from this? What could they be getting?

Gerhard Lazu

Maybe - maybe - we're the speed test. \[laughter\] Someone is using us to speed-test their connection. Who knows?

Jerod Santo

Yeah, maybe.

Gerhard Lazu

That's the only thing I can imagine.

Adam Stacoviak

Well, that's a lot of IP addresses.

Gerhard Lazu

It is.

Adam Stacoviak

And it's across multiple regions, which...

Gerhard Lazu

Multiple data centers, yes. So multiple regions, Fly regions are serving these IPs, yes. They're all coming from Asia, by the way... Again, I don't want to mention any names, because there's no bad guys here, right? We just want to assume that someone left the oven on...

Jerod Santo

I don't know, man...

Adam Stacoviak

It's like the blinker on when you're driving. I would say "Hey, you're not turning. It's time to turn that blinker off."

Gerhard Lazu

So the way I can see us mitigating this - and this is a hard problem because of the number of IPs which are hitting us. We can basically start blocking entire net blocks, entire network blocks... Unfortunately, some genuine listeners might be caught in this, and basically, Changelog will not be available, or at least the MP3s will not be available to a portion of users. The other one is - obviously, we can, and we should... This is like the next problem. We should enable some throttling, because there's more stuff happening here. So we don't have any sort of throttling. We assume fairness, we're assuming goodwill, we're assuming decency, and we're not seeing that here.

Jerod Santo

Well, that's the internet.

Gerhard Lazu

So to be honest, whoever is doing this - and it's not LLMs, I had a look; we have that problem as well, but in this case it's not LLMs. This is something completely different. So my hope is by someone that listens to this episode - maybe we put this in the intro - whoever's downloading episode 456, please stop. Because we'll need to take the next step...

Jerod Santo

I'm not sure that's --

Gerhard Lazu

I know it's a bit of a cat and a mouse game, but that's what will need to happen. Because we need to pay for this bandwidth.

Adam Stacoviak

This is only Varnish, right? This is only the cache layer this is happening.

Gerhard Lazu

This is only the cache layer, yes. Yeah.

Adam Stacoviak

And so what mechanisms are in Varnish to do throttling, or rate limiting, or just anything like that whatsoever?

Gerhard Lazu

There's VMods, which are basically modules that Varnish loads, that just give it extra functionality. One such VMod - and I've looked at this, it is free and open source - is the VMod throttle. Now, that means that we need to start keeping track of IPs, and it will use a bit more memory... That's okay, we have more memory. And then we need to start basically applying limits to how many download specific IPs can do. And we can limit it to MP3 files only. So if we have a bot, or if we have, for example, an RSS aggregator or something like that, we're okay serving those requests... Because again, that's what Varnish is meant to do. The problem here is that we're serving a lot of bytes for MP3s, the same MP3 that cannot be real traffic.

Adam Stacoviak

Yeah. I mean, even in this case, you can tie it potentially just to this MP3, like you just said, which is not at all an MP3 scenario. Like, if you request this MP3 with this kind of request signature of X per whatever... I mean, I didn't examine the actual signature of the request, but that's how I'd probably investigate it... It's to begin to isolate. Does that require us to write a lot of defensive code against that kind of scenario?

Gerhard Lazu

I don't think so. It's just system integration. We just need to add more configuration. And back to Jerod's point, we're just chasing now new problems that we didn't even think we would have... But we have what looks to me like an actor that's not very - I wanna say this in a nice way... An unfriendly actor, that is not very happy, and they are very angrily downloading our mp3 over and over and over again, thousands of times, across thousands of IPs... And this is not cool. Because ultimately, we end up paying for this bandwidth. That is not helping anyone. \[01:22:23.19\] But that's one. It's not the only one. We have one more. So you can see here, for example - this is the last seven days; we have seven terabytes that were transferred in the last seven days. Seven terabytes? Maybe that's more than that. It needs to be more... The geocode does not exist. Okay, I was expecting to see more than that. Anyway, Asia is the one that we can see that pattern, but we also have in Europe, sometimes we have these spikes. And this spike, which I wanted to focus on - we know that someone in Frankfurt, that connects to Frankfurt, downloaded the static favicon 170,000 times in the span of, I don't know, an hour or two. So they downloaded this like two, three hours. So we get requests like this, that are putting stress on these instances. I mean, what potentially -- and that was a past request as well, which means it went through the cache, which means that they must have had a cookie set up, or something like that, that basically was preventing the cache from working, in this case... Which - again, that's how it's supposed to work. So anyway, I think that was, unfortunately, not the best thing that we could have ended on, but it's a thing, and it's something - food for thought. More work to be done. There's many things that we didn't get to talk to, we didn't have time for... For example, we didn't talk about the Nightly... By the way, Nightly now is being served by the Pipe Dream as well. And the reason why we had to do this is because that sometimes will get scraped, it would get hit really heavily... It's a very small app, it's NGINX, but if I open it -- so let's just click on that one. That's pull request 46. Before, it was basically topping up at 141 requests per second. Now it's at 1,300, so it's almost a 10X, an order of magnitude faster. The latency went way, way down... And the only thing we had to do is basically put Varnish in front of it.

Jerod Santo

Nice. Well, that's nice.

Gerhard Lazu

Yeah, that's one more thing there, and you can go and have a look how it works... It'd be like a benchmark here, a small benchmark here... And that's it. We have the last one for the road - but before we do that, anything else you want to talk about before I share one last thought?

Adam Stacoviak

I suppose what do we do, if we know these downloads are happening? ...we're here on the podcast, just politely asking to stop... We just let it keep happening?

Gerhard Lazu

Well, we could set up some sort of throttling... I think it would be the easiest thing. Now, it will impact everyone... I don't want to start blocking, again, IP ranges, net blocks, because we don't know who's going to be caught there. They may change to other IP blocks. That's entirely possible. We don't know how this will work. We can't block an entire country, an entire continent, especially if it's a big one... I don't think that's reasonable. So really, throttling is I think the fairest thing. And then we can throttle MP3 specifically. Because we do have, for example -- I see them; for example, we have a Python client and a Go client, that every week they come and they download all our MP3s. I don't know why they do that, but every seven days, they basically request every single MP3 that we have. So they're scraping the website, and then pulling everything down. I don't know why.

Adam Stacoviak

Yeah.

Gerhard Lazu

\[01:25:49.11\] Again, the more I was looking at -- and again, because I was working so deep in this, I started noticing these behaviors that you would normally not see... So it's one of the advantages, I suppose, to working so close with the traffic, with all the requests, and having this level of understanding and visibility into every single request. So it really helps, down to the IP level.

Adam Stacoviak

Something like that though, like the Go client and the Python client - would that be a Honeycomb thing? Where would that be --

Gerhard Lazu

Yeah. It's Honeycomb, yeah. You can filter by user agent, for example, and you can see that there'll be... For example, say -- no, I don't want to show any IPs or anything like that, so that's why I'm not going to screen-share that... But once we start digging into that, you can say "Group by client, by agent", and you can say "Filter by MP3s." So like "URL contains MP3." And that will be able to group... And you can say "Oh, and by the way, only show me where there's more than, for example, 100 downloads." And then you'll start seeing the outliers, which are the clients that are downloading certain MP3s, or MP3s in general, excessively. Now, that can be spoofed. That's the other thing. We have, for example -- the request agent, like the user agent, it's empty; it's an empty string. That also happens. Because you don't have to send the header if you don't want to.

Adam Stacoviak

Yeah, you can also send whatever you want to, if you want to.

Gerhard Lazu

So that can be spoofed.

Adam Stacoviak

Yeah. It's like, whenever you build systems like this, and then even when you observe them - I guess you don't expect, that's why I originally thought, but you kind of hope that clients, a.k.a. people, behave. They're going to use the system for the system's purpose, not to once a week download and scrape the entire thing, and... I mean, in that case, somebody could have their own web archiver, and they could have altruistic reasons for it. I think that's kind of silly, but you know, once per week download the entire contents on somebody's disk seems like "I want your thing. And I want to keep getting your thing. And if it ever changes, I want to make sure I have that snapshot." I don't understand it. It doesn't make any sense. What would make anybody do that? What is the purpose and motivation to keep doing that? ...to even commit the compute, or the script, or the time to do that? What are they getting from it? I don't know.

Jerod Santo

We need to go over there and knock on some doors, man, and ask them. "Why are you doing his...?!"

Gerhard Lazu

"What's up with that?"

Adam Stacoviak

Every door...

Jerod Santo

"What's in it for you?"

Adam Stacoviak

...in Asia. "Do you listen to the Changelog?" "Yes, I do." "How many times?!"

Gerhard Lazu

Yeah.

Jerod Santo

"Tell us about 456. You know what 456 means, don't you?!" \[laughter\]

Gerhard Lazu

Yeah. So this is, I think, a really delicate and a really important point to discuss... Because this is how good systems become bad systems.

Adam Stacoviak

It's true. Yeah. You have to treat everybody bad...

Gerhard Lazu

Exactly. We don't want to be doing this, but we are forced to do something against something which isn't good... So it's not benefiting anyone, and we have to step in and do something about it. Now, we have to do it... I was expecting this to stop, but it's still even to this day. We made Varnish -- I mean, now that it's stable, it's able to serve more traffic... We just had the biggest spike, because now the system is more stable, but it means that bad actors -- again, I shouldn't be using that. Unhappy people, unhappy clients...

Jerod Santo

Use it...! Who are you gonna offend? The only person you can offend is the one who's doing this, and I'm fine with it.

Gerhard Lazu

Yeah.

Jerod Santo

\[01:29:34.18\] They need to knock it off. Here's a -- this might be a cudgel... But if we're trying to solve the problem of "They're taking our bandwidth for something that's no longer relevant, or interesting, and it's been out there for years", what if we could just toggle certain episodes? ...and this might be a cat and mouse game as well. But at a certain point it's like "Well, just give them the R2 URL, and not the CDN URL, and just let Cloudflare deal with it." Just let them download directly from Cloudflare, and we're just out of the equation then. We don't care about the stats, we don't care about anything. We're just like -- you know, we've served this file plenty of times through our CDN. Now we're going to just let R2 serve it. What do you think about that idea?

Gerhard Lazu

I can see for this specific episode being a very simple fix, because we can just serve basically a location header, and we just redirect, and that's it. We're done with it. So it'd be another synthetic response.

Jerod Santo

The question is if they're actually malicious, then they switch to a new episode and start doing that one, right?

Gerhard Lazu

Exactly. Exactly. And we have other clients, which are -- for example, we've seen \[unintelligible 01:30:38.28\] they're basically busting the cache and then purposefully going to R2 directly, and Varnish almost acts like a proxy in this case.

Jerod Santo

Right.

Gerhard Lazu

So we have that as well. We have -- every now and then we have this random client that comes and downloads all the episodes... And that's not a problem. So I think that some sort of a throttle would make sense, which would keep the system fair to everybody. But the throttle needs to be high enough so that it doesn't impact anyone else. Now, if for example our requests, or if our audience grows, or we become more popular, and we get more requests - obviously, we'd need to be aware of where the limit is, and start increasing the limits, once we are throttling too much, maybe. But that seems to be more like long-term, and it seems like a more well-engineered approach, in a way... But certainly, the simplest thing would be just "Take this one URL." I mean, that could be done in minutes, roll it out, and then we would stop this abuse for this specific MP3. That would be the easiest thing, for sure. So yeah, I can see how pragmatic that approach is, and I like the pragmatism.

Jerod Santo

Well, it's at least worth checking to see if, you know, the mouse is still alive over there.

Gerhard Lazu

Right.

Jerod Santo

You know?

Gerhard Lazu

Yup.

Jerod Santo

And if they are - well, then we'll know that this is a cat and mouse game. But if it's just somebody left the blinker on, we're just gonna turn their blinker off for them, and see if the problem goes away. And if it changes to a new MP3, then yeah, we need more generic solutions. But we may not need that at all.

Gerhard Lazu

I do have to say that the internet these days is very different from the internet even like a year ago. With the rise of LLMs and AIs, I'm starting to see patterns in our traffic, which are unlike any other time. We have these very big spikes, when a lot of data is being requested in a very short periods of time, from -- I mean, the user agents, they don't make much sense... I mean, I know they're spoofed. There are many IPs which are being used... So it's almost like there's some system which wants a lot of our content is doing silly things, because some requests, they just don't make sense. For example, what benefit does static favicon have? What's up with that? That just makes no sense.

Adam Stacoviak

It's a small file. Maybe it's a heartbeat, or a version of a heartbeat.

Gerhard Lazu

Maybe... But this is the first time I've seen this specific file being downloaded this many times. I haven't seen this before... Which makes me think, is this a trend that we'll start seeing, more and more requests that don't make sense? And then you start having to set up some form of protection for all sorts of clients that are just doing the wrong thing.

Adam Stacoviak

Yeah, you need like a defensive layer by default.

Gerhard Lazu

Exactly. Exactly, yeah. And something that would be fair to regular clients... Like, for example when I want to do a benchmark... I mean, sure, it's me. I wouldn't want other people to do that, because I'm testing the system, making sure the real world production system everywhere in the world is working correctly... And I'm aware of what that means, and how it costs... And by the way, my IPs are removed from all the stats, because otherwise you'd see those massive benchmarks... So we account for that. But we can't account for all these weird clients. \[01:34:05.02\] It's a challenge... I think it's a good one, but it just sets us up to -- you know, when you become older, it feels like this is more of an adult problem. So we got the thing barely working, we got it out there, we made it stable, reliable, all that... And now we're hitting -- it almost feels like a new layer of problems. And this to me is a hint as to the next phase.

Jerod Santo

Oh, to be a kid again...

Adam Stacoviak

Yeah. Well, one positive thing I think is the robustness of our observability. Being able to have this visibility is great, because otherwise we're like "Wow..." We pat ourselves on the back. "Aaron Parecki, let's get you back on the pod, because man, you are big all over the world. That's amazing. A million downloads."

Jerod Santo

Big in Asia. So what's your one last thing for the road ahead? I agree \[unintelligible 01:34:54.22\]

Gerhard Lazu

What's my one last thing? So I'll keep it short, we'll keep it fun. I mentioned about the Christmas tree, I mentioned about the various things which I had going over the holidays... So - Make It Work Club. That's the place. You're there. Both of you are there. So you can join whenever you want. Next Thursday... Yeah, next Thursday. I'm going to talk about the 100-gigabit WAN. The 100-gigabit WAN. So why would I need such a thing?

Adam Stacoviak

Smokin'.

Gerhard Lazu

It's smokin', for sure. So I thought the CCR 2004 has like four CPUs, has multiple 10-gigabit SFP+ ports... It even has 24 SFP28 ports, but it doesn't have a switch chip. And people that know a little bit about hardware, you want to switch chip to do hardware offloading L3, and even L4. So after I bought the CCR 2004 - it was almost like a Christmas present - I thought "Surely, this will be enough for the rest of my life." And no, I had to get the flagship. So I'll be talking about that - the LAN, the setup, quite a few things - coming up. And it just goes to show how much I enjoy the hardware side of things as well, the networking side of things... Like, I shaved two milliseconds off my WAN. It's amzing. Little things like that. It was already good. It was already sub five milliseconds, but I wanted sub three milliseconds. It is now 2.4 milliseconds. So the next -- and what it means... Like, why would I do this? So first of all, I'm all about improving. Every winter I improve the network. In this specific instance, I wanted the pages just to be snappier, things to load a lot quicker, to handle a bit more traffic, but also to not have any impact... I was running that benchmark, 2.5, like -- look, I'm going to do another one right now. Let's see. I have a speed test... I have speed test right here, speed test London... Let's go for this one. So we're recording, we're streaming, and I'm just pulling 2.5, 2.6 gigabits down. And there's no interruption on my network. So it's just my bread and butter. That's how I work. \[01:37:20.27\] And by the way, if you see any buffering or any slowing down, let me know. I see Adam a bit more pixelated. Maybe you can see pixelated too, I don't know... But yeah, I just pulled six gigabytes, three down, three up... And it's just what I do every day.

Jerod Santo

It's just what he does.

Gerhard Lazu

I work with this stuff, and -- yeah, I enjoy it. And by the way, this is the slower gateway router. So I'm getting the proper one set up, and I'll talk about that. And there's so many things there. VLANing is quite a thing... I have a new IPv4 block, by the way... So some would say that I'm preparing for hosting something. And maybe I am, I don't know. We'll see how that works. But I just realized that my home connection - obviously, I couldn't serve all the MP3s that were being downloaded. That would really cripple my connection if that was happening... But I'm at 2.5 gigabits. The next one will be five gigabits, and the hardware can do it. And the five gigabits - I mean, that's like a decent server.

Jerod Santo

Sure.

Gerhard Lazu

And if you can do five gigabits all day, every day... Sorry. Yeah, gigabits. Gigabits per second. That's pretty decent. So I'm just waiting for more internet.

Jerod Santo

I was gonna say, you're gonna have a hundred gigabit WAN, but you're not gonna have a connection for it, right?

Gerhard Lazu

Right. So... Very few places in the world have that. So if I was in Switzerland, I would get 25 gigabits.

Jerod Santo

Now, would you move? Would you move for this?

Gerhard Lazu

Of course.

Jerod Santo

\[laughs\] "Of course..."

Gerhard Lazu

The only reason to move...

Jerod Santo

I know that sensation.

Gerhard Lazu

Yeah. It's the 25-gigabit connection. But I know that a hundred gig is coming... So we'll see. They either ship it by the time I move, or I move, and then they ship it. So it's one or the other.

Jerod Santo

Okay.

Gerhard Lazu

The important thing is, I have the router to handle that.

Jerod Santo

You'll be ready. You'll be ready.

Gerhard Lazu

I'll be ready. Exactly. So I'm a prepper... I'm prepping for that.

Jerod Santo

Prepping for good internet.

Gerhard Lazu

And interestingly -- five years ago, when I got the previous router, I did the same thing. It's a forum post, it's like a follow-up... So I just did a follow-up recently, at this milestone... So I've been at this for some number of years, and optimizing my network and making sure that it's in tip top condition.

Adam Stacoviak

Relentless... I love it. So relentless.

Jerod Santo

Good stuff, Gerhard. Well, that's a happy note to end on, right? That's a happy note to end on.

Adam Stacoviak

Observability in a hundred gigabit? That's the way to do it.

Jerod Santo

Alright. Well, the good news for Kaizen is we have a lot to work on.

Adam Stacoviak

Always. Always.

Gerhard Lazu

That's what it seems. We know how to pick them, don't we?

Adam Stacoviak

Oh, my gosh... The rabbit hole goes deep, and we keep going in.

Jerod Santo

Kaizen!

Adam Stacoviak

Bye, friends.

Gerhard Lazu

Kaizen.

View original episode page