Ship It! ā€“ Episode #83

šŸŽ„ Planning for failure to ship faster šŸŽ

with Alex Sims, Solutions Architect & Senior Software Engineer at James & James

All Episodes

Eight months ago, in šŸŽ§ episode 49, Alex Sims (Solutions Architect & Senior Software Engineer at James & James) shared with us his ambition to help migrate a monolithic PHP app running on AWS EC2 to a more modern architecture. The idea was some serverless, some EKS, and many incremental improvements.

So how did all of this work out in practice? How did the improved system cope with the Black Friday peak, as well as all the following Christmas orders? Thank you Alex for sharing with us your Ship It! inspired Kaizen story. Itā€™s a wonderful Christmas present! šŸŽ„šŸŽ

Featuring

Sponsors

Sourcegraph ā€“ Transform your code into a queryable database to create customizable visual dashboards in seconds. Sourcegraph recently launched Code Insights ā€” now you can track what really matters to you and your team in your codebase. See how other teams are using this awesome feature at about.sourcegraph.com/code-insights

Raygun ā€“ Never miss another mission-critical issue again ā€” Raygun Alerting is now available for Crash Reporting and Real User Monitoring, to make sure you are quickly notified of the errors, crashes, and front-end performance issues that matter most to you and your business. Set thresholds for your alert based on an increase in error count, a spike in load time, or new issues introduced in the latest deployment. Start your free 14-day trial at Raygun.com

Notes & Links

šŸ“ Edit Notes

Pick Service - High Level
Pick Service - Auth Flow
Gerhard & Alex

Chapters

1 00:00 Welcome 01:04
2 01:04 Sponsor: Sourcegraph 01:43
3 02:46 Intro 05:56
4 08:43 Moving towards Kubernetes 04:14
5 12:57 Feature flagging, did it work? 04:00
6 16:57 Black Friday and Christmas 06:59
7 23:56 From problem to solution 01:37
8 25:40 Sponsor: Raygun 02:07
9 27:53 Datadog 01:44
10 29:37 MySQL and MariaDB 05:28
11 35:06 SLIs and SLOs 03:17
12 38:22 Day to day before Christmas 02:42
13 41:07 The improvement list 03:06
14 44:14 Deployment pipeline for legacy 03:10
15 47:24 It's going great 05:02
16 52:26 Big plans for 2023 02:54
17 55:20 Where software meets the real world 01:31
18 56:51 More Kubernetes or less? 00:50
19 57:41 RUM (Datadog) 02:55
20 1:00:36 Key takeaway 01:39
21 1:02:15 Wrap up 02:00
22 1:04:17 Outro 00:52

Transcript

šŸ“ Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. šŸŽ§

Alex, welcome back to Ship It.

Yeah, itā€™s great to be back. It doesnā€™t feel that long since our last chat.

No, it wasnā€™t. It was episode #49, April(ish)ā€¦ Six months, actually; six, seven months.

So much has changedā€¦

Yes. The title is ā€œImproving an e-commerce fulfillment platform.ā€ A lot of big words thereā€¦ The important one is the improving part, right?

Indeed, yeah. So much has changed, and itā€™s really interestingā€¦ I think the last time we spoke, we fulfilled about 15 million orders, and weā€™re closely approaching 20 million. So almost another 5 million orders in six months. Itā€™s just crazy the pace weā€™re moving at this year.

Thatā€™s nice. Itā€™s so rare to have someone be able to count things so precisely as you areā€¦ And itā€™s a meaningful thing, right? Itā€™s literally shipping physical things to people around the world, right? Because youā€™re not just in the UK. Now, the main company is based in the UK; you have fulfillment centers around the world.

Yeah, weā€™ve now got four sites. Weā€™re in the UK, weā€™ve got two sites in the US, in Columbus, Ohio, and just over one up in Vegas, and weā€™ve got another site in Auckland. So itā€™s growing pretty quick, and I think this year weā€™re opening two more sites.

This year 2022?

Sorry, yeah - we opened one site this year, which is the Vegas oneā€¦ Oh, and Venlo; thatā€™s the Netherlandsā€¦ And yeah, weā€™ve got plans for two more sites next year, I believe.

Nice. So an international shipping company that shipped 5 million orders in the last six-seven months. Very nice.

Funny story on thatā€¦ Iā€™d love to imagine it was just me, sat there, counting orders as they go out the door, but weā€™ve actually got a big LED sign thatā€™s mounted to the wall, and every time an order dispatches, it ticks up. Itā€™s a nice bit of fun. Thereā€™s one in the office and thereā€™s one mounted on the wall in the fulfillment center, and itā€™s quite interesting to see that ticking up, especially this time of year, when the numbers really start to move.

Yeah. Okay, okay. So for the listeners that havenā€™t listened to episode #49 yet, and the keyword is ā€œyetā€, right? Thatā€™s a nudge; go and check it outā€¦ What is it that you do?

Yeah, so I mischaracterized myself last time as a 4PL; weā€™re actually a 3PL. I was corrected. And essentially, we act on behalf of our clients. So imagine youā€™re somebody that sells socks, and you have a Shopify account, and you come to us and we connect to your Shopify account, we ingest your orders, and we send them out to your customers via the cheapest shipping method, whether that be like Royal Mail in the UK, or even like FedEx, going international [unintelligible 00:05:58.17] We handle all of that. And then we provide tracking information back to your customers, and give you insights on your stock management, and ā€“ yeah, thereā€™s tons of moving parts outside of just the fulfillment part. Itā€™s all about how much information can we provide you on your stock, to help you inform decisions on when you restock with us.

Okay. So that is James & James. James & James, the company - thatā€™s what the company does. How about you? What do you do in the company?

Yeah, so Iā€™ve sort of transitioned through many roles over the last few years. Started this year, I was a senior engineer, and Iā€™ve transitioned to a solution architect role this year. Main motivation for that is weā€™ve predominantly been a monolithic ā€“ we had a big monolith that was on a very legacy version of Symphony; Symphony 1.4, to be specificā€¦ And we want to start making tactical incisions to start breaking some of those core parts of our application now into additional services, that use slightly more up-to-date frameworks that arenā€™t going to take us years to upgrade, say, 1.4 version of Symphony to something modern. Weā€™ve decided itā€™s going to be easier to extract services out, and put them into new frameworks that we can upgrade as we need to, and itā€™s sort of my job to oversee all of the technical decisions weā€™re taking in the framework, but also how we plan upgrades, how we stitch all these new systems together, and most importantly, how we provide sort of like a cohesive experience to the end user. I think thereā€™s six services running behind the scenes. To them itā€™s just one sort of UI thatā€™s a portal into it all.

Yeah. When you say end users, this is both your staff and your customers, right?

Exactly. We have two applications, one called CommandPort which is our sort of internal tool where we capture orders, and pick and pack them and dispatch them, and then we have the ControlPort which is what our clients use, which is their sort of portal into whatā€™s going on inside the warehouse, without all of the extra information they donā€™t really care about.

[08:08] Okay. And where do these services ā€“ I say services; I mean, where do these applications run? Because as you mentioned, thereā€™s multiple services behind them. So these two applications, where do they run?

Yeah, so they run in AWS, on some EC2 instances, but we have recently created an EKS cluster for all of our new services, and weā€™re slowly trying to think about how we can transition our old legacy application into the cluster, and start spinning down some of these old EC2 instances.

Okay. I remember in episode #49 thatā€™s what we started talking about, right? Like, the very early steps towards the Kubernetes architecture, or like Kubernetes-based architecture, to see what makes sense, what should you pick, why would you pick one thing over another thingā€¦ Thatā€™s been six months ago. How did it work in practice, that migration, that transition?

Yeah, so it worked pretty well. So one of our biggest projects over these last six months has been to rewrite Pick, which is one of our largest parts of our operation, into a new application. So what we ended up doing - we created a Remix application, which is a React framework, and thatā€™s deployed on the edge using Lambda, just so you get pretty much fast response times from wherever youā€™re requesting it fromā€¦ So that sits outside the cluster. And then we have a new Pick API, which is built using Laravel; thatā€™s deployed inside of EKS, and also a new auth service, which is deployed inside of EKS as well.

So currently, the shape of our cluster is two services running inside of EKS, and then our EC2 instances make requests into the cluster, and that lambda function also makes requests into the cluster. We have three nodes in there, operating on a blue/green deploy strategy. It was actually really interesting, we got bitten by a configuration error.

Okayā€¦

This might make you laughā€¦ To set the scene - itā€™s Friday night, the shift is just handed over to the next shift manager in the FC. Weā€™ve been Canary-releasing one or two operators for the last two weeks, doing some testing in production on the new Pick service, and everythingā€™s been going flawlessly. Weā€™re like ā€œThis is such a great deployment. Weā€™re happy. Thereā€™s been no errors. Letā€™s roll it out to 30% of everybody thatā€™s running on tonightā€™s shift.ā€

And earlier that day, I was speaking with one of our ops engineers, and I said, ā€œItā€™s really bugging me that we only have one node in our cluster. It doesnā€™t really make much sense. Could we scale it to three nodes, and then also do blue/green deploy on that?ā€ He was like, ā€œYes, sure. No worries.ā€ We added two more nodes to the cluster, we deployed the app over those three nodes. He sort of looked at the state of Kubernetes, and he was like, ā€œYeah, itā€™s great. I can see all three instances running, I can see traffic going to all of themā€¦ Yeah, no worries. Call it a day.ā€

I started getting pinged on WhatsApp, and theyā€™re saying ā€œEverything in Pickā€™s broken. If we refresh the page, it takes us back to the start of our Pick route. Weā€™re having to rescan all the items againā€¦ Someoneā€™s got a trolley with 100 stops on it, and theyā€™re having to go to the startā€¦ā€ And Iā€™m like ā€œWhat the f is going on?ā€ And it turned out that in the environment variables that weā€™d set for the application, that weā€™d set the cache driver to be filed instead of Redis.

Ahhā€¦ Okay.

So as soon as someone got directed to another node, they lost all of their progress, and they were getting reset. So that taught me to not just deploy on a Friday night and be happy that the tests passed, becauseā€¦

Oh, yes. And then I think ā€“ because youā€™ve been testing with like a single instance, right? ā€¦and everything looked good. So going from one to three seemed like ā€œSure, this is gonna work. No big deal.ā€ Itā€™s so easy to scale things in Kubernetes when you have that.

Yeahā€¦

And then things like thisā€¦ ā€œAhā€¦ Okay.ā€ That sounds like a gun to your foot. What could possibly happen? [laughs] Okay, wowā€¦

[12:10] It was really nice to have an escape hatch, though. So we deployed everything behind LaunchDarkly. So we have feature flags in there. And literally, what I did is I switched off the ā€“ scaled the rollout down to 0%, everyone fell back to the old system, and it was only the cached state that was poisoned. So their actual state of what they picked had all been committed to the database. So as soon as I scaled that down to zero, they fell back to the old system, and were able to continue, and I think we only really had like 10 minutes of downtime. So it was really nice to have that back-out method.

Yeahā€¦ But you say downtime - to me, that sounds like degradation, right? 30% of requests were degraded. I mean, they behaved in a way that was not expected. So did ā€“ again, Iā€™m assumingā€¦ Did the majority of users have a good experience?

No, everybody that was being targeted ā€“ sort of 30% of operators that were going to the new service, everyone had a bad experience.

Right. But the 70% of operators, they were okay.

Oh yeah, exactly.

Yeah. So the majority was okay. Okayā€¦ Well, feature flags for the win, right?

Yeah, it was really nice, because this is the first time weā€™ve deployed a new service like this, and it was the first time trying feature flags. And even though we had an incident, it was really nice to have that graceful backout, and be confident that we could still roll forward. And in the WhatsApp chat with our operations manager, we were just sending emojisā€¦ roll forward, and itā€™s like, rolling panda down a hill. He was just like ā€œYeah, no worriesā€¦ā€

[laughs] Thatā€™s what you want. Thatā€™s it. Thatā€™s the mindset, right? Thatā€™s like the mindset of trying something new. You think itā€™s going to work, but you can never be too confident. The more confident you are, the more ā€“ I donā€™t know, the more painful, I think, the failureā€¦ Like, if youā€™re 100% confident itā€™s going to work and it doesnā€™t, what then? Versus ā€œI think itā€™s going to work. Letā€™s try it. I mean, if it wonā€™t, this is the blast radiusā€¦ Iā€™m very aware of the worst possible scenario, and Iā€™m okay with that riskā€, especially comes to production, especially when it comes to systems that cost money when theyā€™re down. So imagine if this would have happened to 100% of the stuff. I mean, youā€™d be basically stopped for like 10 minutes, and that is very costly.

Yeah. And itā€™s been really nice to see like the mindset of people outside of tech evolve over the past couple of years. There was a time where we would code-freeze, everything would be locked down, and nothing would happen for two months. And slowly, as weā€™ve started to be able to introduce things that mitigate risk, the mindset of those people external to us has also changed, and itā€™s just a really nice thing to see that we can keep iterating and innovating throughout those busy periods.

Once you replace fear with courage, amazing things happen. Have the courage to figure out how to apply a change like thisā€¦ Risky, because all changes are risky if you think about it, in production. The bigger it is, the hotter it runs, the more important the blast radius becomes. I donā€™t think that youā€™ll never make a mistake. You will.

No, exactly.

Sooner or later. The odds are in your favor, but every now and then, things go wrong. Cool. Okay.

I mean, I was very confident with this until I realized Iā€™d broken all of the reporting on that service that I shared in the last episode; it just completely fell on its face.

Really?

[15:46] Because I found in the old system it did two saves, and we use change data capture to basically analyze the changes on the record as they happen in real time with Kafka. And it ultimately did two saves. It did one to change the status of a trolley from a picking state to an end shift state, and one change to divorce the relationship with the operator from that trolley. And in the application that consumes it, it checks for the presence of the operator ID that needs to be on the trolley, and the status needs to change in that row. If that case wasnā€™t satisfied, it would skip it, and that trolley would never be released, which means the report would never be generated.

And what ended up happening is I saw that old code and went ā€œWhy would I want to do two saves back to back, when I can just bundle it all up into one and be like micro-efficient?ā€

[laughs] Of course.

ā€œOh, okay. Yeah, Iā€™m just gonna take down like a weekā€™s worth of reporting.ā€ Yeah, that wasnā€™t fun.

All great ideas.

We could live without it, though. Itā€™s all edge stuff, and ā€“ yeah, we can live without it. Itā€™s fixed now, butā€¦ Yeah, finding those things and going ā€œOh, my god, I canā€™t believe thatā€™s a thingā€¦ā€

Okayā€¦ Thatā€™s a good one. So you had two, possibly the biggest events ā€“ no, I think theyā€™re probably the biggest events. I mean, I donā€™t work in the physical shipping world, but I imagine that Black Friday and Christmas are the busiest periods for the shipping industry as a whole. I think itā€™s like the run-up, right? Because the things have to be there by Black Friday, and things have to be there by Christmas. How did those two major events work out for you with all these changes to the new system that started six months ago?

So to give an idea of what our normal sort of daily volume is, and maybe set the scene a bit - weā€™re normally about 12,000 orders a day, I think, and on the ramp up to Black Friday, from about the 20th of November, we were up to release 20,000 a day. And on Black Friday I think 31,000 was our biggest day of orders. And to also set the picture a little bit better, in the last six months I said weā€™ve done about 5 million orders; in the last 15 days, weā€™ve done about 400,000 orders across all of our sites.

Thatā€™s a lot.

So yeah, volume really ramps up. And we were really, really confident this year, going in from like a system architecture perspective . Weā€™d had a few days where we had some spiky volume and nothing seemed to let up, but it seemed to all ā€“ not start going wrong, because we never really had a huge amount of downtimeā€¦ But a lot of our alarms in Datadog were going off, and Slack was getting really bombarded, and we had a few pages that were 503ing, because they were just timing outā€¦ We were suddenly like ā€œWhatā€™s going on? Why is the system all of a sudden going really slow?ā€ And weā€™d released this change recently called ā€œlabel at pack.ā€ And essentially, what it did is as youā€™re packing an order, previously, youā€™d have to like pack all the items, and then once youā€™ve packed all the items, you weighed the order, and then once youā€™ve weighed the order, you wait for a label to get printedā€¦ But it was really slow, because that weighing step you donā€™t need; you already know whatā€™s going in the box, you know what box youā€™re choosing, so you donā€™t need that weigh step. And it means as soon as you start packing that order, we can in the background go off and make a request to all of our carriers, quote for a label, and print it.

So at the time that you finished packing all the stuff in the box, youā€™ve got a label ready to go. But what we didnā€™t realize is that AJAX request wasnā€™t getting fired just once; it was getting fired multiple times. And that would lead to requests taking upwards of like sometimes 30 or 40 seconds to print a labelā€¦ If you have tens of these requests going off, and weā€™ve got 80 packing desks, thatā€™s a lot of requests that the systemā€™s making, and it really started to slow down other areas of the system. So we ended up putting some SLOs in, which would basically tell us if a request takes longer than eight seconds to fire, weā€™ll burn some of the error budget. And we said ā€œOh, we want 96% of all of our labels to be printed within eight seconds.ā€ And I think within an hour, we burned all of our budget, and we were like, ā€œWhatā€™s going on? How is this happening?ā€ And it was only when we realized that the AJAX request was getting fired multiple times that we changed it. And as soon as that fix went out, the graph was like up here, and it just took a nosedive down, everything was sort of printing within eight or nine seconds, and the system seemed to be a lot more stable.

[20:24] Thereā€™s also a few pages that are used for reporting, theyā€™re like our internal KPIs to see how many units and orders weā€™ve picked, and operator level, by day, week, monthā€¦ And theyā€™re used a lot by shift managers in the FC. And historically, theyā€™re a bit slowā€¦ But in peak, when weā€™re doing a lot more queries than normal, weā€™re going really slow. I think ā€“ what was happening? Iā€™m not sure how much technical detail you want to go intoā€¦

Go for it.

Yeah, we use ORM in our legacy application, and we greedy-fetch a lot of stuff.

Okayā€¦

We definitely over-fetchā€¦

From the database, right? L

From the database.

Youā€™re getting a lot of records, a lot of rows; any scanning, anythingā€¦

Yeah, just tons of rows, and weā€™ve got a reasonably-sized buffer pool. So all those queries run in memory. But what happens is when the memory in the buffer pool is used up, those queries will start running on disk. And once they start running on disk, it significantly degraded performance.

Yeah. Let me guess - spinning disks? HDDs?

So I thought weā€™d upgraded to SSDs on our RDS instance, but I need to go back and clarify that.

That will make a big difference. And then thereā€™s another step up; so you go from HDDs to SSDs, and then you go from SSDs to NVMEs.

Yeah, I think thatā€™s where we need to go. I think weā€™re at SSD, but itā€™s still on those ā€“ like, scanning millions of rows queries, and over-fetching like 100 columns or more at a time, maybe 200 columns, the amount of joins that those queries are doingā€¦ Yeah, theyā€™re going straight into the table. But yeah, they were essentially taking the system offline because they would just run for like 10-15 minutes, eat a connection up for that entire time, and then youā€™ve got someone out there hitting Refresh, so youā€™ve got 30 or 40 of these queries being ran, and no one else can make requests to the database, and it chokes. So we ended up finding that if we changed, or forced different indexes to be used in some of those queries, and reduced the breadth of the columns, they are able to still run, within tens of seconds; so itā€™s still not ideal, but it was enough to not choke the system out.

And luckily, these things all started happening just ahead of Black Friday, so then we were in a much better position by the time Black Friday came along. We also found that we accidentally, three years ago, used Redis keys command to do some look-ups from Redis, and didnā€™t realize in the documentation it says ā€œUse this with extreme care in production, because it doesnā€™t onScan over the entire cell.ā€

Okayā€¦

Yeah. And when youā€™ve got 50 million keys in there, it locks Redis for a while, and then things also donā€™t work. So we swapped that with scan, and that alleviated a ton of stress on Redis. So yeah, thereā€™s some really pivotal changes that we made this year. They werenā€™t massive in terms of like from a commit perspective, but they made a huge difference on the performance of our system.

Thatā€™s it. I mean, thatā€™s the key, right? It doesnā€™t matter how many lines of code you write; people that still think in lines of code, and like ā€œHow big is this change?ā€ You actually want the really small, tiny decisions that donā€™t change a lot at the surface, but have a huge impact. Some call them low-hanging fruit. I think thatā€™s almost like it doesnā€™t do them justice. I think like the big, fat, juicy fruits, which are down - those are the ones you ought to pick, because they make a huge difference to everything. Go for those.

Iā€™m wondering, how did you figure out that it was the database, it was like this buffer pool, and it was the disks? What did it look like from ā€œWe have a problemā€ to ā€œWe have a solution. The solution worksā€? What did that journey look like for you?

[24:12] Yeah, so Iā€™m not sure how much of this was sort of attributed to luckā€¦ But we sort of dived straight into the database.

Thereā€™s no coincidence. Thereā€™s no coincidence, Iā€™m convinced of that. Everything happens for a reason. [laughs]

Thereā€™s no correlation.

You just donā€™t know it yet. [laughs]

But yeah, we just connected to the database, to the Show Process list and saw that the queries had been running for a long timeā€¦ Itā€™s like ā€œHmā€¦ We should probably start killing off all these ones that have been sat there for like 1000 seconds. They donā€™t look healthyā€¦ā€ [laugh]

Okay. So before we killed them, we sort of copied the contents of that query, pasted it back in, and put an ā€œexplainedā€ before, and just sort of had a look at the execution planā€¦ And then saw how many rows it was considering, saw the breadth of the columns that are being used by that query, and then when we tried to run it again, it gives you sort of status updates of what the query is doing. And when itā€™s just like copying to temp table for about over two minutes, youā€™re like ā€œThatā€™s probably running in disk and not in memory.ā€ So thereā€™s a bit of an educated assumption there of ā€“ we werenā€™t 100% confident, thatā€™s what was happening, but based on what the database was telling us itā€™s doing, we were probably assuming thatā€™s what was happening. Now, none of us are DBAs, I just want to sort of clear that upā€¦ But that was our best educated guess, correlated with what we could find online.

Is there something that you think you could have in place, or are thinking of putting in place to make this sort of troubleshooting easier? ā€¦to make, first of all figuring out there is a problem and the problem is most likely the database?

So we already have some of that. We use APM in Datadog, and it automatically breaks out like queries as their own spans on a trace, so you can see when youā€™ve got a slow-running query. And we do have some alarms that go off if queries exceed a certain breakpoint. But there are certain pages and certain queries that we understand are slow, and we kick those into like a ā€œKnown slow pagesā€ dashboard, that we donā€™t tend to look at as much, and we donā€™t want bombarding Slack, because we donā€™t want to be getting all these alarms going off for things we know are historically slow.

Thereā€™s a few of us on the team - shout-out to George; heā€™s a bit of a wizard on Datadog at the moment, and really gets stuck in there and building these dashboards. And those are the dashboards that we tend to lean towards first; you can sort of correlate slow queries when disk usage goes up on the database, and those dashboards are really helpfulā€¦ But normally, when weā€™re in the thick of it, the first thing that I donā€™t run to is Datadog, and I donā€™t know why, because it paints a really clear picture of whatā€™s going on.

I tend to ā€“ I think itā€™s muscle memory, andā€¦ Over the past five years, when we didnā€™t have Datadog, I would run straight to the database ,first and start doing ā€“ show the process log, and whatā€™s in there, and why is that slowā€¦ And then Iā€™d forget to go check our monitoring tool. So I think for me thereā€™s a bit more of a learning curve of how do I retrain myself to approach a problem looking at our tooling first, rather than at the database.

Okay. So Datadog has the APM stuff. From the application perspective, what other integrations do you use to get a better understanding of the different layers in the system? Obviously, thereā€™s the application, thereā€™s the database server itself, then thereā€™s the - MySQL or Postgres SQL?

We use MariaDB.

MariaDB, okay.

So itā€™s a variant of MySQL.

In my head - MySQL. Legacy - MySQL. [laughs] Itā€™s like a forkā€¦ Like, ā€œWhich what is it?ā€ The MySQL fork. So I donā€™t know, does Datadog have some integration for MySQL MariaDB, so that you can look inside of whatā€™s happening in the database?

I believe it does. And I think we actually integrated it. I just have never looked at it.

Oh, right. Youā€™re just like not opening the right tab, I seeā€¦ [laughs]

Yeah, because if I look at integrations, weā€™ve got like 15 things enabled. Weā€™ve got one for EKSā€¦ Oh, we do have one for RDS, so we should be able to seeā€¦ We have it for Kafka as well, so we can see any lag that on topics, and when consumers stop respondingā€¦ So those sorts of things alert us when our edge services are down. Yeah, I think we monitor a lot, but we havenā€™t yet fully embraced the culture of ā€œLetā€™s get everyone to learn whatā€™s available to themā€, and thatā€™s something that I hope we sort of shift more towards in ā€™23.

That sounds like a great improvement, because each of you having almost like a source of truthā€¦ Like, when something is wrong, where do I go first? Great. And then when Iā€™m here, what happens next? So having almost like a ā€“ I want to call it like play-by-play, but itā€™s a bit more than that. Itā€™s a bit of ā€œWhat are the important thingsā€“ā€, like the forks, if you wish, in the road, where I know itā€™s the app, or itā€™s the instance, like the frontend instances if you have such a thing, or itā€™s the database. And then even though we have services ā€“ I think services make things a little bit more interesting, because then you have to look at services themselves, rather than the applicationsā€¦ And then I know thereā€™s toolsā€¦ Like, service meshes come to mind; if anything, thatā€™s the one thing that service meshes should help with, is understand the interaction between services, when they degrade, automatic SLIs, SLOs, all that stuff.

So thatā€™s something that at least one person would care about full-time, and spent full-time, and like they know it outside in, or inside out, however you wanna look at it; it doesnā€™t really matterā€¦ But they understand, and they share it with everyone, so that people know where to go, and they go ā€œThatā€™s the entry point. Follow this. If it doesnā€™t work, let us know how we can improve itā€, so on and so forth. But that sounds ā€“ itā€™s like that shared knowledge, which is so important.

[32:17] Itā€™s a bit of an interesting place, because we have a wiki on our GitHub, and in that wiki there are some play-by-plays of common issues that occur. I think weā€™ve got playbooks for like six or seven of them, and when the alarm goes off in Datadog, thereā€™s a reference to that wiki document.

So for those six or seven things, anybody can respond to that alarm and confidently solve the issue. But we havenā€™t continued to do that, because there arenā€™t that many common issues that frequently occur, that weā€™ve actually then gone and applied a permanent fix for you. Weā€™ve got a few of these alarms that have been going off for years, and itā€™s just like, ā€œHey, when this happens, go and do these stepsā€, and you can resolve it. And as a solutions architect, one of my things that I really want to tackle next year is providing more documentation over the entire platform, to sort of give people a resource of ā€œSomethingā€™s happened in production. How do I start tracing the root cause of that, and then verifying that what Iā€™ve done has fixed it for any service that sort of talks about that?ā€ But yeah, weā€™re not there yet. Hopefully, in our next call we touch on that documentation.

Yeah, of course. The only thing that matters is that you keep improving. I mean, to be honest, everything else, any incidents that come your way, any issues - opportunities to learn. Thatā€™s it. Have you improved, having had that opportunity to learn? And if you have, thatā€™s great. Thereā€™ll be many others; they just keep coming at you. All you have to do is just be ready for them. Thatā€™s it. And have an open mind.

And Iā€™m wonderingā€¦ So I know that the play-by-plays and playbooks are only so useful, because almost every new issue is like a new one. Right? You havenā€™t seen that before. Would it help if youā€™re able to isolate which part of the system is the problem? The database versus the CDN (if you have such a thing), network, firewall, things like that?

Yeah, it would be really useful. And one thing weā€™re trying to do to help us catalog all of these is anytime we have an incident. Weā€™ve not gone for propper incidents [unintelligible 00:34:19.25] We were looking at incident.io. We havenā€™t sprang for it yet. We just have an incidence channel inside of Slack, and we essentially start a topic there, and we record all of the steps that happened throughout that incident inside of that log. So if we ever need to go back and revisit it, we can see exactly what caused the issue, and also what services or pieces of infrastructure were affectedā€¦ Because Slack search is pretty nice. You can start jumping into that incidence channel, somethingā€™s gone wrong, you do a search and you can normally find something that might point you in the right direction of where you need to steer your investigation. We know itā€™s not the most perfect solution, but itā€™s worked so far.

If it works, it works. If it works, thatā€™s it. You mentioned SLI and SLOs, and how they helped you understand better what is happening. I mean, first of all, signaling thereā€™s a problem with something that affects users, and then being able to dig into it, and troubleshoot, and fix it. Are SLIs and SLOs a new thing that you started using?

Yeah, weā€™re really sort of dipping our toes in the water and starting to implement them across our services. I think we currently have just two SLOs.

Itā€™s better than zeroā€¦

Exactly. We havenā€™t yet decided on SLIs. Weā€™ve got a chat next week with George, and weā€™re going to sit down and think what components make up this SLO that can sort of give us an indication before it starts triggering that weā€™ve burned too much of our budget. So weā€™ve both got like a shared interest in SRE, and weā€™re trying to translate that into James & Jamesā€¦ But yeah, thatā€™s still very much amateur, and just experimenting as we go, but itā€™s nice to see at the peak this year that the SLA that we did create gave us some real value backā€¦ Whereas previously, we would have just let it silently fail in the background, and be none the wiser.

[36:14] Yeah, thatā€™s amazing. It is just like another tool in your toolbox, I supposeā€¦ I donā€™t think you want too many. Theyā€™re not supposed to be used like alarms. Right? Especially when, you know youā€™re like thousands and thousands of engineersā€¦ By the way, how many are you now in the engineering department?

I think weā€™re eight permanent and four contract, I believe.

Okay, so 12 people in total. Again, thatā€™s not a big team, and it means that everyone gets to experience pretty much everything that happens in some shape or form. I think youā€™re slightly bigger than a two-pizza team, I thinkā€¦ Unless the pizzas are really, really large. [laughter] So youā€™re not like ā€“ sure, it can be one team, and I can imagine that like retros, if you have them, or stand-ups, or things like that are getting a bit more complicated with 12 people. Still manageable, but 20? Forget about it. Itā€™s just like too much.

Yeah, it was getting a bit toughā€¦ And what we do now is we have a single stand-up once a week, an hour long. Everyone gets in, and sort of unites their teams, and what weā€™ve been doingā€¦ And then we have like breakout teams. So weā€™ve got four sub-teamsā€¦

That makes sense.

And yeah, we have our dailies with them, and that seems way more manageable.

That makes sense. Yeah, exactly. But still, youā€™re small enough, again, to have a good understanding of most of the system, right? I mean, once you get to like 20, 30, 40, it just becomes a lot more difficult, because the system grows, more services, different approaches, and maybe you donā€™t want consensus, because thatā€™s very expensive, right? The more you get, the more expensive that gets; it just doesnā€™t scale very well.

What Iā€™m thinking is SLIs and SLOs are a great tool. A few of them that you can all agree on, all understand, and at least focus on that. Focus on delivering good SLOsā€¦ No; actually, good SLIs, right? SLIs that match, that everyone can agree on, everyone understands, and itā€™s a bit of clarity in what is a chaotic ā€“ because it is, right? When you have two, three incidents happening at the same timeā€¦ It does happen.

Okay. Okay. So these past few weeks have been really interesting for you, because itā€™s been the run-up to Christmas. More orders, as you mentioned, the system was getting very busyā€¦ What was the day to day like for you? Because I think you were mentioning at some point that you were with the staff, on the picking floor, using the system that you have improved over those months. What was that like?

Yeah, it was really interesting. This year I really wanted to just use the Pick part of the system. So last year I did a bunch of packing of orders, and that was fine. But after spending sort of like four months rewriting Pick, I really wanted to just take a trolley out and just go pick a ton of orders and experience it for myself. So yeah I did three, four days down there, picked like a thousand ordersā€¦

Wowā€¦ Okay. Lots, lots of socks; too many socks. [laughter] I donā€™t want to see another pair of socks for a while. But yeah, it was really nice to sort of get down there and involved with everybody, and sort of going around and talking to operators, and then sort of saying parts of the system they liked, but also parts they didnā€™t like, and parts they felt slowed them down, versus what the old one didā€¦ And it got some really, really useful feedback on what we could then put into the system going into 2023. And we try and do ā€“ we have like two or three [unintelligible 00:39:41.06] days a year where we will all go down into the FC and weā€™ll do some picking and packing, or looking in, just so we can get a feel for whatā€™s going on down there, and how well the systems are behaving.

[39:54] But at the peak, when itā€™s our most busy time of the year, itā€™s sort of like, everybody, all hands-on deck, weā€™ll get down there, all muddle in, DJ plays some music in the warehouse, and weā€™ve got doughnuts and stuff going around, soā€¦ Itā€™s a nice time of the year; everybody sort of gets together and muddles in and makes sure that we get all the orders out in time. I did some statistics earlier, and out of the 300,000 orders that left our UK warehouse, we processed them all within a day.

Wowā€¦

So it gives you an idea of how quickly those orders need to come in and get out once we receive them.

Thatā€™s a lot of like 300k a day ā€“ this is likeā€¦ How many hours do you work?

Itā€™s a 24/7 operation?

24/7. Okay. So that is 12,500 per hourā€¦ That is three and a half orders per second.

Thatā€™s crazy, isnā€™t it?

Every second, 3.5 orders gets ready. Can you imagine that? And thatā€™s like 24/7. Thatā€™s crazy. Wowā€¦

And weā€™re and weā€™re still quite small in the e-commerce space. Itā€™s gonna be interesting to see where we are this time next year.

Six months ago, you were thinking of starting to use Kubernetes. You have some services now running; you even got to experience what the end users seeā€¦ What are you thinking in terms of improvements? What is on your list?

Oh, thatā€™s a really hard oneā€¦ I want to get more tests of our legacy system to run. So we had another incident where weā€™d essentially deployed a change, and it took production down for like six or seven minutes for our internal stuffā€¦ And it would have been caught by a smoke test. Like, outright, the system just wouldnā€™t have booted. And weā€™ve now put a deployment pipeline replace which will run those smoke tests and ensure the application boots, and it will just run through a couple of common pagesā€¦ And that was a result of that incident.

But what we really want to do is gain more confidence that when we deploy anything into production for that existing system, weā€™re not going to degrade performance, or take down like certain core parts of the application. What we want to probably do is come up with a reasonable time to deploy. Maybe the test harness that runs canā€™t take more than ten minutes to deploy to productionā€¦ Because we still want to keep that agility that weā€™ve got.

One of the real benefits that weā€™ve got working here is deployment in terms of production is under sort of two or three minutes. And if we have a bug, we can revert really quickly, or we can iterate on it quickly, and push out. So having a deployment pipeline that sits in the way and takes over 10 minutes to run - thatā€™s really going to affect your agility. So yeah, next year I really want the team to work on hardening our deployment pipeline, just so we can keep gaining confidence in what weā€™re releasingā€¦ Especially as we plan to scale our team out, weā€™re going to have much more commits going through on a daily basis.

Now, when you say deploying, Iā€™m wondering - do you use blue/green for your legacy app?

No. Not yet.

Because if you had two versions running at any given point in timeā€¦ So the old one, the legacy one, and just basically change traffic, the way itā€™s spread, then rollbacks would be nearly instant. I mean, the connections, obviously, they would have to maybe reconnect depending on how the app works, where theyā€™re persistent, whether retryā€¦ And everything goes back as it was. And if itā€™s a new one, if it doesnā€™t boot, so if it canā€™t boot in your incidents case, then it never gets promoted to live, because it never came up, and itā€™s not healthy.

Yeah, that would be really nice if we could get that in place. I think our deployment pipeline for legacy at the moment is just - they push these new changes to these twelve nodes, and do it all in one go. And then flush the cache on the last node that you deploy to. Itā€™s very basic. Whereas the newer services do have like all the bells and whistles of blue/green, and integration, unit tests that run against it to give us that confidence.

[44:13] Would migrating the legacy app to Kubernetes be an option?

Weā€™re thinking about it. So only one issue that Iā€™ve run up to so farā€¦ So Iā€™ve Dockerized the application, it runs locally, but thereā€™s one annoying thing where it canā€™t request assets. And this is probably some gap in my knowledge in Docker, is it runs all in its Docker network, and then when it tries to go out to fetch assets, itā€™s referencing the Docker container name, where it should actually be referencing something else, which would be like outside of that Docker networkā€¦ And that causes assets to load. So once I fix that, weā€™ll be able to move into production. But thatā€™s a pretty big deal-breaker at the moment.

Yeah, of course. When you say assets, do you mean static assets, like JavaScript, CSS images, things like that?

Yeah, like our PDFs, and those sorts of things.

Okay. Okay. So like the static filesā€¦ Okay. Okay. Interesting. I remember ā€“ I mean, that took us a while, because the static assetsā€¦ I mean, in our case, in the Changelog app, before it went on to Kubernetes, it had volume requirements, a persistent volume requirement. And the thing which enabled us to consider, just consider scaling to more than one was decoupling the static assets from the volume from the app. If the app needs to mount a volume, it just makes things very, very difficult. So moving those to S3 made a huge, huge difference. In your case, Iā€™m assuming itā€™s another service that has to be running; itā€™s trying to access another service that has the assets.

Yeah, yeah. Because weā€™ve got a bunch of stuff in S3 and requesting that, itā€™s fine. But itā€™s any time it needs to request something thatā€™s on that host, and then itā€™s using the Docker container name rather than the host name. And the whole reason is just because of the way that legacy application is written; itā€™s a configuration variable that says, ā€œWhatā€™s the name of my service that I need to reach out to?ā€ But when youā€™re accessing it externally into the container, you can resolve it with the container name; but when the container tries to resolve it internally to itself, it then falls over and doesnā€™t work.

Oh, I see what you mean. Okay. Okay. And you canā€™t make it like localhost, or something like that.

Exactly. On my local machine, itā€™s like manager.controlport.local. But then internally, Docker would see that as DefaultPHP, which is the name of the container. But itā€™s trying to go for the manager.controlport.local, which doesnā€™t exist on that network. So then it just goes ā€œI donā€™t know what youā€™re talking aboutā€, and thatā€™s the end of it.

Well, as itā€™s becoming obvious, I am like ā€“ how shall I say? How should I say this? Iā€™m like a magpie. Itā€™s a shiny thing, I have to understand ā€œWhatā€™s the problem?ā€ ā€œThe problem?ā€ Like, ā€œOh, I love this. Like, tell me more about it.ā€ Iā€™m basically sucked into troubleshooting your problem live as weā€™re recording thisā€¦ [laughter] Okay, I think weā€™ll put a pin in it for now, and change the subject. This is really fascinatingā€¦ But letā€™s go to a different place. Okay. What are the things that went well for you, and for your team, in the last six months, as youā€™ve been improving various parts of your system?

Yeah. So I think the biggest thing thatā€™s been really our biggest success in this year is that whole rewrite of the Pick application. The fact we went from no services ā€“ I just sort of want to be clear as well, when I talk about services, how weā€™re planning to structure the application, weā€™re not going like true microservice, like hundreds of services under each domain part of the system. What weā€™re really striving to do is say - we have this specific part of domain knowledge in our system; say like Pick, for example. We also have Pack, and maybe GoodsIn. And we want to split those like three core services out into their own applications, and as we scale the team, weā€™ve then got the ability to say, ā€œTeam X looks after Pick. Team Y looks after Pack.ā€ And theyā€™re discrete and standalone, so we could just manage them as their own separate applications.

[48:25] Is there a Poc? I had to ask thatā€¦ Thereā€™s Pick, this Packā€¦ There has to be a Poc. [laughter] Those are so great names.

Yeah, no Pocā€¦

Okay, thereā€™s lots and lots of POCs. Right? Lots of proof of concepts.

Yeahā€¦ We had a POC six months ago, and itā€™s now actual real production. Itā€™s now Pick. It evolves from a POC to a Pick.

Yeah. It was really fascinating to sort of go from ā€“ weā€™ve never put a microservice out into production, and weā€™ve now somehow got this cluster thatā€™s running to microservicesā€¦ And the user experience from the operatorā€™s perspective - they either go to the old legacy application that has its frontend, or the new Remix application. And regardless of which one you go to, it feels like the same user experience. And to build that in six months, and have a cohesive end-to-end experienceā€¦ Yeah, itā€™s something that weā€™re really, really proud of, for delivering that in such a short period of time.

And also to not have that many catastrophic failures on something so big. It is really nerve-wracking, being responsible for carving out something thatā€™s used every single day, building a new variant of it that performs significantly better, but also introduces some new ideas to actually gain operational efficiency. And then to see it like out in the wild, and people are using it, and the operation is still running, nothingā€™s fallen on its face, apart from when we didnā€™t set the cache driver to be Redisā€¦ But apart from that, it felt seamless. And sort of re-educating the team as well to start thinking about feature flags, and the benefits of Canary releases, and how that gives external stakeholders confidenceā€¦ Yeah, thereā€™s a lot of new tooling that came in, and Iā€™m really happy with how the team started to adopt it.

Yeah. Not to mention SLIs and SLOs that the business cares about, and the users care aboutā€¦ And you can say, ā€œHey, look at this. Weā€™re good. The system is too stable; we have to break something, dang it.ā€ [laughs]

[laughs Yeah. I think the next stage is to put a status page up so that our stakeholders and clients can sort of see uptime of the service, and sort of gain an understanding of whatā€™s going on behind the scenes. But weā€™ll only really be able to do that once weā€™ve got a list of SLIs and SLOs in place that will drive those.

only if itā€™s real-time. The most annoying thing is when you know GitHub is down, but GitHub doesnā€™t know itā€™s down. Itā€™s like, ā€œDang itā€¦ I can guarantee you that GitHub is down.ā€ Five minutes later, status page, of course itā€™s down. So thatā€™s the most annoying thing about status pages, is when theyā€™re not real-time. I know there will be a little bit of a delay, like secondsā€¦ Even 30 seconds is okay. But I think if itā€™s SLI and SLO-driven, thatā€™s a lot better, because you start seeing that degradation, as it happens, with some delay; 15-30 seconds, thatā€™s acceptable. Any more than a minute and itā€™s masking too many things.

Yeah. So Iā€™m completely new to all this stuff. I thought the status page was driven by those SLIs and SLOs. Is that not something that ā€“ thatā€™d be really cool.

It depends whichā€¦ I mean, thereā€™s obviously various services that do this; you pay for them, and itā€™s like a service which is provided; sometimes it can be a dashboard, a status page. I mean, like a read-only thing. They are somewhat betterā€¦ Itā€™s just like, deciding what to put on it, you know? And then when you have an incident, how do you summarize that? How do you capture that? How do you communicate to people that maybe donā€™t need to know all the details, but they should just know thereā€™s a problem. So itā€™s almost like you would much rather have almost like checksā€¦ You know, like when a check fails, it goes from green to red, you know thereā€™s a problem with the thing. Itā€™s near real-time. But you hide, like ā€“ because to be honest, I donā€™t care why itā€™s down; I just want confirmation that thereā€™s a problem on your end, and itā€™s not a problem on my end, or somewhere in between.

Okayā€¦ So we talked about the status page, we talked aboutā€¦ What else did we talk about? Things that you would like to improve.

[52:34] Yeah. SLIs, SLOsā€¦

Yup, thatā€™s right.

ā€¦and the deployment pipeline for legacy.

Ah, yes. That was the one. How could I forget that? A deployment pipeline. Okay, cool. So these seem very specific things, very ā€“ almost like itā€™s easy to imagine, easy to work withā€¦ What about some higher-level things that you have planned for 2023? The year will be long, for sure.

Yeah. So weā€™ve sort of had a big change this year, where [unintelligible 00:53:02.09] Weā€™ve got changes to Pick, and weā€™re changing Pack next yearā€¦ But weā€™re trying to think from like an operational perspective how can we gain more efficiency out of our packers. And right now, when youā€™ve finished picking a trolley, you put it in like a drop zone, and then someone could ā€“ theyā€™re called a water spider. They come in, they grab the trolley, they shimmy it off to the packing desk, and then the packer puts it into a bin, and that water spider comes back and takes the bin thatā€™s full of orders over to a dispatch desk. And what we want to do is start automating that last bit of the journey, from the pack station to dispatch and labeling. Essentially, what weā€™ll do then is an operator will finish packing their order, they will put it onto a conveyor belt, and that conveyor belt will have a bunch of like sensors on it, which will sort of do weighing as the order is like conveyancing from the pack desk to the outbound desk. And if the order is not within like a valid tolerance that weā€™re happy with, we will kick it back into a ā€œproblem orderā€ bin, which will be like reweighed and relabeled. Because I said earlier, we got rid of the weighing step, and thereā€™s a certain variance that our carriers will tolerate, and say ā€œYep, thatā€™s fine. It should have been like X amount of grams weā€™ll still process it.ā€ But if we go like too much under or too much over, we can get chargebacks from the carrier, to say ā€œHey, you sent us this order, and it didnā€™t have the correct weight.ā€ So we want to start handling those in-house.

And whatā€™s gonna be really interesting is building the SLOs and SLIs around that. Like, how many orders are we weighing at Pack, weā€™re skipping and weighing at Pack, and putting it on the conveyancing system and how many orders are we kicking out? And have like an error budget on that, and seeing like how accurate our product weights are in the system, how accurate our packaging weights areā€¦ Itā€™s gonna be really interesting to see that in operation next year.

So I think the plan is weā€™ll probably get an independent contractor to come in and set up the conveyancing. But then we want our own bespoke software running in that pipeline that we can hook into, and start pulling data out of that. And Iā€™m really, really excited to start working on some of those automation pieces.

Itā€™s really interesting how youā€™re combining the software with the real world, right? So how everything you do ā€“ like, you can literally go in the warehouse and see how the software is being used, what is missing, what software is missing, what can be made more efficientā€¦ Because what youā€™ve just described, itā€™s a real-world process that can be simplified, can be made more efficient by adding a bit more software. And that belt. Very important. With the right sensors.

See, one of the really interesting parts about our company is everything, end-to-end, is bespoke; from like order ingest, to order being dispatched from the warehouse - we control everything in that pipeline.

[56:01] The only things we depend on is buying labels from carriers. I mean, we spoke at some point about managing our own price matrixes in real time of the carriers, and doing our own quoting and printing our own labelsā€¦ Maybe one day weā€™ll go in that direction, but itā€™s a lot of work. And thereā€™s companies out there that are dedicated to doing that, so we have those as partners for now. But apart from that, pretty much everything out in the FC is completely bespoke.

You mentioned FC a couple of timesā€¦ Fulfillment center.

Thatā€™s what it is. I was thinking,ā€ What is it? What is FC?ā€ Itā€™s not a football clubā€¦ [laughter] Because itā€™s like the World Cup is on, so FC - itā€™s easy to associate, weā€™re primed to associate with a football clubā€¦ So itā€™s not that. Fulfillment center, thatā€™s what it is.

Yeah. We used to warehouses, but I think fulfillment center is more accurate to what we do.

Do you see more Kubernetes in your future, just about the same amount, or less? What do you think?

So I think purely because weā€™re moving to a more service-oriented architecture, weā€™re probably going to continue to depend on Kubernetes. I canā€™t see how practical a world would be where we have to keep provisioning new EC2 instances, and setting up our deployment pipelines to have specific EC2 instances as targets, and managing all the ingress to those instances manuallyā€¦ It just feels a little bit messy. Having one point of entry to the cluster, and also being able to like pull that from AWS to like GCP in the future if we ever wanted to move cloud providers - I think for us it makes more sense to stay on Kubernetes.

Okay. Technology-wise, Datadog was also mentioned, so Iā€™m feeling a lot of love for Datadog coming from you, because it just makes a lot of things simpler, even though easier to understand, even though itā€™s not muscle memory just yetā€¦ Are there other services that you quite enjoyed using recently?

Yeah. Iā€™m shouting out to Datadog again, but itā€™s just ā€“ itā€™s another part of their ecosystem; they have something called RUM, real-time user monitoring. And when we actually deployed the Pick service, we were getting tons of feedback, but there was no real way to correlate the weird edge cases people were having, and we installed RUM. Basically, what it does is it records the user session end-to-end, takes screenshots and then uploads it to Datadog, and you can play that session back and watch it through, but it will also have like a timeline of all the different events that that operator clicked on through that timeline. So you can scrub through it and attach as much meta information to that trace as youā€™d like, just like with any other OpenTelemetry trace.

So in our example, we began to get lost, because we couldnā€™t correlate a screen recording to some actual like picker data that was stored in S3ā€¦ Whereas now, we store the picker into S3, which is like all of the raw data that the operator interacts with from an API perspectiveā€¦ But we also take that picker ID and attach it to the trace, along with their user ID, and along with the trolley they were picking onā€¦ So now we can just go into Datadog and say, ā€œHey, give me all the traces for this user, on this trolley.ā€ And if they said like they had a problem on Sunday with that trolley, we can now easily find that screen recording and watch it back. And we can also then correlate that with all of the backend traces that happened in that time period.

So we used to use Datadog and Sentry, and even though I have a lot of love for Sentry, and I think theyā€™re a great product, having it all under one roof and being able to tie all the traces together and get an end-to-end picture of exactly what a journey looked like - Iā€™m really starting to enjoy that experience with Datadog.

Nice. Very nice. Okay. That makes a lot ā€“ I mean, it makes sense. I would want to use that. If I were in that position, why wouldnā€™t I want that? It sounds super-helpfulā€¦ And if it works for you, itā€™s most likely to work for others. Interesting. Anything else apart from these two?

[01:00:05.20] Iā€™m trying to think what else Iā€™ve usedā€¦ I mean, I was looking a bit at Honeycomb, and I really wanted to get it up and running for us, but they donā€™t yet have a PHP SDK. You have to sort of set it up with an experimental one thatā€™s sort of community-driven. I just havenā€™t had the time to plug into it. I went through their interactive demos, and I really, really, really want to try it. Itā€™s bugging me that we canā€™t make it work for us just yet. But no, no additional tooling. Those are the two at the top of my list.

Okay. So as we prepare to wrap up, for those that stuck with us all the way to the end, is there a key takeaway that youā€™d like them to have from this conversation? Something that ā€“ if they were to remember one thing, what would that thing be?

I think donā€™t be scared to keep moving the needle, and keep iterating on what youā€™ve got; even if you want to try a new service in production, having the sort of foresight to say ā€œWe can gracefully roll this out and scale out, but also gracefully roll it back if weā€™ve got issuesā€ is really powerful. And from my experience, like Iā€™ve touched on today, the more you do it, the more confidence you can get the rest of your business to have in your deployments. And that sort of leads to being able to keep iterating and deploy more frequently. And thatā€™s what we all want to do, right? We want to just keep making change and seeing like positive effects in production.

How do you replace fear with courage? How do you keep improving?

Just keep failing. Failing and learning from it. Like, thereā€™s no real secret formula. The first time you fail, as long as you can retrospect on that failure and take some key learnings away from itā€¦ And I think the more you fail, the better. As long as youā€™re not failing to the point where youā€™re taking production completely offline and costing your business like thousands of pounds, and maybe your customers lose confidence in your productā€¦ As long as youā€™ve risk-assessed what youā€™re deploying and you have a backup strategy, I think thatā€™s how you replace fear with courage - just knowing youā€™ve got that safety blanket of being able to eject.

Yeah. Well, itā€™s been a great pleasure, Alex, to watch you go from April, when we first talked, and we posted some diagrams, to now Decemberā€¦ You successfully sailed through Black Friday, Christmas as well, a lot of orders, physical orders have been shipped, a lot of socks, by the sounds of itā€¦

Yeahā€¦ Everyoneā€™s getting socks for Christmas this year.

Exactly. Yeah, apparently. Itā€™s great to see, from afar, and for those brief moments from closer up, to understand what youā€™re doing, how youā€™re doing it, how youā€™re approaching problems that I think are fairly universal, right? Taking production down. Everyone is afraid of that. Different stakes, based on your company, but still, taking production down is a big deal. Learning from when things fail, trying new things outā€¦ Itā€™s okay if itā€™s not going to work out, but at least youā€™ve tried, youā€™ve learned, and you know ā€œOkay, itā€™s not that.ā€ Itā€™s maybe something else, most probably. And not accepting the status quo. Each of us have legacy. Our best idea six months ago is todayā€™s legacy. Right? And it is what it is; itā€™s served its purpose, and now itā€™s time for something new. Keep moving, keep improvingā€¦ Thereā€™s always something more, something better that you can do.

I completely agree. Itā€™s been great to come back on, and i look forward to sharing the automation piece sometime next year.

Yeah. And Iā€™m looking forward to adding some more diagrams in the show notesā€¦ Because I remember your 10-year roadmap - that was a great one. Iā€™m wondering how that has changed, if at allā€¦ And what is new in your current architecture, compared to what we had. I think this is like the second wave of improvements. Six months ago we had the first wave, we could see how well that worked in productionā€¦ And now we have the second wave of improvements. Very exciting.

Yeah, Iā€™ll send those over when I have them.

Thank you, Alex. Thank you. Thatā€™s a merry Christmas present, for sure. Merry Christmas, everyone. See you in the new year.

Merry Christmas. Cheers.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. šŸ’š

Player art
  0:00 / 0:00