Eight months ago, in š§ episode 49, Alex Sims (Solutions Architect & Senior Software Engineer at James & James) shared with us his ambition to help migrate a monolithic PHP app running on AWS EC2 to a more modern architecture. The idea was some serverless, some EKS, and many incremental improvements.
So how did all of this work out in practice? How did the improved system cope with the Black Friday peak, as well as all the following Christmas orders? Thank you Alex for sharing with us your Ship It! inspired Kaizen story. Itās a wonderful Christmas present! šš
Featuring
Sponsors
Sourcegraph ā Transform your code into a queryable database to create customizable visual dashboards in seconds. Sourcegraph recently launched Code Insights ā now you can track what really matters to you and your team in your codebase. See how other teams are using this awesome feature at about.sourcegraph.com/code-insights
Raygun ā Never miss another mission-critical issue again ā Raygun Alerting is now available for Crash Reporting and Real User Monitoring, to make sure you are quickly notified of the errors, crashes, and front-end performance issues that matter most to you and your business. Set thresholds for your alert based on an increase in error count, a spike in load time, or new issues introduced in the latest deployment. Start your free 14-day trial at Raygun.com
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Welcome | 01:04 |
2 | 01:04 | Sponsor: Sourcegraph | 01:43 |
3 | 02:46 | Intro | 05:56 |
4 | 08:43 | Moving towards Kubernetes | 04:14 |
5 | 12:57 | Feature flagging, did it work? | 04:00 |
6 | 16:57 | Black Friday and Christmas | 06:59 |
7 | 23:56 | From problem to solution | 01:37 |
8 | 25:40 | Sponsor: Raygun | 02:07 |
9 | 27:53 | Datadog | 01:44 |
10 | 29:37 | MySQL and MariaDB | 05:28 |
11 | 35:06 | SLIs and SLOs | 03:17 |
12 | 38:22 | Day to day before Christmas | 02:42 |
13 | 41:07 | The improvement list | 03:06 |
14 | 44:14 | Deployment pipeline for legacy | 03:10 |
15 | 47:24 | It's going great | 05:02 |
16 | 52:26 | Big plans for 2023 | 02:54 |
17 | 55:20 | Where software meets the real world | 01:31 |
18 | 56:51 | More Kubernetes or less? | 00:50 |
19 | 57:41 | RUM (Datadog) | 02:55 |
20 | 1:00:36 | Key takeaway | 01:39 |
21 | 1:02:15 | Wrap up | 02:00 |
22 | 1:04:17 | Outro | 00:52 |
Transcript
Play the audio to listen along while you enjoy the transcript. š§
Alex, welcome back to Ship It.
Yeah, itās great to be back. It doesnāt feel that long since our last chat.
No, it wasnāt. It was episode #49, April(ish)ā¦ Six months, actually; six, seven months.
So much has changedā¦
Yes. The title is āImproving an e-commerce fulfillment platform.ā A lot of big words thereā¦ The important one is the improving part, right?
Indeed, yeah. So much has changed, and itās really interestingā¦ I think the last time we spoke, we fulfilled about 15 million orders, and weāre closely approaching 20 million. So almost another 5 million orders in six months. Itās just crazy the pace weāre moving at this year.
Thatās nice. Itās so rare to have someone be able to count things so precisely as you areā¦ And itās a meaningful thing, right? Itās literally shipping physical things to people around the world, right? Because youāre not just in the UK. Now, the main company is based in the UK; you have fulfillment centers around the world.
Yeah, weāve now got four sites. Weāre in the UK, weāve got two sites in the US, in Columbus, Ohio, and just over one up in Vegas, and weāve got another site in Auckland. So itās growing pretty quick, and I think this year weāre opening two more sites.
This year 2022?
Sorry, yeah - we opened one site this year, which is the Vegas oneā¦ Oh, and Venlo; thatās the Netherlandsā¦ And yeah, weāve got plans for two more sites next year, I believe.
Nice. So an international shipping company that shipped 5 million orders in the last six-seven months. Very nice.
Funny story on thatā¦ Iād love to imagine it was just me, sat there, counting orders as they go out the door, but weāve actually got a big LED sign thatās mounted to the wall, and every time an order dispatches, it ticks up. Itās a nice bit of fun. Thereās one in the office and thereās one mounted on the wall in the fulfillment center, and itās quite interesting to see that ticking up, especially this time of year, when the numbers really start to move.
Yeah. Okay, okay. So for the listeners that havenāt listened to episode #49 yet, and the keyword is āyetā, right? Thatās a nudge; go and check it outā¦ What is it that you do?
Yeah, so I mischaracterized myself last time as a 4PL; weāre actually a 3PL. I was corrected. And essentially, we act on behalf of our clients. So imagine youāre somebody that sells socks, and you have a Shopify account, and you come to us and we connect to your Shopify account, we ingest your orders, and we send them out to your customers via the cheapest shipping method, whether that be like Royal Mail in the UK, or even like FedEx, going international [unintelligible 00:05:58.17] We handle all of that. And then we provide tracking information back to your customers, and give you insights on your stock management, and ā yeah, thereās tons of moving parts outside of just the fulfillment part. Itās all about how much information can we provide you on your stock, to help you inform decisions on when you restock with us.
Okay. So that is James & James. James & James, the company - thatās what the company does. How about you? What do you do in the company?
Yeah, so Iāve sort of transitioned through many roles over the last few years. Started this year, I was a senior engineer, and Iāve transitioned to a solution architect role this year. Main motivation for that is weāve predominantly been a monolithic ā we had a big monolith that was on a very legacy version of Symphony; Symphony 1.4, to be specificā¦ And we want to start making tactical incisions to start breaking some of those core parts of our application now into additional services, that use slightly more up-to-date frameworks that arenāt going to take us years to upgrade, say, 1.4 version of Symphony to something modern. Weāve decided itās going to be easier to extract services out, and put them into new frameworks that we can upgrade as we need to, and itās sort of my job to oversee all of the technical decisions weāre taking in the framework, but also how we plan upgrades, how we stitch all these new systems together, and most importantly, how we provide sort of like a cohesive experience to the end user. I think thereās six services running behind the scenes. To them itās just one sort of UI thatās a portal into it all.
Yeah. When you say end users, this is both your staff and your customers, right?
Exactly. We have two applications, one called CommandPort which is our sort of internal tool where we capture orders, and pick and pack them and dispatch them, and then we have the ControlPort which is what our clients use, which is their sort of portal into whatās going on inside the warehouse, without all of the extra information they donāt really care about.
[08:08] Okay. And where do these services ā I say services; I mean, where do these applications run? Because as you mentioned, thereās multiple services behind them. So these two applications, where do they run?
Yeah, so they run in AWS, on some EC2 instances, but we have recently created an EKS cluster for all of our new services, and weāre slowly trying to think about how we can transition our old legacy application into the cluster, and start spinning down some of these old EC2 instances.
Okay. I remember in episode #49 thatās what we started talking about, right? Like, the very early steps towards the Kubernetes architecture, or like Kubernetes-based architecture, to see what makes sense, what should you pick, why would you pick one thing over another thingā¦ Thatās been six months ago. How did it work in practice, that migration, that transition?
Yeah, so it worked pretty well. So one of our biggest projects over these last six months has been to rewrite Pick, which is one of our largest parts of our operation, into a new application. So what we ended up doing - we created a Remix application, which is a React framework, and thatās deployed on the edge using Lambda, just so you get pretty much fast response times from wherever youāre requesting it fromā¦ So that sits outside the cluster. And then we have a new Pick API, which is built using Laravel; thatās deployed inside of EKS, and also a new auth service, which is deployed inside of EKS as well.
So currently, the shape of our cluster is two services running inside of EKS, and then our EC2 instances make requests into the cluster, and that lambda function also makes requests into the cluster. We have three nodes in there, operating on a blue/green deploy strategy. It was actually really interesting, we got bitten by a configuration error.
Okayā¦
This might make you laughā¦ To set the scene - itās Friday night, the shift is just handed over to the next shift manager in the FC. Weāve been Canary-releasing one or two operators for the last two weeks, doing some testing in production on the new Pick service, and everythingās been going flawlessly. Weāre like āThis is such a great deployment. Weāre happy. Thereās been no errors. Letās roll it out to 30% of everybody thatās running on tonightās shift.ā
And earlier that day, I was speaking with one of our ops engineers, and I said, āItās really bugging me that we only have one node in our cluster. It doesnāt really make much sense. Could we scale it to three nodes, and then also do blue/green deploy on that?ā He was like, āYes, sure. No worries.ā We added two more nodes to the cluster, we deployed the app over those three nodes. He sort of looked at the state of Kubernetes, and he was like, āYeah, itās great. I can see all three instances running, I can see traffic going to all of themā¦ Yeah, no worries. Call it a day.ā
I started getting pinged on WhatsApp, and theyāre saying āEverything in Pickās broken. If we refresh the page, it takes us back to the start of our Pick route. Weāre having to rescan all the items againā¦ Someoneās got a trolley with 100 stops on it, and theyāre having to go to the startā¦ā And Iām like āWhat the f is going on?ā And it turned out that in the environment variables that weād set for the application, that weād set the cache driver to be filed instead of Redis.
Ahhā¦ Okay.
So as soon as someone got directed to another node, they lost all of their progress, and they were getting reset. So that taught me to not just deploy on a Friday night and be happy that the tests passed, becauseā¦
Oh, yes. And then I think ā because youāve been testing with like a single instance, right? ā¦and everything looked good. So going from one to three seemed like āSure, this is gonna work. No big deal.ā Itās so easy to scale things in Kubernetes when you have that.
Yeahā¦
And then things like thisā¦ āAhā¦ Okay.ā That sounds like a gun to your foot. What could possibly happen? [laughs] Okay, wowā¦
[12:10] It was really nice to have an escape hatch, though. So we deployed everything behind LaunchDarkly. So we have feature flags in there. And literally, what I did is I switched off the ā scaled the rollout down to 0%, everyone fell back to the old system, and it was only the cached state that was poisoned. So their actual state of what they picked had all been committed to the database. So as soon as I scaled that down to zero, they fell back to the old system, and were able to continue, and I think we only really had like 10 minutes of downtime. So it was really nice to have that back-out method.
Yeahā¦ But you say downtime - to me, that sounds like degradation, right? 30% of requests were degraded. I mean, they behaved in a way that was not expected. So did ā again, Iām assumingā¦ Did the majority of users have a good experience?
No, everybody that was being targeted ā sort of 30% of operators that were going to the new service, everyone had a bad experience.
Right. But the 70% of operators, they were okay.
Oh yeah, exactly.
Yeah. So the majority was okay. Okayā¦ Well, feature flags for the win, right?
Yeah, it was really nice, because this is the first time weāve deployed a new service like this, and it was the first time trying feature flags. And even though we had an incident, it was really nice to have that graceful backout, and be confident that we could still roll forward. And in the WhatsApp chat with our operations manager, we were just sending emojisā¦ roll forward, and itās like, rolling panda down a hill. He was just like āYeah, no worriesā¦ā
[laughs] Thatās what you want. Thatās it. Thatās the mindset, right? Thatās like the mindset of trying something new. You think itās going to work, but you can never be too confident. The more confident you are, the more ā I donāt know, the more painful, I think, the failureā¦ Like, if youāre 100% confident itās going to work and it doesnāt, what then? Versus āI think itās going to work. Letās try it. I mean, if it wonāt, this is the blast radiusā¦ Iām very aware of the worst possible scenario, and Iām okay with that riskā, especially comes to production, especially when it comes to systems that cost money when theyāre down. So imagine if this would have happened to 100% of the stuff. I mean, youād be basically stopped for like 10 minutes, and that is very costly.
Yeah. And itās been really nice to see like the mindset of people outside of tech evolve over the past couple of years. There was a time where we would code-freeze, everything would be locked down, and nothing would happen for two months. And slowly, as weāve started to be able to introduce things that mitigate risk, the mindset of those people external to us has also changed, and itās just a really nice thing to see that we can keep iterating and innovating throughout those busy periods.
Once you replace fear with courage, amazing things happen. Have the courage to figure out how to apply a change like thisā¦ Risky, because all changes are risky if you think about it, in production. The bigger it is, the hotter it runs, the more important the blast radius becomes. I donāt think that youāll never make a mistake. You will.
No, exactly.
Sooner or later. The odds are in your favor, but every now and then, things go wrong. Cool. Okay.
I mean, I was very confident with this until I realized Iād broken all of the reporting on that service that I shared in the last episode; it just completely fell on its face.
Really?
[15:46] Because I found in the old system it did two saves, and we use change data capture to basically analyze the changes on the record as they happen in real time with Kafka. And it ultimately did two saves. It did one to change the status of a trolley from a picking state to an end shift state, and one change to divorce the relationship with the operator from that trolley. And in the application that consumes it, it checks for the presence of the operator ID that needs to be on the trolley, and the status needs to change in that row. If that case wasnāt satisfied, it would skip it, and that trolley would never be released, which means the report would never be generated.
And what ended up happening is I saw that old code and went āWhy would I want to do two saves back to back, when I can just bundle it all up into one and be like micro-efficient?ā
[laughs] Of course.
āOh, okay. Yeah, Iām just gonna take down like a weekās worth of reporting.ā Yeah, that wasnāt fun.
All great ideas.
We could live without it, though. Itās all edge stuff, and ā yeah, we can live without it. Itās fixed now, butā¦ Yeah, finding those things and going āOh, my god, I canāt believe thatās a thingā¦ā
Okayā¦ Thatās a good one. So you had two, possibly the biggest events ā no, I think theyāre probably the biggest events. I mean, I donāt work in the physical shipping world, but I imagine that Black Friday and Christmas are the busiest periods for the shipping industry as a whole. I think itās like the run-up, right? Because the things have to be there by Black Friday, and things have to be there by Christmas. How did those two major events work out for you with all these changes to the new system that started six months ago?
So to give an idea of what our normal sort of daily volume is, and maybe set the scene a bit - weāre normally about 12,000 orders a day, I think, and on the ramp up to Black Friday, from about the 20th of November, we were up to release 20,000 a day. And on Black Friday I think 31,000 was our biggest day of orders. And to also set the picture a little bit better, in the last six months I said weāve done about 5 million orders; in the last 15 days, weāve done about 400,000 orders across all of our sites.
Thatās a lot.
So yeah, volume really ramps up. And we were really, really confident this year, going in from like a system architecture perspective . Weād had a few days where we had some spiky volume and nothing seemed to let up, but it seemed to all ā not start going wrong, because we never really had a huge amount of downtimeā¦ But a lot of our alarms in Datadog were going off, and Slack was getting really bombarded, and we had a few pages that were 503ing, because they were just timing outā¦ We were suddenly like āWhatās going on? Why is the system all of a sudden going really slow?ā And weād released this change recently called ālabel at pack.ā And essentially, what it did is as youāre packing an order, previously, youād have to like pack all the items, and then once youāve packed all the items, you weighed the order, and then once youāve weighed the order, you wait for a label to get printedā¦ But it was really slow, because that weighing step you donāt need; you already know whatās going in the box, you know what box youāre choosing, so you donāt need that weigh step. And it means as soon as you start packing that order, we can in the background go off and make a request to all of our carriers, quote for a label, and print it.
So at the time that you finished packing all the stuff in the box, youāve got a label ready to go. But what we didnāt realize is that AJAX request wasnāt getting fired just once; it was getting fired multiple times. And that would lead to requests taking upwards of like sometimes 30 or 40 seconds to print a labelā¦ If you have tens of these requests going off, and weāve got 80 packing desks, thatās a lot of requests that the systemās making, and it really started to slow down other areas of the system. So we ended up putting some SLOs in, which would basically tell us if a request takes longer than eight seconds to fire, weāll burn some of the error budget. And we said āOh, we want 96% of all of our labels to be printed within eight seconds.ā And I think within an hour, we burned all of our budget, and we were like, āWhatās going on? How is this happening?ā And it was only when we realized that the AJAX request was getting fired multiple times that we changed it. And as soon as that fix went out, the graph was like up here, and it just took a nosedive down, everything was sort of printing within eight or nine seconds, and the system seemed to be a lot more stable.
[20:24] Thereās also a few pages that are used for reporting, theyāre like our internal KPIs to see how many units and orders weāve picked, and operator level, by day, week, monthā¦ And theyāre used a lot by shift managers in the FC. And historically, theyāre a bit slowā¦ But in peak, when weāre doing a lot more queries than normal, weāre going really slow. I think ā what was happening? Iām not sure how much technical detail you want to go intoā¦
Go for it.
Yeah, we use ORM in our legacy application, and we greedy-fetch a lot of stuff.
Okayā¦
We definitely over-fetchā¦
From the database, right? L
From the database.
Youāre getting a lot of records, a lot of rows; any scanning, anythingā¦
Yeah, just tons of rows, and weāve got a reasonably-sized buffer pool. So all those queries run in memory. But what happens is when the memory in the buffer pool is used up, those queries will start running on disk. And once they start running on disk, it significantly degraded performance.
Yeah. Let me guess - spinning disks? HDDs?
So I thought weād upgraded to SSDs on our RDS instance, but I need to go back and clarify that.
That will make a big difference. And then thereās another step up; so you go from HDDs to SSDs, and then you go from SSDs to NVMEs.
Yeah, I think thatās where we need to go. I think weāre at SSD, but itās still on those ā like, scanning millions of rows queries, and over-fetching like 100 columns or more at a time, maybe 200 columns, the amount of joins that those queries are doingā¦ Yeah, theyāre going straight into the table. But yeah, they were essentially taking the system offline because they would just run for like 10-15 minutes, eat a connection up for that entire time, and then youāve got someone out there hitting Refresh, so youāve got 30 or 40 of these queries being ran, and no one else can make requests to the database, and it chokes. So we ended up finding that if we changed, or forced different indexes to be used in some of those queries, and reduced the breadth of the columns, they are able to still run, within tens of seconds; so itās still not ideal, but it was enough to not choke the system out.
And luckily, these things all started happening just ahead of Black Friday, so then we were in a much better position by the time Black Friday came along. We also found that we accidentally, three years ago, used Redis keys command to do some look-ups from Redis, and didnāt realize in the documentation it says āUse this with extreme care in production, because it doesnāt onScan over the entire cell.ā
Okayā¦
Yeah. And when youāve got 50 million keys in there, it locks Redis for a while, and then things also donāt work. So we swapped that with scan, and that alleviated a ton of stress on Redis. So yeah, thereās some really pivotal changes that we made this year. They werenāt massive in terms of like from a commit perspective, but they made a huge difference on the performance of our system.
Thatās it. I mean, thatās the key, right? It doesnāt matter how many lines of code you write; people that still think in lines of code, and like āHow big is this change?ā You actually want the really small, tiny decisions that donāt change a lot at the surface, but have a huge impact. Some call them low-hanging fruit. I think thatās almost like it doesnāt do them justice. I think like the big, fat, juicy fruits, which are down - those are the ones you ought to pick, because they make a huge difference to everything. Go for those.
Iām wondering, how did you figure out that it was the database, it was like this buffer pool, and it was the disks? What did it look like from āWe have a problemā to āWe have a solution. The solution worksā? What did that journey look like for you?
[24:12] Yeah, so Iām not sure how much of this was sort of attributed to luckā¦ But we sort of dived straight into the database.
Thereās no coincidence. Thereās no coincidence, Iām convinced of that. Everything happens for a reason. [laughs]
Thereās no correlation.
You just donāt know it yet. [laughs]
But yeah, we just connected to the database, to the Show Process list and saw that the queries had been running for a long timeā¦ Itās like āHmā¦ We should probably start killing off all these ones that have been sat there for like 1000 seconds. They donāt look healthyā¦ā [laugh]
Okay. So before we killed them, we sort of copied the contents of that query, pasted it back in, and put an āexplainedā before, and just sort of had a look at the execution planā¦ And then saw how many rows it was considering, saw the breadth of the columns that are being used by that query, and then when we tried to run it again, it gives you sort of status updates of what the query is doing. And when itās just like copying to temp table for about over two minutes, youāre like āThatās probably running in disk and not in memory.ā So thereās a bit of an educated assumption there of ā we werenāt 100% confident, thatās what was happening, but based on what the database was telling us itās doing, we were probably assuming thatās what was happening. Now, none of us are DBAs, I just want to sort of clear that upā¦ But that was our best educated guess, correlated with what we could find online.
Is there something that you think you could have in place, or are thinking of putting in place to make this sort of troubleshooting easier? ā¦to make, first of all figuring out there is a problem and the problem is most likely the database?
So we already have some of that. We use APM in Datadog, and it automatically breaks out like queries as their own spans on a trace, so you can see when youāve got a slow-running query. And we do have some alarms that go off if queries exceed a certain breakpoint. But there are certain pages and certain queries that we understand are slow, and we kick those into like a āKnown slow pagesā dashboard, that we donāt tend to look at as much, and we donāt want bombarding Slack, because we donāt want to be getting all these alarms going off for things we know are historically slow.
Thereās a few of us on the team - shout-out to George; heās a bit of a wizard on Datadog at the moment, and really gets stuck in there and building these dashboards. And those are the dashboards that we tend to lean towards first; you can sort of correlate slow queries when disk usage goes up on the database, and those dashboards are really helpfulā¦ But normally, when weāre in the thick of it, the first thing that I donāt run to is Datadog, and I donāt know why, because it paints a really clear picture of whatās going on.
I tend to ā I think itās muscle memory, andā¦ Over the past five years, when we didnāt have Datadog, I would run straight to the database ,first and start doing ā show the process log, and whatās in there, and why is that slowā¦ And then Iād forget to go check our monitoring tool. So I think for me thereās a bit more of a learning curve of how do I retrain myself to approach a problem looking at our tooling first, rather than at the database.
Okay. So Datadog has the APM stuff. From the application perspective, what other integrations do you use to get a better understanding of the different layers in the system? Obviously, thereās the application, thereās the database server itself, then thereās the - MySQL or Postgres SQL?
We use MariaDB.
MariaDB, okay.
So itās a variant of MySQL.
In my head - MySQL. Legacy - MySQL. [laughs] Itās like a forkā¦ Like, āWhich what is it?ā The MySQL fork. So I donāt know, does Datadog have some integration for MySQL MariaDB, so that you can look inside of whatās happening in the database?
I believe it does. And I think we actually integrated it. I just have never looked at it.
Oh, right. Youāre just like not opening the right tab, I seeā¦ [laughs]
Yeah, because if I look at integrations, weāve got like 15 things enabled. Weāve got one for EKSā¦ Oh, we do have one for RDS, so we should be able to seeā¦ We have it for Kafka as well, so we can see any lag that on topics, and when consumers stop respondingā¦ So those sorts of things alert us when our edge services are down. Yeah, I think we monitor a lot, but we havenāt yet fully embraced the culture of āLetās get everyone to learn whatās available to themā, and thatās something that I hope we sort of shift more towards in ā23.
That sounds like a great improvement, because each of you having almost like a source of truthā¦ Like, when something is wrong, where do I go first? Great. And then when Iām here, what happens next? So having almost like a ā I want to call it like play-by-play, but itās a bit more than that. Itās a bit of āWhat are the important thingsāā, like the forks, if you wish, in the road, where I know itās the app, or itās the instance, like the frontend instances if you have such a thing, or itās the database. And then even though we have services ā I think services make things a little bit more interesting, because then you have to look at services themselves, rather than the applicationsā¦ And then I know thereās toolsā¦ Like, service meshes come to mind; if anything, thatās the one thing that service meshes should help with, is understand the interaction between services, when they degrade, automatic SLIs, SLOs, all that stuff.
So thatās something that at least one person would care about full-time, and spent full-time, and like they know it outside in, or inside out, however you wanna look at it; it doesnāt really matterā¦ But they understand, and they share it with everyone, so that people know where to go, and they go āThatās the entry point. Follow this. If it doesnāt work, let us know how we can improve itā, so on and so forth. But that sounds ā itās like that shared knowledge, which is so important.
[32:17] Itās a bit of an interesting place, because we have a wiki on our GitHub, and in that wiki there are some play-by-plays of common issues that occur. I think weāve got playbooks for like six or seven of them, and when the alarm goes off in Datadog, thereās a reference to that wiki document.
So for those six or seven things, anybody can respond to that alarm and confidently solve the issue. But we havenāt continued to do that, because there arenāt that many common issues that frequently occur, that weāve actually then gone and applied a permanent fix for you. Weāve got a few of these alarms that have been going off for years, and itās just like, āHey, when this happens, go and do these stepsā, and you can resolve it. And as a solutions architect, one of my things that I really want to tackle next year is providing more documentation over the entire platform, to sort of give people a resource of āSomethingās happened in production. How do I start tracing the root cause of that, and then verifying that what Iāve done has fixed it for any service that sort of talks about that?ā But yeah, weāre not there yet. Hopefully, in our next call we touch on that documentation.
Yeah, of course. The only thing that matters is that you keep improving. I mean, to be honest, everything else, any incidents that come your way, any issues - opportunities to learn. Thatās it. Have you improved, having had that opportunity to learn? And if you have, thatās great. Thereāll be many others; they just keep coming at you. All you have to do is just be ready for them. Thatās it. And have an open mind.
And Iām wonderingā¦ So I know that the play-by-plays and playbooks are only so useful, because almost every new issue is like a new one. Right? You havenāt seen that before. Would it help if youāre able to isolate which part of the system is the problem? The database versus the CDN (if you have such a thing), network, firewall, things like that?
Yeah, it would be really useful. And one thing weāre trying to do to help us catalog all of these is anytime we have an incident. Weāve not gone for propper incidents [unintelligible 00:34:19.25] We were looking at incident.io. We havenāt sprang for it yet. We just have an incidence channel inside of Slack, and we essentially start a topic there, and we record all of the steps that happened throughout that incident inside of that log. So if we ever need to go back and revisit it, we can see exactly what caused the issue, and also what services or pieces of infrastructure were affectedā¦ Because Slack search is pretty nice. You can start jumping into that incidence channel, somethingās gone wrong, you do a search and you can normally find something that might point you in the right direction of where you need to steer your investigation. We know itās not the most perfect solution, but itās worked so far.
If it works, it works. If it works, thatās it. You mentioned SLI and SLOs, and how they helped you understand better what is happening. I mean, first of all, signaling thereās a problem with something that affects users, and then being able to dig into it, and troubleshoot, and fix it. Are SLIs and SLOs a new thing that you started using?
Yeah, weāre really sort of dipping our toes in the water and starting to implement them across our services. I think we currently have just two SLOs.
Itās better than zeroā¦
Exactly. We havenāt yet decided on SLIs. Weāve got a chat next week with George, and weāre going to sit down and think what components make up this SLO that can sort of give us an indication before it starts triggering that weāve burned too much of our budget. So weāve both got like a shared interest in SRE, and weāre trying to translate that into James & Jamesā¦ But yeah, thatās still very much amateur, and just experimenting as we go, but itās nice to see at the peak this year that the SLA that we did create gave us some real value backā¦ Whereas previously, we would have just let it silently fail in the background, and be none the wiser.
[36:14] Yeah, thatās amazing. It is just like another tool in your toolbox, I supposeā¦ I donāt think you want too many. Theyāre not supposed to be used like alarms. Right? Especially when, you know youāre like thousands and thousands of engineersā¦ By the way, how many are you now in the engineering department?
I think weāre eight permanent and four contract, I believe.
Okay, so 12 people in total. Again, thatās not a big team, and it means that everyone gets to experience pretty much everything that happens in some shape or form. I think youāre slightly bigger than a two-pizza team, I thinkā¦ Unless the pizzas are really, really large. [laughter] So youāre not like ā sure, it can be one team, and I can imagine that like retros, if you have them, or stand-ups, or things like that are getting a bit more complicated with 12 people. Still manageable, but 20? Forget about it. Itās just like too much.
Yeah, it was getting a bit toughā¦ And what we do now is we have a single stand-up once a week, an hour long. Everyone gets in, and sort of unites their teams, and what weāve been doingā¦ And then we have like breakout teams. So weāve got four sub-teamsā¦
That makes sense.
And yeah, we have our dailies with them, and that seems way more manageable.
That makes sense. Yeah, exactly. But still, youāre small enough, again, to have a good understanding of most of the system, right? I mean, once you get to like 20, 30, 40, it just becomes a lot more difficult, because the system grows, more services, different approaches, and maybe you donāt want consensus, because thatās very expensive, right? The more you get, the more expensive that gets; it just doesnāt scale very well.
What Iām thinking is SLIs and SLOs are a great tool. A few of them that you can all agree on, all understand, and at least focus on that. Focus on delivering good SLOsā¦ No; actually, good SLIs, right? SLIs that match, that everyone can agree on, everyone understands, and itās a bit of clarity in what is a chaotic ā because it is, right? When you have two, three incidents happening at the same timeā¦ It does happen.
Okay. Okay. So these past few weeks have been really interesting for you, because itās been the run-up to Christmas. More orders, as you mentioned, the system was getting very busyā¦ What was the day to day like for you? Because I think you were mentioning at some point that you were with the staff, on the picking floor, using the system that you have improved over those months. What was that like?
Yeah, it was really interesting. This year I really wanted to just use the Pick part of the system. So last year I did a bunch of packing of orders, and that was fine. But after spending sort of like four months rewriting Pick, I really wanted to just take a trolley out and just go pick a ton of orders and experience it for myself. So yeah I did three, four days down there, picked like a thousand ordersā¦
Wowā¦ Okay. Lots, lots of socks; too many socks. [laughter] I donāt want to see another pair of socks for a while. But yeah, it was really nice to sort of get down there and involved with everybody, and sort of going around and talking to operators, and then sort of saying parts of the system they liked, but also parts they didnāt like, and parts they felt slowed them down, versus what the old one didā¦ And it got some really, really useful feedback on what we could then put into the system going into 2023. And we try and do ā we have like two or three [unintelligible 00:39:41.06] days a year where we will all go down into the FC and weāll do some picking and packing, or looking in, just so we can get a feel for whatās going on down there, and how well the systems are behaving.
[39:54] But at the peak, when itās our most busy time of the year, itās sort of like, everybody, all hands-on deck, weāll get down there, all muddle in, DJ plays some music in the warehouse, and weāve got doughnuts and stuff going around, soā¦ Itās a nice time of the year; everybody sort of gets together and muddles in and makes sure that we get all the orders out in time. I did some statistics earlier, and out of the 300,000 orders that left our UK warehouse, we processed them all within a day.
Wowā¦
So it gives you an idea of how quickly those orders need to come in and get out once we receive them.
Thatās a lot of like 300k a day ā this is likeā¦ How many hours do you work?
Itās a 24/7 operation?
24/7. Okay. So that is 12,500 per hourā¦ That is three and a half orders per second.
Thatās crazy, isnāt it?
Every second, 3.5 orders gets ready. Can you imagine that? And thatās like 24/7. Thatās crazy. Wowā¦
And weāre and weāre still quite small in the e-commerce space. Itās gonna be interesting to see where we are this time next year.
Six months ago, you were thinking of starting to use Kubernetes. You have some services now running; you even got to experience what the end users seeā¦ What are you thinking in terms of improvements? What is on your list?
Oh, thatās a really hard oneā¦ I want to get more tests of our legacy system to run. So we had another incident where weād essentially deployed a change, and it took production down for like six or seven minutes for our internal stuffā¦ And it would have been caught by a smoke test. Like, outright, the system just wouldnāt have booted. And weāve now put a deployment pipeline replace which will run those smoke tests and ensure the application boots, and it will just run through a couple of common pagesā¦ And that was a result of that incident.
But what we really want to do is gain more confidence that when we deploy anything into production for that existing system, weāre not going to degrade performance, or take down like certain core parts of the application. What we want to probably do is come up with a reasonable time to deploy. Maybe the test harness that runs canāt take more than ten minutes to deploy to productionā¦ Because we still want to keep that agility that weāve got.
One of the real benefits that weāve got working here is deployment in terms of production is under sort of two or three minutes. And if we have a bug, we can revert really quickly, or we can iterate on it quickly, and push out. So having a deployment pipeline that sits in the way and takes over 10 minutes to run - thatās really going to affect your agility. So yeah, next year I really want the team to work on hardening our deployment pipeline, just so we can keep gaining confidence in what weāre releasingā¦ Especially as we plan to scale our team out, weāre going to have much more commits going through on a daily basis.
Now, when you say deploying, Iām wondering - do you use blue/green for your legacy app?
No. Not yet.
Because if you had two versions running at any given point in timeā¦ So the old one, the legacy one, and just basically change traffic, the way itās spread, then rollbacks would be nearly instant. I mean, the connections, obviously, they would have to maybe reconnect depending on how the app works, where theyāre persistent, whether retryā¦ And everything goes back as it was. And if itās a new one, if it doesnāt boot, so if it canāt boot in your incidents case, then it never gets promoted to live, because it never came up, and itās not healthy.
Yeah, that would be really nice if we could get that in place. I think our deployment pipeline for legacy at the moment is just - they push these new changes to these twelve nodes, and do it all in one go. And then flush the cache on the last node that you deploy to. Itās very basic. Whereas the newer services do have like all the bells and whistles of blue/green, and integration, unit tests that run against it to give us that confidence.
[44:13] Would migrating the legacy app to Kubernetes be an option?
Weāre thinking about it. So only one issue that Iāve run up to so farā¦ So Iāve Dockerized the application, it runs locally, but thereās one annoying thing where it canāt request assets. And this is probably some gap in my knowledge in Docker, is it runs all in its Docker network, and then when it tries to go out to fetch assets, itās referencing the Docker container name, where it should actually be referencing something else, which would be like outside of that Docker networkā¦ And that causes assets to load. So once I fix that, weāll be able to move into production. But thatās a pretty big deal-breaker at the moment.
Yeah, of course. When you say assets, do you mean static assets, like JavaScript, CSS images, things like that?
Yeah, like our PDFs, and those sorts of things.
Okay. Okay. So like the static filesā¦ Okay. Okay. Interesting. I remember ā I mean, that took us a while, because the static assetsā¦ I mean, in our case, in the Changelog app, before it went on to Kubernetes, it had volume requirements, a persistent volume requirement. And the thing which enabled us to consider, just consider scaling to more than one was decoupling the static assets from the volume from the app. If the app needs to mount a volume, it just makes things very, very difficult. So moving those to S3 made a huge, huge difference. In your case, Iām assuming itās another service that has to be running; itās trying to access another service that has the assets.
Yeah, yeah. Because weāve got a bunch of stuff in S3 and requesting that, itās fine. But itās any time it needs to request something thatās on that host, and then itās using the Docker container name rather than the host name. And the whole reason is just because of the way that legacy application is written; itās a configuration variable that says, āWhatās the name of my service that I need to reach out to?ā But when youāre accessing it externally into the container, you can resolve it with the container name; but when the container tries to resolve it internally to itself, it then falls over and doesnāt work.
Oh, I see what you mean. Okay. Okay. And you canāt make it like localhost, or something like that.
Exactly. On my local machine, itās like manager.controlport.local. But then internally, Docker would see that as DefaultPHP, which is the name of the container. But itās trying to go for the manager.controlport.local, which doesnāt exist on that network. So then it just goes āI donāt know what youāre talking aboutā, and thatās the end of it.
Well, as itās becoming obvious, I am like ā how shall I say? How should I say this? Iām like a magpie. Itās a shiny thing, I have to understand āWhatās the problem?ā āThe problem?ā Like, āOh, I love this. Like, tell me more about it.ā Iām basically sucked into troubleshooting your problem live as weāre recording thisā¦ [laughter] Okay, I think weāll put a pin in it for now, and change the subject. This is really fascinatingā¦ But letās go to a different place. Okay. What are the things that went well for you, and for your team, in the last six months, as youāve been improving various parts of your system?
Yeah. So I think the biggest thing thatās been really our biggest success in this year is that whole rewrite of the Pick application. The fact we went from no services ā I just sort of want to be clear as well, when I talk about services, how weāre planning to structure the application, weāre not going like true microservice, like hundreds of services under each domain part of the system. What weāre really striving to do is say - we have this specific part of domain knowledge in our system; say like Pick, for example. We also have Pack, and maybe GoodsIn. And we want to split those like three core services out into their own applications, and as we scale the team, weāve then got the ability to say, āTeam X looks after Pick. Team Y looks after Pack.ā And theyāre discrete and standalone, so we could just manage them as their own separate applications.
[48:25] Is there a Poc? I had to ask thatā¦ Thereās Pick, this Packā¦ There has to be a Poc. [laughter] Those are so great names.
Yeah, no Pocā¦
Okay, thereās lots and lots of POCs. Right? Lots of proof of concepts.
Yeahā¦ We had a POC six months ago, and itās now actual real production. Itās now Pick. It evolves from a POC to a Pick.
Right.
Yeah. It was really fascinating to sort of go from ā weāve never put a microservice out into production, and weāve now somehow got this cluster thatās running to microservicesā¦ And the user experience from the operatorās perspective - they either go to the old legacy application that has its frontend, or the new Remix application. And regardless of which one you go to, it feels like the same user experience. And to build that in six months, and have a cohesive end-to-end experienceā¦ Yeah, itās something that weāre really, really proud of, for delivering that in such a short period of time.
And also to not have that many catastrophic failures on something so big. It is really nerve-wracking, being responsible for carving out something thatās used every single day, building a new variant of it that performs significantly better, but also introduces some new ideas to actually gain operational efficiency. And then to see it like out in the wild, and people are using it, and the operation is still running, nothingās fallen on its face, apart from when we didnāt set the cache driver to be Redisā¦ But apart from that, it felt seamless. And sort of re-educating the team as well to start thinking about feature flags, and the benefits of Canary releases, and how that gives external stakeholders confidenceā¦ Yeah, thereās a lot of new tooling that came in, and Iām really happy with how the team started to adopt it.
Yeah. Not to mention SLIs and SLOs that the business cares about, and the users care aboutā¦ And you can say, āHey, look at this. Weāre good. The system is too stable; we have to break something, dang it.ā [laughs]
[laughs Yeah. I think the next stage is to put a status page up so that our stakeholders and clients can sort of see uptime of the service, and sort of gain an understanding of whatās going on behind the scenes. But weāll only really be able to do that once weāve got a list of SLIs and SLOs in place that will drive those.
only if itās real-time. The most annoying thing is when you know GitHub is down, but GitHub doesnāt know itās down. Itās like, āDang itā¦ I can guarantee you that GitHub is down.ā Five minutes later, status page, of course itās down. So thatās the most annoying thing about status pages, is when theyāre not real-time. I know there will be a little bit of a delay, like secondsā¦ Even 30 seconds is okay. But I think if itās SLI and SLO-driven, thatās a lot better, because you start seeing that degradation, as it happens, with some delay; 15-30 seconds, thatās acceptable. Any more than a minute and itās masking too many things.
Yeah. So Iām completely new to all this stuff. I thought the status page was driven by those SLIs and SLOs. Is that not something that ā thatād be really cool.
It depends whichā¦ I mean, thereās obviously various services that do this; you pay for them, and itās like a service which is provided; sometimes it can be a dashboard, a status page. I mean, like a read-only thing. They are somewhat betterā¦ Itās just like, deciding what to put on it, you know? And then when you have an incident, how do you summarize that? How do you capture that? How do you communicate to people that maybe donāt need to know all the details, but they should just know thereās a problem. So itās almost like you would much rather have almost like checksā¦ You know, like when a check fails, it goes from green to red, you know thereās a problem with the thing. Itās near real-time. But you hide, like ā because to be honest, I donāt care why itās down; I just want confirmation that thereās a problem on your end, and itās not a problem on my end, or somewhere in between.
Okayā¦ So we talked about the status page, we talked aboutā¦ What else did we talk about? Things that you would like to improve.
Yup, thatās right.
ā¦and the deployment pipeline for legacy.
Ah, yes. That was the one. How could I forget that? A deployment pipeline. Okay, cool. So these seem very specific things, very ā almost like itās easy to imagine, easy to work withā¦ What about some higher-level things that you have planned for 2023? The year will be long, for sure.
Yeah. So weāve sort of had a big change this year, where [unintelligible 00:53:02.09] Weāve got changes to Pick, and weāre changing Pack next yearā¦ But weāre trying to think from like an operational perspective how can we gain more efficiency out of our packers. And right now, when youāve finished picking a trolley, you put it in like a drop zone, and then someone could ā theyāre called a water spider. They come in, they grab the trolley, they shimmy it off to the packing desk, and then the packer puts it into a bin, and that water spider comes back and takes the bin thatās full of orders over to a dispatch desk. And what we want to do is start automating that last bit of the journey, from the pack station to dispatch and labeling. Essentially, what weāll do then is an operator will finish packing their order, they will put it onto a conveyor belt, and that conveyor belt will have a bunch of like sensors on it, which will sort of do weighing as the order is like conveyancing from the pack desk to the outbound desk. And if the order is not within like a valid tolerance that weāre happy with, we will kick it back into a āproblem orderā bin, which will be like reweighed and relabeled. Because I said earlier, we got rid of the weighing step, and thereās a certain variance that our carriers will tolerate, and say āYep, thatās fine. It should have been like X amount of grams weāll still process it.ā But if we go like too much under or too much over, we can get chargebacks from the carrier, to say āHey, you sent us this order, and it didnāt have the correct weight.ā So we want to start handling those in-house.
And whatās gonna be really interesting is building the SLOs and SLIs around that. Like, how many orders are we weighing at Pack, weāre skipping and weighing at Pack, and putting it on the conveyancing system and how many orders are we kicking out? And have like an error budget on that, and seeing like how accurate our product weights are in the system, how accurate our packaging weights areā¦ Itās gonna be really interesting to see that in operation next year.
So I think the plan is weāll probably get an independent contractor to come in and set up the conveyancing. But then we want our own bespoke software running in that pipeline that we can hook into, and start pulling data out of that. And Iām really, really excited to start working on some of those automation pieces.
Itās really interesting how youāre combining the software with the real world, right? So how everything you do ā like, you can literally go in the warehouse and see how the software is being used, what is missing, what software is missing, what can be made more efficientā¦ Because what youāve just described, itās a real-world process that can be simplified, can be made more efficient by adding a bit more software. And that belt. Very important. With the right sensors.
See, one of the really interesting parts about our company is everything, end-to-end, is bespoke; from like order ingest, to order being dispatched from the warehouse - we control everything in that pipeline.
[56:01] The only things we depend on is buying labels from carriers. I mean, we spoke at some point about managing our own price matrixes in real time of the carriers, and doing our own quoting and printing our own labelsā¦ Maybe one day weāll go in that direction, but itās a lot of work. And thereās companies out there that are dedicated to doing that, so we have those as partners for now. But apart from that, pretty much everything out in the FC is completely bespoke.
You mentioned FC a couple of timesā¦ Fulfillment center.
Yes.
Thatās what it is. I was thinking,ā What is it? What is FC?ā Itās not a football clubā¦ [laughter] Because itās like the World Cup is on, so FC - itās easy to associate, weāre primed to associate with a football clubā¦ So itās not that. Fulfillment center, thatās what it is.
Yeah. We used to warehouses, but I think fulfillment center is more accurate to what we do.
Do you see more Kubernetes in your future, just about the same amount, or less? What do you think?
So I think purely because weāre moving to a more service-oriented architecture, weāre probably going to continue to depend on Kubernetes. I canāt see how practical a world would be where we have to keep provisioning new EC2 instances, and setting up our deployment pipelines to have specific EC2 instances as targets, and managing all the ingress to those instances manuallyā¦ It just feels a little bit messy. Having one point of entry to the cluster, and also being able to like pull that from AWS to like GCP in the future if we ever wanted to move cloud providers - I think for us it makes more sense to stay on Kubernetes.
Okay. Technology-wise, Datadog was also mentioned, so Iām feeling a lot of love for Datadog coming from you, because it just makes a lot of things simpler, even though easier to understand, even though itās not muscle memory just yetā¦ Are there other services that you quite enjoyed using recently?
Yeah. Iām shouting out to Datadog again, but itās just ā itās another part of their ecosystem; they have something called RUM, real-time user monitoring. And when we actually deployed the Pick service, we were getting tons of feedback, but there was no real way to correlate the weird edge cases people were having, and we installed RUM. Basically, what it does is it records the user session end-to-end, takes screenshots and then uploads it to Datadog, and you can play that session back and watch it through, but it will also have like a timeline of all the different events that that operator clicked on through that timeline. So you can scrub through it and attach as much meta information to that trace as youād like, just like with any other OpenTelemetry trace.
So in our example, we began to get lost, because we couldnāt correlate a screen recording to some actual like picker data that was stored in S3ā¦ Whereas now, we store the picker into S3, which is like all of the raw data that the operator interacts with from an API perspectiveā¦ But we also take that picker ID and attach it to the trace, along with their user ID, and along with the trolley they were picking onā¦ So now we can just go into Datadog and say, āHey, give me all the traces for this user, on this trolley.ā And if they said like they had a problem on Sunday with that trolley, we can now easily find that screen recording and watch it back. And we can also then correlate that with all of the backend traces that happened in that time period.
So we used to use Datadog and Sentry, and even though I have a lot of love for Sentry, and I think theyāre a great product, having it all under one roof and being able to tie all the traces together and get an end-to-end picture of exactly what a journey looked like - Iām really starting to enjoy that experience with Datadog.
Nice. Very nice. Okay. That makes a lot ā I mean, it makes sense. I would want to use that. If I were in that position, why wouldnāt I want that? It sounds super-helpfulā¦ And if it works for you, itās most likely to work for others. Interesting. Anything else apart from these two?
[01:00:05.20] Iām trying to think what else Iāve usedā¦ I mean, I was looking a bit at Honeycomb, and I really wanted to get it up and running for us, but they donāt yet have a PHP SDK. You have to sort of set it up with an experimental one thatās sort of community-driven. I just havenāt had the time to plug into it. I went through their interactive demos, and I really, really, really want to try it. Itās bugging me that we canāt make it work for us just yet. But no, no additional tooling. Those are the two at the top of my list.
Okay. So as we prepare to wrap up, for those that stuck with us all the way to the end, is there a key takeaway that youād like them to have from this conversation? Something that ā if they were to remember one thing, what would that thing be?
I think donāt be scared to keep moving the needle, and keep iterating on what youāve got; even if you want to try a new service in production, having the sort of foresight to say āWe can gracefully roll this out and scale out, but also gracefully roll it back if weāve got issuesā is really powerful. And from my experience, like Iāve touched on today, the more you do it, the more confidence you can get the rest of your business to have in your deployments. And that sort of leads to being able to keep iterating and deploy more frequently. And thatās what we all want to do, right? We want to just keep making change and seeing like positive effects in production.
How do you replace fear with courage? How do you keep improving?
Just keep failing. Failing and learning from it. Like, thereās no real secret formula. The first time you fail, as long as you can retrospect on that failure and take some key learnings away from itā¦ And I think the more you fail, the better. As long as youāre not failing to the point where youāre taking production completely offline and costing your business like thousands of pounds, and maybe your customers lose confidence in your productā¦ As long as youāve risk-assessed what youāre deploying and you have a backup strategy, I think thatās how you replace fear with courage - just knowing youāve got that safety blanket of being able to eject.
Yeah. Well, itās been a great pleasure, Alex, to watch you go from April, when we first talked, and we posted some diagrams, to now Decemberā¦ You successfully sailed through Black Friday, Christmas as well, a lot of orders, physical orders have been shipped, a lot of socks, by the sounds of itā¦
Yeahā¦ Everyoneās getting socks for Christmas this year.
Exactly. Yeah, apparently. Itās great to see, from afar, and for those brief moments from closer up, to understand what youāre doing, how youāre doing it, how youāre approaching problems that I think are fairly universal, right? Taking production down. Everyone is afraid of that. Different stakes, based on your company, but still, taking production down is a big deal. Learning from when things fail, trying new things outā¦ Itās okay if itās not going to work out, but at least youāve tried, youāve learned, and you know āOkay, itās not that.ā Itās maybe something else, most probably. And not accepting the status quo. Each of us have legacy. Our best idea six months ago is todayās legacy. Right? And it is what it is; itās served its purpose, and now itās time for something new. Keep moving, keep improvingā¦ Thereās always something more, something better that you can do.
I completely agree. Itās been great to come back on, and i look forward to sharing the automation piece sometime next year.
Yeah. And Iām looking forward to adding some more diagrams in the show notesā¦ Because I remember your 10-year roadmap - that was a great one. Iām wondering how that has changed, if at allā¦ And what is new in your current architecture, compared to what we had. I think this is like the second wave of improvements. Six months ago we had the first wave, we could see how well that worked in productionā¦ And now we have the second wave of improvements. Very exciting.
Yeah, Iāll send those over when I have them.
Thank you, Alex. Thank you. Thatās a merry Christmas present, for sure. Merry Christmas, everyone. See you in the new year.
Merry Christmas. Cheers.
Our transcripts are open source on GitHub. Improvements are welcome. š