Ship It! – Episode #35
How I found my lost network packets
3 routers, 2 fibre lines & 1 networking expert later
Today Gerhard shares the entire story behind his lost packets. He is talking with Drew Marshall, director at Trunk Networks and No One Internet, a Cloud Services Provider & ISP based in Sussex, UK.
Gerhard’s Vodafone ISP gateway was losing packets, and recording some of the previous episodes used to be challenging as his internet connection would cut out up to 10 seconds at a time, multiple times per recording session. He was convinced that his Unifi Dream Machine Pro was not the issue. Drew helped Gerhard realise that it actually was. Not only has Gerhard’s DNS latency improved by 3x, but he can now fail-over between two WAN connections. And because nothing beats a real-world experiment, you can guess what is coming in this episode 😉
You will find latency & packet loss graphs, speed test runs, and a few other interestings in the show notes. We hope that they inspire you to setup a better home network. Most importantly, may you find your humble & brilliant Drew.
Shortcut – The first project management platform for software development that brings every team across the org together to build better products. More than 10,000 companies from all over the world use Shortcut to plan, collaborate, and build better software together.
Raygun – Never miss another mission-critical issue again — Raygun Alerting is now available for Crash Reporting and Real User Monitoring, to make sure you are quickly notified of the errors, crashes, and front-end performance issues that matter most to you and your business. Set thresholds for your alert based on an increase in error count, a spike in load time, or new issues introduced in the latest deployment. Start your free 14-day trial at Raygun.com
MongoDB – MongoDB Atlas is an integrated suite of cloud database and services. Try Atlas today. They have a FREE forever tier, so you can prove to yourself and to your team that they have everything you need. Check it out today at mongodb.com/changelog
GitLab – The DevOps platform that empowers organizations to maximize the overall return on software development by delivering software faster, more efficiently, while strengthening security and compliance. Identify and address blockers immediately, focus on delivering value — not maintaining integrations, automate security and compliance. Get started with their free tier (no credit card required). Learn more at about.gitlab.com/solutions/devops-platform
Notes & Links
- Gerhard’s issue with ISP gateway timeouts
- Vodafone UK Community: Periodic 100% packet loss
- Fair Internet Report: Trunk Networks
- nefilim/pinger - simple network latency and packet loss monitor using Grafana and Prometheus
- Grafana dashboard: Internet-QoS.json
- MikroTik 5009UG text config export
Click here to listen along while you enjoy the transcript. 🎧
I’ve met Drew virtually two months ago, while trying to understand why my network packets were getting lost, and Drew is the main reason why my home ISP connection is now rock-solid. And when I record these podcasts, they flow like butter. Welcome, Drew.
So why did you help me out?
Well, I mean, initially you raised a support ticket to us, and you asked some interesting questions… And like a lot of IT professionals who use networks a lot, maybe didn’t necessarily fully understand everything that was going on… And we see this quite a lot actually, people who come to us, they’ll say “This is my problem. What can I do to fix it?” And actually, we then take you right back to the beginning and go “Okay, so what makes you think this is a problem? Why are you pushing down this road?” And then we’ll spend some time looking at, well, actually, what was the problem, why was it here… And then come up with some things that we can try – which is kind of what we did with yourself, really. You said “Oh yes, I’ve got this problem with my network packets”, we did some traceroutes and looked at some other bits and pieces, and actually where we went with this one was “So, Gerhard, have you tried swapping out your home broadband router?” “Well, yes, I have, and I’ve put this one in, and although it doesn’t do what I want it to do, it actually worked better.” Okay, right… So whilst we haven’t got the right router, we’ve clearly got an issue going on here. It’s not further into the network, it’s not where your ISP hands off in terms of the server you’re trying to get to, or the servers you’re trying to consume… This is happening much closer to home. And we see this quite a lot. To be honest, I was quite pleased, because you did actually at least have the common sense to have made sure that you’d wired everything before you started shouting about it. We get a lot of complaints from customers who are very unhappy that their brand new gigabit broadband service doesn’t give them a gigabit when they’re at the bottom of the garden, in the shed…
That’s a good one…
They’ve recently lined with that nice foil reflective insulation so it’s not cold… And it’s like, well, what do you expect? So yeah, that was quite refreshing… We very much wanted to work with you to try to fix your problem. You weren’t a customer at the time; for us, that’s not necessarily all about whether we’re making pound notes from you, but actually let’s make it better. Hopefully, you liked what we did, and elect to become a customer into the future… And that you have, just to be fair. So…
Yeah. I mean, that is the thing that really surprised me. At the time, I was convinced that my issue was the ISP. My ISP was Vodafone, I live in Milton Keynes they’re running on the CityFibre network, and I was convinced that there’s something going on in the Vodafone network, because I was getting packets lost to the gateway, and then obviously, everything from there was not the way it should have been.
To be honest, and to be fair, I’ve had various issues since April 2020, and they were changing, they were somewhat improving, but I never got to the point where I was happy with my connection. And I was thinking, “You know what - I’ve been banging this drum for about a year. Maybe it’s time to do something else.” When I was recording these podcasts, my guests - they were just cutting out for me. It was getting to the point where I couldn’t even record a podcast without my connection going down, and that was just ridiculous… Because this fiber, to the premises, one gigabits metric - they should be amazing. And it was for most of the time, but you know, I wasn’t like streaming, or content wasn’t buffering for things to be okay, if there was like a temporary, short interruption.
So for me it was really surprising that even though you were not my ISP, you were so helpful, and we got to the bottom of it. It took us weeks to get to the bottom, you were patient, you were there, and you weren’t my ISP. So I think kindness goes a long way, and that’s what I really liked about we interacted, and I really wanted to tell this story.
So why are you passionate, Drew, about low-latency, high-performance networks? Because it’s in your title, right?
Yeah, it is on my LinkedIn profile, you’re right… Because for me, the network is the enabler. It’s the thing that has to happen before any of the other things that we consume can exist. Probably the best example I could bring would be maybe Netflix and Amazon Prime and those sorts of streaming services. Now, I’m old enough to go all the way back to dial-up days, and of course, those services didn’t exist. And in fact, actually you were quite lucky to get a picture. I remember – I’ll come back to this story in a bit, but it revolved around the day that, sadly, the Twin Towers came down, and the internet maxed out, and all the rest of it.
[08:18] But my passion, really, for low-latency/high-capacity networks is about ensuring that people can continue to take advantage of whatever the next thing is. Today, most of the U.K. have got fiber to the cabinet (FTTC), which is the hybrid fiber that the big networks have arguably missold as being a fiber connection. It’s not, of course; it’s half-fiber, half-copper. And when that became the common connection for most subscribers in the U.K, Netflix was born. And people said, “Well, what can we do with this?” Well, it’s to deliver films, or deliver on-demand services. And that’s great; we’ve got up to kind of compressed 4K, and higher definitions, and this sort of thing, and that’s fantastic… But along comes full fiber. So fiber all the way to the home, gigabit services. In fact, it actually goes beyond that, of course, because the expensive bit is the glass tube between your house and wherever it’s backhauled to. But once that’s in place, you just change each end and you can go to 10 gig, you can go to 100 gig, 400 gig, 800 gig, whatever the next standard will be.
So we’ve got this great upgrade path, we’ve got great bandwidth, we’ve got great capacity… I’ve got no idea what the next big thing is going to be; unfortunately, I’m not clever enough to work that one out. If I could do that, then I’d be a very rich man, of course… But what I do know is that until the network is in place to deliver it, whatever it will be will never be delivered. So we’ve always gotta take that responsibility – as an ISP, we’re gonna take the responsibility that we have to make the decision that we’re the horse, and it’s got to come before the cart. And that’s all there is to it.
So yes, I am passionate about delivering the best products, at the best capacities… I get very unhappy when I feel that as an industry we let our customers down. I don’t like the idea of selling something as fiber when it’s not really fiber. You know, if you go to the way-back machine, you’ll find that our website has never talked about the fact that it’s fiber. The product we’ve sold has been FTTC. We’ve made absolutely no bones about it, and that’s because actually I refuse to get on that bandwagon, and my business partner backed me in that decision. And that’s something that I believe very passionately about.
I also don’t believe, and we’ve never signed up to the current Ofcom code of conduct, which basically says that as long as you tell your consumer that their broadband won’t go quick enough at peak times, that’s alright. Well, for me that’s not alright, actually… If I’ve sold you a 100 meg connection, then I expect you to get, give or take, 100 meg, whether that’s 4 o’clock in the morning, 4 o’clock in the afternoon, 8 o\clock even at evening…
I’m not sure I sign up to where some of the larger businesses have gone. My daughter has recently moved into a house in Bristol, and the only available service was via a large national player. The only sensible available service was via a large national player. She phoned me up and she said “Dad, I’ve got this contract in front of me, and I’m buying a 200 meg service, and it says that it’s quite acceptable that it can do 25 meg at peak time.”
Yes, I know.
That is just ridiculous… That really gets me. I think unrealistic expectations is top, and then broken promises is next. And they interchange. So it’s one or the other.
[11:55] Give me what you tell you’re gonna give me, because that’s what I’m paying for. Am I paying less when it’s 25 megabits? I’m not. It’s always the same price, it doesn’t matter how fast or how slow it is. You’re right.
And then I think hand in hand is when I expect to get something and I don’t get that thing, and then I don’t know what the problem is, and people aren’t straight. Is it your connection, is it like your gateway? Where is the bottleneck?
And I think you mentioned something very interesting when we were exchanging emails… Is that people - they don’t even know what they’re getting. They don’t even realize where the problems are, because they don’t know. And this isn’t just the consumers; this is also support, because they don’t know what they’re supporting.
No, absolutely. And again, as a business, we decided a long time ago that this was something – we weren’t prepared to sign that code of conduct, because we didn’t think it was right. It doesn’t seem fair or reasonable. So actually, we only provide what people pay for, and we expect to do that 24 hours a day, seven days a week. Yes, okay, there’s gonna be some contention; the internet is a contended resource. Your local gym is a contended resource. They’ve only got a dozen bikes, or rowing machines, or whatever… And if every one of their subscribers turned up on that 9 o’clock Monday morning, there wouldn’t be enough machines to go around. That’s not how the business works. You know, they just need to make sure they’ve got enough rowing machines for when their peak happens. And the same thing with the internet. But it doesn’t mean that you look around and you think “Oh, I’ll tell you what Mrs. Miggins doesn’t mind waiting 20 minutes for a rowing machine… So what we’ll do is we’ll take one out. That way I don’t have to pay for quite so many.” It’s that sort of thing that – actually, I’d rather offer people a product at a fair price to get what they’re paying for, than to try and offer a cut-price deal and… You know, I’ve gotta make money; we’re a business to make money, and making a profit, although some people would like to suggest otherwise… It’s not a rude thing to do. Our profits go into paying our members of staff, and paying for the infrastructure that people use, and all the other things that go on. We pride ourselves on being a very proud local business, and a lot of our business is focused locally to us in the Sussex and South-East corner.
We like to be able to go, “Look, these are our members of staff, and this is one of our young engineers, and he’s just had a child, and he can afford this, because we pay him, because you pay us.” This is how the world goes around.
Maybe it’s a slightly old-fashioned tradition, or view, but… Yeah, we don’t think that’s unreasonable.
No, I think that works really well, because from my perspective, you’d never sold anything, and I love that approach. You were there to help, and that was it. If you’re happy, and if things work for you, we are happy. Because that’s why we’re doing this. That is our reasons; it’s not to make money, it’s not to get rich, it’s to help people. And I love that approach.
So how many ISPs would you say that are helping customers with their routers - which is not an ISP-supplied router… [laughs] I know that you know where this is going, but I don’t think many do…
…because my existing ISP said “You know what - it’s your router. We can’t support your router.” The issue was - yes, with my router, but can you prove that what’s on the other side works well?” And they can’t. So as a result, you don’t know whether it’s my router or whether it’s your gateway. And they couldn’t prove that.
So we were going in circles until I met you, and you said “Well, okay, these are the steps. Let’s go through them.” And when I say “we”, you have to go through them… “And then let’s figure out what the problem is.” And that’s exactly what happened.
So the problem was the router, it was not an ISP-supplied one… And neither of us knew what the problem was, but we were willing to work together, and that made all the difference.
Yeah, absolutely. And the official line in terms of supported routers is that “We can’t help you with your supplied router”, as in “We can’t help you program it. We can’t help you beyond giving you the username and password that you need to be able to log into our service.” However, what I did and what we do as a business for our customers is to go through the fairly basic steps. We swapped a couple of routers around, we tried some different combinations. Is it the cable? Is it the router itself? Is it the way that your router is connected into your network?
[16:14] If we build this thing up one baby step at a time, then we can eliminate elements. And by eliminating those elements, you can get to a point where you can either - and I think I’ve probably put this into an email, actually… You get to a point where you go “It’s definitely not my kit. Vodafone, you need to sort yourself out… Because I’ve proven - I’ve done this and I’ve done that, and I’ve gone all the way through this…” And at the end of the day, although lots of people like to think that they know lots about the internet - and quite a lot of them do; some of them know a lot less than they like to proclaim… And actually, it’s not as tricky as it sounds. If you can cover off the first three layers, then you’re pretty much there.
I do this training course with our new engineers, where we sit down and we go “Right, so the first thing is “Is there a light on? Have you got a light flashing? If you have, then we can pretty much tick the box that says layer one is working.” Then we get into “Can you see the gateway? Can you see the next hop?” Then you get into “Can we route through that?” And once you’ve built up that level, the next steps beyond that are out of your control. and it’s just trying to work out, “Well, who is it you need to go and shout at in order to get what you’re after?” And ultimately, from your perspective, it’s really simple; you just want an internet connection that works…
…so that if you want to record a podcast, you can do so without every fifth word dropping, or whatever else is going on.
And this goes back to my point about why am I passionate about low-latency/high-bandwidth networks. Well, actually, it’s not so much low-latency, it’s not necessarily high-bandwidth, it’s about stuff that makes other stuff work. That’s the key element, the low-latency and the high bandwidth is just the enabler for whatever the next thing is that we want to make work. But actually, at a very basic level, I don’t want people sitting there watching wheels of doom go around in circles, and their Love Actually film stop playing because they’ve got to that particularly poignant moment… That’s not what we’re about. We want stuff to just work, and it should do.
You’re right, it is that reliability element. It is that met expectations; I know what to expect, I’m not being unreasonable, and what I expect to happen does happen. When you have those instabilities, whatever the reason may be - which is what my issue was; stopping and starting it wouldn’t have fixed it. Same as “replace the cable” wouldn’t have fixed it. So we went through the steps and we realized, “You know what - it must be this thing.” And I didn’t know, you didn’t know, but there was a process that we followed.
So I think there is a lot to be said about good processes and knowing what you should do, without reverting to diagrams and flow charts. Like, okay, I’m support; the first line… I’ll have you going through these steps. And they help, but then when some people – I mean, I think for me it was very frustrating talking to first-line support, trying to explain them things, and they’re telling me things which are unrelated to my problem. I say “Look, that can’t be it, because of this, that and that.” It’s like, layer one, layer two. You’re telling me it’s layer seven, and I’m telling you it’s not layer seven. And I say, like, “Okay, which layer do you think it is?” It’s like, “Oh it’s not good how did this happen?”
The problem is that if you leap straight into layer seven and you ignore the other six layers, then you’re trying to diagnose a problem without actually understanding. Does anything else work? At one point I think I looked at some log outputs of yours that suggested that the router interface was flapping. That could have been a dodgy Ethernet cable. Something as simple as that. But until you say, “Well, actually, let’s replace that. Let’s prove layer one works”, you can’t move to layer two without knowing that the lights are still on.
[19:55] In the same way, as much as it frustrates people, we sometimes have to go through the simple things like “It has got electricity, hasn’t it?”
Because if the internet goes down at four o’clock every Friday, when the cleaner comes around… We have all heard the joke about bed six in the ward, where people seem to die every Friday at six, because the cleaner comes in, unplugs the life support machine, plugs the Hoover in, cleans the floor, plus it back in again.
Right… [laughs] That’s a good one.
It’s a similar sort of process. You have to go through “What is going on?” Let’s do the basic ticks, let’s go through the basic stuff… Because there’s a lot stronger argument to have, particularly with large ISPs where they’ve perhaps got more customer service people, shall we say, on their level one help desks… You’ve gotta better cut through that “Have you tried rebooting it again?” “Yes I’ve done that lots.” And the problem is that they’ll get to a point where they’ll want to hang up on you, because you’ve had your 20 minutes. They’ll set you a task, and you’ll go away and do it, and phone back to tell them it’s not working, and they’ll start again at the beginning…
…and actually, that’s no good to you. What you need to do is pick up again from “I did what you asked me to do. It didn’t make a difference. What’s the next step?” Not “Let’s go back to square one again.”
Because actually, that’s just really frustrating. So it’s just following that base process and making sure that you’ve covered off all the bits and pieces. And some of them are more obvious than others. Even I kick myself for missing something that’s ought to be glaringly obvious, but isn’t something that the customer always realizes. Not everybody is technical; you have to keep reminding yourself. And sometimes that whole electricity question is one that people don’t even realize, or don’t even think about, because it’s so obvious. And solely because it’s so obvious, it can’t be the fault.
Yeah. I know what you mean.
In my case the problem was actually the router which I had, and I was convinced that was not the issue. That was the UniFi UDM Pro… And the way that was holding the PPPoE connection was flapping. That was it. It couldn’t hold the connection stable long enough. And there was like all sorts of issues. Now, the UDM Pro, for those that know it - it’s like an all-in-one thing. It has a drive, you plug in cameras, it’s running all sorts of applications… So it is an all-in-one solution; it’s more like an appliance. You can run UniFi Protect, you can run the UniFi Doorbell, all sorts of things. I’m not sure how much this is related to it, but what I do know is that it’s trying to do too many things, and when it does too many things, as with all things, it doesn’t do them all as well as maybe it could.
So as soon as I’ve put the MikroTik - a really old one; this was like a seven-year-old RB2011 - all problems went away. Everything was working now, because the CPU was really slow on that specific device; I think it was like 600 MHz, an older ARM one. The throughput would max at 700 megabits. So the connection can do 900, so I wasn’t getting the full – like, what I was paying for. However, my problem with latency, with packets getting lost - all of it disappeared, simply by putting a new router in. So it was a device problem, but not what I assumed was the problem. But going through that process really helped, for sure.
[24:12] One of the interesting things that we saw when we upgraded our broadband gateways about a year ago now - and our new gateways are able to actually drop and re-establish a PPP session in something in the order of 7 to 10 seconds… Now, if you’ve got a flapping PP interface like you had, actually it’s very possible, particularly on a nice, fast, low-latency CityFibre or full-fiber connection, you can actually drop and re-establish that session really quite quickly, and it will only exhibit itself as being some packet loss. And it’s a bit loose and wooly as to what that is and where that is, and particularly if it’s doing it randomly. So it’s not every five minutes, or every 30 seconds, or whatever. If it’s doing it a random period of time, it can exhibit itself as being a random packet loss. And the problem – it’s quite interesting, you’ve sent me some forum links to Vodafone forums where people were making comments about the various different elements of network failure, and stuff… And I’m looking at some of these and I’m thinking, “That’s great”, but this is somebody who has either a little knowledge and is trying to make themselves look much better than they are, or actually has absolutely no idea… Because I’m looking at some of this stuff and they go “Look, there’s a traceroute here, and - could you see the latency at the far end?” Well, I can, but actually, at hop 5 your connection could have dropped, re-established, dropped a couple of packets, and then continued. Because the traceroute doesn’t care. The traceroute says “I will trace for 64 hops”, or whatever your OS allows you to do. So yes, it can drop two in the middle and continue and pick up, and you just think “Well, that’s a bit weird.” Equally, there are traceroutes - the BBC is quite a good one. If you do a traceroute to the BBC, you’ll find the second to last hop disappears into their own PLS core somewhere, because the response comes from a router that has no routing back to it. You get a line of stars. That doesn’t mean it’s broken…
Yeah, that’s right.
…it just means that the box in the middle can’t talk back to you. That doesn’t mean it doesn’t work. The BBC website still loads fine. With the tools that we have, for the average member of the public, which is typically a ping and traceroute or some similar variant, are just too blunt to be able to accurately pin the problem on a particular element. So much traffic these days is tunneled in some way, shape or form. It’s actually really difficult to be able to narrow down and go “Well, yes, inside that tunnel is a problem.”
So what tools would you recommend to understand what the problems are better? What do you normally use?
If I’m honest, ping and traceroute are your friend. But you have to be able to interpret what’s going on, and some of that involves a lot more understanding of the infrastructure. For example, the latest forum post that you sent me just before we had our chat - there was a person on there that was saying, “Look, it’s got to be a CityFiber problem, because hops 3 and 4 - can you see the latency?” Well, actually, I can comfortably tell you that CityFiber are running a transparent layer one service over a passive optical networking kit. You won’t get a pin or a traceroute off of a layer one connection. So CityFiber’s congestion, if there is any, is invisible. The only way you can measure that is potentially to the first hop gateway, but even then, you can’t guarantee where you’re connected to.
[27:49] A lot of the protocols that we still use in networking date back to 40 years ago, back to NASA and the original ARPANET and all of that kind of really beginning infrastructure. Yeah, we’ve upgraded some of them, we’ve plugged in a load of stuff over the top, we’ve built newer methods of doing things, but actually, when you run it all the way back down to the base level, almost every network will be pinned on BGP for its gateway routing, it will have some form of IGP in terms of OSPF or ISIS, neither of which unlike that, somebody put it that forum post, OSPF doesn’t have any form of latency detection; it works on the shortest number of hops. If you’ve got seven routers that are on a low-latency path, that’ll be de-preferenced over the three routers, that might be over a much higher latency path. And actually, we as network administrators have to recognize that and have to manually adjust for it.
So there isn’t anything that measures latency. CISCO invented a proprietary protocol about 15-20 years ago that did have some latency measurement stuff going on in it, but because it was proprietary, the world hated it, and invented MLPS instead. It didn’t quite but it died when MPLS came out.
So actually, what you’ve got to do is you’ve got to take the data you’re collecting from your traceroute and you’ve got to overlay what you know about your network architecture. I think it actually might have even been yourself that asked “Why am I connected to a broadband gateway in Manchester, when actually I want it to be one in London? Because I live in Milton Keynes.” Well, the way that that works is that we connect all those various sites up over ostensibly a layer 2 protocol we don’t tunnel in MPLS, because it gives us faster failover, and that sort of thing, for the backhaul network. But fundamentally, it’s a layer two switch network; it has to be, because when your router sends its PPP multicast packet out and says “Does anybody fancy logging me in?”, that has to be done over multicast, it has to be done over layer two, switched, it’s all Mac-addressed etc. But actually, the control mechanism we have over this - you’ll love this. The first one to respond is the one that gets the connection.
So if Manchester is ten miles closer in network terms, or is half a millisecond quicker to respond than London, you’re going to Manchester.
Now I get it, yeah.
And there’s not a lot you can do to influence that, unless they manage to change the way that your packets are routed at a much lower level than you’ve got any chance of seeing. So this is why I say that traceroute is great, but it lives at a network level that is above where your packets are really traveling. The bit you can see is happening much later than the bit that could be influencing how your connection works. And that’s the joys of dealing with a smaller ISP, of course, because you’re closer to the bloke that built it, or that designed it.
That has its advantages, for sure. I really enjoyed being able to talk to someone that knows what he’s talking… You know, when I say “Hey Drew, what’s up with this?” You can actually give me an answer, because as you mentioned, you built the thing, so you know exactly what’s underneath it. I think that’s super-rare, and I appreciate it so much, you have no idea. When I see that, I recognize it, like “Yes, that’s exactly what I want.” And I’m so grateful that ISPs and people like you exist. They’re real people I can talk to. It’s just amazing. It’s almost like you speak my language, I understand what you mean, and I can ask you questions with the confidence that I’ll get an accurate answer, including “I don’t know. Let me find out.” That’s so refreshing. That’s so refreshing.
So I noticed two days ago something very interesting happened. My gateway changed, and the way it manifested itself - this is like my Vodafone one; and we’ll get to the secondary line as well in a minute, but let’s just finish this, because it’s really interesting.
[31:52] So two days ago, my internet went down, my PPP stayed open, but no traffic was getting through. So no packets were getting routed, no traffic was getting through. I had to recycle the connection, and I connected to another endpoint, another gateway, which was closer. The reason why I say it was closer - the pings, the latency was 3 milliseconds. Before, it was 8 milliseconds. So before my packets were getting routed anywhere, just like getting to the gateway used to be 8, and now it’s 3, overnight. How would you explain that? What do you think actually happened?
Not knowing where you were connected to prior to your 8 milliseconds, I would say that one of potentially a couple of things happened. One was that the gateway that you’re now connected to with your three milliseconds wasn’t available, for whatever reason. They took it out of service to do maintenance, it might even be a new BNG they’ve recently deployed.
So I give you our network design, or some element of it, but I’ll try and draw some analogies with where you are with Vodafone. So for example in Worthing, where we’re present with CityFibre, we’ve actually got a network gateway in the exchange in Worthing. The benefit that brings is that for our gaming customers, who are an amusing breed, because they’re hell-bent on that first hop latency… Well, actually, in Worthing it’s sub-millisecond. Sometimes the whole millisecond, but not normally. Because actually, the bit of fiber they’re traveling down is the bit of PON fiber that CityFibre have got deployed, then hops through the CityFibre fiber exchange, lands on a router that we have there that does the whole backhaul handling piece, and shoot-drops straight onto the BNG in Worthing, where the session is terminated, you get given an IP address, you then come back around the corner, jump back on the router that took your incoming session, and then that routes you out to the big, wide world.
Now, if that BNG in Worthing isn’t available, or it’s busy, or whatever, we have a number of other broadband gateways around the rest of the network, in London, designed, to handle that overflow, if you like. So if a customer in Worthing doesn’t land on that Worthing gateway - maybe I’ve taking out service to do a firmware upgrade, for example - they’ll end up terminating in London. So their first hop goes from being a millisecond or better, to being more like 4 or 5 milliseconds, because what’s gonna happen is that the router – so they’re coming through the CityFibre flex as I’ve mentioned already, they’ll end up on the core router than handles that backhaul, it sends the multicast PPP request into the whole of that layer two network.
Now, that layer two network actually traverses about half a dozen different fiber exchanges, main data centers, and other places. Basically, anywhere I’ve got a BNG, that multicast packet will be sent. Now, as I mentioned already, depending on which BNG reacts first, dictates who gets the session. So if the Worthing one - which should be the first one, because it’s definitely the closest, is unavailable to React, then one of the others will take it. That means your first hop goes from being a millisecond or less to being 4-5 milliseconds, as I’ve said already… Because actually, you’ve disappeared into a tunnel. You can’t see that, it’s all layer two, but you’ve disappeared off into a tunnel that’s then presented itself to the next available broadband gateway, which has then terminated, hung some IPs on it, and you’re out into the big, wide world.
The reality - if you did a ping to, let’s say, Google, and that was 8 milliseconds away, you did a traceroute to it, and if you connected to the Worthing broadband gateway, you’d go a millisecond, it would then be 5 milliseconds to the core, another millisecond to Google… There’s your 7-8 milliseconds. If you’re terminated in London, your first hop is the 5-6 milliseconds, and then a millisecond to Google. It equates to the same thing, it’s just that the first hop is a bigger number, because you’re further away, because unfortunately I can’t get your packets to go quicker than the speed of light.
Yeah, that’s right. That is a limit that everybody has to contend with… And even that’s pretty good. It’s like, all the extra hops that you need to make in between.
This first hop I think is really important, and the first hop will be from your router to the gateway, wherever that may be. On Vodafone, this used to be 8-9 milliseconds. So you mentioned sub-millisecond - I would love to have that. And while I don’t have that with Vodafone, what happened recently is it went down to about five milliseconds, three milliseconds… Actually, it was 3-4 milliseconds, it fluctuates, but it’s twice as good as it used to be. So that is a good improvement.
Now, Trunk Networks is my secondary ISP, and I can switch between the two. Currently, my setup is configured so that I disable one, and then the other one, even though it’s live, the routing only uses one of the gateways. So when I disable the primary, the Vodafone, I fall back to the secondary. And even then, my first hop is five milliseconds. So the quickest that my gateway is – so that’s like my starting point, whether it’s DNS after that when I go to Google’s DNS, or Cloudflare’s DNS, or OpenDNS, it will be on top of those 4-5 milliseconds in this case. And that’s still good. If it’s over 10 milliseconds, I think that’s when you start feeling it. And it’s not only that, like when it starts fluctuating. So from that perspective, 5 milliseconds as a starting point - great. It’s when it gets higher than that when it’s bad. And when you have packet loss, it’s even worse.
[40:11] So one problem which I have - and I hope that you’ll be able to explain to me why that is - is when I switch from one ISP to another one. So when I disable one of the connections, there’s like a maybe 20-30 seconds period, as if the existing connection stopped working. And TCP connections - new ones are okay, but existing ones stop working. Why is that?
That’s almost certainly going to be because you’ve got two different ISPs on your router, and you’ll be running a network address translation (NAT) to the WAN-IPs that your router has been given. Now, obviously, your router has to keep a memory of state, if you like, of those NAT addresses.
So to give you an idea, you go to the BBC’s website, your computer initiates that connection, the router makes the network address translation, so it takes your internal IP and maps it to your external one, and sends it out to the big, wide world. Now, what it has to do is remember when the BBC sends you the data back, it has to remap that back to your computer. Now, if you swap ISPs halfway through, suddenly your WAN-IP address has changed. Now, that NAT table, so the return path is almost certainly the one that is causing you the problem, because it’s saying “I’ve got a return path and I’m expecting back in from interface A”, and actually, that’s not happening at the moment. So it gets itself in a little bit of a mess, which is why new connections work fine, because it’s writing new address table entries.
If you actually cleared down the NAT table as part of your switchover, you’d probably find that actually it will be pretty much seamless.
Interesting. Okay, right. So it is that NAT table I need to look into clearing when I do a switchover. It’s not enough just to disable one and basically re-enable the other one, because they’re both connected… So that actually does make sense. Okay.
For example, the way that we do high availability - we do this a lot in the corporate world with leased lines… So we’ll put a fiber, an Ethernet circuit in, and we’ll probably look to put a broadband failover in - because actually, in the great scheme of things, they’re pretty cheap, and actually are good enough in a DR scenario that you can continue to function. So what we’ll do is we’ll give you an IP address that routes off of both circuits, preference one over the other; in the event of a failure, that route gets torn out, but at no point does your NAT translation ever change.
So your IP remains the same at all times, and all that happens is you failover between them. That’s not possible unfortunately in the home broadband world, particularly when you’re using two ISPs, because obviously you’ve got a Vodafone IP on one port and you’ve got a Trunk Networks IP on another. And I can’t announce a Vodafone IP, and Vodafone can’t announce mine. So that’s one of those compromises that you have to make.
You talked about in a previous podcast a seven-minute delay when you destroyed your website and it rebuilt itself - well, this is kind of one of those similar network things where it’s like –
Yeah, got it.
Yeah. There’s a 30-second failover, but it’s – it failed over, you know? And if the internet, as it obviously is, is very important to your life, then actually it failed over, it recovered all by itself… If you were, I don’t know, maybe you weren’t recording a podcast, but you were watching a Netflix film, for example - it would failover and you almost certainly wouldn’t notice. It’d do it seamlessly.
So when you said that the NAT tables, the routing tables have outdated information when it comes to how to negotiate the packets, where to send the packets, was it on the MikroTik, or was it on the UniFi?
[44:01] If I’m honest, I don’t know. My guess would be probably the UniFi. But because you’ve got two lots of tables going on, it would be difficult to be able to tell. Now, having said that, let’s think about that… No, it probably isn’t, actually; I’m lying. I reckon it’d be on the MikroTik, because that’s the one that’s handling the actual final endpoint. The NAT translation between your UniFi and your MikroTik won’t change.
That’s right, yes.
So actually, that should be absolutely fine. It’s only the one IP address element which is the one that needs to actually make the change. You’ve not changed any other piece of infrastructure, so you should be fine there.
Okay, so this is something that I definitely need to look into, because I would like to make it as seamless as possible. Small improvements, right? This is like the one improvement - have two connections, so if one fails, I can still record, you can still hear me. I can’t hear you, but… All I have to do is refresh. So I’m really glad that we were able to test that. Because I’m also recording locally, we caught all of that on my end.
Obviously, with riverside the way it works - it will just catch up. It buffers locally in the browser, and then it just re-uploads anything that it didn’t manage to upload.
That was fun.
That was fun. [laughs] That was really amazing. I really enjoyed that.
Okay. So the other question which I have is around the latency. What I have noticed – so Vodafone is 900 megabits and it’s symmetric. The Trunk Networks - it is a failover sort of a setup, and it’s 100 meg download, 20 meg upload.
So when we started this recording, I switched to Trunk Networks, and what I noticed is that the latency on the Trunk gateway went up. So it used to be like a flat 5-6 milliseconds. When we started recording, the spikes were 100 milliseconds, 100+ milliseconds… Why do you think that’s happening? How would you explain that?
It’s because you’re actually saturating your uploads. Now, what I’ve noticed since you’ve moved to the Vodafone circuit is actually the background behind you no longer goes fuzzy and drifts in and out.
So that was telling me that your upstream was struggling with the amounts of bandwidth that it was demanding. So when BT, when Openreach decided that they were going to invent ADSL (asymmetric digital subscriber line), it’s bigger down that it is up. It was designed always for downloading stuff.
So with particularly TCP - because of UDP’s fire and forget - TCP requires acknowledgment packets to be sent back, it requires that negotiation that goes on between your device and my server, or whatever… So the idea was that you had just enough bandwidth to be able to send back enough ACK packets to hit your maximum download threshold. That theory hasn’t changed. So give or take, at 100 meg (90, or whatever it is), your 20 meg upload is just enough to send back that many ACK packets that you hit that upper threshold. It was never really designed for uploading per se. It’s only more modern enterprises such as CityFibre who have said “Well, actually, you can have symmetric, because why not?” And that to me is a much better solution. Quite clearly, when you’re recording this podcast, you are doing more than 20 meg, or pretty close to 20 meg. So if I’m honest, what you’re doing is you’re saturing your uplink; that’s automatically going to mean that packets get dropped, and as your packets start to get dropped, your latency will increase, because part of what you’re measuring is how long it takes for a packet to get there and back. Well, if it never makes it there, it’s always gonna increase your latency figure.
Okay, that makes sense.
That’s what you’re seeing.
That’s what I need, more upload.
More upload, yeah. I’m afraid so.
Okay. So in your experience – I mean, we’ve been talking about my connection quite a bit. But I’m wondering, in your experience, what does a good home connection look like? And I think about remote workers - what should they care about? How would they go about setting it up? What would you recommend? And first of all, what does it look like?
[48:14] Well, it’s an interesting question. Assuming money to be no object - and the home broadband is one of those really interesting areas that people don’t always consider. The classic - and I’ll go back to the gaming industry, because the gaming industry has led a lot of PC development, and hardware development, and so on. So people spend thousands of pounds on really, really nice gaming PCs, and then 21.99 a month on the broadband.
[laughs] Right. Okay… That to me is like the really expensive TV, but the really cheap cable. And you’re wondering, “Why does this look bad?” It’s the cable. [laughs]
Exactly. You buy a really nice 4k or 8k UltraHD TV, and you plug your aerial into it, and you wonder why it is that it doesn’t look very good. Well, you have to make sure that your network is aligned with the hardware you’re running it on. That’s just how it is.
So in terms of a home internet connection - and these days, home working, and home internet, and all the rest of it are now in one big melting pot. So my first question would be “So how important is the internet to you?” If the answer is “Well, actually, I run my business on it. I run my life on it. Actually, I want this to be as near as is reasonably possible 100% up”, then I would say you’ve gotta be really looking at two connections, like you have, ideally from two different providers, because that covers off “What happens if the network goes bad? What happens if something bad happens?” And actually, we’ve covered you off – so not only are you with Vodafone and Trunk Networks, but actually… Vodafone is with CityFibre, so that is a different physical cable in the streets, that’s to a different physical aggregation point, that’s plugged into a different physical bit of network, in Vodafone. And then we’ve given you an Openreach circuit, which is why it’s asymmetric, not symmetrical like CityFibre would be… But that’s then on a separate piece of fiber, taking a separate route, in separate ducts, different bits of road, that will go back to the BT exchange, and then over a different backhaul, onto us as a different network… So if any one of us, and if any one of those components goes bad on you - give or take, 30 seconds, you’ll fail over - your life will continue, and depending on what you’re doing, you may not even notice. That’s kind of where I believe people should be. We can fail over transit providers. We can have a major BNG failure he says touching which hoping that doesn’t happen, but… We can suffer those things; we’ve got more than one of them. Customers simply end up disconnecting and reconnecting somewhere else. And although that may not be completely seamless, hopefully within 30 seconds to a minute your router has done its thing, our network’s done its thing, you’re back connected, we’re all online, everybody’s happy again.
That’s how we design our network, that’s how Vodafone designs its network… And actually, if your home network is that important to you, it’s kind of how you should be thinking about your home network as well. It’s worth spending a few pounds on it. You have to take a view – you know, if you replace your VC every 3-5 years probably, probably closer to 3 if you use it extensively… Your network kit probably lasts you 5-10 years.
Yeah, that’s right.
So actually, it’s got over double the lifecycle of your PC. And if you’re happy spending 1,500 pounds on a new PC every three years, then spending 500 pounds on a router that outlasts your PC by twice its life actually isn’t a bad investment. And then - yeah, putting the right connections on the end of it to give you as much resilience as you can makes sense.
[52:01] Now, if it’s not that important, or you’re in a lucky position that you have got Openreach and CityFibre on your street – but you know, we can still look at, for example, a 4G failover; plug a SIM card or a dongle into your router, buy one that has that functionality, and then you can have 4G or 5G as a failover option. It’s just about how important is that to you. And these days, I would argue, for home working - very.
Absolutely. Yeah, like two of each, except your wife. That’s what I’m thinking. [laughter]
Two routers? Get two routers, that’s okay. And by the way, they shouldn’t maybe be from the same provider. And they will be a different age, and it’s okay. Like, when you get a new one, you don’t have to wait ten years. In five years you get another one. You refresh the oldest one, and then you have one older, one newer.
So yeah, the same thing for phones, or even laptops, or workstations, whatever you have. So when I recycle, I keep the old one around just in case I need it. And then you can gift it to someone, or whatever the case may be. Reuse. That is a big deal. Okay.
And actually, that reuse thing is a really interesting point… One of the things that we recognized – and we supply what I would deem to be decent routers for our broadband connections… We don’t lock them down to us. Yes, they are set with our ACS details, so they auto-provision themselves… But you can create a new profile and set your own username and password up and all the rest of it; they don’t have to talk to our ACS. But most importantly, because we supply decent routers, we want them to last five years. What we don’t want to be doing is filling some smoky landfill site in China with yet another bit of hardware because you’ve changed ISP.
We’re actually encouraging – now, if you go buy your own router, we’ll give you a credit; we’ll actually give you a router credit for using your own connection as if we bought you a router. But that’s one less bit of plastic in some landfill site.
So we want people to have more than one router, we want people to recycle or reuse their router wherever we can… But you talk about the 5-year and 10-year - that’s great; we want to encourage that, because actually that’s the right thing to do for the planet, the right thing to do for you as well… It fits all ways round and it just makes sense, really.
Everyone wins. I love those scenarios where everyone wins.
Just keep looking for those and keep doing that. So who is Trunk Networks to me? Well, Trunk Networks to me is a U.K. ISP, a small one, that has an amazing team. Everyone that I interacted with via support tickets, like you, Drew… You just give so much. I can feel the passion, I can feel the commitment, and I feel in safe hands. Who is Trunk Networks to you?
Well, Trunk Networks to me I really hope is what you’ve experienced. We have got some fabulous members of staff, who are very passionate, who do work hard to offer the best service that we can for our customers. We’re a business that’s not frightened of saying “We don’t know”, but we’ll go away and we’ll damn well find out for you.
We’re passionate about delivering what we’re selling to you. As I mentioned previously, we want to make a profit, because actually that’s what makes the world go round…
That’s what keeps you in business. You won’t be around if you don’t make a profit, so it’s in my interest for you to make a profit, so that you’re around.
Yeah. And every morning I wake up and I think, you know, every one of our members of stuff, the work that I put in helps to pay their mortgage, and that drives me to continue to push forward with what we’re doing. And I know that Darren, my business partner is absolutely no different. He’s every bit as focused on delivering a great product, with happy customers, for people who want to actually turn around at the end of the day and go “You know, that was great. Yes, I had a problem, but these guys worked as hard as they could to fix it.” If there’s not a lot I can do to stop it…
[55:55] We’ll figure it out. It’s okay.
…you know, we’ll work damn hard to fix it for you, and we’ll do that as quickly as we can, based on the fact that typically the stuff that breaks is outside of our control. If the fiber to your house breaks, it’s not one of our engineers that will come and fix it, unfortunately. But we’ll get on the case with our third-party provider and shout at them until you are fixed. That’s what we do.
Okay. So as we prepare to wrap up, what would you say is the most important takeaway for our listeners, from our conversation?
I think the most important thing that I really would like to hammer home is that if the internet is important to you, the cheapest isn’t necessarily going to be the best option. It’s an option, of course it is. There are large ISPs that make a very decent return on selling “pile it high, sell it cheap” connections. But with that comes the potential for things to go wrong. If you want people who care, people who can work with you to fix your problem… We want to be part of the solution, we don’t want to be part of the problem. And we want to make sure that we’re part of your team. We’re on team customer, believe it or not. That’s kind of our mantra and our thought process.
If off the back of our conversation today one person turns around and says, “You know, actually, I kind of see where they’re coming from”, then it’s been a success.
Well, that’s exactly what it has been for me. It started as a conversation, random questions, “Hey, can you help?” Yes, you could. “This is what we can do. This is what we offer. This is how we operate.” We went through that process together, and I never felt like you were selling me anything, and I loved that. I was the one wanting to buy, because I realized how amazing that was, and how rare that was. And no sponsorship, nothing like that, because - why? I mean, it doesn’t work like that. You have to basically put your money where your mouth is, or where your mind is. I certainly did that, and it was great, and I wanted to share this story with all the details, with how basically I went through this process, and what I ended up with… Which by the way, I’m really happy about. We can still improve it, and I know that we will improve it.
My interest is in sharing this with you all that are listening to this, and see if it works for you. I don’t know what your ISP choice is, or where you are in the world, but this is how I solved my problem, and hopefully it will inspire you to solve yours.
Thank you, Drew. This has been amazing. I’ve been looking forward to this for a long time, and I’m looking forward to what we do next year. Thank you.
Yes. Thank you very much for your time – and you’re a customer, of course, but… It has been an absolute pleasure, and it was really great to be able to fix your problem. And most importantly of all, you’re wearing a nice, big smile, which for me says that we got it right, so… Thank you.
Thank you, Drew.
Our transcripts are open source on GitHub. Improvements are welcome. 💚