In todayās episode we have the pleasure of Audun Fauchald Strand, Principal Software Engineer at NAV.no, Norwayās Labour & Welfare Administration. We will be talking about NAIS.io, the application platform that runs on-prem, as well as on the public cloud.
Imagine hundreds of developers shipping on an average day 300 changes into a system which processes $100,000,000 worth of transactions on a quiet week. If you think this is hard, consider the context: a government institution which must comply with all laws & regulations.
Featuring
Sponsors
Sourcegraph ā Transform your code into a queryable database to create customizable visual dashboards in seconds. Sourcegraph recently launched Code Insights ā now you can track what really matters to you and your team in your codebase. See how other teams are using this awesome feature at about.sourcegraph.com/code-insights
Raygun ā Never miss another mission-critical issue again ā Raygun Alerting is now available for Crash Reporting and Real User Monitoring, to make sure you are quickly notified of the errors, crashes, and front-end performance issues that matter most to you and your business. Set thresholds for your alert based on an increase in error count, a spike in load time, or new issues introduced in the latest deployment. Start your free 14-day trial at Raygun.com
Notes & Links
- NAIS.io - Application platform and DevEx toolbox for teams digitalizing NAV.no
- š docs.nais.io - references, step-by-step guides & some good YAML
- š NAV.no deployment stats, 2009 - 2022
- šŗ NAV Teknisk retning (technical direction)
- Being NAIS at a distance - How we work in a hybrid world
- Do we need an internal technology platform? - The case for platforms at NAV
- Changing Service Mesh - How we swapped Istio with Linkerd with hardly any downtime
- NAIS @ GitHub
- š How to Optimize for Fast Flow Using Alignment and Autonomy
- š¦ Do you know whatās cool? Keeping your #kubernetes clusters secure.
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Welcome | 01:03 |
2 | 01:03 | Sponsor: Sourcegraph | 03:04 |
3 | 04:07 | Intro | 04:20 |
4 | 08:27 | NAV vs NAIS | 01:43 |
5 | 10:10 | What's good about having NAIS? | 04:03 |
6 | 14:13 | Why K8s? | 02:04 |
7 | 16:17 | Kollide, Unleash, how close are you? | 01:33 |
8 | 17:50 | Favorite NAIS tool | 02:38 |
9 | 20:29 | Kyverno | 03:19 |
10 | 23:48 | Sponsor: Raygun | 01:58 |
11 | 25:46 | How do you secure the data? | 01:17 |
12 | 27:02 | Cilium and eBPF? | 03:48 |
13 | 30:51 | How long has Audun been in tech? | 03:08 |
14 | 33:59 | Challenges from Covid | 06:33 |
15 | 40:31 | Success story | 02:26 |
16 | 42:58 | How big is NAIS? | 02:53 |
17 | 45:51 | GCP in practice | 01:12 |
18 | 47:03 | Missing from GCP | 02:08 |
19 | 49:11 | Migrating the older services | 02:22 |
20 | 51:33 | A good day for Audun | 02:23 |
21 | 53:56 | The SLSA model | 01:09 |
22 | 55:05 | Anything bigger? | 01:16 |
23 | 56:21 | Inspiration for NAV | 02:00 |
24 | 58:21 | Go listen to this talk | 01:16 |
25 | 59:37 | Wrap up | 01:31 |
26 | 1:01:08 | Outro | 00:46 |
Transcript
Play the audio to listen along while you enjoy the transcript. š§
Iāve heard about NAIS.io, an application infrastructure service built on Kubernetes from Vincent Ambo, our guest from episode 37. This application platform was built specifically to increase the rate of shipping code from a few times a week to hundreds of times each day. The surprising part is that this application platform is running Norwayās welfare payments. So we are talking about many billions of dollars worth of transactions every year. Itās huge. One of the masterminds behind it is joining us today to talk about it. Audun, welcome to Ship It.
Hi, thank you. Thanks for having me.
So who is the other mastermind behind this application platform? I know thereās two of you.
Well, actually, the application platform is ā thereās a team, I think. So I think thereās two of us working as principal engineers at NAV, but the application platform is built by a team. I was a part of that team when I started, and then I worked there for two or three years full-time, and now Iām more like everywhere in the company, because thereās so much things to do, and the application platform work so well, we need to fix all the other stuff.
But thereās like you, and this other personā¦ Iāve even seen like your talks that talk about that; youāre most like the public figures when it comes to this right? And the ones that have a lot to do with it. So who is your partner in crime?
Oh yeah, his name is Truls JĆørgensen. We had this big change of strategy in our company, NAV, five years ago, where we went from fully outsource to try to in-source and hire on developers. And Truls was the first one, and I was the sixth or seventh one, I thinkā¦ And now we are 300 or 400, it depends a bit on what you count as a developer. But if you say product developers, I think weāre up to 400 people.
That is a lot of people to be working on a lot of applications and a lot of code, and manage a lot of complexity. So before we started recording, youāve described the two of you as Walldorf and Statler, the two angry old man from the Muppet Show that constantly complain about everything. What is the most recent thing that you complained about?
Well, today we had a discussion about GDPR [unintelligible 00:06:47.17] although itās really important, sometimes it feels like thereās so much ā the technology says something, and the lawyer says another thing, and the trick is to kind of balance the two out, and thatās always difficult, because a lot of the time one of the sides kind of says, āWell, everything has to be like I sayā, but you have to balance the value of using cloud technology with the risk of privacy.
Yeah. And Iām assuming that running everything in-house is not an option.
Well, we used to have that when we started. The first iteration of NAIS was basically running on-prem. We have a strategy where we go quite slowly, both going from our old legacy systems, to NAIS, the Kubernetes platform, but at the same time also going quite slowly from on-prem to cloud. We donāt want to do lift and shift, we want to modernize our applications and to get the full value, both of using Kubernetes, but also using the cloud, because we donāt see ā well, thereās not that much value to gain from just moving old stuff to new infrastructure. You need to modernize the applications and make better applications. We always say, none of the users of NAV care about our application platform. Weāre not here to make better application platforms, weāre here to make better services. And better services come from better applications. And then the platform can of course help with that, but thatās not why NAV is NAV, to make application platformsā¦ Although that would have been quite cool, actuallyā¦ [laughs]
[00:08:25.25] Hmā¦ So what is the difference between NAV and NAIS.io?
Well, NAV is the biggest governmental agency in Norway. As you mentioned, we pay out about a third of the federal budget in Norway. We have everything from age-related pensions, to parental benefits, to sickness benefits, and we also have the responsibility of helping people get back into work and kind of have the whole working system working as good as possible. So thatās NAV. And we used to be many different organizations, and then we had a big merger in 2006, where the politicians thought that if theyād just put all these different organizations into one organization, then everybody will start to cooperate, and data will flow between the different systems, and everything. It turns out that wasnāt as easy as just putting them into the same organization; thereās still monolithic software causing problems. Three monoliths are necessarily better than one monolith. So we still have that.
And so thatās NAVā¦ And NAIS is basically our open source platform that we started building in 2017, which was kind of a kickstart of the whole in-sourcing process, where we thought that we should ā because when we go from none, or almost none developers and we want to hire a lot, we needed to make it visible and clear that being a developer in our company is good. Itās possible to work with good technology, and have development speed, and have all the good things, so we kind ofā¦ We used it as a branding exercise as well as a technology platform to help us get developers in.
So five years ā well, 2016; six years almost now, it will be since NAIS has been aroundā¦ What are the benefits that youāve seen in this time of having NAIS?
I think thereās multiple. Maybe the clearest one is what you said - before, we used to deploy at nighttime, and have manual testing periods, and so deployments was maybe something that happenedā¦ Well in 2005 or 2006 it was four times a year, and then it grew very slowly. But it was always coordinated, always the big releases. But what NAIS did was make it possible for the teams to handle this whole process themselves, and they didnāt have to do ā there was no technical reason from the platform or infrastructure side that made it necessary for them to coordinate with anyone when they wanted to deploy. Of course, there might be dependencies between applications, but thatās a different thing to fix. So now weāre up to 1,500 releases a week, and weāve been quite steady on that for a few years, I think. We have about 1,300-1,400 applications, so on average, thereās one deployment per application per week. But we have some data showing that most of deploys come from a smaller part of the applications. Thereās a few application that change a lot, and some that hardly change at all. I think thatās a logical consequence of doing micro services, because some is more support, and some [unintelligible 00:11:37.07]
But the other side which has taken me a bit by surprise is the fact that we have this platform, which has very ā you can say it has quite tight entry conditionsā¦ Itās only Docker containers, but we say you have to be stateless so we can deploy, because Kubernetes can move your application.
[00:12:00.25] And we do log collection this way, and metrics this way, and alerts this way. Itās quite unifying. We tried to do what Spotify called kind of making a golden path, to make it easy to do it the right way. And that works almost too well. Almost all the teams do almost everything the same way. So although we say thereās no real guardrails on programming language, for instance, or stuff like that, people copy from each other and learn from each other, so itās quite unified how we do development. And we have a limited number of external services that are available to NAIS; we have Postgres and Kafka. And that means that Postgres and Kafka are basically the two most important architectural [unintelligible 00:12:47.12] That drives the technology development in a quite clear direction, I would say.
So we have a quite consistent architecture, and I thought that almost the opposite would happen, because you can do whatever programming language you want. The organization is so big, and thereās so many different problems to solve, I thought the diversity would almost be bigger. But it turns out itās quite unified, our architecture. And I think thatās a good thing, although Iām always a bit scared of what that means. I donāt want us to relax either, I want us to be able to see when thereās new, interesting stuff happening that we need to use.
Yeah. Okay. So going back to how many deploys you do, there is data.nav.nl (and Iāll put a link in the show notes) that shows how these deploys have changed over the years. And I think it starts in 2009. So thereās a lot of data, 13 yearsā worth of data, to see how many services you had, how often you deployā¦ That is so insightful. I was surprised that this data is public, by the way. This is amazing, for anyone to see just you know how big this platform is in terms of applications, in terms of deploysā¦ That was really interesting. So I have to ask this - why Kubernetes? What drove you to Kubernetes?
Well, I think thereās two answers to this. When we look back on it, I think we want to use open source technology; we want to have ā although itās not important for us to use many clouds at the same time. We think that that probably costs more than it gives. But to use open source APIs as the main boundary between the application and the platform makes it easy to move and makes a better distinction between the application and the platform.
When I started to make applications platforms - I think itās 2014 - Mesos was the thing. So we used Mesos and Marathon. I canāt even remember all the things we used, but it was kind of a completely different platform. And then we had the problems of ā they werenāt really cooperating well. There was just a bunch of open source projects, and we had to spend a lot of time just updating everything and figuring out how to use them. And then, at that time, I think 0.8 or something Kubernetes was released, and someone in our team knew someone from Google and said, āWell, this is good.ā So we looked at it, and it did all the things good that the Mesos universe did badly. Everything was one big package. You just had to figure out how this worked, and then it solved everything. So after that, it feels like Kubernetes basically won that space. And then all the cloud vendors came running, or offered that as a service as well.
[00:15:56.28] So I think the main ā there doesnāt seem to be that many alternatives that areā¦ As open source, you could go all the way to some kind of serverless thing, and then be more cloud-dependent, but Iām not sure I see that as a good move, at least not for organizations of our size.
Yeah, yeah. Thatās right. Okay. So Iām looking at NAIS.io and I see a lot of great components there. Grafana, InfluxDB, Linkerdā¦ A few that I do not recognize. Thereās Kollide, OSquery, Unleashā¦ How close are you to those components?
Well, theyāre different things. Unleash is a feature toggle system. It was created initially in the company I worked for before NAV, a company called FINN, which is basically the Norwegian eBay, and itās now a big open source project. Itās one of the two big players in the feature toggle system. So a lot of our teams need feature toggles to be able to have the deployment speed.
And then Kollide and OSquery are part of a feature we developed quite late in NAIS, where itās more about handling controlling the laptops we use and how the laptops connect to our clusters. So we call that NAIS device, and Kollide is basically the itās an hosted service that glues together OSquery, which is an agent that runs on the laptop and checks if everything is up to date and the laptop is sound, and handles the management of that, and how to communicate with users, telling them āWell, you need to update macOSā, or āYou need to do a Chrome upgrade.ā And then we build some gateways ourselves, kind of just the last bit, so we can control exactly how our laptops access our production environments.
Hmm, interesting. So which of these components do you use most often? Because thereās quite a few, and itās not an exhaustive list. Is there something that you use on a daily basis? Iām assuming Kollide and OSquery, because that must be running on your laptopā¦ But what else that is more like your hands-on, youāre much more aware that is there? Because I think Kollide and OSquery - you install them, they provide connectivity, and they just kind of get out of the way.
I would say my favorite tool of all the tools of NAIS is probably Grafana.
Really? Okayā¦
I used to be a backend developer, making applications, and just the sheer joy, and all the interesting things you get out of looking into what happens in production, and making graphs of everything, and trying to figure out why stuff is happening, and what does this mean when this goes up and this goes down.
Whenever I ā at least when I was an application developer, whenever I didnāt know what to do, I could always find something to measure and try to get more insight into whatās actually happening in our application. Interestingly, we kind of - as a platform at least, or as a company, we moved a bit away, or we extended how we do that. Now weāre also more into getting the getting the data out of the databases as well, trying to think of that data as a product, and not just do the real-time monitoring, but also try to do more aggregated monitoring, or reports even. Itās kind of a sister platform of NAIS, our data platform called [unintelligible 00:19:23.15] which we tried to do that withā¦ So it helped the teams to be even more conscious of all the data they have, and what they can learn from looking at the data.
Did you say Nada?
Yeah.
Like nothing? That means nothing. Okayā¦ Thatās an interesting name. [laughs]
The reason for that name - at least my version of the history of the reason for that name - is that we didnāt want the platform to own the data. Because traditionally, the data warehouse is a central team, and the data warehouse team owns both the platform and the data. But we wanted to do the same thing with NAIS, because NAIS doesnāt own the applications running on them, and we wanted a data platform thatās a platform; and the teams should own the data. The application teams should own the data. So basically, the Nada name is ā also, itās NAV Data, of courseā¦ But we wanted it to be clear that the platform is a platform, and the teams own the data.
[00:20:25.02] Okay. Thatās a good one. So I know thatās you run other services, other components, as you call them, which are not listed on the NAIS.io website. There is a tweet which I noticed three hours ago, very recentā¦ āDo you know whatās cool? Keeping your Kubernetes cluster secure. At NAV.no we use Kyverno to ensure no pod runs unchecked. And the question is, what is your best tip securing Kubernetes clusters? We want to hear.ā Iāll put a link in the tweet. I mean, when this comes out, if you want to answer, it will be a few weeks later, but still, it will be aroundā¦ What do you think about Kyverno, and how do you think about securing things? ā¦because this must be a very important topic, considering the data and the transactions, and what is happening in your applications.
I would say, answering that question from more of a top-down perspective, first, I think the main thing with securityā¦ When youāre making an application platform and you want to help the team secure your applications, itās really important to understand the needs of the developers, to make sure that any security feature you add is usable. Because in my experience, thereās been loads of security people that are so into security that they make this principle that is almost impossible to adhere to.
So at sometime in the process there will be something where the developer has to choose āShould I follow the principle, or should I deliver on time?ā Most of the time, they will deliver on time. So you have to make the security things easy to use. In my experience, itās more important that itās usable, than itās 100% secure. Because if you make all the principles and all the things that are needed to make it 100% secure, and the team doesnāt follow that because itās impossible, then you have [unintelligible 00:22:13.19] when the people responsible for security think everything is okay, and the people in the team doesnāt want to tell the security team what they havenāt done.
For instance, we used service mesh before we used to have these network [unintelligible 00:22:30.10] in our on-prem architecture, where we had two [unintelligible 00:22:33.02] one for the internal applications and one for the external applications. And basically, perimeter safety. So if you came inside the firewall, you had access to everything. But instead, we used the service mesh and zero trust principles to basically put a small firewall around every application, and make it the teamās responsibility of configuring this firewall.
So the teams - and itās a part of the configuration of the application, what applications can talk to you and what applications do you need to talk to. So instead of a central firewall, and some kind of person in the middle that always has too much to do, you make it a teamās responsibility to configure this, and then everything works better.
What about the data? How do you secure the stateful data that is persisted at REST, PostgreSQL for example, or anything else that you use for persistence? Flat files maybe, you have those as wellā¦ I donāt know.
Well, there isnāt one way weāre doing it, because weāre so big, and we have a gazillion different requirements. But for Postgres, we use ā NAIS now runs on GCP. So we basically use the managed GCP service for Postgres. And we considered the bring-your-own-key architecture, but right now, it feels like that increases the risk of us losing the key more, and thus losing the data, than actually losing the data. So although thatās something we consistently rethink, right now we mostly use the normal features of GCP, and then have some extra backup things, because we want to ,ake sure we have everything running inside Norway as well, or have the data accessible inside Norway.
Right. Okay, that makes sense. Do you make use of anything like eBPF to secure or at least have visibility all the way down into system calls, bot just like network traffic?
I know the plan is for us to ā Iām not entirely sure right now, but our plan is to go to Cilium as a service mesh, from Linkerd. We were first on Istio, and then Istio felt like it was a bit too much for our needs, and then we went to Linkerd, and now weāre looking at Cilium, which is eBPF. But thatās how we approached that problem.
Thatās interesting. Itās great that using a platform that promotes open source, and has a very rich ecosystem - it allows you without too much investment to be able to go from one provider to another, from one solution to another, which by the way, thereās the open source versions software you can try out, thereās paid-for versionsā¦ So itās nice that you can switch between these things. How did that work out for you in practice? How did it work out for you going from Istio to Linkerd? Was that, would you say, a seamless migration or transition, or were there complications that you couldnāt foresee?
[00:28:15.26] First of all, I just wanted to say we blogged about that; thereās a blog which I presume you can put in the show notes afterā¦ I wasnāt the main part of that process, but as far as I can tell, it wasnāt that difficult. It took some time, because you had to change something in all the applicationsā¦ But we have a really good dev environment that we can do these things in. And as far as I remember, it was something you did in approximately one day, moving all the several hundred applications from one to anotherā¦ But yeah.
Thatās impressive.
And I think, more or less going all the way back to the question you had about why Kubernetes - one of the main reasons Kubernetes is so good is you have all thisā¦ You have this API which is incredibly well thought through. And they made this in 2015 or something, and it still kind of makes sense as an API, even though they changed a lot of the things behind it, and they made it extensible. But you have this API thatās so good, and it matches so well with what an application and an infrastructure does. So it makes it possible to create tools like a service mesh, and it makes it possible to change implementations even of the service mesh, or the implementation of the Docker runtime, or container runtime or whatever, with almost no disruptions to the actual uses of the platform.
I remember in the old days most of these things were almost impossible, and it took weeks and months to plan and do.
Yeah, thatās right. Yeah, I remember the pain. It was like hard. Like, you wouldnāt even think. Like āNo, no, too expensive. Letās just not do that.ā And thatās how a lot of the great ideas would end up, because the implementation was just not worth it.
And now you can even buy it hosted. So most of the stuff you donāt even have to think about, like updating, or changing nodes, or increasing the capacityā¦ Itās not even clicking a button. [laughs]
Yeah, thatās amazing. I love that part, too. So if anyone is curious to learn more about this platform, thereās some great content on docs.nais.io. There is references with diagrams, thereās step-by-step guidesā¦ Thereās even 300 lines of YAML for the NAIS application example. There is a lot of Kubernetes YAML, best practices, and other content worth reading. I enjoyed digging into Deploy section. I was really surprised, thereās like so much good stuff there. So have a look if youāre curious. Weāll add a link in the show notes, but itās docs.nais.io.
So before we change subjects, thereās something that I wanted to ask you since the beginningā¦ I know that youāve been in tech for quite a few decades. So how long has it been that youāve been in tech? Do you remember when you started?
I left university in 2003, I thinkā¦ So Iāve basically been working ā I started as a consultant, and then I realized consultancy isnāt what I want to do; I want to be part of the company that owns the product. So - well, itās almost 20 yearsā¦ 20, yeah.
Okay. So in a few sentences, as a very brief summary, what were the last ten years in tech? What were your last ten years in tech? There was the Kubernetes part of it, but what else happened that brought you to where you are today, a principal engineer at NAV?
Well, I used to be a Java developer. I had really identified as a Java developerā¦ And a bit by chance, I got the role as a lead developer for the infrastructure and operations team at one company, and then I realized I could use all the experience I had as a frustrated backend developer to make applications platforms. And basically, Iāve been doing a lot of that since then, just figuring out, doing all the things I learned, or I couldnāt do easily before, trying to make that possible.
[00:32:07.23] And then for last few years, itās been more and more about making everything fit together, not just the application platform, but making the management understand whatās important, and why making software is completely different from doing other things that the Norwegian government doesnāt finance, for instance.
Yeah, yeah, thatās right. So if you were to write an application today, would you still pick Java?
No, I would probably do Kotlin. I donāt program as much as I want to anymore, but mostly I program in Kotlin and Golang. And I think programming in Kotlin is more fun, and I might think that and programming in Golang is a bit frustrating, but it feels like it will last longer, and be stable for longerā¦
I see.
So if itās my decision, I would probably still go for Kotlin, because thatās more fun, and you can you can feel more clever when you write Kotlin than when you write Golang.
Okay, okay, thatās a good one. And if you were to choose where to run this application, what would your choice be?
At NAIS? Well, I mostly write things for NAV, and then the question is similar, but Iāve always thought ā I mostly worked at big companies, with hundreds of developers, and I have this lingering thing in my head where maybe all the things I think is good for those companies might not be good for small companies. So Iād probably try to figure out some other ways of doing more serverless, or more higher-level abstractions from some of the cloud vendors, for instance, just to make sureā¦ I have this suspicion that at some point in the future thatās going to be even easier, even for the big organizations, but Iām not necessarily sure weāre there yet. But itās difficult to say.
Okay, so the last few years have been really challenging for governments around the world, especially welfare systems around the worldā¦ And obviously, weāre talking about COVID, about the pandemicā¦ Itās been really, really tough. So what challenges did COVID bring for your platform for NAIS?
Well, the very first challenge was ā I think this was the 10th or 11th of March in 2020.
Thatās very specificā¦ Okay, this is gonna be good. [laughs] Alrightā¦
Because thatās when basically the prime minister of Norway said, āWell, everybody has to be stay-at-homeā, unless you have a really good reason to; basically, if youāre a fireman or work at the hospital, or something, you have to work from home. Before that, most people went to the office on most days, and all the tech and all the infrastructure was basically built around that. So luckily, we had just enough ā we had the necessary things to be able to start working from home, but it was kind of a challenge and we had to relearn how to communicate and how to work as a team, basically. And I think that was interesting. It worked quite well when everybody was at home, and I think itās an even more interesting challenge now when some people want to stay home, and some people want to go to the office, because itās much more difficult to solve this challenge when the teams are more hybrid.
But that wasnāt the most difficult thing that happened during the pandemicā¦ Because of this order from the Prime Minister, we had a lot of ā and I think the English word here is āfurloughed.ā We had a lot of workers in Norway - not at NAV, but in Norway, furloughed. And according to the rules of Norway, then youāre supposed to get the benefit. I think normally thereās around 1,000 of those applications a day, and now we have like several hundred thousand furloughed people in a week or so.
[00:35:57.13] We were still early in our transformation, and so most of those applications would normally be handled by manual caseworkers. So our estimates was this is going to take a year, for the current systems to handle all of these applications before everybody gets their money. And people needed money.
So the government in Norway tried to make some alternative ways of handling this. So they had a list of 12 different things, I think, and me, and the team, and Truls, and others started working on one of them, where wanted to kind of have aā¦ At the same time, we wanted to make the laws describing this benefit as basically an advance of the normal unemployment benefit. So we had to make the law, and then we had to make a system that implemented that law. And we had to do it really quickly.
I still remember we had stand-up meetings at eight oāclock in the morning, four oāclock in the afternoon, and 10 oāclock at night. Every day for two weeks or something, where we tried to figure out what we could put in the law, because that was limited by what is marked within the law, and what we could implement. So we have to balance kind of whatās possible to implement in a week or so, and whatās necessary to put in the law ro reduce the risk of people misusing this opportunity.
Thatās amazing. And this is a country weāre talking about. This is not like a big tech company. This is like ā youāre dealing with the benefits of a whole country, right? Like, thatās like your responsibility. Wow, that is big.
So the law was ready on a Thursday, I think, and then we managed to build the system in basically three days. And Iām really proud to say we built that system using pair programming, and we had the user testing late Sunday nightā¦ And then we went live on ā I canāt remember if it was the Monday or the Tuesday. And then we had a gradual rollout, using Unleash, actually, so we could make sure that the system kind of worked well. We had the increased pressure, and then in a week we had paid out one billion NOK. A week from when the law was ready to when we had a billion NOK paid out.
Okay. That was a good system. Why did you write it in, by any chance? Was it Kotlin?
This was Kotlin.
Really?
Kotlin running on NAIS and a Postgres database.
Wowā¦
But we actually ā it was a lot of reuse. We had some strange components; we had the calculator that people used to figure out what they could get in a benefit if they needed to. But that calculator kind of had the functionality we needed to get the data, the data to calculate what people should get in these new benefits.
So basically, you used the calculator as an API. So we kind of grabbed things from everywhere, and the payment system had this old file-based interface that we usedā¦ So some of the integrations was like totally modern, with Kafka and asynchronous, and another one was writing files to disk, and a Bash script moving that file to another deskā¦
Thatās crazy.
ā¦and then the payment system picking up the file and putting it into making payments. So it was everything. We took whatever we could find, basically.
Lots of respect to the people that build that, because as cobbled together as it was, it worked, right? And weāre talking billions like with a big B. 1 billion NOK Iām pretty sure is at least as much as like $1 billion.
No, I think itās a 10th. Yeah, I think $1 is 10 NOK. But still, itās loads of money in Norway.
Right. So just 100 million, right? Like, in a week; just $100 million. Thatās okay. Like, some Bash scripts, and some Kotlin, and some Postgres, some Kafkaā¦ Thatās just amazing. And it all worked. Okay. And how long have you been using that system for?
Well, of course, when we built it, we said, āWell, this isnāt going to last longā, and I think we turned it off a few months ago.
[00:40:09.24] Okay. It served its purpose. It served its purpose. Wow. Okay. Sometimes itās just like you have to make it work, and thatās all the time that you have. So itās not like āWeāll ship it next week.ā Itās not an option. Especially if like the Prime Minister says, āOkay, a week from now those payments will start going out.ā You have to deliver. Wow, thatās amazing. Do you imagine that being a success story, if you didnāt have the platform that you had at the time? Can you imagine like making it work without it?
Not in that timeframe, and maybe not as secure, because we could probably make something like that to work quickly, but then weād have to build even more stuff. And in that timeframe, the less you have to build, the better, because youāre bound to make mistakes and cut corners and everything when you have to do things that quicklyā¦ So the more things you could use that are hardened, and works, the better. So I think the security part is probably what we got from using the platform.
Hmā¦ How many people were involved in his project in that one week?
I think we were maybe 20 people. I think we had around 10 developers, and lawyers, and everything.
Wow. Okay. Thatās amazing.
And we had ā this was one of the things I had to makeā¦ I think Norway had a 12-point plan, or something, and Iāve implemented a few of them, and then other parts of the government implemented the rest of them.
So today how many developers are working on the platform and using the platform? I think you mentioned 400, roughly?
Well, yeah, we have 400 in-source developers, and then we have a few hundred consultants as well. So I think weāre up to 600-700 product developers at NAV. I think we have 800 seats on GitHub.
Wowā¦ Thatās a big org. Okay. And how are they structured? How many teams do you have? Or do you even have teams?
Yeah, we have teams. We have ā well, it kinda depends. I think we have about 100 actual teams doing product development, and then we have some management structures, and everything; thatās kind of our teams, but not that kind of teams. And as we got big, we tried to organize even more, so we have what we call product areas, where we kind of divide NAV into ā some teams work with the work stuff, and some people work with the health stuff, and some people work with the family benefits. So we need to split NAV up into smaller parts for it to be comprehensible.
Okay, it makes sense. I mean, thereās like a lot of people obviously organizing that, and being aware of what everyone does, and not duplicating efforts, like āI did it this way, and you did it that way. Okay, we have to reconcileā and all that good stuff. Iām wondering just how big is this platform in terms of resources? Iām thinking CPUs, memoryā¦ Things like that.
Well, at least our production cluster running in GCP has 50 nodes; it has a total of almost 800 virtual CPUs and 1.6 terabytes of memory.
Wow. That is a big cluster. Okayā¦
And our architecture is one big cluster instead of multiple small ones. Itās kind of a religious question, I think.
How are you finding that configuration, having one big cluster versus a couple of smaller ones?
Well, of course, we do divide it; so we have namespaces for each teams, and stuff. The question is probably āDo you want to have more separation?ā But I find that itās easier to manage one cluster, although lately, weāve been working more on making it possible to make more clusters, because weāre experimenting with providing NAIS clusters to other governmental agencies as well in Norway. And to be able to do that, we have to mean automate, or making it more robust and more automated, the process of making new clusters, because we want different other companies to have their own clusters, and other setups.
[00:44:09.17] One thing which I remember when we were using Kubernetes - again, the scale was very different, but upgrades sometimes wouldnāt go as smoothly. And then what do you do? What do you do if you have a single cluster that you do an in-place upgrade, that doesnāt go out too smoothly? You know, know some component doesnāt interact well with other componentsā¦ What do you do then? Did you have any such problems in the past?
We had more problems ā or maybe not problemsā¦ It was more work when it was on-prem. But this, I think, is one of the good things of the managed service. Google does everything for us. So either we decide when to do it manually, which is probably for major upgrades; and for minor upgrades, itās just a maintenance window, and it kind of happens.
One of the reasons itās important for us to modernize the applications before we migrate to Kubernetes is these kinds of operations become easier as wellā¦ Because if the application is robust enough to be able to handle that, and the node dies, because itās moved to another node, and then upgrading the cluster is also much easier.
Okay, I see. That makes sense, especially if the applications are stateless, and you can run more than one instance, then, we have reduced capacity for a while. But then if you have NAIS ā youāre basically draining a node, the application knows to spin extra instances somewhere else, and thatās okay. Itās like minimal disruption. Itās no different to scale-up, in a way.
No different to a deployment, or whatever. So I think thatās one of the ā again, the value for this, for our sake, is better applications. Thatās the core value of doing all of this.
So youāve been using GCP for a few years nowā¦ How was it like in practice to use them?
Well, I think before we went too far into the cloud journey, we kind of had a rather small ā we checked the different cloud vendors, at least three big ones. We realized Alibaba isnāt for us right now, so itās basicallyā¦
[laughs] Thatās a good one.
ā¦so itās basically S3, AWS or GCP. We looked a bit at the offerings, and we focused mostly on hosted Kubernetes, because we knew that was the big thing we hadā¦ And especially at the time - I think this was 2019 - the difference in quality was quite big. I think theyāre closer now, and I probably think it would be a more different or more difficult comparison now. But at that time, it felt like Google had the by far the best hosted Kubernetesā¦ Which kind of makes sense, because theyāre the most, theyāre the biggest ā theyāre the fathers of Kubernetes.
Yeah, I know what you mean. Do you feel like thereās something missing in GCP? Something that you would want to have?
Well, we are quite conservative in what we use. So Iām not really sure weā¦ As I said, we want to focus on using open source components, or at least APIs of open source components. There seems to be a trend where the cloud vendor will say, āWell, this database is Postgres-compatible, but we wonāt tell you whatās behindā¦ā And thatās kind of okay. But as long as we want to use open source APIs and open source components, the number of services we use are quite small.
So Iām not really sureā¦ We could probably get ā Kafka, for instance, weāre buying from a different vendor; weāre running it on GCP, but weāre buying it from a company called Aiven, which is a Finnish company hosting open source databases. So yeah. Thatās really a problem, but weāre quite conservative in the technologies we use, so Iām not really sure I can answer what I need, other than more open sourceā¦ Well, Elastic, and Kafka, and everything. But Aiven gives us that.
[00:48:07.29] I think thatās a good strategy, right? The boring technology is what you would want to have, considering the stability that you require, right? You donāt want to be on the cutting edge, you donāt be trying things out; you want to go with a proven, tested, reliable software, that is open source preferably, so that if you want to or if you need to make a change, you can contribute thatā¦ And something that you can trust that will be around for the next 10-20 years, ideally, at least.
Yeah. Because weāre no startup, and basically, weāre not in a competitive marketplace. Weāre part of the nation. So we have systems, not running on NAIS, but mainframe systems that are 40 years old. Iām not necessarily sure that the code we write now will run for 40 years, but the problem weāre solving is going to be needed to be solved for many, many decades to come. So itās better to spend some more time doing it properly now, than trying to redo everything every fifth year because we hurried when we started.
Thatās right. So are there any migration plans for the older services that have been around for decades?
Yeah. But then again, weāre basically rewriting everythingā¦ Well, thatās not true; for some of the systems, weāre rewriting them. For some of them, weāre looking into more different migration strategies. I didnāt know this was possible before, but you can take COBOL code and translate it into Java. Really strangeā¦ I looked at the Java, it looks really strange; it looks like COBOL, but it is Java. And then you can run it on normal servers. And then you can reduce the cost of the infrastructure quite a lot, because mainframes are really expensive, and Linux servers arenāt. But of course, thereās risks involved, because we have systems that has to work, and youāre making them run on the new technology. But our main strategy is to basically recreate the products that run on the old systems, on new architecture, and build them again with teamsā¦ Basically we try to frame the problems within an organization that can live as long as the problems need to be solved. Because I think the biggest thing is to have the teams knowing the domain, not having the systems being able to solve it, because of the timeframe weāre working in.
And what would a COBOL job ad even look like these days? Where would you find those people? [laughs] Thatād be really, really hard. Okayā¦
As far as I understand, thereās loads of important stuff running on COBOL in the world. And a lot of the people who wrote them and know COBOL is getting old, and ready for retirement. So at some point, I presume itās going to be very lucrative to learn COBOL, because not everybody has the opportunities we have to modernize, soā¦
Yeah, thatās true. Thatās a good point. Okay. COBOL-owned Kubernetes. That is a startup idea right thereā¦ [laughs]
Well, when we started introducing Kubernetes, I think I had the argument, at least five different times, of how Kubernetes is basically exactly like the mainframe. Thereās obvious similarities, but itās also the clear differences.
Okay. What does a good day for Audun look like?
Yeah, thatās a really good question. [laughs] I had a really good summer holidayā¦ The funniest thing in the world is to code. But then again, whenever Iām coding, I realize, at least most of the time, thereās bigger problems that need to be solved to make it fun to code. I spend a lot of my time trying to fix the big problems, and then hoping at some point we can code again.
[00:52:06.18] But of course, itās also important to code, so I try to ā or me and Truls and a few other people, we try to code a bit every week. And then the important thing is to find the things you can make that are important. Thatās valuable, but not important, because sometimes we havenāt got the time to deliverā¦ We canāt promise when anything will be finished, but itās fun to make things that people like.
So trying to find kind of the small thingsā¦ Right now we are working on trying to take the application configuration in NAIS, the NAIS YAML file, which basically says āWhat applications do you need access to, and what applications have access to you, and what Kafka topics do you need to write the [unintelligible 00:52:50.29] to?ā And take this information out of the cluster and make a visualization of all the applications and who talks to who. And thatās fun, and I think itās going to be useful, but no oneās asked for it, so no one no one can tell us weāre late.
Well, as you know, a lot of the time itās the ideas or the things that no one asks for that prove to be the game-changing oneās. No one needs this until like āHow did we live without it? Like, everyone needs that.ā
Yeahā¦ And the other thing thatās part of a really good day is when we manage to get all the other disciplines of NAV to understand ā or that we learn something thatās important, from the lawyers, or the management, or whatever, and they also understand a bit more about whatās important to do, whatās the important frameworks to have in place to do modern software development? Thatās not necessarily the same as running other parts of the government, because the soft part of software makes everything a bit different.
Yeah, for sure. So talking about frameworks, I know that you mentioned security few timesā¦ Iāve seen a blog post about SLSAā¦ Where do you stand on the whole supply chain security, the SLSA model, things like that?
Well, I think at least for us it was an important next step. Youāre kind of building blocks from kind of the basic stuff, and then you go further up, and you realize thereās always more problems to solve. When we open-source, and when we trust the teams as much as we do, itās important to make the systems that can basically prove that the trust weāve given them was okay; that we can say, āWell, we can see that this happened from that teamā, and we know that this is okay.
For instance, when Log4Shell came, and although weāve managed to get a handle on it, it was obvious that we could have responded even quicker by saying, āWell, what applications are affected by this?ā and to automate that. This kind of feels like the next big thing, or the next thing, at least. One of the next things; thereās always multiple things.
So is there something more significant than this that youāre working on in the context of NAV? Something that is important to you?
Well, the one thing I mentioned - trying to see if we can make NAIS a platform for more than NAVā¦ Because I think we are one of the biggest organizations in Norway, and we have 25 people working on platforms; some of the smaller governmental agencies maybe have 10 developers. And there is no reason to believe that they have the capacity to make as good a platform, or think through enough of the security aspect as good as we do. So if we can manage to make that possible for them, and help them as well, I think thatās really good for Norway.
I know the UK had something similar with go.uk. They had this platform-as-a-service, I think they had almost 30 different organizations running on this central platformā¦
[00:56:01.20] Yes, thatās right. Alpha Gov, I remember that. Yeah, I havenāt checked it recently to see where they are at now, but I remember that. That was a very interesting model. I know that the US government was doing something similar, and that was a reference at the time. It was many years ago - five, six, maybe more. Okay. Was that by any chance an inspiration for NAV?
Well, itās something ā one thing we really learned from gov.uk was the open sourcing. I remember reading their principles on open sourcing from gob.uk, and basicallyā¦ Well, we started to translate, and then realized we could just link to it, and say, āWe agree totally with this.ā
Yeah.
And basically, because of that, we open-source almost all the code we write; not just the application platform, but everything we write at NAV, almost everything is open source, except for fraud detection, and some experiments with the laws that arenāt finished yetā¦ And of course, some security aspects, like passwords, and everything. Most of the code we write now is open source.
Do you find that other people contribute to that, or comment? What is the interaction with that open source code from the public?
Most of the interaction and most of the use of the open source platform is kind of obscure libraries. We have one small Kubernetes operator that talks to [unintelligible 00:57:28.03] which is used by multiple companies. And then we have a Kafka testing library that someone usedā¦ But it turns out that there arenāt much of a market for open source unemployment benefit systems, for instanceā¦ [laughs]
Right. I see what you mean. Okay. So not much competition thereā¦ Okay.
No. Itās more about openness than about people. And we think that people should ā we implement the laws that are public, so the code should be open and public as well.
Yeah, thatās right. Do you find it helps when it comes to hiring, when it comes to recruiting?
Yeah, absolutely. It feels like a valid proposition that software developers really like that we say, āWell, we code open.ā
Youāre attracting a certain type of developer that I think itās very good to have. Okay, okay. Are there any talks that you or someone else from your team gave recently that you would like us to link in the show notes?
Well, Truls and I was at QCon in London in May, talking about NAIS and how we do technical governance, basicallyā¦ Thatās probably the best one from an international audience.
Is it public, the talk?
I think so. [unintelligible 00:58:45.04] If it isnāt public, I think itās going to be public at some point, but Iām not entirely sure when.
Okay, Iāll check it out. I know that you have a very good blog, the NAIS.io blog. Thereās a post on SLSA, thereās a few othersā¦ I think you mentioned about service meshes, thereās a post there, tooā¦ I really like it. I mean, thereās not too many posts there, so it doesnāt feel overwhelmingā¦ But what is there is very compressed, itās very goodā¦ āThe learnings from thisā or āThis is what weāre thinking about that.ā And thereās not a lot, but itās very valuable. Iāve found it just like browsing through it.
The newest post is about Elm as a frontend platform, the frontend framework. So the application platform concept is kind of a bit stretched nowā¦
Okay. So as we are preparing to wrap this conversation up, is there a takeaway that youād like our listeners to have from today?
For big organizations, I think an application platform is really valuable. And I think the main thing to think about when you make application platforms is to treat the intern developers of your company as users, and basically make an application platform the same way as you make an application. Do experiments, and think of the product, and try to figure out how can you solve the problems of your users, and then solve them.
Yeah. And write good docs. The docs on NAIS.io - theyāre really, really good. Thereās so much good stuff there. I really like that. Okayā¦
I havenāt written any of it, because Iām not a good writerā¦ I think I one chapter somewhere in there, but most of it is written by other people. But I agree, itās really good.
Alright, any shout-outs that you want to give to anyone from NAIS, from NAV, people that you work with that are doing amazing work and you want to give a shout-out to them?
No, the shout-out could maybe go to our /navikt GitHub profile, where you see all the other open source code, not just NAIS. I think thatās a good place to start.
Okay. Excellent. Alright, Audun. Well, I had a lot of fun today. Thank you very much for sharing so many amazing things with us, and Iām looking forward to next time. Thank you.
Thank you.
Our transcripts are open source on GitHub. Improvements are welcome. š