Continuous integration and continuous delivery are both terms we have heard, but what do they really mean? What does CI/CD look like when done well? What are some pitfalls we might want to avoid? In this episode Jérôme and Marko, authors of the book “CI/CD with Docker and Kubernetes” join us to share their thoughts.
Teleport – Quickly access any resource anywhere using a Unified Access Plane that consolidates access controls and auditing across all environments - infrastructure, applications, and data. Try Teleport today in the cloud, self-hosted, or open source at goteleport.com
LaunchDarkly – Test in production! Deploy code at any time, even if a feature isn’t ready to be released to your users. Wrap code in feature flags to get the safety to test new features and infrastructure in prod without impacting the wrong end users.
Equinix Metal – Globally interconnected fully automated bare metal. Equinix Metal gives you hardware at your fingertips with physical infrastructure at software speed. This is the promise of the cloud delivered on Bare Metal. Get $500 in free credit to play with plus a rad t-shirt at info.equinixmetal.com/changelog.
Click here to listen along while you enjoy the transcript. 🎧
Hello, everybody. Welcome to Go Time. Today we are joined by Marko Anastasov. Marko, do you wanna say hi?
Hello, everyone. Thanks for having me.
And we’re also joined by Jérôme Petazzoni. Jérôme, do you wanna say hi to everybody?
Marko is the co-founder of Semaphore, which is a continuous integration/continuous deployment service, and Jérôme was part of the team that created Docker. He plays a dozen musical instruments, and you also teach containers in Kubernetes, is that correct?
Okay. And then we’re also joined by Kris Brandow, our other host. Kris, do you wanna say hi?
Alright, so if it wasn’t clear by the guests, today we’re gonna be talking about continuous integration and continuous deployment… So I guess to kick it off, let’s just start with something basic - what is continuous integration and continuous deployment?
Continuous integration is essentially a process of frequently integrating each other’s work as developers into some kind of a central branch. For a lot of us, as a developer, when you think about it, the association is tests, building and testing your code. Why is that? That’s because in order for us to integrate often, we need to figure out very quickly if what we’re integrating works. So that’s what got us to the practices of automation and having automated tests.
Continuous delivery is kind of a broader method of developing software in which you apply a set of practices, one of which is continuous integration, where you make sure that your code is always in a deployable state. Typically, in practice, that means that at least your deployment process, which follows after running tests, is also automated and usually simple enough and robust enough.
A follow-up question I would have is why is it that we always see these terms together? CI/CD is almost like a single term these days, when it sort of sounds like they’re actually separate things that just kind of get bundled together.
Yeah. For example, in my personal journey as a developer, I first discovered continuous integration, and I was led to it through basically realizing the importance of automated tests, and getting feedback often… And I think that’s probably a frequent case.
[00:04:17.16] On the other hand, deployment is – you know, even when you’re having a prototype and you don’t have any tests, and you’re not even thinking about CI, you are… And maybe it’s a web app, maybe it’s a hobby project; the way you’re deploying it basically is continuous deployment, typically. So maybe you do a git push and it goes live.
So there’s some kind of a mix in terminology, because these two things are typically done together, in teams of a certain size and codebases of a certain size. It’s just that when you talk about maybe just continuous delivery, for example, it’s maybe too ambiguous for people to also understand that it includes CI. The way I see it, it’s just so we know what we’re talking about.
Okay. So if we’re looking at this CI/CD, what problems does it solve that would sort of cause a company to want to look into it? Why is it something that’s taking off and been adopted so much recently?
I think it’s all a matter of developer velocity, like being able to ship things faster, so that we shorten the time it takes between the moment when I hit Save in my code editor and the moment when I can see if my stuff works or not.
I remember when I was a teenager I was lucky to have my dad who wrote code among other things, and I remember somewhere I saw something on like – I think it was an ad for Turbo Pascal, and there was something like “Oh, that thing can compile like 57,000 lines per second.” I don’t remember the exact figures, because that was a long time ago, but I remember back then I was thinking “What’s the point of a thing that can compile more code than I’m maybe ever gonna write my entire life in one second? Why is that an important figure?” And long after, I kind of thought “Well, maybe it matters, because usually when we compile a big codebase - you know the XKCD joke, when you see the folks on their office chairs and they’re fighting with straw sticks, and the boss comes up like “Hey, what are you doing here?” and they’re like “Oh, we’re just waiting for the compile to finish”, and they’re like “Oh. Okay, fine.”
So back then we were waiting for stuff to finish to compile, and today we are waiting for knowing that the code works. So it has to go through build, and maybe some deployment, and some test environment, and then we need to wait for people to actually QA the code etc.
So if we can automate as much of these steps as possible, we’re saving time. If I can hit Save, push to a branch of whatever, and then I know that there is a bunch of automation that’s going to build my code, test it, deploy it in some staging environment, and then send me a notification, whether it’s Slack or whatever, to let me know “Hey, your code is deployed on this staging environment. Now you can have a look”, maybe it’s me who’s gonna have a look, maybe it’s somebody from QA, or some co-worker, or the peer or the manager who asked me to deliver that specific feature…
So if we can shorten that time, if instead of taking a whole day because I have to open a JIRA ticket for somebody to put my stuff in production, if it’s done automatically in five minutes, then it means I can iterate every five minutes, instead of iterating every day. So I can iterate multiple times an hour, I can make multiple experiments and multiple mistakes multiple times an hour, instead of just once per day. So to me, that’s what it’s all about. It’s making it so that I can try many things quickly, and that I can fail fast and fix my bugs, and then try again, and at the end of the day I was able to try and fail and eventually succeed maybe 10, 20, 50 times, instead of just one time.
[09:45] That makes sense. So when you were talking about that, you mentioned pushing to a staging environment, and having QA, and processes that in general, at least in my head, I sort of associate with larger projects, rather than a small project with one or two developers, perhaps. Would you say that this is something that becomes more valuable as the team size grows and the project scale grows, or is it something you tend to use no matter what the team size?
Both, I would say. A while ago, yeah, I would have agreed, like “Oh, this is something extremely complicated. I don’t know if I want that for my little pet project.” And I think there were a couple of things that made me change my mind about that. The first one was when I saw Heroku more than a decade ago, just when I joined dotCloud, so the company that would eventually become Docker… And Docker was initially a PaaS company competing with Heroku; and the ability to just push my code, and instead of pushing it to a repo, I push it to something that builds and deploys it - that was great, and it was really easy to do. That was the whole point of Heroku, and that’s what dotCloud was emulating and adding support for the languages, and so on. And that worked even for tiny, little projects.
I think this is a very important point, in a way. Even if you don’t maybe initially plan (or at all) to write tests, it’s really a good idea to set up a deployment pipeline, assuming you’re building something for other humans. The idea is just make that process – like, once you’re done writing the code, automate everything that needs to happen next, until other people can see it or use it. Make it basically one command. And the thing that typically does all the work, if it’s multiple steps in between, then that’s the task for the CD pipeline.
So are there situations where you think that using continuous integration or continuous deployment is a bad idea? Or maybe not a bad idea, but perhaps something that might not provide as much value.
Perhaps when it takes a lot of effort, for some reason. It’s the kind of thing that it’s a good idea to do it, but if it makes you jump through extremely complex hoops, and if it makes you waste a lot of time because of the setup or because of these very peculiar, special setups that you have, then yeah; then I could question it. But this shouldn’t become an excuse. We shouldn’t say “Oh, my app is special, so I can’t do CI.” I prefer the “Yes, and…” approach, like “Well, yes, I should do CI, and currently I cannot because this and this. But once I have solved this special problem, then I will be able to do it.”
[13:52] For instance, in the Kubernetes ecosystem a while ago I had this thought, I was like “Wow, I really wish I could run a bunch of tests on a brand new Kubernetes cluster each time.” Imagine that you push your code, and the thing is going to deploy a complete cluster, and test the code on the cluster, and then tear down the cluster. And a few years ago, that seemed – I wouldn’t say impossible, but kind of ridiculous maybe, because like “Okay, this is going to take a lot of resources, a lot of time etc.” And today, you can use something like kind for instance, to do that very easily and very quickly, just because things evolved a lot, and we got lots of contributions, new projects etc. So things that seemed extremely complicated and expensive a while ago, now are super commonplace and relatively easy to do.
So I think it’s great to not set anything in stone and accept the “Yes, I cannot do it today because X, but once we solve X, then I will be able to do it.”
Yeah, I would add to also consider that there are different flavors, for example, of continuous delivery. Maybe you’re working in an industry where it’s just not possible; regulations do not allow, or you don’t wanna maybe continuously deploy changes to the code that runs the airplanes, or medical devices… And on the other hand, continuously deploying changes of a complex codebase which has no tests is a huge risk, and such teams are not really continuously deploying… But they are aware of the risks and they have usually a very elaborate process; maybe they do it weekly, or monthly, and there are several people involved who need to sign off. There’s a QA team going through scenarios, checking everything all the time… So there are different maturity levels in each situation.
For the CI, I would maybe rephrase it - it doesn’t make sense to write automated tests for that project, and then maybe it becomes a little more clear. If you’re just prototyping, you don’t exactly know what you’re gonna end up with; writing tests may not be the right time to be test-driven… But as soon as you have some clarity on what you’re building and you’re working towards having that somehow see the light of day, again, in the hands of some kind of user, whether the user is another developer, or just a user, where you basically have some kind of an agreement that what you’re gonna write should work, I kind of see no reason not to write at least some tests. And if it’s maybe a lack of practice, or skill, fine. But that’s maybe a different subject, like how do you get better at it.
Marko, you mentioned deploying in cases where regulations don’t allow it, for example deploying to an airplane software for that. I think - at least in my mind, most of the time when I think about CI/CD, it’s more web apps… But I know that it can be used in other scenarios. So do you have any experience, or can you speak to what that setup might be like, and what delivery means in that sense?
I think about from the customers of Semaphore, who are working in some other types of some maybe non-usual industries, at least for most developers… But off the top of my head I wouldn’t know; in most cases, a lot of industries are kind of being transformed, and everybody is writing some kind of a web app, some kind of a maybe mobile app.
[17:44] I was recently talking to some people who were working on some satellite technology, where it’s not a web app, it’s not Linux or anything, it’s a real-time operating system. In that case, in such scenarios - also kind of recalling some experiences from my early career, when I worked on some embedded systems - writing tests is not so widespread in those projects. It’s more about manual QA, and then there is some kind of a release cycle, definitely less frequent than daily.
I was about to mention, when you deploy stuff that runs in space or in airplanes or something like that, you can definitely do CI, but CD is not really an option, just because the deployment itself can’t happen as easily and automatically as pushing to a server… That’s actually a bunch of industrial processes and industrial code where ideally you can do some CI, but it’s often pretty complicated, because you have to mock a bunch of things… And then CD is not really an option, because the code runs in an air-gapped environment, or maybe I should say sometimes space-gapped environment… These are very specific environments, of course.
Yeah. I was actually recently looking up – there’s this language called Verilog, which people use to write chips; you define chips in code. And there is a TDD framework for Verilog as well… So yeah, things have progressed everywhere, I would say.
I think another area where you might do CI and not CD is library development. If you’re not building something that’s actually run on a server somewhere, but someone else is gonna consume, that would definitely be a candidate for “I still wanna run all my tests and make sure everything’s working, but I’m not gonna deploy, and I’m not gonna make a release for every commit I merge or issue I close.”
I’ve seen some software where they do a build of the binaries they’re gonna have, and then they actually have tests that run with the binaries, that stub out some stuff… So when they’re calling Git, or whatever else. So they still almost do continuous delivery, in the sense that they make a binary; it’s just not one that actually gets shipped to users. So it’s like a weird middle ground where it does most of the things. You don’t wanna release a new version to your user every two hours; that would be pretty awful. But you can still get some of the benefits. And then finally, once a week, actually bundle it all up to be one final binary that you know has been tested all week long.
So when we’re looking at CI and CD, what is the typical setup that you guys see? What tools are being used, and why are those tools useful?
I don’t know if there is really a typical setup. To me, the core thing is that there is always a notion of a pipeline, even if it’s not really called that way… But it’s a sequence of operations that we run. If you look at the configuration options and how people run - whether it’s Semaphore, Travis, Jenkins etc. it’s always the same overall principle. You prepare the environment, you run the things, a bunch of tests, maybe there is some metrics going on because you have many combinations of versions of things to test, and you need to collect all these logs, and at the end you get like a yay or nay.
[22:24] And then in tooling, what I’ve seen is that there is what they would call maybe the venerable ones, the ancient ones; I’m thinking, for instance, tools like Travis or Jenkins, just to give one in the SaaS space and one more in the on-prem space. And then there has been a lot of new tools that appeared to leverage new stuff. Obviously, containers happened, so we want a way to leverage that… And very often, the more ancient platforms did not allow that, or at least not at once, or not in an elegant way… So that made a space for a bunch of new players to be like “Okay, we’re going to support containers and a bunch of other technologies from day one, in a way that makes sense for people who actually write a Dockerfile and want to run that code in containers”, as opposed to just want to tick a box, saying “Oh, yes, I use CI on these containers”, but that just means they’re using it somewhere.
So yeah, on the tools themselves it would be this kind of 2D metrics, kind of on-prem, and then more SaaS-oriented, even though many tools actually play on both sides… And to me personally, I kind of see – it’s not a very clear line – the pre-container and post-container environments almost. It’s pretty telling.
When I first started seeing CI for the first time, I know it was with a lot of tools like Travis, where it definitely felt like you could just take what you had, and it would somehow magically make it work. Whereas now, it seems like most of the new products just have to support containers, and then it almost feels like since that’s become so widely adopted, one of the upsides at least to using them is that you can generally pick and choose the tools that seem right for your setup… Whereas – I know before, when you were using Travis, it would magically work most of the time, but if something didn’t work, it could sometimes be a pain to figure out “How do I test this really weird scenario where I need some random software installed on the server?”
So is that true in your opinion, that the ecosystem has sort of evolved because of how prevalent Docker and containers have become?
Sure, Docker was very disruptive for the CI and CD space, because it introduced an entirely new abstraction process of building, testing, deploying software. Typically, developers previously did not deal with the things that Docker represents, so for all the CI, for example - Semaphore is a cloud-only service, so that’s what I know best.
For example, the early cloud-based services like Travis or Semaphore had very simple capabilities, in terms of the kind of workflows that you could run. Basically, you could have a sequence of steps, or maybe a sequence of paralel jobs, and that’s pretty much it. Maybe some services had also a separate deployment step… But some even didn’t have that.
So in the case of Docker containers, even if you don’t have that problem, Jon, that you describe, like there’s something weird and maybe I wanna define my own environment with a container; I don’t have that problem, but I need to build a container, that’s what I need to ship to production.
[26:20] When you start, when you do a build, so you build a container, and then maybe you have a relatively large test suite, so you wanna parallelize it… You would ideally build a container once, and then the term is “fan out” to several parallel jobs, and re-use that container; not rebuild it five times, but reuse it five times. That’s where an early version of Semaphore, for example – we basically had to reinvent what Semaphore was at one point, a few years ago, because of this and some other scenarios we wanted to support. Like, this was not possible. You had to rebuild the container in all the parallel jobs. When you’re actually working with containers all day, that’s not really acceptable, and then it suddenly doesn’t matter how good and useful and beneficial to you that CI tool was previously; suddenly, it’s just not the right fit.
But from the CI provider standpoint, to make that new scenario possible, and a bunch of others that are kind of related, and maybe not so obvious, it’s a lot of work. Some of us who were doing cloud-based CI, we had to basically reinvent our solutions… Or not. Some have not done it. Or some new players obviously appeared. It was a pretty important change in the industry.
So when you’re talking about running this continuous integration, and you had said that even if you don’t need a separate environment, you can basically fan out the builds - why is that speed important? I guess the way I would phrase this is I’ve definitely been in teams that have quick feedback from continuous integration, and then other teams where continuous integration is something where you push your code and then you check 15 minutes later to see what’s happening. So can you sort of speak to how that affects the developer experience?
I think it comes back to what I was explaining earlier about iterating faster and being able to try and experiment more things in a given day. There is a kind of quest for the fastest deployment time; I think that’s almost verbatim the title of a talk by Ellen Körbes, who works at Tilt, and has this amazing talk which is about how short can it be between the moment when I push the button and my code ends up running on my Kubernetes cluster. And I think the answer is something like you can go all the way down to four seconds, or something like that. Of course, in that case we’re not talking about CI; it’s a very special case. But that address is exactly like that need for speed.
I think that for most of the code that we write this is maybe not required, because I can test things locally. Ideally, I can just save a build, and I try my thing, it works… But if I’m working on something more complex, that interacts with an environment that is really hard to mock… For instance, let’s say you write a Kubernetes operator, because that’s a super-fashionable thing these days, and many people do that, so you end up writing your thing in Go, and then you need to run it on a Kubernetes cluster… So especially when you learn in the beginning - I did that recently, and honestly, it’s the kind of thing where you’re trying to put things together from the docs and the sample code that you’ve seen, and the idea you have in your head of how it works… But a number of times I’d just put the line here, and honestly, I had no idea what it would do; I was hoping it would get me closer to what I wanted, but I really had no other option than trying it out, poking at it and see what happens.
[30:22] In that case, of course, I’m not in CI, but I’m in hopefully some kind of CD. If I can work locally, that’s great. But if I need to interact with a big cluster, that has a bunch of pods and containers and node balancers etc, in that case I need to deploy to maybe not the real thing, but at least a thing that is real enough for all my tests… And then I want that to be fast. Because again, if I’m in that learning stage where I’m at the point of print the beginning, and things like that, that ideally we shouldn’t do them, but sometimes we still have to fall back to that - well, in that case, I want things to build and deploy really quickly. I’m willing to take a lot of shortcuts to make that happen, just like in the example I was giving, for instance. I’m not talking about CI yet, I’m just learning and I think it’s also an important point in modern CI and CD pipelines; it’s the “How can we shortcut some parts?” or “How can we make the thing suitable both for local development experimentation, and then get that as close as possible to the CI and CD form?”
It’s a need that I felt a lot of times. I was mentioning Tilt recently - it’s one of the tools which fills a big gap in the container, but particularly Kubernetes ecosystem, because we still don’t have a really nice developer experience with Kubernetes the way we had with Compose in Docker… So when I saw that tool, Tilt, I was like “Wow, this is really great”, and I started to use it and almost abuse it… And then I started to wonder, “Well, I describe my whole stack with that tool, which is just for development, but now I want to make that into a deployment tool. Do I have to start all over again?” And it turns out that other folks had similar ideas, and I realized even though at first it was a development tool, folks added some CI commands, so that you can basically say “Okay, instead of just spinning up all my services and containers etc. and then work with this development cycle iteration, change code, save etc”, now you work more in a CI mindset where you run the tool to bring everything up once, perhaps run your tests and shut everything down. I think there’s going to be a lot of evolution in that space, because we have great CI tools, great CD tools, great local development tools, great this and that… But more and more, we need tools that are able to do both - that can salsa AND tango, not just one or the other.
One question I have is that – like, most of the time we’re talking about CI/CD, we’re sort of thinking about something that we can run locally, and then we can deploy it to see how it works as a released product at that point. But you had mentioned developer speed and some of those different use cases… I guess one that I’ve always sort of questioned is “Could there be a case where CI/CD almost replaces somebody running stuff locally, if we got the feedback loop quick enough?” And I guess one of the examples that came to mind for me was - in a previous episode we talked with the creator of Play With Go, which I think stemmed from Play With Docker, which I believe you have some familiarity with, Jérôme…
I don’t remember if you were one of the creators of it… Is that correct?
Well, it was created by two Docker captains, and I would butcher the names, so I don’t want to pronounce them… But Marcos and Jonathan… And I helped a little bit in some points, but mostly by cheering and encouraging them, because I think that what they made was really amazing, at a time where all these tools like … and so on were emerging… So yeah, I see what you mean.
[34:25] Yeah, I was sort of thinking about the – the Play With Go version at least, it uses Qlang and some other stuff, so that when you’re writing a guide, it builds that all and pushes it. But at least right now in its current state, actually writing a guide means that you have to pull the whole thing, get it running locally, get all the scripts running locally, and all that… Whereas if you wanna lower the barrier to entry, it would be ideal if somebody can just write the script and have some sort of CI/CD pipeline that just spits out something and says “Well, this is what it looks like, roughly.” Maybe it’s not perfect, but it allows them to skip that – you know, I just wanna write a two-page guide, I don’t really wanna have to figure out how to install this entire system and set it all up.
Yeah, absolutely. I agree. In a way, containers made it easy to do that between “normal code”, but now if my code is doing things with containers, then how do I put that in containers itself? So that’s how we had projects like Docker-in-Docker, and things like that… Or for instance, another project that I’ve seen recently and which I think for now is kind of flying under the radar, but when people will see what it can do, it’s going to blow up… It’s something called Sysbox, which – basically, to simplify it, it lets you run the equivalent of privileged containers, but kind of safely, or at least in a safer way, which means that all this stuff like Docker-in-Docker or Kubernetes-in-Docker etc. other workloads where you typically think “Oh, I need a VM”, these things could now run in containers, and that’s going to make a bunch of things doable… Just like I was saying earlier, a few years ago it was like “No, I can’t do that, because that seems impossible”, and then today, with the new tools, the new – it could be some canary feature that you didn’t see coming up, and then unlocks some really interesting use cases etc. So yeah, CI and dev - I think these things are going to get closer and closer.
I would add to Jon’s initial question - I think large web apps, over time they develop large test suites. You have a lot of unit tests, which are maybe not so complex to run locally, but usually end-to-end tests or acceptance tests are the more demanding ones… And what I’ve seen from our own internal experience, and also a lot of Semaphore users, is if you’re developing some kind of a SaaS, developers typically don’t run the whole test suite locally; they just push to CI on the feature branches… Because in CI they have a very elaborate parallelization and optimization. So if they would run everything sequentially, the total time would maybe even be above an hour… But in CI, they actually got it down to around ten minutes. So it’s just more convenient to push and wait for feedback.
It’s also nice because in that case you can sort of push and go back to work… Whereas running locally, at least you have to have a second tab, or something open to let it happen, and it might slow down your computer, depending on what you’re developing on. Because I know some people are running on Chromebooks and things like that, where sometimes it’s a little trickier.
To ask a question related to that, and to step back, talking about tools again - if you were choosing tools today… Let’s say you have a web app - I think a lot of listeners build web applications or something along those lines - and you wanted to start off with continuous integration/continuous delivery, how would you go about choosing tools, and where you do think they’re gonna get the most bang for their buck if they’re just trying to get something starting out? How would you go about thinking through that process?
[38:15] Excellent question. For me, my personal approach to try to aim for the simplest tool that would do the job. Not too simple, because otherwise I can’t do what I do, but also not too complex, because it’s really easy to fall down the rabbit hole of complexity.
For instance, I’ve seen so many folks going with Kubernetes or Docker, just because they thought it would be the thing to do, like it’s fashionable, and then when we look at “Okay, what are you running in it?” “Well, we just have Go microservices”, or maybe it’s only Python. And then when we look at it, we’re like “Well, are you really going to get something from (again) Kubernetes, or Docker, or whatever?” Because maybe you are in one of these scenarios where you don’t need that extra-complexity. In that case, I would be happy to do without. I’m happy to use something like Docker when there is a mix of different languages, and some exotic databases, and things like that, because when I land on a project like that, I know that it’s going to take minutes, not hours or days to bring up the dev environment. But if all you have to do is go get/go build, it’s pretty hard to get easier than that.
So I don’t think I would point to a specific tool; I won’t tell you “Oh, you should absolutely use that thing or that thing”, but rather think about what’s the easiest tool that’s going to work for me, and try to not overcomplicate things.
So Marko, I assume you’re a bit more biased.
Maybe I’m wrong, but… Where do you see Semaphore fitting into it? What’s your bread and butter use case that you think people would be like “Yes, you should definitely go check this out”?
Yeah. I’ll just maybe add to Jérôme’s point - if you are just a beginner in this whole area, maybe not even think about CI and CD. Maybe first invest time in learning test-driven development. It’s gonna level up your skills in designing code and thinking about systems, and writing cleaner code. If you got that mostly right, then just make sure that the way you run tests, or build your application from scratch is very simple. Ideally, one line, one command. If you have that, if you’re not leaking any complexities, but you keep it simple like that, then choosing a tool is gonna be – you’ll get it done in one hour in the afternoon, whatever you are maybe familiar with somehow, or heard about, or is able to get you to a passing build very quickly.
I can share how I see companies evaluate choices… Typically, they look at what are they building today, what are the technical requirements of their systems, and most of Semaphore’s customers are building some kind of a SaaS, or they’re some kind of a technology company. They usually have a relatively large codebase, and because in that case they did benefit from Semaphore the most, because Semaphore is the fastest cloud-based CI service; everybody’s free to fact-check that.
So typically, people have different teams, maybe they’re building mobile apps… You know, it depends on what frameworks, what languages they’re using; once you put all that on paper, there are usually some edge cases where suddenly not every tool fits the bill. You also need to figure out “Can you use cloud-based?” Can you outsource the whole process, or is something forcing you to do it yourself? That’s an important junction.
And once you’re kind of through all that, if more than one option remains, I would evaluate just what’s the user experience. Is it easy enough for developers to use, or it’s like developers don’t want to work with pipelines but it’s more like pushing you to have a magical person or a team working on pipelines… Which is not so great, in my opinion; I think developers should own basically the pipelines of the project, have full autonomy… And you know, just see performance, basically. If there are differences – there are huge differences, in some cases even 2X among cloud services, so I think it matters a lot if you’re getting feedback in 15 or 30 minutes.
[42:50] It’s definitely a big difference between 15 and 30 minutes if you’re waiting to figure out if something works. As a developer, I can imagine that would – I mean, it can almost change your productivity by 2x-3x factors at times.
Marko, you mentioned that if you focus on getting your app set up – basically, having it set up well ahead of time, so you have tests there, it’s relatively simple to run those tests… Are there any other pitfalls or mistakes people make that when they go to start looking at CI/CD leads to issues?
Well, one thing that maybe people who have not been previously practicing CI usually do - they work in very long-living branches; so they accumulate a lot of changes in feature branches, which just makes it more difficult to integrate. That’s something to avoid.
In conversation, I do use the term feature branch, but – I don’t know. For me, a feature branch is something that you do a git checkout and you’re gonna merge maybe one hour later, not one month later. Yeah, just make sure that you work in small batches of changes; you can basically hide undeveloped features behind simple if statements, and basically just carry on, merge piece by piece. We talked about avoiding unnecessary complexity, as Jérôme talked about it…
The feature branches is definitely a good one to keep in mind, because I kind of am in the same mindset as you, where even if you’re gonna spend more than an hour on a feature branch, I try to keep it as something that – I want it to emerge as one single commit, that describes everything being done. And if you have too much code for that, it kind of is a sign that you’re keeping a feature branch open way too long. And that doesn’t mean inside the branch it ends up being one commit as I’m developing, because sometimes I just wanna save my work, or whatever… But eventually, I’ll squash the whole thing and merge it in, so I want it to kind of be one commit at that point, that describes hopefully one small feature, or some part of the feature being described there.
Oh, flaky tests? I was gonna say, that’s the one that I’ve seen the most. Where CI became useless for me was when I worked on a project that we would actually deploy, and then maybe 50% of the time the CI would fail… And at that point, it wasn’t useful feedback, because you couldn’t tell “Well, is it something broken, or is it just a test that doesn’t run correctly all the time?” And it kind of made that CI like a weird – you’d wait ten minutes to get your feedback, and then be like “Well, now I just need to run the test again to see if that was actually broken, or if it wasn’t.” And when we’re talking about speed, that means that half your tests are gonna take 20 minutes now, potentially, to double-check if it’s correct or not.
Yeah, and we talk about the same kind of things around monitoring, and observability, and the false positives, when your monitoring system pings you or pages you, especially in the middle of the night… If it’s a fluke, it’s going to be terrible, first because it sucks to be pinged by a machine in the middle of the night, and then especially if you know that half of the time, even if it’s just 10% of the time, you know it’s a fluke… So now it’s like the story of the child who cries wolf, basically, because since the monitoring is nagging you constantly, then you don’t pay attention when it becomes important.
And I think for the test scenario that you mention here, the behavior you describe is conscientious because it’s like “Well, I’m going to run my tests again”, but some folks might just be like “Well, if the test can’t be trusted, I’m just gonna stop paying attention altogether and not care.” So in that case, yeah, we need to fix this test.
[46:46] To bounce on something that was said earlier - I’m also a huge fan of the developers owning the CI and the process around it. However, I’m also very pro bringing in, maybe for a short engagement, bringing in some expert commando team to help you figure out what you need and how to set it up, and quickly explain to developers, “This is how you’re going to be autonomous.” I’ve done that for container stuff a number of times, just because these ecosystems are so big, so ideally, in the best possible world, we would do a research and pick the solution, but sometimes it really helps if someone can sit down with you and listen to what you’re using, and the code you’re trying to run, and then tell you “I can at least help you narrow down your search to this and this and that. And personally, this is how I would do it”, and then if they do it for you, empower you to maintain it after.
So speak of what I know - yeah, writing the first Docker file from scratch can be extremely difficult, especially doing it well, with all the multi-stage built-in whistle etc. However, once you have that Docker file, adding one extra dependency or changing something - that’s way, way, way easier. So there’s a little bit of both here.
I have a question, I guess related to CI/CD, around build systems, and at what point it makes sense to bring in maybe something better than a makefile or a shell script, like Bazel, or Pants, or Buck, or one of those things… That seems very connected to the CI/CD pipeline, that equation.
Yeah, that’s super-connected, and I really liked how you mentioned Bazel, because I had a friend who kind of helped me understand what exactly was the point of Bazel, because from outside I had seen some container examples, because for a while in the previous years all I was doing was containers, basically… And I couldn’t really understand “Okay, what’s the point of using Bazel for containers? That seems super-complicated.” And then my friend basically explained to me “Well, if you have a team of 100-200 developers constantly shipping code, and you have this test suite which kind of grows and grows and grows, and now each time you change one line of code in this little, tiny dependency at the front of the codebase, you end up having to re-run everything, and quickly that complexity blows up… Maybe not exponential, but at least it’s not linear anymore.”
And so you quickly get from the point where your test suite might take – you know, in the beginning it’s a few minutes, and then it’s a few hours, and then suddenly it’s a few days, and then you’re like “No, we can’t do this anymore.” And with something like Bazel, then you can express dependencies in a really nice way.
[51:44] To me, it was to understand that yeah, something like make and makefiles helps me to rebuild just what I need, and with something like Bazel I can take this one step further and not only build only what I need, but also test only what I need, and build only the artifacts that I need etc. and I can bring back down that incredibly long test time, I can bring it back to something reasonable, and my developers can, again, wait just minutes instead of days to see results.
The flipside is, of course, the complexity of the tool. The situation of my friend, basically I had the impression that there was like one full-time engineer kind of maintaining the Bazel build system for them - which if you’re talking about hundreds of engineers shipping code behind that, that’s reasonable, because tooling is so important… But I’ve also seen the other extreme, where you have folks who can’t even comfortably write Dockerfiles, and there was this one dude who showed up with Bazel and was like “Oh, this is awesome. I’m going to put Bazel files everywhere”, and nobody can understand or maintain it, and it’s just [unintelligible 52:56] because people just kind of run it and pray, and when they need to tweak something, it gets complicated.
But yeah, it’s a continuum. From makefiles, Bazel, containers, all the container build systems that we have now, because even though I keep talking about Docker files etc, but we have other things now as well, so it’s meshed in.
Yeah, I don’t have experience with Bazel; we’re still using make, so…
It sounds like it’s one of those things where it starts to become obvious that you need something else when it happens, if things are getting too slow…
And I personally haven’t been in that situation yet either, so I’m thankful for that… But at the same time, it’s nice to know there’s tools available.
I just wanna say about flaky tests - what I think most people don’t know is, from a CI provider, I was able to see that basically everybody, every organization has them, and people are usually kind of ashamed that they have flaky tests… So I’m just here to tell you you’re definitely not alone. It’s just part of the work, part of the complexity, it’s just about how you deal with it… And yeah, I definitely wanna encourage people to invest a little bit of time in maintenance of their tests for code as well. They need maintenance and some polish.
It’s definitely something good to keep in mind… And I think you’re probably right, I don’t think I’ve ever seen an organization that doesn’t eventually introduce a flaky test. Now, they might be quicker removing it, but I think they do get introduced over time.
Okay, so I’m gonna play this intro thing for everybody, and then we can jump into your unpopular opinions.
Okay, so Jérôme, Marko, do you have any unpopular opinions you’d like to share? Whenever we do this, typically Jerod will take your unpopular opinion, make it into a little Twitter poll, and he’ll poll anybody who’s following the @GoTimeFM Twitter; he’ll poll them to see if it’s unpopular. I will warn you that most of that audience is gonna be Go developers, so sometimes opinions that might be unpopular overall aren’t unpopular there… But it’s completely fine if it’s not unpopular we’re just interested in different opinions than what the norms are.
[55:32] Well, mine would be that we have to stop insisting that updates, etc. need to be distributed over HTTPS; very often when I say that all my security friends and even non-friends are like “No, you don’t know what you’re talking about. It’s very important, because we have this, and this, and this attacks.” And then when I explain, I’m like “No, no, no… Sure, distribute the metadata - list of packages, versions, checksums over HTTPS all you want. But the big bits - you can serve that over HTTP, FTP, etc.” And the reason being that serving over HTTPS costs a lot of money, not because TLS is complicated, and whatever, but because if you’re using HTTP or FTP, you can just let the world mirror your stuff. That’s the way that Debian and Slackware and all these distros have operated for decades, on a shoestring as far as the budget.
If you take the Docker Hub - and I’m not going to give you numbers from when I was at Docker, because I don’t even know if I knew these numbers, and I wouldn’t remember… But just taking the public numbers from the beginning of this year, Docker said in some PR stuff that they had like 15 petabytes of images on the Docker Hub… So storing that on S3 would be at least $300,000/month, not counting transfer. Transfer - again, I took some numbers that Docker published in the beginning of this year, like 8 billion pulls per month. And I went with an average 10 megs per pull, which is really low… That would give you a bill of four million dollars per month, just to operate the Docker Hub, and these are pretty optimistic estimations.
So if only that was mirrorable easily over plain HTTP, FTP etc. and you just served the metadata over TLS, and perhaps have an origin copy over TLS for the one odd scenario where somebody is running this attack against you, or they prevent you from updating etc. I’m not saying that this would have changed the fate of Docker, but I’m curious to see what the parallel universe where things have been made differently in that regard looks like. A world where you can have something like Docker Hub that doesn’t end up costing in the six, seven, eight digits range per month to some company somewhere.
So do you have any guesses as to how much that would actually save? Do you think it would cut the costs in half, or…?
Oh, I think it would save like 99%, or something like that… Which sounds completely like “What?!” But if you look at Linux distros - and I’m talking about stuff like Debian, Slackware, Arch Linux, I’m not aware of… You know, there is not a Debian Inc. or Arch Linux LLC or whatever paying for all the mirrors etc. It’s just like companies, universities, labs, ISPs etc. who decide to just mirror all that, because they feel like it’s the public good. It’s the commons. It’s something that we maintain.
At some point when I was running a hosting company in France a while ago, we had mirrors as well, first for our own convenience, because when we deployed machines, it was so convenient to have something in our network, and it was also good to make that available for others.
So at the end of the day - yeah, I think it would slash the costs by maybe 100 or 1,000, something like that.
I think this is a very important message for whoever is building maybe the next company that’s with the goal of being kind of a backbone in the community…
Yeah. I’m thinking about npm as well, and I don’t know how much it might cost, but I’m scared to think about it.
Yeah, yeah. I remember being a college student, downloading Gentoo Linux, obviously looking to download from the mirror of my local university… But today I guess most people have faster internet. But still, I think every organization would want to download from the closest source. I think it’s not even a question of like a budget. It is going to be faster and more convenient, so…
[1:00:17] I can definitely say, when I’ve worked at companies that have some of that stuff mirrored internally, that’s also – you can tell when you’re getting stuff which ones are mirrored internally versus which ones aren’t, because it’s a drastic difference.
Yeah. So if only Docker, in that case, I would say – if I had tried to make my case back then to my co-workers when we designed that whole protocol, if only it had been plain HTTP for the data bits, then it could have been mirrored transparently. But yeah, I’m curious to see what that parallel universe looks like.
Isn’t that why they just recently did the changes? Or I’m assuming that’s why they did the changes recently that you have to be signed in after like 200?
Yeah, I guess at some point – I mean, it’s just so much money… And especially because we in the CI space are also guilty as charged. The number of times where I’ve set up a pipeline, and when I look at it, I’m like “Well, this kind of sucks, because I end up pulling these images from the Docker Hub each time. Is there any way I could not do that?” And it turns out that it’s complicated.
I remember having these Linux install parties, where you get together with a bunch of nerdy friends and you’re like “Hey, we’re going to install Linux. It’s going to be fun!” And I remember setting up a transparent proxy for that, and it was fairly easy, and nobody had to do anything, and everybody could just pull the packages from the proxy… Try and do that for the Docker Hub. You can’t, because it’s over HTTPS. Well, you can, but it gets really tricky. You have to set up a transparent TLS proxy, inject certificates, and suddenly, the oldest security that you had, your hard-earned security that you got from TLS goes down the drain, because you’re adding this kind of backdoor, so that you can have the caching proxy. So yeah, that’s nice.
That makes me wonder if the middle road that, say, modules went, where it still has that security, but it’s also able to be distributed. Is that a good middle road, or do you think it should still just kind of be strictly HTTP.
I guess it’s also maybe a size problem; the issue is magnified for container images, because it’s so easy to end up with four gigs container images, and you haven’t even started putting your code in it. And then you end up with a pipeline that just pulls these four gigs 20 times, because that’s how things work. And when nobody’s paying for it, nobody has an incentive to try to improve that, the main incentive is “Hm, maybe I could mix more images, because this pipeline is getting slow, and I have a hunch that if my images were smaller, my CI would run faster…”
But yeah, at the end of the day someone’s paying for it, and at some point I get that the someone here (Docker) was footing that bill… So that’s where we are now.
Marko, do you have an unpopular opinion you’d like to share?
Yeah, I have one which is in tune with today’s topic, although we’ll see how often this happens when you’re writing small Go services… So mine is that it’s not proper continuous integration if it takes more than ten minutes to get feedback… Which is essentially about drawing a line somewhere, saying what’s good enough.
[1:03:50] The idea is it’s good enough if as a developer you don’t completely lose focus while you wait… And it’s kind of around ten minutes. Basically, if you wait any longer… I mean, you might still remain focused for 15, but you know, going any more… It just sucks. From a developer point, it’s like somebody took away my keyboard and I’m not able to do my work, do what I enjoy… Which sucks.
It’s about around the time it would take to go make a coffee, or tea, or something, and come back. And if it’s not done by then, then we’ve got an issue.
I think that makes sense. It’s something that’s hard to explain to somebody who’s not a developer, how distracting it can be to go do something else for a half hour and then come back to what you were trying to do…
I’m guessing most developers have struggled to explain that to somebody else, but it is a real pain point, where if you have to wait too long, it’s hard to keep that focus.
Yeah, yeah. The way you could maybe explain it to somebody who’s not a developer is like – okay, let’s say it’s one hour, and there’s 12 of us working on a project… And how many working hours do we have? At most eight… So technically, it’s not possible for all of us to push and merge something in one day. So think about the implications of that, and how often we’re gonna basically check in and do stuff together… So yeah. I think pretty quickly you can run into very hard limitations. Or if you have flaky tests, as we talked about, you need to re-run, but there’s two other guys re-running stuff on master, and it’s 3 PM, so you might as well just go home.
In the scenario you described it could even get to the point where code’s still running the next morning when people come into the office, which would be even worse… Like, if it’s long enough and you have enough people, that could potentially be a real – because as soon as something gets committed, you pretty much have to run against that new commit at that point, so it’s not like you can parallelize all this and count it as correct.
That’s why maybe the thing of being able to cut corners – I’m thinking if you’re adding commits to a feature branch, it might make sense to just cancel whatever had been scheduled on that branch before… And I guess each time we accomplish something and get progress in the tooling, we’re like “Okay, now we have, for instance, a matrix of different versions etc.”, we always can imagine a new feature, a new thing that we didn’t even think about before, but now that we have this foundation, we’re already thinking about building the next floor, the next level on top of that.
I don’t know if the ten minutes – is it really an unpopular opinion, or is it unpopular because it’s hard to do and people are like “No, I’m not gonna commit to that, because that’s way too hard.”
Yeah, there’s probably a lot to it. When I talk about it, people kind of get defensive, like “Oh, you don’t know my code. It has to be this way.”
It’s one of those things where in theory everybody likes it, but in practice nobody’s willing to actually put in the effort to make sure it happens.
I guess, Marko, you’re saying that it should be important enough that you put in the effort to make sure it happens.
[1:07:31] Yeah, yeah. But it can probably be made easier with a tool, if you wanna – you don’t need to run all the tests immediately. For example, your tools should let you run unit tests first, and efficiently proceed further to maybe end to end tests… Because if you have a problem in unit tests, it’s probably fundamental enough that it doesn’t matter what the result is on the end to end stuff. So there’s things like that… Or if you have multiple projects in a repository, the tools should let you say “If this directory changed, then do this. But don’t do anything else.”
I feel like a part of this too is code maintenance over time. The reason you wind up at like “Oh, my CI pipeline is taking like 20 minutes, or an hour”, it’s usually like “Oh, well you didn’t design parallelism into your tests”, or even into your unit tests. I’m definitely guilty of that, where it’s like “Oh, I’m just writing tests”, and I’ve written this code in a way where it’s just like “Oh, it’s using some global state, or whatever, so everything has to run synchronously, one after the other.” And “Oh, I could spend the ten minutes now and fix that, but I don’t feel like I need to do it”, and then three months down the road it’s like everything’s been built up around this concept, and now it’s like “Oh, this is a giant project to remove this global state, so now I just don’t really wanna do it, and we’re just gonna suffer because of it… When I could have spent that 10-20 minutes to have not introduced that global state in the first place.” It always reminds me of those slippery slopes; that first step just makes you slide all the way down.
Some of them are hard to avoid, too. An example I can give is if you wanna run a test with a real database, then you need to have a database spun up. And spinning up one Postgres database to test with is pretty easy, but you might not wanna run six tests in parallel, because they might interfere with each other. So it’s an easy way to be like “Okay, well this makes sense. We’re just gonna have the one database. And spinning up four is gonna be kind of annoying, so let’s not do that.”
But there are some tools – I think Dockertest can actually help with that, if I recall correctly. I think it can spin up multiple copies of Postgres. I’d have to go look, but I don’t remember.
It used to be one of my demos in the early, early Docker days. I was loading data in a Postgres database, and then doing a Docker commit, and then spinning up like ten containers with that load of the data, because it makes for a cool demo… But then it also kind of muddied up the message a little bit, because you don’t really want to Docker-commit your database data in the container image, except for that kind of scenario… But yeah, there are some interesting things to do there.
Alright. Well, Jérôme, Marko, thank you for joining us. It’s been great talking about CI and CD with you two both. Hopefully, everybody else who’s listening had a good experience and learned a lot. We’ll see you next time on Go Time.
Our transcripts are open source on GitHub. Improvements are welcome. 💚