Ops Icon


DevOps, infrastructure, etc.
162 Stories
All Topics

Ops matthewtejo.substack.com

Why Twitter didn’t go down (from a real Twitter SRE)

Matthew Tejo:

Twitter supposedly lost around 80% of its work force. What ever the real number is, there are whole teams with out engineers on it now. Yet, the website goes on and the tweets keep coming. This left a lot wondering what exactly was going on with all those engineers and made it seem like it was all just bloat. I’d like to explain my little corner of Twitter (though it wasn’t so little) and some of the work that went on that kept this thing running.

This is a detailed post about Twitter’s caching system that Matthew and others built while working there, and it’s brilliantly summed up by commenter Johnny Manu40:

When everything works fine, they wonder why they hired you. When everything stops working, they wonder why they hired you. I.T. in a nutshell.

Go github.com

A configuration management system for Pets, not Cattle

This is for people who need to administer a handful of machines, all fairly different from each other and all Very Important. Those systems are not Cattle! They’re actually a bit more than Pets. They’re almost Family. For example: a laptop, workstation, and that personal tiny server in Sweden. They are all named after something dear.

Liz Rice and I spoke about pets & cattle recently on The Changelog. I asked her, “from your perspective, are the people still doing it the old-school pets way?”

This tool is a great example of Pets-style administration being alive and well.

Brandon Willett willett.io

How to build software like an SRE

This is awesome — Brandon, please write down more lessons learned, even if they’re just something for you to reference later on.

I’ve been doing this “reliability” stuff for a little while now (~5 years), at companies ranging from about 20 developers to over 2,000. I’ve always cared primarily about the software elements I describe as living “outside” the application – like, how does it get its configuration? What kinds of instances does it run on, and are those the best kinds to use? What steps does it take on its path from “code in a repository” to “running in production”? And I’ve always kept track of what I liked – which mechanisms allowed fast iteration and which caused frustration, which led to outages and which prevented them.

… So! With that out of the way – this is how I’d rebuild it all from scratch if I could.

Lane Wagner wagslane.dev

DevOps: an idea so good, no one admits they don’t do it

Lane Wagner:

The ideas behind the DevOps movements undeniably changed the software development world for the better - but by now, the term “DevOps” has lost all meaning.

According to Lane, some companies believe in DevOps ideas, but are unwilling to invest in changing their ways. Others aren’t convinced that some of the “scarier” ideas are all that helpful.

Ops specbranch.com

Use one big server

A lot of ink is spent on the “monoliths vs. microservices” debate, but the real issue behind this debate is about whether distributed system architecture is worth the developer time and cost overheads. By thinking about the real operational considerations of our systems, we can get some insight into whether we actually need distributed systems for most things.

Scaling up has always been easier than scaling out. It’s amazing what one beefy server can do these days…

Machine Learning github.com

A collection of resources to learn about MLOps

While still in its infancy, MLOps has attracted machine learning engineers and software engineers in general. With every new paradigm comes new challenges and opportunities to learn. In this primer, we highlight a few available resources to upskill and inform yourself on the latest in the world of MLOps.

Good resources, regardless of whether you think MLOps is its own thing or should be rolled into DevOps.

Gergely Orosz newsletter.pragmaticengineer.com

Inside the longest Atlassian outage of all time

Gergely Orosz did an excellent job detailing the ins & outs of Atlassian’s epic outage:

Hundreds of companies have no access to JIRA, Confluence and OpsGenie. What can engineering teams learn from the poor handling of this outage?

The TL;DR on the cause of the outage is a script that was supposed to “mark for deletion” some records also had “permanently delete” functionality and was run against a wrong list of IDs, improperly deleting 400 of their customers. Oh, and their backup restore process is really good at doing all customers, but not a subset. Ruh roh!

Lots to learn here, and Gergely puts a fine point on the biggest takeaways. A must-read!

Ops squeaky.ai

Why we don't use a staging environment

The Squeaky team goes from dev straight to prod (same here), but many people advocate for (and use) staging or other “pre-live” environments instead.

While there are obvious benefits to deploying to different environments, at Squeaky we’ve decided to take a different approach. We only have two environments: our laptops, and production. Once we merge into the main branch, it will be immediately deployed to production.

Perhaps that sounds unusual, but so far it’s outweighed the benefits of pre-live environments, and we believe it’s helping us to ship faster, and lower the number of issues on production. So, I thought I’d write this post to share why we think it works, and why you should consider it too.

Lars Wikman underjord.io

Fundamentals & deployment

Lars Wikman reacts to Gerhard’s excellent conversation with Kelsey Hightower on Ship It!

So he essentially said, I’m interpreting here, that when it comes to deploying software to servers the documented manual steps for deploying something need to be the canonical reference. Then whether you build bash scripts, Ansible playbooks, Makefiles, Dockerfiles, Terraform, Kubernetes or something else to encode that procedure into something repeatable and scalable that’s a separate step. Having documented the process required to set it up means that there is an answer to the question: How do I get this running? An answer that doesn’t require you to parse the .yml files or grok Ansible roles and groups.

Lars springs forward from there with many thoughts of his own on the matter.

Rich Burroughs loft.sh

7 open source cloud native tools that aren’t Kubernetes

Rich Burroughs:

When you hear the phrase “cloud native,” is Kubernetes the first thing that comes to your mind? It is for me, and I expect I’m not alone. Kubernetes is now the second-largest open source project after Linux, and it’s the big fish in the cloud native pond. But there are many other projects in the CNCF landscape and the broader cloud native community.

So, I thought I’d list some cloud native tools that can be very useful for teams that aren’t using Kubernetes or aren’t using it for every workload. Here are 7 of them that I like a lot.

If Rich’s name rings a bell, that’s because he was just on Ship It! last week. 😉

The New Stack Icon The New Stack

Will Grafana become easier to use in 2022?

B. Cameron Gain on The New Stack:

Despite an ample amount of documentation and demos made available by Grafana Labs and community members, Grafana can be a challenge to set up (although those that do get its dashboards working generally sing its praises). Many manual configurations and steps are required when installing the different dashboard options. Once installed, many users can be overwhelmed with the number of logs and other data to process for monitoring and observability.

Grafana sure does produce pretty (useful) dashboards 👇, but I do find it overwhelming at times.

Will Grafana become easier to use in 2022?

Nora Jones changelog.com/posts

“Incident” shouldn’t be a four-letter word

We truly believe that incident analysis can be your organization’s secret weapon that will allow you to gain value from your incidents, but we know getting started can be a daunting task. We’ve been in your shoes and we’ve seen and heard how excruciatingly intimidating it is for many engineers to lead an incident review. This guide is your toolbox, packed with practical, easy-to-adopt strategies for getting you set up to do your first one.

The New Stack Icon The New Stack

Wait, do we need to hold up on GitOps?

Eric Gregory asks (and answers) himself a question on The New Stack:

For years now, blogs, webinars and white papers have opined that GitOps is the Next Big Thing, yet here a respected voice in the field is saying to tread carefully. So what gives? Do we need to pump the brakes? Is GitOps just a lot of unwarranted hype? Or is there a missing piece of the puzzle here? As in so many things, the answer is: It’s complicated. GitOps can be transformative for some teams, but it’s not a one-size-fits-all solution.

Ops nomadproject.io

Nomad vs. Kubernetes

This page is built by the Nomad folks, so keep that in mind when reading through the comparison;

Kubernetes is an orchestration system for containers originally designed by Google, now governed by the Cloud Native Computing Foundation (CNCF) and developed by Google, Red Hat, and many others. Kubernetes and Nomad support similar core use cases for application deployment and management, but they differ in a few key ways. Kubernetes aims to provide all the features needed to run Linux container-based applications including cluster management, scheduling, service discovery, monitoring, secrets management and more. Nomad only aims to focus on cluster management and scheduling and is designed with the Unix philosophy of having a small scope while composing with tools like Consul for service discovery/service mesh and Vault for secret management.

I’m just excited to see strong competition in this space, and had never heard of Nomad prior to today. If you’ve used it and have experience/opinions, I’d love to hear ’em!

Docker gitlab.com

Harbormaster – easily deploy many Docker-Compose apps on a single host

Here’s their pitch:

Do you have a home server you want to run a few apps on, but don’t want everything to
break every time you upgrade the OS? Do you want automatic updates but don’t want to buy
an extra 4 servers so you can run Kubernetes?

Do you have a work server that you want to run a few small services on, but don’t want
to have to manually manage it? Do you find that having every deployment action be in
a git repo more tidy?

Harbormaster is for you.

You create a YAML config file with all the git repos you want it to include and it’ll watch them for changes (on a timer) and do the necessary cloning/pulling, service restarting, etc. that needs doing to make it all run. Simple. Neat!

Career iximiuz.com

DevOps, SRE, and Platform Engineering

Ivan Velichko:

I compiled this thread on Twitter, and all of a sudden, it got quite some attention. So here, I’ll try to elaborate on the topic a bit more. Maybe it would be helpful for someone trying to make a career decision or just improve general understanding of the most hyped titles in the industry.

Titles come and go, and it’s worth knowing which ones are coming and which ones are going. This article is a good place to catch up if you haven’t been tracking. Oh, and there’s a pod for that too. 😉

  0:00 / 0:00