Ops Icon


DevOps, infrastructure, etc.
134 Stories
All Topics

Machine Learning github.com

A collection of resources to learn about MLOps

While still in its infancy, MLOps has attracted machine learning engineers and software engineers in general. With every new paradigm comes new challenges and opportunities to learn. In this primer, we highlight a few available resources to upskill and inform yourself on the latest in the world of MLOps.

Good resources, regardless of whether you think MLOps is its own thing or should be rolled into DevOps.

Gergely Orosz newsletter.pragmaticengineer.com

Inside the longest Atlassian outage of all time

Gergely Orosz did an excellent job detailing the ins & outs of Atlassian’s epic outage:

Hundreds of companies have no access to JIRA, Confluence and OpsGenie. What can engineering teams learn from the poor handling of this outage?

The TL;DR on the cause of the outage is a script that was supposed to “mark for deletion” some records also had “permanently delete” functionality and was run against a wrong list of IDs, improperly deleting 400 of their customers. Oh, and their backup restore process is really good at doing all customers, but not a subset. Ruh roh!

Lots to learn here, and Gergely puts a fine point on the biggest takeaways. A must-read!

Ops squeaky.ai

Why we don't use a staging environment

The Squeaky team goes from dev straight to prod (same here), but many people advocate for (and use) staging or other “pre-live” environments instead.

While there are obvious benefits to deploying to different environments, at Squeaky we’ve decided to take a different approach. We only have two environments: our laptops, and production. Once we merge into the main branch, it will be immediately deployed to production.

Perhaps that sounds unusual, but so far it’s outweighed the benefits of pre-live environments, and we believe it’s helping us to ship faster, and lower the number of issues on production. So, I thought I’d write this post to share why we think it works, and why you should consider it too.

Lars Wikman underjord.io

Fundamentals & deployment

Lars Wikman reacts to Gerhard’s excellent conversation with Kelsey Hightower on Ship It!

So he essentially said, I’m interpreting here, that when it comes to deploying software to servers the documented manual steps for deploying something need to be the canonical reference. Then whether you build bash scripts, Ansible playbooks, Makefiles, Dockerfiles, Terraform, Kubernetes or something else to encode that procedure into something repeatable and scalable that’s a separate step. Having documented the process required to set it up means that there is an answer to the question: How do I get this running? An answer that doesn’t require you to parse the .yml files or grok Ansible roles and groups.

Lars springs forward from there with many thoughts of his own on the matter.

Rich Burroughs loft.sh

7 open source cloud native tools that aren’t Kubernetes

Rich Burroughs:

When you hear the phrase “cloud native,” is Kubernetes the first thing that comes to your mind? It is for me, and I expect I’m not alone. Kubernetes is now the second-largest open source project after Linux, and it’s the big fish in the cloud native pond. But there are many other projects in the CNCF landscape and the broader cloud native community.

So, I thought I’d list some cloud native tools that can be very useful for teams that aren’t using Kubernetes or aren’t using it for every workload. Here are 7 of them that I like a lot.

If Rich’s name rings a bell, that’s because he was just on Ship It! last week. 😉

The New Stack Icon The New Stack

Will Grafana become easier to use in 2022?

B. Cameron Gain on The New Stack:

Despite an ample amount of documentation and demos made available by Grafana Labs and community members, Grafana can be a challenge to set up (although those that do get its dashboards working generally sing its praises). Many manual configurations and steps are required when installing the different dashboard options. Once installed, many users can be overwhelmed with the number of logs and other data to process for monitoring and observability.

Grafana sure does produce pretty (useful) dashboards 👇, but I do find it overwhelming at times.

Will Grafana become easier to use in 2022?

Nora Jones changelog.com/posts

“Incident” shouldn’t be a four-letter word

We truly believe that incident analysis can be your organization’s secret weapon that will allow you to gain value from your incidents, but we know getting started can be a daunting task. We’ve been in your shoes and we’ve seen and heard how excruciatingly intimidating it is for many engineers to lead an incident review. This guide is your toolbox, packed with practical, easy-to-adopt strategies for getting you set up to do your first one.

The New Stack Icon The New Stack

Wait, do we need to hold up on GitOps?

Eric Gregory asks (and answers) himself a question on The New Stack:

For years now, blogs, webinars and white papers have opined that GitOps is the Next Big Thing, yet here a respected voice in the field is saying to tread carefully. So what gives? Do we need to pump the brakes? Is GitOps just a lot of unwarranted hype? Or is there a missing piece of the puzzle here? As in so many things, the answer is: It’s complicated. GitOps can be transformative for some teams, but it’s not a one-size-fits-all solution.

Ops nomadproject.io

Nomad vs. Kubernetes

This page is built by the Nomad folks, so keep that in mind when reading through the comparison;

Kubernetes is an orchestration system for containers originally designed by Google, now governed by the Cloud Native Computing Foundation (CNCF) and developed by Google, Red Hat, and many others. Kubernetes and Nomad support similar core use cases for application deployment and management, but they differ in a few key ways. Kubernetes aims to provide all the features needed to run Linux container-based applications including cluster management, scheduling, service discovery, monitoring, secrets management and more. Nomad only aims to focus on cluster management and scheduling and is designed with the Unix philosophy of having a small scope while composing with tools like Consul for service discovery/service mesh and Vault for secret management.

I’m just excited to see strong competition in this space, and had never heard of Nomad prior to today. If you’ve used it and have experience/opinions, I’d love to hear ’em!

Docker gitlab.com

Harbormaster – easily deploy many Docker-Compose apps on a single host

Here’s their pitch:

Do you have a home server you want to run a few apps on, but don’t want everything to
break every time you upgrade the OS? Do you want automatic updates but don’t want to buy
an extra 4 servers so you can run Kubernetes?

Do you have a work server that you want to run a few small services on, but don’t want
to have to manually manage it? Do you find that having every deployment action be in
a git repo more tidy?

Harbormaster is for you.

You create a YAML config file with all the git repos you want it to include and it’ll watch them for changes (on a timer) and do the necessary cloning/pulling, service restarting, etc. that needs doing to make it all run. Simple. Neat!

Ivan Velichko iximiuz.com

DevOps, SRE, and Platform Engineering

Ivan Velichko:

I compiled this thread on Twitter, and all of a sudden, it got quite some attention. So here, I’ll try to elaborate on the topic a bit more. Maybe it would be helpful for someone trying to make a career decision or just improve general understanding of the most hyped titles in the industry.

Titles come and go, and it’s worth knowing which ones are coming and which ones are going. This article is a good place to catch up if you haven’t been tracking. Oh, and there’s a pod for that too. 😉

Zach Bloomquist zach.bloomqu.ist

Reliable, deliverable, self-hosted email

This sounds too good to be true, because it kind of is. There is no escaping the cloud (because of email trust) or the requirement of sysadmin’ing this setup (sending/receiving email is critical). If you slack on the details or upkeep, it’s your email.

I have been on an ongoing quest to free myself from cloud services for years now. During this time, I have hosted my personal email (@bloomqu.ist) on a Google Apps G Suite Google Workspace account, which, while convenient, also means that my personal emails are at the whims of one of the world’s most privacy-hostile companies.

Don’t get me wrong – what Zach shared is quite possible, but it’s still too time consuming and difficult to host your own email. It’s untenable long-term. There’s a billion dollar business there waiting for someone to seriously compete with Google on email, and not be evil. Fastmail comes to mind. I could be wrong, but I would characterize them as being an alternative, not seriously competing with Google.

Ops incident.io

Incidents are for everyone

A perspective on incidents that makes a lot of sense actually, and captures the “Why?” perfectly. My highlights: Incidents involve more people than we think. Tooling just makes it really hard for them to help. We have more incidents than we realise. We just don’t hear about them. Your whole team, on the same team. Practice makes perfect.

Ops tech.channable.com

Nix is the ultimate DevOps toolkit

At Channable we use Nix to build and deploy our services and to manage our development environments. This was not always the case: in the past we used a combination of ecosystem-specific tools and custom scripts to glue them together. Consolidating everything with Nix has helped us standardize development and deployment workflows, eliminate “works on my machine”-problems, and avoid unnecessary rebuilds. In this post we want to share what problems we encountered before adopting Nix, how Nix solves those, and how we gradually introduced Nix into our workflows.

If Nix is intriguing to you, you’re going to love an upcoming episode of The Changelog. 😉

HackerNoon Icon HackerNoon

Why ML in production is (still) broken and ways we can fix it

Hamza Tahir on HackerNoon:

By now, chances are you’ve read the famous paper about hidden technical debt by Sculley et al. from 2015. As a field, we have accepted that the actual share of Machine Learning is only a fraction of the work going into successful ML projects. The resulting complexity, especially in the transition to “live” environments, lead to large amounts of failed ML projects never reaching production.

Productionizing ML workflows has been a trending topic on Practical AI lately…

Why ML in production is (still) broken and ways we can fix it

Machine Learning huyenchip.com

The MLOps tooling landscape in early 2021 (284 tools)

Chip Huyen:

While looking for these MLOps tools, I discovered some interesting points about the MLOps landscape:

  1. Increasing focus on deployment
  2. The Bay Area is still the epicenter of machine learning, but not the only hub
  3. MLOps infrastructures in the US and China are diverging
  4. More interests in machine learning production from academia

If MLOps is new to you, Practical AI did a deep dive on the topic that will help you sort it out. Or if you’d prefer a shallow dive… just watch this.

Gerhard Lazu changelog.com/posts

The new changelog.com setup for 2020

In this post I share the latest 2020 and beyond details for changelog.com’s infrastructure.

Why Kubernetes? How is Kubernetes simpler than what we had before? What was our journey to running production on Kubernetes? What worked well? What could have been better? What comes next for changelog.com? Read this post and listen to episode #419 to learn all the details.

0:00 / 0:00