Changelog Interviews – Episode #295

Scaling all the things at Slack

with Julia Grace

All Episodes

Julia Grace joined the show to talk bout about scaling all the things at Slack. Julia is currently the Senior Director of Infrastructure Engineering at Slack, and has been their since 2015 — so she’s seen Slack during its hyper-growth. We talked about Slack’s growth and scale challenges, scaling engineering teams, the responsibilities and challenges of being a manager, communicating up and communicating down, quality of service and reliability, and what it takes to build high performing leadership teams.

Featuring

Sponsors

Airbrake – Airbrake is an exception reporting service, currently providing error monitoring for 50,000 applications with support for 18 programming languages.

DigitalOcean – DigitalOcean is simplicity at scale. Whether your business is running one virtual machine or ten thousand, DigitalOcean gets out of your way so your team can build, deploy, and scale faster and more efficiently. New accounts get $100 in credit to use in your first 60 days.

GoCD – GoCD is an on-premise open source continuous delivery server created by ThoughtWorks that lets you automate and streamline your build-test-release cycle for reliable, continuous delivery of your product.

O'Reilly Velocity Conference – Future-proof your systems and yourself. Learn about performance, monitoring and observability, scalability, serverless, security, and leadership. Use the discount code CHANGELOG to get 20% off Gold, Silver, and Bronze passes. Location and dates: San Jose, California, June 11-14. Learn more - oreil.ly/2J3gCBP

Notes & Links

📝 Edit Notes

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

So we’re here on the leadership frontlines - I love that line in your summary for your session at Velocity, Julia.

Thank you.

We’re here on the *leadership* frontlines, and I think one thing we talk about a lot in engineering and software development is scaling… But you often think about scaling software, not so much scaling teams. From my point of view - and I’m sure Jerod will agree - Slack has been in this constant scale motion. You’ve never been able to just kind of like chill out in some sort of infrastructure setup; you’ve always been scaling…

It is true…

…so this conversation is essentially about scaling all the things. What do you think?

When I interviewed at Slack two and a half years ago - in a few months it will be three years - I was told by Cal Henderson, our CTO, he said “We don’t have anyone for you to manage, but we’re gonna hire some people and we’ll figure it out.” And by the time that had elapsed between when I interviewed and when I started, I think we had hired five more people, and so my first day I was managing a team of about seven engineers, I think, at that time… Seven incredible front-end engineers, because we needed someone to manage front-end engineers, and I would never characterize myself as a front-end engineer; I’m much more of a back-end engineer, which means I have huge respect for folks who do front-end work.

So I come on board, I’m really forced into a fascinating situation where I was not the subject matter expert, as I had said earlier. I’m definitely a back-end person, so I really had to focus on becoming a great manager, because when I would for example look at pull requests, I didn’t know if some of the code that was being written - if those were the right architectural decisions… So I really had to defer to the team, and I really had to get really good at asking a lot of questions.

[04:26] I actually did read a lot of code. I learned a lot about JavaScript in that initial period. And then if you fast-forward from the seven front-end engineers two and a half years ago, I now run an organization that didn’t even exist 18 months ago (infrastructure), that has 75 people. That’s 10x growth in two-and-a-half years, and over that time, every six months, my job would totally change - from managing front-end engineers, to managing both front-end and back-end, to managing a junior manager, to leading the infrastructure organization, which was small at the time, which then grew from 12, to 50, to now 75, and so… I look back two-and-a-half years ago and I barely recognize the job I used to have, because we’ve grown so much.

We have a very small team here, Julia, and I’ve always been on small teams; I’ve never even been on a team of 70, let alone managed one… So I hear that number and I’m just immediately overwhelmed, I start sweating, my hands are getting a little sweaty just thinking about the responsibility – and when you first started, two-and-a-half or three years ago, it wasn’t very long, and you had a team of seven. Do you sleep well at night? Do you feel like you have the weight of the world on your shoulders? That’s just a lot of people…

Yeah, I agree with that.

So I have learned a lot over that time. I feel as a leader that the only way that I can be successful is by having an incredible - and having the privilege of having an incredible - team. With that comes hiring amazing people, so that I can sleep well at night. When you think about the number 75, you think about 75 humans - that’s a lot of folks, but the organization is divided in different ways, and I have incredible leaders, that again, I have the privilege of working with, who lead some of those sub-organizations.

Part of growing fast is learning how to delegate and give things away really rapidly. There were so many things I worked on in the early days. An example would be hiring processes, and how we hire (especially at the time) front-end engineers, because that was my jam. Now I’ve given that away to somebody who did additional iterations, made it even better, and then they grew and scaled in their role, and they gave it away to someone else. And I say that in a wonderful way, where we’re always iterating and growing and changing and evolving on everything that we do.

The really big challenge that I’ve found is you have to learn how to do that for yourself, because growing – in a hypergrowth company, you have to grow and evolve with the company, and that is one of the hardest things, and having the mindset of like every day I go to work and I do things I don’t know how to do, that I’ve never done before, and then I get reasonably good at them and then I hand them off to someone else who they’ve never done it before, but they can pick up where I left off and make it even better. The hiring process is one example of that sort of thing.

[08:12] Also, how do we communicate as a 75% organization, how do we propagate knowledge, how do we propagate decisions - and that means both top-down, like from myself and my boss, but also bottoms-up, from the engineering frontlines, the critical decisions that we’re making, propagating that up to myself and then my boss.

It’s interesting to hear how – you know, in hypergrowth companies, to take on the role(s) over time, you really have to accept that… Or maybe you would learn in the book Who Moved My Cheese the idea of change. You cannot be a fearful person of change, because it’s inevitable; I think it’s probably the case in most lives anyways, but even more so in a hypergrowth company where you’ve got to accept change. And if you’re the kind of person that can’t deal with change that rapidly, maybe it’s not for you.

Absolutely. I mean, I do think that just like every company isn’t for everyone, there’s folks who are attracted to that high velocity of change, and then there are also individuals for very legitimate and understandable reasons where it may not be the right environment.

One of the things I always say with my team, especially given we’ve just moved to a new office building, I was running around, trying to find all the conference rooms on the various floors… The thing we say now is the only thing constant at Slack is change. But that doesn’t necessarily mean volatility. It doesn’t mean things are hectic and scary. Instead, it means that we’re always trying to learn and iterate and grow and learn how to do things better, and I myself as a leader am always trying to figure out what are the things that I’m dropping on the floor, what are the areas that I need to improve, and a big part of being able to do that is creating a safe and inclusive culture so that people can provide you with feedback, because the only way that you’ll be able to learn and grow really rapidly is with really excellent high bandwidth feedback from the people below you, my peers in my case, and from upper management.

I think it’s so true, that point about the only constant is change at Slack as a software company, and I think it can be applied to anybody who’s writing software or running businesses on software. That’s the only thing that we know is gonna happen, is that things are going to change, and that we don’t know as much right now as we’re gonna know later… So we build our systems and we design things in order that they can change - malleable, as opposed to rigid - and we do as little as we can now, because we’re gonna be smarter later and we can make wiser decisions later.

So you’re in a senior role now, right?

I am. I mean, it depends on how you define senior.

Do you have it in your title? [laughter]

I do have it in my title, thus it must be true.

The point I’m getting at is so if you’re a senior now, were you always senior? Is this new for you? And maybe share some points along your path of like scaling you from someone who wasn’t senior, the things you’ve learned and the things you’ve had to endure to get to a senior role and some of the responsibilities you hold day to day.

Absolutely. So I have definitely not always been in a senior role, and let me tell you, it has been a long and fantastic journey to get here, that was never always up and to the right. My career has taken many different twists and turns, and I’ve tried out product management, and I’ve tried founding a company… So I’ve done all kinds of things and learned so many things along the way.

At Slack - it’s funny, when I joined I was a senior engineering manager, so maybe it comes full circle. Then I transitioned into an engineering director when I started running infrastructure, and I’m now a senior director, so I got that senior back.

[12:25] In the beginning, when I was managing that team of seven front-end engineers, I was – and again, not hands-on from an “I was writing code”, because as we’ve talked about, I wasn’t the right person to be making the technical decisions, although I can understand the technology quite well and quite quickly… But I knew what the team was working on, I knew the challenges, I knew with a very high degree what was coming next for them. Engineering was much smaller then at Slack, it was less than 100 people. The group that I was in had about 25 engineers, so I knew what our larger plan was for all those 25 engineers.

I would often sit in meetings that were talking about – and again, as a manager, I do attend a lot of meetings. The goal of opt-in attending a lot of those meetings is to gather information, and to also see when people are blocked and how I can help them, and how I can also help transmit information throughout the organization.

So I would sit in meetings that would be talking about things at the feature level, and as I transitioned to lead infrastructure, one of the things that happened was – this was a brand new engineering organization. When engineering teams get big enough, you have to subdivide them in some sort of logical way, but always knowing that org structures and how you divide - that’s a very hard problem.

We had had a logical division there of how we would divide it, and now I was running this new organization… So I had this really exciting, unique opportunity to figure out “Well, what is the mission and what is the vision and what is the strategy for infrastructure?” So instead of thinking necessarily about the feature level that I had before, and the vision and the larger plan being set by the senior directors and VPs in that previous engineering organization, I was thus in those shoes.

So I had to figure out what are the current challenges with our infrastructure, how are we scaling right now, what’s breaking, and how are we gonna scale through the next huge jump and growth in our user base? What are things that are important for us to work on, but not urgent? What are the fires that are burning? So I really had to deeply understand from this infrastructure perspective what was going on, and I had to create a compelling vision that resonated not only with the engineers, but with the senior executives - the CTO, the VP of engineering, even the CEO, Stewart Butterfield; I presented this vision to him, as well. So it moved from feature level, again, to all of infrastructure.

Now, as a senior director, as my boss likes to tell me - my boss is Michael Lopp, who many of you may know him on the internet as Rands - my role is not only to stay involved in infrastructure… I mean, I love this team; I feel like it is such an incredible, incredible organization… But to think about all of engineering, and the company as well.

[15:53] When I joined, around two-and-a-half years ago, engineering was around 100 people, and now I think we’re at around 350 people. So thinking at that larger scale, thinking about how we make decisions that impact across all of the organizations and impact other places of the company… So it’s all about leveling up the scale at which you’re thinking about, and when you do that, you can then have even greater impact. But one of the hardest challenges with that is that now you need to influence – and again, I deeply think… One of the most profound lessons that I learned in my career was when I became a product manager and I had to learn how to influence people (people being the engineers) when I was not their manager and I did not have explicit authority to tell them what to do.

The higher up that you go in management, your job is all about influence. Ultimately, the engineers in my organization and other organizations - they decide what code they’re going to write that day and what code they’re not going to write that day. They make all of the decisions. Now, I try to influence those decisions by giving them additional context, by giving them background, by talking about why what they’re working on is so important… But at the end of the day, they decide their destiny, and I am there to help support and guide them. And the higher you go up in an organization, you have to be able to influence even more people in the organization, and that’s incredibly difficult to do.

That last part about the dream part is something that resonates with me. I’ve played the role of product manager for a bit, and that’s such a truth - to be able to influence someone, you have to be able to share with them a dream to strive for. And when you don’t have that explicit control over their day-to-day code they can write, or even manage them to guide them that way every step of the way, you have to be able to kind of cast some sort of vision or dream for them to follow, because otherwise they’re just gonna do what they have to do to ship code and keep it simple.

Absolutely. You have to inspire and compel folks to be aligned with where you think the organization should go, with the exciting challenges… You need to be able to craft a message that really resonates with the team.

So Julia, as I listen to you talk and I’m trying to have takeaways of like what makes a great leader, and I’m always thinking in the context of a developer, and like “What makes a great developer/leader?” or “What turns a great developer into a great leader?” and I’m thinking about your points about a) communication - I think that’s an obvious one, and definitely the most paramount thing. If I think about the best developers I’ve met, and we’ve interviewed a lot of them on the show, like “What makes them stand out?” a) their ability to communicate, absolutely. Communicate their thoughts.

So let’s set that one aside, and say aside from communication, the next thing that I think of with great developers is their ability to kind of inhabit an entire system. Like, the more of a system you can keep in your head, the whole thing, holistically, the better you are as a developer, I believe, and I’ve found.

So what you were talking about was really even transitioning from developer to leader, or holding both roles, is the ability to speak about that system at different levels, to communicate about it, to speak up to people either above you in the organization, or to speak down to people below you or in your employ… And that to me sounds like you have to be able to inhabit the system maybe even at different levels, like conceptually, in big picture, small picture. Is that the case?

Okay, and is that a learned skill, or is that just a natural thing? How do you get to be able to do that?

I love it, great question. I am very much a systems thinker. Before the ship sailed on my coding career three years ago when I stopped writing code day to day, I am always thinking about systems and how systems interact with other systems.

Now, the way that I employ that systems thinking now is I’m thinking about systems of people, and how other people communicate with other people. So just like different systems have API contracts and they have different protocols with which they talk to each other, humans are the same way - they have different preferences for how people talk with them, the vocabulary that they use, that maps really nicely to the different protocols… And I absolutely feel like these are learned skills.

In the – oh my goodness… Without revealing my age, in the many, many years, like over 15 years that I’ve been a post-graduate - I did my graduate work in computer science, and that was a long time ago, and in the time that I’ve been a professional software developer, and now a manager, I have learned these skills. I also really deeply believe that anyone can learn anything, and that if given the right environment and the right teachers, people can absolutely rise to the occasion.

And we can talk a little bit more about that, but I think this goes to the – you had started the question around “How do I employ those skills now?” So as I think about the systems of people, and I think about the different relationships, I’m also thinking about “How can I communicate in a way to those different audiences?” The analogy is software is then like “What language do I need to speak to this system?” or “What vocabulary do I need to use to this human?” If I speak too fast, I might need to get rate-limited, so… There’s so many analogies of how humans interact with some of the systems, and I don’t mean to say that in a robotic way, but we all have preferences, and if you can understand someone’s preferences - and that’s I think a really important part of leadership, is building that rapport and that connection with people, so you can understand their preferences. Because if I start sending requests to a system in the wrong language or malformed requests, they’re gonna be throw away or I’m gonna get error codes. So in this world, I need to deeply understand the people, just like I need to deeply understand the systems.

[24:20] So that goes back to what makes, I think, not only great developers, but also great leaders, and I think it’s important to note that the skills that make really senior developers great are often the same skills that make really senior managers great. You can be a leader in an organization if you’re a manager - or if you’re not a manager; if you’re an individual contributor. Leadership comes from everywhere. But really great leaders, whether it be managers or individual contributors - they’re really fantastic, as you highlighted, communicators. They know the protocols, the vocabularies, the error codes, the exit statuses - they know all these things, but they then can level up the people around them; they can grow them by being able to teach them new things, so they can help the (let’s say) next generation, the more junior developers, the more junior managers.

The more senior folks teach them about the systems, the mental models that they’ve developed about how the humans interact or how the systems interact… And as those junior folks learn and grow and they tackle problems, they then can refine and grow their own mental model about how these systems and humans interact with one another.

It’s interesting… I mean, I agree with you specifically on the overall in skills between a great developer or a great leader, or perhaps manager, if that’s the same person… We do have this idea of the Peter principle in management where people tend to get - and I’ll just summarize it; not the exact principle, but the way I think of it is people often get promoted to their level of incompetency, right? They’re very good at this thing, so therefore a promotion comes and moves them into a role that they’re not good at, and that’s unfortunate at the time, because they were really good at the other thing, but they need this new position in order to achieve a salary raise, or something like that.

This happens a lot with developers, like you’re a great developer, and all of a sudden now you’re a manager of developers, and even though there’s overlap in those skills, that doesn’t mean you immediately recognize or can apply – arguably, software systems are easier to understand than people are, right? Like, we’re way more complicated in many ways.

People are non-deterministic…

Yeah! So do you have tips and tricks, or thoughts on developers who find themselves in the position of manager or a leader, and all of a sudden they feel like they don’t have the chops to thrive in that position?

Well, I would say – I have so many different thoughts… I think first if you’re – I do view management as a different job. Not a better job, not a worse job, but a different job. So the challenge is, as a developer, do you want to change your job? And if the answer is no, you love your job, you love what you do, and you want to continue growing, then hopefully you can, whether that be through promotion or through projects.

Assume a role where you’re able to teach a larger number of people, lead them through like a tech lead position, something along those lines. I think it’s very important that companies have a track where very senior technologists do not have to become managers. Because again, I do very deeply feel like it is a different job. Sometimes, when I talk about what I do every day, which if you want to know the one-sentence version, it is “I read and write English documents”, that is what I do all day… [laughter] You know, a lot of developers would say, “Well, I don’t wanna do that. I wanna continue to program.”

[28:09] I do read code occasionally, but those times are fewer and farther between. So if you want to enter a world where you’re spending more time in Google Docs than in Emacs or VI, then potentially management is for you. But in order to make that transition, it’s so important that your organization can support you with training, with mentorship, with people who can give you feedback so you can learn and grow, because you’re doing a totally new and different job.

At Slack, I’ve managed many people who tried out management, managing a small team. Very senior technologists; technologists who had been programming professionally for 15-20 years, they wanted to try management. And I do think that it is important if people do want to try it, and their intentions are pure, meaning their intention is not because they want more power, but they want to really foster and grow and help others… We tried it out, and some of those individuals have been exceptional managers. And some of those folks have realized that this is not their calling… And huge hats off to them, because it can be so hard once you embark down a different path to realize that it’s not for you.

So some of those individuals have transitioned back to individual contributor roles, and I really wanna highlight, it’s not a demotion, it’s a transition back to a different job, to a job where they’re fantastic. So in large part, some of the folks on my management team used to be very senior engineers, and some of them transitioned to IC roles, and others have been managers for a decade or more. So I think it’s important at a company to also be able to provide the no-penalty, so to speak, for transitioning back to the role if you realize that it’s not for you.

Just as a point of clarification, when you say IC roles, what are you referring to?

Oh, I’m sorry, individual contributor roles… So roles where you’re not managing people.

So it’s very important to emphasize what you said there when you go back to an IC role, that this isn’t like a demotion, or this isn’t a step backwards in your career, because management is a different job, but not necessarily a higher calling, so to speak. Is that what you’re trying to say?

Yes. It is not a better job, it is not a more powerful job, it is not a more prestigious job, it’s only–

Just different.

I’m sitting here thinking about this and I’m like, I’m that kind of person where I naturally by my talents just gravitate towards management type roles, because I can be in an individual contributor role, and I will just naturally wanna lead. It’s just something that comes out, it’s not something I’m like “I’ve gotta…” It’s just my DNA. And if you put me in an individual contributor role, I’m probably depressed, or aloof, or not that invested… But if you give me an opportunity to influence and change, and determine vision and where we’re gonna go - that’s where I thrive.

I imagine there’s a lot of developers out there who are like that as well, so how do you keep being a programmer, but then also leverage those skills, too?

[31:50] I completely understand. To give you a quick story, I have a daughter and she’s three, and we go to music class every Saturday, and there’s – at times, my daughter is very aware of who’s following the rules and who’s not… So I was very self-conscious about this, because she likes to correct other people and she likes to stand up there and be in charge - I say that in a positive way - and the teacher said “There’s always a supervisor in every class.”

And I think, good for you to acknowledge that this is where you can be most successful and where you’re deriving the most value for yourself and for others. It can be often difficult to know… Especially early in my career, I didn’t know what were the environments and the situations and the ways where I really was able to thrive, so good for you for knowing that.

This goes back to the notion of even though management and individual contributor (development, programmer) roles - even though those are different jobs, the higher up that you go in both of them, the skills to some degree converge, where instead of you building the features, you’re leading, communicating, growing, teaching others.

So instead of me communicating vision and strategy through written English words and presentations and PowerPoint decks, some of the very senior principal engineers and architects in my organization - they don’t manage people. They do write a lot of English documents about how we’re gonna build a super complex, difficult feature, but then they’ll also lead discussions around different approaches.

One of the things that we do here at Slack is we have what’s called the Software Design Workshop, and any engineer in the organization - junior or senior - can write up a technical document about how they’re going to approach a feature. Then they bring it to the workshop, and people opt in if it’s a topic they’re interested in, because again, our engineering organization is quite large… But people can come, and then we have a discussion - spirited, and I think fantastic discussion about how should we build this? Are there any interesting edge cases? And those discussions - I think actually this is really important - are led by other engineers, and not led by managers. I don’t actually know that much anymore about what the edge cases are… But let me tell you, some of the very senior engineers in our organization - you bet they do, because they’ve been around for a while, and they’ve probably seen the different patterns and they’ve seen systems fail, and they have really great mental models for our systems.

So they then help facilitate the discussion, which leads to the why communication is so important… They facilitate discussion, ask questions, but ensure that the engineers presenting - who might be senior or might be junior - have a safe space to present their ideas and walk away feeling like they’ve learned something.

So I think that’s the – there’s still such a huge need for very senior developers and programmers in organizations, because let’s face it, sometimes those developers and programmers, especially in senior roles, have more credibility than the managers do, because they’re in the trenches, on the front lines sometimes with the other engineers, and I’m no longer in the trenches. So I ask questions, but I’m not there debugging if we have some sort of incident… The engineers are doing that.

So Julie, you said “In the trenches” - now you’re really speaking our language. These are common idioms and phrases that Adam and I often use. So as a manager, as a leader on the management side, as you said, you don’t have that day-to-day in the trenches – you’re day-to-day, but you’re not in the debugger; how do you keep your street cred with your teams? How do you stay relevant and not become the Pointy-haired Boss that is so laughed at in the Dilbert comics?

[36:11] I ask a lot of questions. Early in my management career - and I see this with a lot of more junior managers… They think their job is to have all the answers, and they think that their job, like a la Dilbert, is to tell people what to do. I very much believe that is not my job. I ask a lot of questions because I don’t have all the answers.

Ideally, there are very few hard decisions honestly that I’m making, because I’ve created an environment with my team where they have the context, they know what we’re building, they know why we’re building it, they know why it’s important, and then they decide how they’re going to approach actually building things.

I try very hard not to be prescriptive. My job is not to tell people how to do things, but to again, set the contexts and let them run free… Because let me tell you, they’re gonna come up with much more innovative, interesting, creative solutions to things than I’m going to come up with. As a manager, I manage “What is outcome? What do we want to achieve by building this? How do we build it - that is up to you all! Run free!” My engineers are in the debugger, and I am not.

So again, coming back to your question, all I do all day is ask questions. Let’s say that someone comes to me and they’re stuck; this happens – not regularly, but with like a somewhat normal cadence, where maybe we’re deadlocked on a decision, we don’t know what to do, so when the team comes to me and they’ve decided “We want Julia to weigh in, because we don’t know what to do.” So the first thing that I do in all of those discussions is I start asking a lot of questions, because I probably don’t have all the context.

Most of the time – and it’s almost like I’m rubber duck debugging the team. They come, and I just start asking questions. And usually through all the questions that I ask - and I’m not being prescriptive, I’m not telling them what to do - they come to a logical conclusion, and the team has decided how to effectively fix something themselves. I think that’s fantastic, because in an ideal world, the team is able to function and make decisions, and I manage myself out of a job, meaning they don’t need me, because they understand what they’re doing, and they understand the business requirements, and they deeply know the purpose of what we’re building.

Now, there’s also a lot of situations where maybe I ask a lot of questions and it’s a hard call, and I have to make a decision, and part of what I do as a now senior leader is when I do have to make a decision, I can ask questions rapidly and then be able to make the decision quickly… Because the last thing I would want to do is to block a team from being able to do something.

So the credibility comes through asking people and getting them to volunteer what they think the solution should be, versus coming in and being that boss who doesn’t know what’s going on, but is telling people to do things that those developers are actually diametrically opposed to.

[40:05] It’s getting people to talk, really, right? In a lot of cases, unless you do that, the silence will come in and you’ll pontificate, rather than say “Hey, where should this go? Here are the problems I’m seeing from this meeting” or “I’m information gathering here, I see this problem there. Here’s the collective problem - how does this impact you and how can we solve that?” Is that what you mean by that?

Absolutely. The last thing people wanna hear me do is get up on a pedestal and give a speech about how they should solve a problem, or the implementation details of a problem. It is all about (exactly) asking question, getting them to talk, getting them at times to see things from a different perspective.

So Julia, you’ve been scaling up the team at the speed of the business, which as we mentioned earlier in the conversation, has been very rapid, and you’re from 7(ish) to up to a 70-person team; I’m sure there’s plenty of other teams… But your major goal is to keep the infrastructure up with the demand on the platform and the business. Give us some insights into exactly the infrastructure of Slack, some of the technical hurdles you all have been dealing with, maybe some success stories, maybe some bad days.

For sure. From a technical perspective, the founders of Slack previously had started Flickr, which was a photo sharing site that they then sold to Yahoo! many years ago. So Cal Henderson, who I work with very closely - he’s our CTO - during the Flickr days him and co-founder Serguei, who is actually a part of the infrastructure organization; I directly managed Serguei for some time - brilliant technologist… They learned how to scale PHP.

Cal even wrote a book about scaling PHP and how Flickr, the first consumer web startup of tremendous scale, how they went through that period. So when Cal and Serguei and Stewart Butterfield (our CEO), when they went to start Slack, they knew how to scale PHP. We have a large PHP monolith with many services that we’ve split out right now.

We recently hired – or actually, you know, all the years blend together now, but we have a chief architect, Keith Adams, and he came from Facebook; Facebook was also a really large PHP shop, and they then created the Hack language, and use HHVM (HipHop Virtual Machine)… So we are now transitioning to using Hack and HHVM. Fantastic, fantastic performance improvements there, as well as typing, as well as many of the affordances of modern languages.

[44:19] We tend to, at Slack, use boring technology, and part of the reason for that - and I say boring with so much love - is because we have to know how to operate the technologies that we use at incredible, incredible scale, and so we don’t want to be on the bleeding edge, because we have to have incredibly high uptime, because we have so many companies, from the NASAs of the world, to Capital One, to IBM, to eBay - all of these companies run the backbone of their business on Slack, and we can never ever go down.

So in scaling infrastructure, we build a lot of the services that connect to the monolith. Some of those services are written in Java, some of them are written in Go… And so I now manage a fantastic team of machine learning and search engineers. We have that office based out of New York, and so they’re also experimenting with some Java and Go services to connect to the monolith, and we’re slowly – we do not have a microservices model, but when it makes sense, we split things off the monolith and potentially turn them into external services.

In the early days of having services, that we’ve either split off or that were always separate, we didn’t deeply understand SLAs around those services, and so as we’ve grown, one of the ways in which we’ve matured is understanding and having performance targets and also SLAs for all the services that we’re building… And these are not external SLAs, these are SLAs for ourselves, because infrastructure is a horizontal organization, meaning we build, of course, common infrastructure that’s used by 300-350 engineers that are in product engineering, like building on top of what we’ve built.

It reminds me of a conversation we had a long time ago now… Man, time flies. In 2014 we had Sara Goleman - who worked at Facebook at the time; I believe she still does, but perhaps not - come on the show and talk all about the PHP language spec, making PHP awesome, the work they were doing with HHVM and Hack… And there’s been a whole bunch of engineering efforts by many companies now; I’m happy to hear Slack is contributing and using PHP and helping make it an awesome language of today.

So we use Slack every day, almost all day, every day, and you mentioned it always has to be up… Like, I can’t think of a time – I think there was one time when Slack was down… I’m just trying to think if you had any real bad days.

Another one of our often used services is Twitter, and they historically have had many bad days, and we even lovingly think of the Fail Whale of years past…

Oh, yes…

Actually, Twitter had some downtime maybe last week, and I noticed the Fail Whale was gone. It’s like a weird Octocat looking thing instead. I was like, “That’s not endearing, give us the Fail Whale.” But Slack really hasn’t had… I mean, Adam, can you think of a time where it was just like, “Well, Slack’s down… I guess we’ll just email each other”?

No, I don’t. The only thing I would really ever notice - and this isn’t a dig - is just maybe slower service, not down service… Which could be - but isn’t - as bad.

Service degradation.

[47:54] Yeah. You know, we’re starting up the app, and it takes ten seconds versus instant or closer to instant… Those kinds of things. Or slow notifications. When you rely on notifications, iOS notifications, and you’ve already had the conversation and then finally on your iOS device you get a notification or two of the conversation you’ve already had… Those kinds of things. I’m sure that they’re not quite down, but they’re like – it’s sort of not relevant anymore, so how do you deal with non-relevant, distributed notifications that should have been closer to real-time, that are now just not important anymore?

See, I think you’re highlighting on a really interesting question, which I see the parallels, and sometimes no internet is better than slow internet… Where you want a service to be really quick, you want to ensure that your notifications show up instantaneously… Imagine if you get a DM from your boss and you wanna know, you wanna be able to respond, if that’s your relationship with your boss… So we think about this a lot, and we think about it especially with respect to we’ve grown so fast, and now that over half the messages in Slack are sent outside of the U.S., we have to have an infrastructure that allows Slack to boot instantly everywhere around the world, meaning Houston, Omaha - since Adam, Jerod, I know you’re out there - but also in Japan, in Asia-Pacific.

So we run Slack in the cloud, and we’ve been cloud from day one, and as part of the infrastructure organization, we’ve had to build a lot of tooling to understand what our performance is around the world… And also, you were talking about notifications, and especially on mobile, we use the infrastructure - I believe it’s APNs, which is the infrastructure provided by Apple, to send notifications on iOS; there’s also an analogous system on Android. So one of the most difficult things is providing a service that is used 24 hours/day, 7 days/week, around the globe, that people need to do their jobs, that has to almost be more reliable than the internet backbone.

What I mean by that is there are parts of the world where the internet backbone is less reliable, especially in Asia-Pacific. Let’s say that you get a DM and it doesn’t come in fast enough. It seems delayed; Slack seems slow. You don’t care that there’s DNS issues that are happening in your part of the world; you need Slack to be fast, and that’s what you expect. So one of the awesome challenges here is figuring out how to provide that level of service when we don’t control the racked machines; we don’t have our own data centers, so how do we do that when we don’t have that low, low, low level of control? And the way that we do it and the way that we’re figuring it out - because again, the scale is just tremendous - is by building software. And that makes me incredibly excited. We are figuring out how to work with different vendors to build really resilient fault-tolerance software that can provide you that level of experience, when fundamentally the underlying infrastructure - the cloud providers that we run on, and then the internet backbone lines that they run on - do not provide the level of uptime that we need.

[51:57] It’s an interesting perspective to think about that too, because - I’ll also say this - we’re not paying you. So we’re obviously not complaining…

Do you mean that in terms of “We’re on the free version of Slack”?

Right, exactly. I think of this as an interesting problem, because you have such a unique type of software, where you have a lot of people using for free, and a lot of people – you know, in your own terms; we’re not ripping you off, it’s the way things work. [laughter]

Oh yeah, of course!

But the point is–

If we were, that’d be a really bad confession right there. “By the way, we’re not paying you…” [laughter] “Surprise!”

I’m not beating down your door and demanding it from you, but you know, we talked about uptime or downtime reliability - you know, I’ve never really seen Slack down, but I’ve seen it be slow, or I’ve seen it be delayed… And you’re right, I don’t think like “Hey, DNS isn’t working properly here in Houston, Texas”, I just think “Slack is not working right.” I blame you, not the DNS, or the other problems in the internet backbone.

Or when S3 went down.

Yeah, or S3 went down, or something changed to make things not work right.

So we definitely have had – so we run our business on Slack; we’re Slack on Slack all day, and when we do have service interruptions or things are slow, it really heavily impacts our ability to do work. And when you build software, we do everything we can to ensure we have unit testing, and load testing, and we have linting, and we have tooling… But we’re all engineers here, we make mistakes. We all wish that we could write perfect code and never deploy bugs, but of course we do! So the challenge becomes – so we absolutely have had situations where Slack has gone down for periods of time, so what we’ve done is ensure that when things happen (because they do), that we’re able to recover and detect those problems instantaneously.

So in an ideal world, we accidentally break something, or S3 goes down, or a huge storm in Northern Virginia impacts U.S. East, the Amazon facility out there, and we’re able to detect that and we route the traffic or revert the bug and do that without you ever noticing. That’s the world that we’re moving towards - being able to detect and recover really rapidly, so that you all will never know that anything happened.

And you’re doing that through relationships – you mentioned talking to vendors…

So we talk with our vendors very regularly. We also build software that can handle network flakiness, in case something does happen with the underlying network… And I feel like that’s such a fascinating engineering challenge, because it’s like trying to understand ways in which something will fail. When you build software, there’s often obvious edge cases, and then there’s things that happen where you’re like “I never could have imagine that that ever would have happened”, and now we have another if/else clause, and how to handle that… [laughs]

As someone who runs an infrastructure organization, I think a lot about the challenges at scale that involve vendors, that involve us building and baking in resiliency into our software. Another fascinating thing, at least to me, about Slack is we open – most people have Slack open for ten hours a day on the desktop; they’re probably not sending messages for ten hours a day, but they have it passively open, and we open a WebSocket connection and we’re sending incremental diffs, if you will, across that WebSocket.

[56:13] The reason that at times - and we’re very heavily working on this - that startup time might be low is because we need to send you a whole lot of data across the WebSocket. Now, remember, WebSockets are a bidirectional communication, so we are sending like “Is Jerod in new channels? Has Adam gotten some new DMs? Has someone mentioned Jerod or Adam?” We’re sending all of this information, the state of the world since you last connected, across the WebSocket. And then once you’re connected, we’re able to send you smaller bits of information about things that have changed.

Now, one of the things that happens that’s particularly precarious is if we see millions of users - or tens of millions - get knocked offline… Like, let’s say there’s a storm; let’s say we deploy a bug. Let’s say the internet backbone in Singapore has a blip, and suddenly all those users are knocked offline, they all immediately hit Refresh, or they wait, and suddenly we’ve got millions, tens of millions of users trying to reconnect. And those are the things that are really difficult.

Building systems that can handle those - what we call “reconnection storms”, that’s really interesting and has been really hard, because you really have to build infrastructure that can handle so much greater than your current load… But that’s not just sending the data; it’s querying for the data, it is packaging it, getting it to clients, ensuring the clients can parse it efficiently… All of these things. And I think that’s such an exciting challenge.

I’m sure we can go much, much deeper on these challenges, and I think these are probably never-ending, and probably not even the most – maybe fun for you to talk about, but not always fun to reveal.

Maybe before – we wanna ask you one or two questions about your upcoming talk here at Velocity, but I think maybe… Share what you can, just to give listeners kind of a scale of like how many users? Is there any public information around paid or unpaying users, so you can help the audience and listeners understand what the scale you’re actually operating at when it comes to a concurrent user base, or something like that?

Yeah, absolutely. So we have over nine million weekly active users and over six million daily active users. There’s over two million paid users. What I think is super-cool… So of those paid users, the two million paid, 43% of Fortune 100 companies use Slack. So a lot of the companies that you think about - credit cards, Capital One - they’re running their business on Slack. Super-cool. TicketMaster, if you’re buying tickets on Slack. But also a lot of fascinating and exciting, like NASA - I have talked about that before - doing really cool stuff on top of Slack. And of course, there’s a lot of technology companies - the Paypals, the LinkedIns, Spotify, Pinterest… They’re all running their businesses on Slack.

[59:46] And not only do we have rapidly growing numbers of users, but I think the demands on the service in terms of “We can never go down” are really high. If you think about consumer internet businesses, for example Twitter (we talked about them earlier), or Facebook - when those services go down, of course that really sucks, and there’s clearly a loss for those companies in ad revenue. When Slack goes down, the people at Capital One can’t do their work, and that’s terrible. If TicketMaster goes down, then potentially they can’t process orders. So the reliability and scalability constraints are real, and I think that’s really exciting, because I think it means that we’ve built an incredible product, that people love and that people rely on every day to do their job, and ideally, we’re in the background… Like, we just work.

Segueing into – you had asked a few other numbers… When I first started using Slack, before I ever thought about working at the company, I had started a company and I installed all the engineering integrations on top of Slack. So I did GitHub, continuous integration, PagerDuty, we also use Zendesk for our customer support tickets… So we have a really active and vibrant developer community that builds on top of the APIs, so we’ve got something like 1,000 apps in our app directory. The app directory was actually the first big launch that I was part of at Slack, which was super cool, because you had to go search for apps, and that’s how in my early Slack days I found apps - I would do a google search; now you can search in the app directory.

This is just so cool, because I’ve built apps before I even joined the company; I built integrations, so I could send data, pipe data from our systems into Slack, so that if something was going wrong, I hooked up our error servers to Slack so that I could see the channel light up, versus waiting for the email or waiting for the page, because I was in Slack all the time…

There’s something like 155,000 of these weekly active developers building on Slack. So that’s a lot of people building on Slack, and I think that’s so cool, because they are building things that we never could have imagined, in a wonderful way.

I think what’s interesting is that you’ve got - I think you said two million(ish) paying users, but roughly nine million in a week, right?

To me, that’s just crazy because of what you’re doing for your uptime, that the large majority of your users aren’t paying you.

You’re feeling guilty, aren’t you, Adam?

No, I’m just feeling like – you know, it’s just the world we live in, but it’s just like… You know, you kind of understand why services charge a higher premium for what they do, because it takes a lot to run them, but you’ve got a large majority using a service for free, but they get – maybe not the same, but a very similar service. We probably get a very similar service that one of these companies gets that pays you.

Yeah, I mean, I think they’ve done a good job setting forth what seems to be a solid business model with free versus paid, and it seems like everybody’s happy that way, at least so far… Right, Julia?

So what I love is that all of the changes that we make to ensure that the service is better for big customers, every single small customer and free team benefits, too. That’s really exciting, because not only can the people who use Slack for work hopefully have a better experience, but then in the communities that you run, you’re also able to benefit from all of those things as well.

[01:03:50.29] I think another – active user numbers vary, and as an enterprise software company, we see there’s periods of time when… You know, Mondays are really big days for us, for example, because everybody comes back online, because they spent the weekend hopefully chilling out… So those numbers fluctuate based on the calendar year, the – I think what’s super interesting from my perspective was, you know, in the early days of Slack usage would dip over the holidays, because people weren’t working. They’d take a week off, or two weeks off for winter holidays and New Year’s, but as we’ve grown, we see that less because there are more companies using Slack that don’t have that dip in the holidays. The companies that don’t are the ones like credit card processors, for example, or anyone in e-commerce.

So as we start to see more and more folks – the “nice break” we used to get for the holidays, that doesn’t exist anymore. It’s been really cool to watch… We used to be able to say “Only one person is triaging bugs”, but that’s not the case anymore. So that’s been wonderful, the challenges of success and the challenges of growth.

Well, Julia, I wanna plug your talk here at Velocity here in a bit. We work closely with O’Reilly, especially around Velocity, Fluent and OSCON conferences they put on, and we’re always happy to talk with speakers like yourself, speaking at this conference. So you’re giving a talk called “Scaling yourself during hypergrowth”, and I think we’re actually gonna title this podcast “Scale all the things” or “Scaling all the things” - one of the two.

I’m excited about this talk. We have some team members who are gonna be there. Listeners, if you’re checking this out and you’re gonna go to that conference or you’d like to, we can give you 20% off either a gold, silver or bronze pass. Use the code “changelog”, check the show notes for a link. We’ll also include a link to Julia’s talk there as well; maybe you can catch it. If not, maybe it’ll be on YouTube, who knows…?

Anything you wanna share with us in closing to some of the things either in your talk, or things we haven’t covered that you wanna say as we tail out?

Thank you both so much for having me. If any of these challenges - of scale, of growth - resonate with any of you listeners out there and you’re interested in learning more and working on some of these things, you can find me easily on Twitter; I would love to–

I love the Twitter handle, by the way.

Oh, thank you.

Jewelia. [laughs]

You know, a last final story… When I was 18 years old and I went to college, I had to choose my email address and it had to be more than five characters… So Julia, how I spell my name, is five characters and it wouldn’t work, so immediately (as an 18-year-old) I had to come up with this handle, and you know what? Many decades later, it’s still around, so… Any 18-year-olds out there? [laughs]

Nice…

The decisions live with you.

It’s a big decision.

And now, of course, you can have five characters or less; the world is a different place. If only… So find me on Twitter, come to the talk - Velocity is such a great conference… Huge shout-out to the O’Reilly folks, who do an incredible, incredible job. I feel so honored to be able to talk about these things both in a keynote and a session, so we can dig deep there… And then it should be on YouTube later, so come find me, I’d love to talk more. I hope that you are having a wonderful, delightful experience on all of your Slacks.

There you go. All your Slacks. I’m on many Slacks. Julia, thank you so much for your time. It’s been a pleasure talking to you, and I appreciate you coming on the show.

Thank you both so much.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00