Earlier this year Brian Long gave one of my all time favorite talks about software development: how we built our real-time infrastructure and the build vs. buy decisions we faced. Brian gave his talk about 6 months after we started the project, I sat down with him to catch up another 6 months later.
You can listen to our discussion below, or for the readers amongst you, a lightly edited transcript of my conversation with Brian follows below.
Brian White: Hi, I'm Brian White, Director of Engineering here at Intercom.
Brian Long: And I'm Brian Long, engineering manager for our delivery team.
Brian W.: Back in the summer, you gave a really great talk about how we built our real-time infrastructure. It's a project we started around the beginning of 2015, and now that we're almost at the end of 2015, it seems like a good time to reflect on some of the decisions we made and how the project went. One of the really interesting aspects of your talk, for me, was how you tried to answer the ‘build or buy’ question. How has that changed since the summer?
Brian L.: This is actually quite an interesting question. With this project, we had something pretty rare happen. When we built our realtime system, even at the point when I gave the talk describing how we built a system for terminating web sockets ourselves to replace a particular vendor, a decent chunk of our system still relied on that third party provider for terminating web sockets. So it was a bit half and half – for our end users, we terminated the web socket connections ourselves. And for our admins, the direct customers of our apps, we still relied on this third party to terminate them.
So that gave us an interesting insight in how this ‘build or buy’ decision played, as it’s quite rare for a project like this. We actually got to see how both options turned out. I think my overall summary would be the build approach was far, far harder than we had hoped it would be. I think that's probably not surprising for anybody who's tried to build software. This stuff always turns out way harder than your nice, optimistic projections, as you break ground on the nice green field site.
But at the same time, living with the vendor was far more difficult and more expensive than we thought it would be as well. So I'm still happy enough to sit around and say we made the right call in moving away from the vendor, who, by the way, we have removed from our system entirely at this point in time. So I'm happy to say we took the right approach, but it has definitely been a difficult path to get to the reliability we wanted.
Brian W.: Yeah, it's a pretty interesting project, where we got to go down both paths. We had a clear idea at the start. I know yourself and your team had a very strong idea of where you wanted to go and how you wanted to build this all yourself. You had a great graph in your original talk that showed the cost of ‘build or buy’, including what one vendor offered to give us a better deal. How does that graph look today?
Brian L.: I think some things were simplified in that graph. Maybe it didn't fully capture all of our costs at the time. For example, I don't think we were correctly accounting for the cost of using DynamoDB at the time; it's actually a reading table in our system to figure out how we direct messages as they flow through the system between admins and users. So while I think our overall cost today is maybe quite a lot higher than I would've estimated at that time, I think we've actually come quite close to keeping it relatively flat over the year, which is actually a pretty impressive achievement, considering the multiples of growth that we've seen through the system.
The way we've managed to do that is by reducing all those other expenses that weren't captured originally, as I drew that graph; things I hadn't really accounted for correctly. These expenses, by the way, that applied equally in the case of using a third party vendor or building our system ourselves. And the cost of living with third parties kept going up and up and up, so that represents a significant saving. So overall, we've managed to keep our costs relatively flat through the year. Simplification of the system, cheaper prices from AWS and aggressively reserving capacity has really helped.
Brian W.: So prioritization at the start of a project is pretty crucial, and it sounds like you've simplified a lot of the system now. At the start of project, though, if you focus on the wrong thing, you end up learning the most important lessons too late. You're pretty much doomed. At the same time, I can't think of any system we've worked on, at any company, where there wasn't some little manual task we never quite got around to automating. Maybe it didn't happen frequently enough for us to care. Maybe we just got into bad habits. It sounds like you've been addressing some of the more strategic things, but many of the services in your old was still in that to-do pile.
Brian L.: I think we've done a reasonably good job of replacing any components that weren't reliable. As you were saying, there’s always some little manual task you don't get around to or some pain point that just doesn't bite you enough. I think my biggest regret is around how hard it is for us to deploy this system. Some concerted effort early on would've paid off – far beyond the amount of effort that we had to put in.
It’'s definitely one of those typical problems where it’s easy to say, "We'll deploy it the way we have this time. Next time, we'll fix it." To give you an idea of the problem. Whenever we deploy to this system, it disconnects from all of the active connections. This causes all of our clients to try to reconnect, which they all do for their first attempt pretty much immediately, which will fail for the majority of clients.
We have some amount of backoff, some jitter, but it does take quite a while for the system to stabilize. It can take, during peak for one of our two stacks, a half an hour before we recover all the connections that we lost during a deploy. And then we repeat that process for our second stack. Right there you're talking about a deployment procedure that takes at least an hour.
In reality, it's slightly more than that. When you have a heavyweight deployment like this, you end up adopting all of these really bad patterns around deployment. We've got some great blog posts and talks about how we really value continuous deployment at Intercom.
It really pains me that we're not there yet with this system. It's not an easy problem to solve. To some extent, the nature of the problem is a system that lives to terminate these connections and just sit on them, holding them open. When you disrupt that, it's going to be somewhat painful. But I think there are a bunch of things we can do to improve that we just haven't made time for, and that's definitely a regret I have.
Brian W.: That's pretty interesting. All services at Intercom, with the exception of yours, are fully automated, fully continuously deployed. And it encourages a virtuous cycle of small changes. There're very few 4,000 line diffs. Does that change the development style on that service too? Changes tend to be bigger and batched up before a more traditional deployment?
Brian L.: We haven't fully swung to a crazy model of monthly deployments, where there's a release manager or anything quite that extreme. We still tend to deploy changes, as they're ready to go. But yeah, the diffs are going to be slightly bigger. The size of the change is going to be slightly bigger. Sometimes we hold off deployments during peak times because there's just a longer recovery time.
There's a longer amount of time you have to babysit the deployment. So it definitely changes our behavior and not in a good way. In a perfect world, we would be deploying nice small changes and deploying them as and when we need to, as we do with the rest of our systems here.
Brian W.: So I'm going to be the annoying engineering director and ask, when do you plan to fix this?
Brian L.: For now, we're pretty happy with where the system is, so development is more or less paused right now. I think before we start into development on this system again, before we start building on any new features or making any significant changes, it's something we have to get right. It's something we do have to fix there and then. It's going to take some discipline to make sure we do that because I'm pretty sure I've made that promise to myself and, perhaps, to the team before already.
Brian W.: Yeah, it can be hard to prioritize work like that, particularly when there's some exciting new feature in the product and you want to just build it. You say we'll include it in the cost of this project, but maybe we'll do it after we've done the feature work, or something like that. I think I've told myself dozens of those lies over the years.
I did have a look around the code base recently. But it's a totally different system right now. What's changed since you've built it, since that initial launch service?
Brian L.: So this was comparing what we have now to maybe the system I described in the talk I gave earlier this year. It does look quite different even to me. We have removed the vendor, as I described. We've also ended up removing our use of DynamoDB. We were using that as a routing table to describe how we pass messages between admins and users. It was causing us some amount of operational pain, in making sure we had provisioned the correct amount of capacity for Dynamo.
So you can imagine a system like this, as admins and users show up, you can imagine how it is susceptible to sudden spikes in load as large customers arrive. And we have to suddenly populate this database with tons and tons of entries for all of these new users, who we've never seen before.
So that required us to over-provision our capacity in Dynamo versus where we needed to be at a steady state. Right now, DynamoDB doesn't make it too easy to automatically scale your provisioned capacity. So we ended up sitting in this halfway point of being somewhat over-provisioned but not enough to avoid alarms when the inevitable large customer did land.
We came up with a nice scheme for encoding the information we needed into the URLs that the end clients connected to, and that allowed to turn this real-time system into, essentially, a stateless system that gets all of the information it needs from the clients that connect to it. There's some signing, or encryption, of the details in that connection URL to make sure people can afford things, connect to things they shouldn't be allowed to connect to. But overall, it's a pretty elegant approach.I can't take credit for it, Aidan on team came up with it, but it has reduced the amount I get paged a lot, so I'm pretty happy with it.
We also decided to use a framework called Atmosphere.We struggled a little bit with that as we're not so compatible with Jetty; thread deadlocks and so on. Those problems continued for quite a while, to the point where we were so comfortable digging through the code base to try and hunt down these problems we realised this isn't actually the particular thing we're using it for. It isn't doing a whole lot of heavy lifting for us, so we can actually just get rid of it, build what we need to directly on top of Jetty's APIs for interacting with web sockets. So we ended up doing that, and that meant performance and general reliability improvement.
Brian W.: That's a pretty interesting evolution of your system. As an engineer, if I was starting this project, give me good frameworks, like Atmosphere, give me DynamoDB so I don't have to manage a horrible database myself. Those would be the building blocks I think I'd lean on to get myself up and running.
But the simplifications you described, do you think if you had gone with those in the first place, gone straight on top of Jetty, figured out the approach, that that would've saved time? Or do you think those came out of the lessons you learned building it?
Brian L.: That's a really tough one to answer. It's pretty easy to sit around and think, "Yeah, yeah, we should've just built it the right way to begin with." But the challenge is you don't know these things ahead of time. You sometimes have to stumble through and learn about them the hard way. It's almost like finding out what's the cheapest way to learn that lesson?
This is a good example of one of those situations where you want to try and fail fast and fail early. Maybe we could've approached this a little better. There were probably times, earlier on in the process, where we could've taken the opportunity to accept the limitations of how we had initially designed the system, changed course and not tried to live with our end choices for quite as long as we did.
Brian W.: We've talked about all the things that we could've done better. It's good to reflect on those things, but I suppose the important thing to finish on is that it’s an incredibly successful service. It's really reliable. It powers a whole bunch of things that make Intercom special. It's a fundamental service that a lot of our product builds on top of. Are there any exciting things in the future for it? Is there stuff you're planning to come back and add to it?
Brian L.: I'm pretty happy with where it is right now. I think the act of removing the extraneous components that I described simplified the system, making it as simple as we could make it. I think it's aligned the system to be ready to be extended. We can build on in so many ways now. It's relatively free from technical debt, apart from our deployment problem, which I swear we will fix. So right now, we're more or less the same as I described back in the talk. For any message that's sent from a user to an admin, that message is sent to all admins for a particular app.
As you can imagine, if we have particularly large apps with lots of active admins, the level of traffic there starts to grow and grow. I think we're particularly seeing this in our mobile devices, where the broadcast nature is liable to become a noticeable drain on battery, as we forward these messages out to all the admins. Even messages related to conversations or events that will not trigger any visible or noticeable differences on your mobile device. Because it's not a conversation or a part of the inbox that you're looking at. It's just something totally unrelated.
So I think the next big improvement to the system will be to just make it more clever. I think a pretty reactive and quick update could be to opt in to certain types of messages, related to certain types of content. So when you're looking at a mobile device at a screen of six conversations, you're only going to find events related to those six conversations.
Brian W.: Sounds like fun.
Brian L.: Yep, should be.
Brian W.: Cool. Well, thanks a lot for taking the time to chat, Brian.