Scaling MongoDB to One Million Connections (and beyond)

We recently held a Scale and Reliability event with a focus on driving excellence with values, not rules.

In the first of our talks from this event, Niall and Barisa describe how we scaled our MongoDB infrastructure above one million concurrent connections. Watch the video below, or for the readers, a lightly edited transcript follows.

 

Niall: Hi, I'm Niall. I was told to re-introduce myself. So I'm Niall, and I'm kind of a big deal.

So, as Rich said, we use MongoDB to store an awful lot of data. It’s all of our user data. And one of the biggest features of the Intercom app is the ability for our customers to store custom data about their customers. This custom data allows app admins to do things like send messages when someone meets specific criteria.

Initially this data was pretty small. Intercom had a small number of customers sending us custom data. But over time, the amount of data we're storing has actually increased quite significantly. Today, our customers are storing billions of data points for hundreds of millions of their customers.

And this data is totally freeform. Like, it can be anything. You can store whatever you want, like how many songs your customers have listened to, or the date they subscribed or updated their plans, or what type of movies they like. And it's kind of endless possibility stuff.

And our primary store for this data is MongoDB. We chose Mongo because its document structure allows for arbitrary freeform data, much like our custom data, and also allows for us to sort and quickly filter this data. Mongo databases are essentially memory mapped files, and the main bottlenecks in Mongo are RAM and I/O. So keeping as much data in RAM as possible is key to Mongo performance.

Our production setup is actually really simple. We run one database for each customer, and all of their customer data resides in that database. We run multiple databases per replica set, and we run multiple replica sets per server. Each replica set has fail-overs and a different availability zone for redundancy.

We've maintained this architecture as the company has grown. It's operationally quite simple, and it reduces complexity of our application code.

And as Intercom grows, we're adding more and more users each day, and those users are generating more and more data each day. By the end of 2015, we reckon we'll have about nine times more data in MongoDB than we did at the start of this year.

More users means more web traffic. And as we scale, we're putting more and more pressure on our databases. Each connection to Mongo has a cost of one megabyte of RAM. Every time we scale out our front ends, we're creating more and more connections to Mongo, which is taking precious memory away from the database. And eventually we reach a tipping point where Mongo is spending significant RAM on connections instead of data.

This is compounded by the fact that at Intercom, we deploy to production many times per day. Each deployment restarts all running instances of our app. All connections to Mongo are shut down and restarted. The result is that Mongo churns hundreds of gigabytes of RAM every time we deploy code.

So if Mongo can't hold enough data in RAM, things get really bad really quickly. Our users see increased latency and errors, and queries start backing up, which only compounds the issues. It's also hurting our team. We're spending our time fighting fires. We're getting paged a lot, and we're tired, and we're unhappy, and we're feeling a lot of operational fatigue.

So how do we get out of this? The fastest, simplest answer is get a bigger box. Get more RAM, scale out. Unfortunately, we're using the biggest boxes we can get, and so this is not an option. We considered manually partitioning our databases to reduce the amount of data on each server. This would reduce the amount of data required to be kept in RAM, but wouldn't do very much for the connection problems or for the memory churn. It would also require significant rewriting of our app in order to deal with this new logic.

We also considered splitting read and write operations across different replicas, but this would have actually cost us read consistency, and that's kind of important, so we considered it a deal-breaker. So, no.

The recommended scaling path for Mongo is actually sharding, and that's what our vendor suggested we do. On paper at least, Mongo sharding is pretty attractive. You get a router which can proxy and load balance connections to your replica sets, and this would reduce the connection load on our servers, and would probably reduce connection churn. And also, sharding would allow us to transparently grow our databases across many machines.

Unfortunately, sharded Mongo has several restrictions in how you use it. So if you're using certain features, you cannot use sharded Mongo without changing how your application talks to Mongo.

We also run an Elasstic Search cluster, and we continually import Mongo data into it. And so sharding would have totally broken our import pipeline. And so we considered sharded Mongo impossible, at least in the short term.

So our problem is we are burning lots of RAM, and we're putting lots of stress on the databases when we deploy. The simplest way out of this is to use a shared connection pool of persistent connections to Mongo. But this needs to be protocol-aware. It needs to handle things like authentication. It needs to handle replica fail-overs.

The good news is that a Mongo connection proxy actually already exists. It's written by Parse. It's called Dvara. And we spoke with Parse, and we found that they've actually got a very similar setup to us. They'd also investigated sharded Mongo, and they rejected it for the same reasons we did.

So now my colleague Barisa is going to tell you how we evaluated and deployed Dvara at Intercom.

Barisa: Hi. At this point in the story, we know two things. We know that we have a business problem, we have operational pain for our engineers, and we also know that the customer experience is fairly bad due to Mongo availability. And we also have an idea how we're going to fix it using Dvara, the Mongo connection proxy.

So in the rest of the talk, I'll present two main topics. One is the process we use to evaluate should we use Dvara or not, which is fairly generic, and the last bit, how we actually deployed Dvara to production, and what were our findings with running Dvara.

So, what is Dvara? Once again, as Niall already mentioned, Dvara is connection pooling proxy for Mongo. It's implemented in Go, and it's been open sourced by Parse, who are running one of the largest MongoDB installations in the whole world. And it's been built to solve exactly the same problem as we have. So, so far it sounds good.

But before we decide to deploy this completely new software on our business critical path, we want to be able to answer four questions. And the first question is, is it still running in production anywhere? So we spoke with Parse, that they're still running in production, and they're really happy with how it performs. So, so far, so good.

The next question is actually, how good is the code? Is it something that we can read and understand, and can actually tweak if we want to? And from what we see, it's easy to read, it has good coverage with unit tests, it has good application metrics. And since in-house we already have some Go expertise, if we want to modify it, it should be fairly easy.

The third question that we want to answer is, is it easy to adapt to what we need inside Intercom? And there are only two small things that we need to do. First, we need to export its metrics to our own metrics system. And the second, we need to enable authentication for Dvara to communicate from the proxy to the actual database, since we're running authenticated Mongo setting.

And the last bit is how easy it is to deploy it. It should be fairly easy to deploy it, since it's just a single binary that we deploy on the host, and we start with that. Quite easy.

So before we actually proceed with seeing that we want to deploy Dvara to production, we want to define a simple success criteria. After we deploy Dvara, we define that the project is a success if we see that Mongo availability has been improved, and thus the customer experience, and also that the pager pain for engineers has been improved.

We also expect the total number of connections for Mongo to be reduced, but that's not actually the business problem we're trying to solve.

So how do we go about integrating Dvara to our system? The first thing that we do is we added Dvara to our continuous integration system. Every unit test that so far has been running against Mongo directly, instead they use Dvara. And after we do it, we see that all the end tests are still passing, so that's quite good.

We also found that there's an unmerged authentication PR, and so we merge it to our own fork. I tested on my own local development machine. It seems to work. So that was also quite quick and quite short.

We also exported metrics to Datadog, which is the cloud metrics system that we're using. We make sure that we start Dvara on every host during the host boot, and we update the rails config to use Dvara. So all these are small steps, though it took a while to make sure that we have it up and running.

And the next bit is that we just need to deploy to production. I'm sure everything will be great. But things happen. We got our first initial outage — we got paged. As a reminder, one of the success criteria for this project is lowering the number of pages. And here I am, I'm causing more. So it's not a great start, but we'll see how it goes.

So I caused an outage. We roll back. And at this point, we want to figure out what happened. We're not really sure. So we try it again on the one host in production, and we can reproduce it easily in production, but I can't actually produce it on my own development box or any of the unit tests.

And about after two days of investigation, checking the rails log, doing TCP dumps, and checking the Mongo logs, we finally saw that there is unusual increase in authentication requests in the Mongo logs. And there we see that there is a bug in the authentication process. It seems that the unmerged authentication PR that we merged, was actually unmerged for a good reason. It wasn't actually tested in production. Well, we tested it. It doesn't work.

So at this point, we know that the Parse version of Dvara which we tested actually runs without any problem. Authentication PR doesn't actually work. Luckily a colleague, Alex, he sits right there, with a good understanding of the Mongo wire protocol and the golang language, was able to quickly produce an actual working solution in about 300 lines of code. His work, as well as other small tweaks that we made to Dvara, are all open source already.

So we deployed it to production. This time we're a bit more careful. So first we deploy to a single host, we make sure it runs on the single host, and only then we spread it to the rest of the fleet.

And on this slide, we can actually see what happened when we deployed Dvara. It shows two months of time. The time on the left bit is one month before deploying Dvara, and afterwards is one month after deploying Dvara. So we see that before deploying Dvara, we had about, in peak, 1 million connections to the Mongo host in total, and after deploying it, it reduced to about consistent 50,000. So that's really good improvement.

On this graph we see before Dvara and after Dvara, but for a given day. Before Dvara, we had connection churn every time that we're doing any sorts of deployments, but after Dvara, we see there is no connection churn.

Now, despite saving about a terabyte of RAM, one megabyte per connection, a million connections; the key improvement was removal of connection churn. This connection churn was putting Linux page cache under pressure, and thus flushing Mongo critical set from the memory.

But other than the technical details, is the project actually a success or not? Well, before Dvara, in previous few months, we had about 150 Mongo fail-overs. After deploying Dvara, in the last month and a half, we had only two of them. Every Mongo failover has a brief impact to a portion of our customers and a potential to page our engineers.

So, so far it looks really good, though we also have really detailed paging metrics. So here we see that in the Week 35, we saw that we had Mongo capacity issues, and we were doing a bit of firefighting trying to work with it. It wasn't going well, but we were still trying it.

We also saw that we need to, for the long-term, deploy Dvara. So in the week 45, we deployed Dvara. The week after, we had way more pages, but they were unrelated. So we fixed those unrelated pages as well, and in a few weeks afterwards, life for our engineers has improved, and also the customer experience has improved as well.

And now, what are the key takeaways? Well, the first bit is that we decided that we wanted to scale out, and it was a good thing, but we were a bit too late on it. We decided to do it only when we were already experiencing the pain, and our customers were experiencing the pain. So doing it sooner would have actually produced a better experience.

The second bit is that it's always good to decide how do we want to scale. So we had a choice. We could have gone with the vendor approach, which is more like DBA, where you decide that you just want to adapt your own application to work with the vendor approach. Or we decided to actually, in our case, it's better to use Dvara, in which case, we have to make sure that we are confident that we can fix any bug that we see in Dvara in our local setup. And in our case, we choose to do it, and in the end, it's worked out really well.

And the last bit is start small on the most valuable piece. So in this particular case, we progressed on a linear scale. We evaluated Dvara as it is, and afterwards, we did a fair bit of work to make sure that we can deploy it. We never actually tested Dvara on a single host before doing all this extra work. And if I hadn’t done it, I probably wouldn't have broken the production.

Now, surprises during deployment to production make an interesting story afterwards, but they're not something that I'd like to repeat.

Thank you.

Rich: That's awesome, guys. That's a great talk. Thanks very much. Do we have any questions for Niall and Barisa?

Attendee: Why did you deploy to all hosts instead of one?

Barisa: Well, I wasn't completely clear. I didn't deploy to all hosts, but I deployed to a fraction of our fleet that we call a profile, which wasn't that critical. So even if it was an outage, it was an outage for customers that can tolerate that, and they can just back off and come again. But it was still something that just was completely surprising, so that was bad.

Rich: Great question. Any other questions?

Attendee: Did you notice the customer connection getting any slower? Did you have a metric set to check if the speed was affected with customer connections?

Niall: I don't think there's any actual increased latency. The main thing is, is actually that the overall latency is down, because Mongo is able to serve all queries faster with the proxy than without.

Rich: One more question?

Attendee: A bit of a hypothetical, but if Parse hadn’t solved the problem for you, do you have a sense of what you might have had to do, like re-architect or try and do this without it?

Niall: So ultimately, we probably would have gone with sharded Mongo. Dvara is not the only connection proxy out there. There's actually more. And, fun fact, about two years ago, Intercom decided to start writing one itself. It never made it to production. I'm not even sure what state the code is in. But, yeah. Even we were writing it two years ago. So we probably would have dusted it off. It's hard to say.

Attendee: What do you open source and what do you not open source?

Rich: Anything that isn't absolutely core to the secret sauce of Intercom, we love to open source. We have a bunch of different open source projects. The infrastructure team has a couple of open source ones. We have an open source tool called Banjaxed, which is our incident management tool.

We definitely have a lot more stuff we would like to open source, but sometimes it's like, you know, you're kind of building really fast to do something to meet a specific need, and then you need to take a little bit more time to kind of make it a little bit more abstract in order to enable you to open source it.

But we're, like, we have actually benefited so much from open source technology, and particularly from, you know, the Parse example is, like, perfect. It's definitely one of our engineering core values, that we actually do things like this in order to give back knowledge and everything like that.

Attendee: Do you get much back from it? Good feedback?

Rich: I guess we do, but that isn't the reason we do it. Like, we don't actually do it to give something back. It's more, you're actually trying to even the scales for the amount of stuff that you've actually got in, and pay it forward.

Attendee: Do you push changes you make back upstream?

Rich: Yeah, absolutely.