Gamasutra: The Art & Business of Making Gamesspacer
View All     RSS
October 21, 2014
arrowPress Releases
October 21, 2014
PR Newswire
View All





If you enjoy reading this site, you might also want to check out these UBM Tech sites:


Scale Something: How  Draw Something  rode its rocket ship of growth
Scale Something: How Draw Something rode its rocket ship of growth Exclusive
April 18, 2012 | By Jason Pearlman




["A few days after Draw Something launched, we started to notice something...strange." In this Gamasutra guest article, OMGPOP CTO Jason Pearlman offers an extensive technical breakdown of how a tiny systems team handled the game's explosive growth.]

I've worked at OMGPOP for almost four years now and have seen it transform from a dating company, a games company and then find its footing in mobile games. We've done tons of stuff, tried many different technologies and business plans.

I've always seen us as the little guy that can use fast prototyping, agile development, and the latest tech in order to gain an advantage. Also, being in the game world means you get to test out a lot of different ideas to see what sticks. In my time here, I've made avatar systems, a text adventure game engine, a full-featured mysql sharding library, a multiplayer real-time platformer built on our javascript game engine, an AIM bot system, a whole slew of chat room games powered by a bot framework we created, and tons more.

On the backend of all these games is a tiny systems team of three people -- myself, Christopher Holt and Manchul Park. We built everything from scratch and thought we had our approach to building and scaling backend systems down pretty well. That was until Draw Something came along.

Countdown

The story of Draw Something’s development actually starts around four years ago when the first version of the game was created on our website OMGPOP.com (then iminlikewithyou.com). It was called "Draw My Thing" and was a real-time drawing game. It was fun and had a somewhat big player base relative to how big our site was at that time. We also made a Facebook version of it, and the game developed a pretty large following.

Last year, we decided to make a mobile version of the game. At that point, OMGPOP was still trying to find its way in the world. We did everything we could to land that hit game. For us, like many developers, it meant working on as many games as possible, and fast. Draw Something was no different. We knew the game had potential, but no one could’ve predicted how big of a hit it would become. From a technical standpoint, we treated Draw Something like a lot of our previous games. The backend team has always built things to be efficient, fast, and to scale.

We’ve learned to keep things simple. The original backend for Draw Something was designed as a simple key/value store with versioning. The service was built into our existing ruby API (using the merb framework and thin web server). Our initial idea was why not use our existing API for all the stuff we've done before, like users, signup/login, virtual currency, inventory; and write some new key/value stuff for Draw Something? Since we design for scale, we initially chose Amazon S3 as our data store for all this key/value data. The idea behind this was why not sacrifice some latency but gain unlimited scalability and storage.

The rest of our stack is pretty standard. Anyone who wants to build scalable systems will attempt to make every layer of the system scale independently from the rest. As the web frontend we use NGINX web server, which points to HAProxy software load balancer, which then hits our ruby API running on a thin web server. The main datastore behind this is MySQL – sharded when absolutely necessary. We use memcached heavily and redis for our asynchronous queueing, using the awesome ruby library called resque.

Lift Off

A few days after Draw Something launched, we started to notice something…strange. The game was growing -- on its own. And it was growing fast. On release day, it reached 30,000 downloads. About 10 days later, the rocket ship lifted off -- downloads accelerated exponentially, soon topping a million.

Celebrities started tweeting about the game – from John Mayer to Pauly D -- sending us even more traffic. And people playing the game weren’t leaving -- they were hooked, so total usage climbed even higher than the number of people downloading the game every day.

Most engineers develop their software to scale, but know that in any complex system, even if you try to benchmark and test it, it’s hard to tell exactly where things will fall over, in what way, what system changes need to be made, and when.

Ground Control

The first issue we ran into was the fact that our usual API is really fast, which means that using thin web server in the way we always have -- single threaded, one request at a time was fine -- but for the public cloud, unpredictable response times can back up everything.

So we watched and saw things starting to backup, and knew this was not sustainable. In the meantime, we just continued to bring up more and more servers to buy us some time. Fortunately we had anticipated this and designed the DrawSomething API in such a way that we can easily break it out from our main api and framework. Being always interested in the latest tech out there, we were looking at Ruby 1.9, fibers, and in particular Event Machine + synchrony for a while. Combined with the need for a solution ASAP - this lead us to Goliath, a non-blocking ruby app server written by the guys at PostRank. Over the next 24 hours I ported over the key/value code and other supporting libraries, wrote a few tests and we launched the service live. The result was great. We went from 115 app instances on over six servers to just 15 app instances.

The smooth sailing was short lived. We quickly started seeing spikes and other really strange performance issues. At this point, we were pretty much working around the clock. Things got really bad around 1 a.m. one night, which is when we realized the main issue -- our cloud data store was throwing errors on 90 percent of our requests. Shortly after, we received an email from our vendor telling us we were "too hot" and causing issues, so they would have to start rate limiting us.

At this point, the service was pretty much a black box to us, and we needed to gain more control. We were now receiving around 30 drawings per second, a huge number (at least to us at the time). So there we were, 1 a.m. and needing a completely new backend that can scale and handle our current traffic. We had been using Membase for a while for some small systems, and decided that that would make the most sense as it seemed to have worked well for us.

We brought up a small cluster of Membase (a.k.a Couchbase) rewrote the entire app, and deployed it live at 3 a.m. that same night. Instantly, our cloud datastore issues slowed down, although we still relied on it to do a lazy migration of data to our new Couchbase cluster. With these improvements the game continued to grow, onward and upward.
 
The next week was even more of a blur. Other random datastore problems started to pop up, along with having to scale other parts of the infrastructure. During this time, we were trying to do some diligence and speak to anyone we could about how they would handle our exploding growth.

I must've spoken to 10-plus smart, awesome people including Tom Pinckney and his great team from hunch, Frank Speiser and his team from SocialFlow, Fredrik Nylander from Tumblr, Artur Bergman from Fastly, Michael Abbot formerly of Twitter, and many others. The funny part was that for every person I spoke to I got different, yet all equally valid answers -- on how they would handle this challenge. All of this was more moral support than anything and made us realize our own answers were just as valid as any of these other teams of whom we have great respect for. So we continued along the path that we started on, and went with our gut on what tech to pick and how to implement it.

Even with the issues we were having with Couchbase, we decided it was too much of a risk to move off our current infrastructure and go with something completely different. At this point, Draw Something was being played by 3-4 million players each day. We contacted Couchbase, got some advice, which really was to expand our clusters, eventually to really beefy machines with SSD hard drives and tons of ram. We did this, made multiple clusters, and sharded between them for even more scalability over the next few days. We were also continuing to improve and scale all of our backend services, as traffic continued to skyrocket. We were now averaging hundreds of drawings per second.

At one point our growth was so huge that our players -- millions of them -- were doubling every day. It's actually hard to wrap your head around the fact that if your usage doubles every day, that probably means your servers have to double every day too. Thankfully our systems were pretty automated, and we were bringing up tons of servers constantly. Eventually we were able to overshoot and catch up with growth by placing one order of around 100 servers. Even with this problem solved, we noticed bottlenecks elsewhere.

This had us on our toes and working 24 hours a day. I think at one point we were up for around 60-plus hours straight, never leaving the computer. We had to scale out web servers using DNS load balancing, we had to get multiple HAProxies, break tables off MySQL to their own databases, transparently shard tables, and more. This was all being done on demand, live, and usually in the middle of the night.

We were very lucky that most of our layers were scalable with little or no major modifications needed. Helping us along the way was our very detailed custom server monitoring tools which allowed us to keep a very close eye on load, memory, and even provided real time usage stats on the game which helped with capacity planning. We eventually ended up with easy to launch "clusters" of our app that included NGINX, HAProxy, and Goliath servers all of which independent of everything else and when launched, increased our capacity by a constant. At this point our drawings per second were in the thousands, and traffic that looked huge a week ago was just a small bump on the current graphs.

Looking Ahead

Everyone at OMGPOP was very supportive of our work and fully realized how important what we were doing was for our company. We would walk in to applause, bottles of whiskey on our desk, and positive (but tense) faces.

It’s rare to see growth of this magnitude in such a short period of time. It’s also rare to look under the hood to see what it takes to grow a game at scale. To date, Draw Something has been downloaded more than 50 million times within 50 days. At its peak, about 3,000 drawings are created every second. Along with the game’s success, we’re quite proud to say that although there were a few rough patches, we were able to keep Draw Something up and running. If the game had gone down, our huge growth would’ve come to a dead stop.

This week, we’re thrilled to release some new features in Draw Something like comments and being able to save drawings. Our players have been dying for them.

Now that we’re part of Zynga (technically Zynga Mobile New York), we’re able to re-focus efforts on making Draw Something as good as possible, while still maintaining the culture that makes OMGPOP such a special place. We’re even making plans to move the game over to Zynga’s zCloud infrastructure that’s tuned and built specially to handle workloads for social games.

Looking back at the roller coaster ride of the last few weeks, it’s crazy to think how far we’ve come in such a short period of time. Coming from a small company where a handful of people needed to do everything to Zynga where we have access to the people and the technology needed to grow -- from engineers to the zCloud -- it’s amazing.

In the end, we finally found our hit game. Despite the late hours, near misses and near meltdowns, we landed on a backend approach that works. Here at OMGPOP we call that drawsome.


Related Jobs

InnoGames GmbH
InnoGames GmbH — Hamburg, Germany
[10.21.14]

Mobile Developer C++ (m/f)
Treyarch / Activision
Treyarch / Activision — Santa Monica, California, United States
[10.21.14]

Senior UI Artist (temporary) Treyarch
Treyarch / Activision
Treyarch / Activision — Santa Monica, California, United States
[10.21.14]

Lead UI Artist
Infinity Ward / Activision
Infinity Ward / Activision — Woodland Hills, California, United States
[10.21.14]

Senior AI Engineer - Infinity Ward










Comments


Joe E
profile image
Congrats to you and your team - it sounds like a tough yet exhilarating ride, and in the end, all the hard work paid off. Glad to know what I have to "look forward" to in case of success ;)

We are also approaching the scalability issue as "make the code flexible, and design hardware as need arises". It also seems the bottlenecks vary depending on the type of game - more real-time action oriented vs. persistent data-centric, which makes it even harder to predict for the next game. So, if you had to do it again knowing what you know (but with the same resources as before), what would be different?

Jason Pearlman
profile image
Thanks! It was pretty intense for a while. It's that other side of the coin where you switch from worrying about the game doing well to worrying about keeping up with the traffic ;)

Keeping things simple is always the best way to go, no one wants to try an scale an overly complicated system. Also don't tie yourself down to any particular service or vendor too much - we were easily able to switch web frameworks, datastores, and more stuff - on demand, when we needed to (and will continue to do so in the future).

Of course hindsight is 20/20 so I'd say we would have done what anyone would probably do if they could go back - prepare more for the onslaught of traffic that was coming. For us this would mean better analytics on the server end, just knowing every detail of your app for every request can go a long way in narrowing down bottlenecks and other issues. We would also have done more research on the vendors and platforms we used, we would have queued up way more servers so we could easily deploy more whenever we wanted instead of having to wait at times.

But I will say I believe we were pretty well prepared, from automated scripts to set up all of our servers, to an awesome custom monitoring system, to really simple and flexible server side code - things worked in our favor!

Jason Harwood
profile image
Amazing read Jason! Thanks for sharing and congrats on the success of Draw Something. It's amazing that what the player sees is rarely representative of what is actually going on under the hood and the hard work that you and your team have had to put in just to keep the momentum going.

John Fries
profile image
Jason,

Thank you for making Draw Something, it's a truly wonderful game that has opened up a new avenue of creativity for me and my family.

I'm an avid Draw Something player, and it's an incredibly polished app that (to me) indicates many many iterations of product development. That's why it seems incongruous to me that most of the stories in the tech press are describing it's success as some kind of overnight fluke. I'm hoping you'd care to tell us more about the design history of Draw Something. It's one thing to humbly admit that no one could have predicted such a huge hit, but I'd like to hear more about the design methodology, wire frames, user feedback, failed experiments, gut feelings, etc, that allowed you to even have a chance of finding the magic formula.

Gratefully,
John

Rasmus Gunnarsson
profile image
http://wosland.podgamer.com/how-to-lose-210m-in-two-seconds/ Check this. The exact game behind Draw Something has already been done a couple of times. I'm guessing a good name + the right exposure really kicked it off though.

Matt Rix
profile image
Rasmus, it's not the "exact game" at all, and anyone who says that simply hasn't played both games. Yes, Draw Something had a better name, but that's only a small part of what made it popular. People who say "Charadium is better because it has more features" have no respect for the amount of effort it takes to make something simple. More than anything else, I believe Draw Something got popular because it's non-competitive, and because they solved the text-entry issue in a smart way.

Josh Bakken
profile image
Like

Jeremy Alessi
profile image
Thank you for sharing!

Thayn Moore
profile image
Great article! Congrats on your explosive success. I hope that they didn't forget the backend guys when they divvied up the spoils from your recent purchase. :)

Quick question, what does Draw Something use as a frontend UI framework? Are you using something that makes it easy to port between the different mobile platforms? I know you work on the backend but this seems to be a big problem for a small startup. You mentioned NGinx, does that mean that you are hosting a webframe and running everything as HTML 5?

Thanks!

Jason Pearlman
profile image
Hey Thayn,

Thanks! ;) - We actually do a lot of our own UI inside Marmalade, a cross platform C/C++ framework. It has worked for us, but I wouldn't say its the best option; theres tons of cross platform solutions out there - and it all depends on the immediate needs of your game/team!

Jason

Mustafa Hanif
profile image
Amazing, amazing read. Thank you so much, it was technical, emotional and inspiring. Great work on your success.

Tom Kal
profile image
Jason,

Can you elaborate on this part:

"our cloud data store was throwing errors on 90 percent of our requests"

- Is this S3?
- Were you employing a hash pre-fix on the keys as explained in this article to maximize performance (i.e. a 4 char hash prefix to ensure content is evenly distributed across S3 infrastructure, the article below claims millions of requests per second are possible): http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tri
cks-seattle-hiring-event.html
- Were these SlowDown 5xx errors basically?

Would greatly appreciate your feedback as we are considering using S3 as a key/value database for a specific workload where the increase latency and lack of querying would be acceptable. However this was entirely predicated on the idea that we'd achieve "infinite" network and IO scalability. If S3 routinely throws SlowDown 5xx errors under heavy prolonged load over time, it would be great to know. It's hard to test, because I'm concerned about it's viability over long stretches of time (under production load over months).

Thanks


none
 
Comment: