The post you’re reading now focuses on the scaling implications of using Clojure and AWS. This companion post tells the tale of business drivers and implications of the same choices.
When Room Key acquired hotelicopter in July of 2011, we were still finishing our B2B product, an embeddable hotel search engine. At the time, our existing site handled a trickle of traffic, and our customers’ sites did only a trickle more. But we had big, BIG plans.
In typical high-growth startup fashion, we planned to conquer the world and become the ITA software of hotels: the Big Switch In the Sky. As CTO, I needed to build an architecture that would handle so-called “Web Scale” traffic. We never quite got to those traffic levels as hotelicopter, but as Room Key, the story has improved remarkably.
In my view, this architecture and what my team subsequently built has reached the goal of “Web Scale” quite admirably. Over the course of the last 7 months (we launched in January 2012), we’ve gone from about 1,000 uniques/day on hotelicopter’s site, to 600,000+/day on roomkey.com. That’s 60,000% growth in 7 months.
In fact, over the course of one 10 day period, we increased our daily traffic by over 50%.
(We thank our hotel partners for this meteoric growth; they send us exit traffic from their websites.)
We’ve done this with no late nights, no frantic war-room huddles, no downtime, no muss and no fuss. We’ve spent $0 (ZE-ro) capex, and our opex spend still remains at what most of our peers in the industry would consider “latte money”.
How’d we pull it off? Having a world-class team makes all the difference, to be sure. But, I’m aiming to sketch a a few details on the architecture and technology choices that I feel made achieving this fast growth and large scale possible.
The Problem Domain
Consumers want to book hotel rooms, but given the option, would prefer to do business directly with the hotel chain, rather than an intermediary like Priceline or Hotels.com, known in the industry as an OTA (Online Travel Agency).
Hotel chains maintain global state of all reservations, inventory, rates and availability in complex centralized systems known as CRSs (Central Reservation Systems). To select partners, hotel chains grant API access for querying rates and availability from their CRSs. The grantee (ie., us) builds a technology integration which hits CRS’s API to obtain a notion of what hotels have availability, which rooms and rates are available, and so on. These APIs are jealously guarded, persnicketity, and generally obnoxious to work with. Oh, and they have awful, terrible latency, too.
Inventory (hotel property) lists and static content (photos, amenities, descriptive content, geocodes, etc) are generally exchanged out of band, cached and maintained by the distribution partner. The quality of this static content leaves a lot to be desired. Generally it must be scrubbed before ingestion. Like a carrot, or potato.
Enriching data, such as travel reviews, are obtained from 3rd parties, and must be matched to properties in the hotel chains’ inventory lists. The lack of any standard and universal identifier for hotels makes this job quite difficult, but that’s a subject for another day.
As an aggregator of our hotel partners’ inventory, our job is to present all this data in a clean, uniform interface. Make it look easy.
When a user does a search, we use our cached and scrubbed inventory lists to figure out which properties are a good match for the user. Then we go off in real time, querying our partners’ CRS APIs for rates and availability. We wait around a few seconds, collating and scrubbing the replies, before showing the user which hotels are availabile in their chosen destination, with the prices we’ve obtained from our partners.
When the user finds a hotel they’d like to book, we ship them off to the appropriate partner’s site, where they complete the reservation. We get paid when that happens.
Three key insights drove many of the architectural decisions I’ve made for us. Exploiting these facets of our problem domain, and letting our architecture reflect them, has made our fast growth and large scale possible.
First, our partners’ hotel content data is messy, voluminous, error-prone, and must in essence be maintained relationally, so that we can do effective data curation and enhancement.
Second, although munging, folding and manipulating this data must be done, in the end, the user experience demands no live relations between data elements. At the point that a user conducts a search, the hotel universe becomes flat and immutable.
Finally, if Bob searches for hotels and finds that the Park Hyatt in Chicago has one geocode, and at the very same time, Jane finds that the Park Hyatt has a slightly different geocode, no one will notice, or care. If Jane searches for hotels in Chicago and finds 67, and Bob searches and finds 68, it’s unlikely that anyone will notice, or care.
Put another way, users of this system have a high tolerance for inconsistent reads. Bob’s and Jane’s hotel universes need not be identical. (They can’t be completely divergent; eventual consistency is fine.)
So: A-ha! The messy relational data could live in a secluded back-end “content sausage factory”, whose sole purpose in life would be to produce a crisp, non-relational version of the hotel universe as known best at that point in time.
This “golden master”, non-relational database of hotels could then be shipped off to the live, operational system which faces users.
Moreover, different users might be exposed to different versions of the “golden master” hotel database, allowing us to test and to do progressive and continuous rollouts.
Decision One: I put relational data on one side and “static”, non-relational data on the other, with a big wall of verification process between them.
This led to Decision Two. Because the data set is small, we can “bake in” the entire content database into a version of our software. Yep, you read that right. We build our software with an embedded instance of Solr and we take the normalized, cleansed, non-relational database of hotel inventory, and jam that in as well, when we package up the application for deployment.
Egads, Colin! That’s wrong! Data is data and code is code!
We earn several benefits from this unorthodox choice. First, we eliminate a significant point of failure - a mismatch between code and data. Any version of software is absolutely, positively known to work, even fetched off of disk years later, regardless of what godawful changes have been made to our content database in the meantime. Deployment and configuration management for differing environments becomes trivial.
Second, we achieve horizontal shared-nothing scalabilty in our user-facing layer. That’s kinda huge. Really huge.
We reap several benefits from this decision. First, our UI gets faster without us having to pay for new servers that take advantage of Moore’s Law. Second, our UI gets faster because browser makers are dumping tons of time an energy into improved JS performance. Third, HTML is verbose and JSON isn’t, saving us both compute cycles and bandwidth on the server side.
These decisions en toto yield a picture of a three-layer cake, with a messy relational tier on the bottom, an operational API and SOA tier in the middle, and a fat client presentation layer on top.
The single decision that has yielded the most scaling benefit to date has been baking a non-relational snapshot of our hotel inventory and content into our application, giving us shared-nothing scalability.
Beyond architectural decisions, choices I made about our technology stack and operating environment have also had major implications for our ability to scale.
Clojure and The Stack
When a user conducts a search, we go off in real time, spinning out dozens or scores of individual web services calls to our partners’ CRS APIs. We then have to wait around for those calls to return, and collate the results before presenting them to the fat client application running in the user’s browser.
Doing this with reasonable performance has proven challenging, and has led to a particularly significant decision about our technology stack.
We started off our journey with Ruby as our language of choice. I began using Ruby long before there was a Rails (late 2001), and like me, the team really liked its productivity, expressiveness, and the enthusiasm of its community. However, with its green threads and global interpreter lock, it proved difficult to use “normal” Ruby means to achieve the performance we needed. All that IO to our partners and the concomitant response parsing was realllly slow.
Concurrency would seem to be a likely candidate solution.
I’ve been around the block enough times to know how difficult it can be to get concurrency right with traditional solutions that use threads. Inevitably, you have to wrestle with the granularity of your locking, which requires deep wizardry. Worse, you will, seemingly by definition, run into hairy, almost-impossible-to-debug situations where some thread stepped on some other thread’s data, and everyone’s pointing fingers at each other.
I wasn’t eager to jump into those waters. I’ve been bitten before.
So, we started using Eventmachine, a Ruby framework that employs the reactor pattern and presents an async, event-based paradigm. This helped. One downside was that Eventmachine demands a style of programming in which the flow of control is inverted, and which there can be no real blocking operations. That’s pretty tough for mere mortals to wrap their heads around. The upside is you don’t end up with the “Who stepped on my data?” problem. But the relatively long and CPU-intensive parsing of partner responses meant we still could eek out only a few dozens of concurrent user searches on a machine. It just wasn’t fast enough.
I don’t intend this to be a indictment of Ruby’s ability to scale. Ruby can be scaled. But it takes work - time and resources that we just didn’t have, when our entire development organization was four people.
So we turned to Clojure, a JVM-hosted Lisp. The things that attracted me to the language were performance, interoperability with Java’s large and mature stable of libraries, and especially the language’s approach to immutability and concurrency. I wondered if finally, Clojure would be the solution to achieving a dense user::machine ratio without the headaches of traditional threads or the obfuscation of the event-driven model.
It turns out that Clojure wins, handily. With language features like Software Transactional Memory (STM), refs and atoms, the “Who stepped on my data?” problem goes away. Couple those features with agents and futures, and you end up with a system that’s fast and doesn’t take a rocket surgeon to understand.
Our user::machine density skyrocketed, resulting in a system that has scaled to nearly 700,000 uniques/day with just a handful of machines.
The final piece of the puzzle that has enabled our ability to scale has been Amazon’s public cloud, AWS.
I shouldn’t need to sell anyone on using the cloud, but incredibly, some folks don’t seem to see it. Trust me, these days, infrastructure is pronounced “cloud”. You don’t own it, or build it, or even lay eyes on it, ever.
The shift doesn’t come for free. You have to figure out how to build your application so that it can leverage the cloud’s strengths and accommodate its shortcomings. It’s not perfect, but it gives you extraordinary flexibility, cost savings, and most importantly for us, the opportunity to scale.
Last month, when traffic shot up by over 50%, no one noticed.
Well, that’s not completely true. Our ops ninja knew it, but the sudden growth in traffic didn’t cause any heartaches. Our system just grew. The “elastic” really matters.
Could scaling like this have been accomplished using premise-based hosting, or traditional outsourced hosting? Sure. But it would have required time, attention and expertise. It would cost a bunch more, too.
We use AWS’s Elastic Beanstalk, a “platform as a service” which gives us tools to easily deploy our application to EC2 instances running Tomcat, behind a turnkey elastic load balancer, with a fancy host manager for versioning our software, detecting and restarting misbehaving instances, etc. We have continuous integration watching our GitHub repo, building and deploying the tip of our master branch to a special Beanstalk environment each time a commit is made.
For those not familiar with the AWS model for load balancing, you might want to look here.
Ditto for Beanstalk.
This stuff should sound obvious and righteous. If you’re building software, and not doing this stuff, you’re doing it wrong.
Specifically, AWS enabled us to scale by letting us forget about the physical implementation of our system. When a new machine - a new unit of user-handling capacity - becomes available as an API call, or better yet, is created and deployed automatically to handle increased demand, my developers (and our ops guru) can spend time thinking about hard problems, instead of “Are there enough slots in the rack for another box?” and “What will our traffic be tomorrow and will we have enough capacity for it?”
It enabled us to scale by abstracting away details like IP addresses and DNS, backups, storage, and monitoring. In short, like Clojure, it gets infrastructure, which is not an essential but an incidental source of complexity, out of our way.
Today we use nearly every single flavor of the AWS alphabet soup: EC2, S3, Route 53, Beanstalk, SQS, SNS, SDB, EMR, you name it.
We’ve grown an amazing 60,000% since we launched Room Key, and we haven’t had a single war-room huddle. We haven’t had any sleepless nights. We haven’t even broken a sweat. I have no doubt that we can scale a similar amount in the next few months.
To what do I attribute this amazing ability to grow, handling ever-greater traffic? First, an architecture that was specifically engineered to scale, leveraging the features of our problem domain. Second, a technology stack featuring Clojure, which allows mere mortal developers to build a highly concurrent system, yielding an enviable user::machine density. Lastly, we embraced the true nature of Amazon’s public cloud, baking AWS into the DNA of our system.
It may not be a recipe that everyone can follow, but it sure has worked well for us.