The post you’re reading now focuses on the scaling implications of using Clojure and AWS. This companion post tells the tale of business drivers and implications of the same choices.
When Room Key acquired hotelicopter in July of 2011, we were still finishing our B2B product, an embeddable hotel search engine. At the time, our existing site handled a trickle of traffic, and our customers’ sites did only a trickle more. But we had big, BIG plans.
In typical high-growth startup fashion, we planned to conquer the world and become the ITA software of hotels: the Big Switch In the Sky. As CTO, I needed to build an architecture that would handle so-called “Web Scale” traffic. We never quite got to those traffic levels as hotelicopter, but as Room Key, the story has improved remarkably.
In my view, this architecture and what my team subsequently built has reached the goal of “Web Scale” quite admirably. Over the course of the last 7 months (we launched in January 2012), we’ve gone from about 1,000 uniques/day on hotelicopter’s site, to 600,000+/day on roomkey.com. That’s 60,000% growth in 7 months.
In fact, over the course of one 10 day period, we increased our daily traffic by over 50%.
(We thank our hotel partners for this meteoric growth; they send us exit traffic from their websites.)
We’ve done this with no late nights, no frantic war-room huddles, no downtime, no muss and no fuss. We’ve spent $0 (ZE-ro) capex, and our opex spend still remains at what most of our peers in the industry would consider “latte money”.
How’d we pull it off? Having a world-class team makes all the difference, to be sure. But, I’m aiming to sketch a a few details on the architecture and technology choices that I feel made achieving this fast growth and large scale possible.
The Problem Domain
Consumers want to book hotel rooms, but given the option, would prefer to do business directly with the hotel chain, rather than an intermediary like Priceline or Hotels.com, known in the industry as an OTA (Online Travel Agency).
Hotel chains maintain global state of all reservations, inventory, rates and availability in complex centralized systems known as CRSs (Central Reservation Systems). To select partners, hotel chains grant API access for querying rates and availability from their CRSs. The grantee (ie., us) builds a technology integration which hits CRS’s API to obtain a notion of what hotels have availability, which rooms and rates are available, and so on. These APIs are jealously guarded, persnicketity, and generally obnoxious to work with. Oh, and they have awful, terrible latency, too.
Inventory (hotel property) lists and static content (photos, amenities, descriptive content, geocodes, etc) are generally exchanged out of band, cached and maintained by the distribution partner. The quality of this static content leaves a lot to be desired. Generally it must be scrubbed before ingestion. Like a carrot, or potato.
Enriching data, such as travel reviews, are obtained from 3rd parties, and must be matched to properties in the hotel chains’ inventory lists. The lack of any standard and universal identifier for hotels makes this job quite difficult, but that’s a subject for another day.
As an aggregator of our hotel partners’ inventory, our job is to present all this data in a clean, uniform interface. Make it look easy.
When a user does a search, we use our cached and scrubbed inventory lists to figure out which properties are a good match for the user. Then we go off in real time, querying our partners’ CRS APIs for rates and availability. We wait around a few seconds, collating and scrubbing the replies, before showing the user which hotels are availabile in their chosen destination, with the prices we’ve obtained from our partners.
When the user finds a hotel they’d like to book, we ship them off to the appropriate partner’s site, where they complete the reservation. We get paid when that happens.
Three key insights drove many of the architectural decisions I’ve made for us. Exploiting these facets of our problem domain, and letting our architecture reflect them, has made our fast growth and large scale possible.
First, our partners’ hotel content data is messy, voluminous, error-prone, and must in essence be maintained relationally, so that we can do effective data curation and enhancement.
Second, although munging, folding and manipulating this data must be done, in the end, the user experience demands no live relations between data elements. At the point that a user conducts a search, the hotel universe becomes flat and immutable.
Finally, if Bob searches for hotels and finds that the Park Hyatt in Chicago has one geocode, and at the very same time, Jane finds that the Park Hyatt has a slightly different geocode, no one will notice, or care. If Jane searches for hotels in Chicago and finds 67, and Bob searches and finds 68, it’s unlikely that anyone will notice, or care.
Put another way, users of this system have a high tolerance for inconsistent reads. Bob’s and Jane’s hotel universes need not be identical. (They can’t be completely divergent; eventual consistency is fine.)
So: A-ha! The messy relational data could live in a secluded back-end “content sausage factory”, whose sole purpose in life would be to produce a crisp, non-relational version of the hotel universe as known best at that point in time.
This “golden master”, non-relational database of hotels could then be shipped off to the live, operational system which faces users.
Moreover, different users might be exposed to different versions of the “golden master” hotel database, allowing us to test and to do progressive and continuous rollouts.
Decision One: I put relational data on one side and “static”, non-relational data on the other, with a big wall of verification process between them.
This led to Decision Two. Because the data set is small, we can “bake in” the entire content database into a version of our software. Yep, you read that right. We build our software with an embedded instance of Solr and we take the normalized, cleansed, non-relational database of hotel inventory, and jam that in as well, when we package up the application for deployment.
Egads, Colin! That’s wrong! Data is data and code is code!
We earn several benefits from this unorthodox choice. First, we eliminate a significant point of failure - a mismatch between code and data. Any version of software is absolutely, positively known to work, even fetched off of disk years later, regardless of what godawful changes have been made to our content database in the meantime. Deployment and configuration management for differing environments becomes trivial.
Second, we achieve horizontal shared-nothing scalabilty in our user-facing layer. That’s kinda huge. Really huge.
We reap several benefits from this decision. First, our UI gets faster without us having to pay for new servers that take advantage of Moore’s Law. Second, our UI gets faster because browser makers are dumping tons of time an energy into improved JS performance. Third, HTML is verbose and JSON isn’t, saving us both compute cycles and bandwidth on the server side.
These decisions en toto yield a picture of a three-layer cake, with a messy relational tier on the bottom, an operational API and SOA tier in the middle, and a fat client presentation layer on top.
The single decision that has yielded the most scaling benefit to date has been baking a non-relational snapshot of our hotel inventory and content into our application, giving us shared-nothing scalability.
Beyond architectural decisions, choices I made about our technology stack and operating environment have also had major implications for our ability to scale.
Clojure and The Stack
When a user conducts a search, we go off in real time, spinning out dozens or scores of individual web services calls to our partners’ CRS APIs. We then have to wait around for those calls to return, and collate the results before presenting them to the fat client application running in the user’s browser.
Doing this with reasonable performance has proven challenging, and has led to a particularly significant decision about our technology stack.
We started off our journey with Ruby as our language of choice. I began using Ruby long before there was a Rails (late 2001), and like me, the team really liked its productivity, expressiveness, and the enthusiasm of its community. However, with its green threads and global interpreter lock, it proved difficult to use “normal” Ruby means to achieve the performance we needed. All that IO to our partners and the concomitant response parsing was realllly slow.
Concurrency would seem to be a likely candidate solution.
I’ve been around the block enough times to know how difficult it can be to get concurrency right with traditional solutions that use threads. Inevitably, you have to wrestle with the granularity of your locking, which requires deep wizardry. Worse, you will, seemingly by definition, run into hairy, almost-impossible-to-debug situations where some thread stepped on some other thread’s data, and everyone’s pointing fingers at each other.
I wasn’t eager to jump into those waters. I’ve been bitten before.
So, we started using Eventmachine, a Ruby framework that employs the reactor pattern and presents an async, event-based paradigm. This helped. One downside was that Eventmachine demands a style of programming in which the flow of control is inverted, and which there can be no real blocking operations. That’s pretty tough for mere mortals to wrap their heads around. The upside is you don’t end up with the “Who stepped on my data?” problem. But the relatively long and CPU-intensive parsing of partner responses meant we still could eek out only a few dozens of concurrent user searches on a machine. It just wasn’t fast enough.
I don’t intend this to be a indictment of Ruby’s ability to scale. Ruby can be scaled. But it takes work - time and resources that we just didn’t have, when our entire development organization was four people.
So we turned to Clojure, a JVM-hosted Lisp. The things that attracted me to the language were performance, interoperability with Java’s large and mature stable of libraries, and especially the language’s approach to immutability and concurrency. I wondered if finally, Clojure would be the solution to achieving a dense user::machine ratio without the headaches of traditional threads or the obfuscation of the event-driven model.
It turns out that Clojure wins, handily. With language features like Software Transactional Memory (STM), refs and atoms, the “Who stepped on my data?” problem goes away. Couple those features with agents and futures, and you end up with a system that’s fast and doesn’t take a rocket surgeon to understand.
Our user::machine density skyrocketed, resulting in a system that has scaled to nearly 700,000 uniques/day with just a handful of machines.
The final piece of the puzzle that has enabled our ability to scale has been Amazon’s public cloud, AWS.
I shouldn’t need to sell anyone on using the cloud, but incredibly, some folks don’t seem to see it. Trust me, these days, infrastructure is pronounced “cloud”. You don’t own it, or build it, or even lay eyes on it, ever.
The shift doesn’t come for free. You have to figure out how to build your application so that it can leverage the cloud’s strengths and accommodate its shortcomings. It’s not perfect, but it gives you extraordinary flexibility, cost savings, and most importantly for us, the opportunity to scale.
Last month, when traffic shot up by over 50%, no one noticed.
Well, that’s not completely true. Our ops ninja knew it, but the sudden growth in traffic didn’t cause any heartaches. Our system just grew. The “elastic” really matters.
Could scaling like this have been accomplished using premise-based hosting, or traditional outsourced hosting? Sure. But it would have required time, attention and expertise. It would cost a bunch more, too.
We use AWS’s Elastic Beanstalk, a “platform as a service” which gives us tools to easily deploy our application to EC2 instances running Tomcat, behind a turnkey elastic load balancer, with a fancy host manager for versioning our software, detecting and restarting misbehaving instances, etc. We have continuous integration watching our GitHub repo, building and deploying the tip of our master branch to a special Beanstalk environment each time a commit is made.
For those not familiar with the AWS model for load balancing, you might want to look here.
Ditto for Beanstalk.
This stuff should sound obvious and righteous. If you’re building software, and not doing this stuff, you’re doing it wrong.
Specifically, AWS enabled us to scale by letting us forget about the physical implementation of our system. When a new machine - a new unit of user-handling capacity - becomes available as an API call, or better yet, is created and deployed automatically to handle increased demand, my developers (and our ops guru) can spend time thinking about hard problems, instead of “Are there enough slots in the rack for another box?” and “What will our traffic be tomorrow and will we have enough capacity for it?”
It enabled us to scale by abstracting away details like IP addresses and DNS, backups, storage, and monitoring. In short, like Clojure, it gets infrastructure, which is not an essential but an incidental source of complexity, out of our way.
Today we use nearly every single flavor of the AWS alphabet soup: EC2, S3, Route 53, Beanstalk, SQS, SNS, SDB, EMR, you name it.
We’ve grown an amazing 60,000% since we launched Room Key, and we haven’t had a single war-room huddle. We haven’t had any sleepless nights. We haven’t even broken a sweat. I have no doubt that we can scale a similar amount in the next few months.
To what do I attribute this amazing ability to grow, handling ever-greater traffic? First, an architecture that was specifically engineered to scale, leveraging the features of our problem domain. Second, a technology stack featuring Clojure, which allows mere mortal developers to build a highly concurrent system, yielding an enviable user::machine density. Lastly, we embraced the true nature of Amazon’s public cloud, baking AWS into the DNA of our system.
It may not be a recipe that everyone can follow, but it sure has worked well for us.
We are looking for kick-ass developers! Come work at the best gig in Central Virginia!
Flowers and chocolates to the Room Key team members who contributed.
Extra special thanks and props to Lawrence Krubner for excellent constructive criticism and feedback during the writing of this story.
;; Who Gives a Shit? …You Might?
You should read this story if you want to learn about the choices I made as CTO at a little startup in Charlottesville, Virginia between late 2007 and 2011.
You should read it if you want to hear about real-world mistakes a CTO made.
You should read it if you’re trying to build your own company.
You should read it if you’re a technology professional interested in new technologies and their impact on real problems.
You should read it if you’re Paul Graham. Or if you want to be like him.
This epic poem captures, in an off-the-cuff way, much of the story of what happened at hotelicopter over many years. It features numerous omissions, many exaggerations, some half-truths, and a few lies. It also contains a few nuggets of information that may be of some interest if you’re out there in the trenches, doing your own startup, or working on some tech. Namaste.
;; The Ballad of hotelicopter
Once upon a time (circa 2006), two bright, shiny, newly-minted graduates from the University of Virginia’s Darden School of Business decided to launch a startup. They came up with a doozy of an idea: a mashup of the best of Facebook, Tripadvisor and Kayak. It would be a hotel metasearch engine that would use social recommendations to find you the best hotel. It was one hell of an ambitious idea. They won business plan competitions. They high-fived each other over lattes. They started working on a prototype. Eventually they got the attention of a serial entrepreneur who had taken his own travel company public a few years earlier, and they managed to convince him to be their lead angel investor.
You would think two smart, earnest and hardworking MBAs would figure out that they’d bitten off too much to chew with this ambitious plan, and they did, eventually, but it took until early 2009.
Not long after arriving at the company in late 2007, I had argued with the founders to adopt a B2B approach, but to no avail. Mea culpa.
In our first pivot, we reoriented the business, away from the social and reviews aspects, stripping it down to just a Kayak-like metasearch site focused solely on hotels. “We’ll be the next consumer destination for hotel bookers! w00t!” That was the story, anyway.
The company raised a sizeable chunk of money from a small group of wealthy angel investors, and re-branded itself as hotelicopter (it was previously known as VibeAgent).
Some of you may recall our 2009 April Fool’s stunt, which to this day stands as the singular best piece of guerilla marketing chutzpah I’ve ever seen. The former CEO of hotelicopter still gets a tip of my hat for that one!
Despite my arguments that we concentrate on a B2B model, hotelicopter pressed on with an ambitious plan to build traffic using SEO and SEM. But, for two smarty-pants MBAs, one smarty-pants CTO (myself), and an experienced CMO, no one really stepped back far enough from the day to day to do the math. Every year, the other competitors in the space spend about a BILLION dollars on marketing. Our budget was, ummm… about $100,000. Even the wildly successful flying hotel prank couldn’t save us.
The flying hotels came home to roost about a year later, in early 2010, when our CMO quit and we collectively realized that ::cough:: Colin was right about going B2B. (You really should listen to me. All the time.)
Over time, I was able to convince our CEO to adopt the customer development methodology we needed to match our agile product development approach. But that’s a story for another day.
The fun now really began. The founders and our lead investor handed me all the rope I could carry. I had more than enough to hang myself and the rest of the company along with me. I knew we had to be nimble, and we had to build a platform, not a web site. We needed something scalable, something that could grow easily to the much maligned “web scale”. It had to support a three-sided platform, with publishers, travelers and hotel suppliers. We needed a solution that could integrate easily on a spectrum of publisher levels: from white-labeled web sites that we hosted, to portable widget-like search solutions, to API-level integrations. The platform had to accommodate the stone-age interfaces of hotel suppliers, and the twitch-game timing of web marketing that our publishers demanded. Oh, and I had to do this with a four person team.
At the time, the company’s technology stack hadn’t evolved much from its prototype: a monolithic LAMP application, slathered in the worst kind of PHP you can imagine, with a giant spaghetti hairtarball of relational data behind it. In a brief consulting gig I took with them before joining as CTO, I had extracted the core metasearch functionality from the big ball of mud, rewritten it in Ruby using the async IO framework EventMachine, and set it stand-alone. But the rest of the system looked like a total loss. Incidental and unwarranted complexity overwhelmed the existing architecture.
You may be aghast that this was the case, but in the 15 years or so that I’ve been working with startups, I can tell you that this state of affairs was absolutely normal. Par for the course.
For example, at that point, the site ran out of one ginormous subdirectory with hundreds of PHP files scattered like chunks of gorgonzola on your salad, sticking to one another with tenacious glee. There was a “lib” directory, which you think would hold much of the supporting library code, but a good fraction of that actually lived in “site”, and some in “server”. The previous programming staff had felt it good and worthwhile to roll their own half-assed MVC framework, including a barely-baked library for page caching (which broke and took the site down at regular intervals), and components for database abstraction that only worked with - wait for it - MySQL. Every single goddamn file was littered with SQL, like bacon bits on this demonic salad. There was a “log” directory, but the search logs weren’t kept there, they were in “server”. Etc., etc. It made you want to eat a gun.
The database was even worse. “Facebook-meets-Kayak-meets-Tripadvisor” sounded so good during the business plan competitions, but no one knew what they were doing when they built it, and the data model… There was no data model, really. Hundreds of tables with distressingly similar names. There were columns in tables that contained string concatenations of fields from other columns in other tables, generally glommed together with pipes or semicolons, or some other ick, except when they weren’t. There were missing indexes, huge, ponderous unused indexes, replication that worked by sheer luck, and every single fucking field on every table was prefixed with “hotel_”. It was nightmarish. “Normal form?” “Sure, it’s normally a clusterfuck.”
We dubbed this hairy mudball salad “PHP Hell”.
I suppose this could be construed as indirectly throwing rocks at PHP. Hmm. Yep, that’s pretty much true. Discuss.
Oh, yeah, lest I forget: There were no tests. None.
Faced with the oft-recurring question of “Evolve or Big Rewrite (tm),” I resolved to do the latter. Risky! But, at the time it seemed justifiable. And now in retrospect, it was what saved us. To their eternal credit, the founders and our lead angel investor gave deep, patient and ongoing support to an anxiety-inducing and time consuming process.
Taking a deeeeep breath, I chucked out the old team, and chucked out our entire code base, from a standing start at the start of 2010. We left the existing site running, on life support, while we wiped the slate clean.
The time had come for a different class of developers, and we needed them to have the latitude to work to the best of their ability. One by one I fired the old team members or they left, and in their place I hired veteran, self-starting software craftsmen. Folks who can’t help but code, whose intellectual curiosity is matched only by their desire to make lasting and significant contributions to the success of the company they work at. Folks who read xkcd, issue pull requests, hack robots after dinner at home, play Mario Kart obsessively, and who have long, hard, passionate arguments about SCRUM tools and unit testing. Bad ass muthafuckas.
This redemptive catharsis played out over the course 2010, with the size of the team remaining more-or-less constant as we went.
I hired programmers based on their previous work. Everyone had to submit a code sample. I was hiring craftsmen! If you were hiring a woodworker to hand build chairs for your dining room, you’d want to see the chairs that he’d made previously, right? Software is no different. I was looking for folks who were very, very good, and I didn’t really give a damn what their backgrounds were. I ended up with a guy with a computer engineering degree who had recently been wielding a soldering iron, a Berkley grad who’d majored in Jazz, a messenger biker, and a Brit whose background was English Lit - The Bard, to be precise.
The next big decision I had to make, after “Big Rewrite”, and “New Team”, was “Traditional Infrastructure or Cloud?” Our experience with our traditional hosting provider had been good. Well, as good as it can be. Frankly, I was tired of trying to guess when we’d run out of space in our rack, and whether or not the next rack over had a 4U or 8U slot available, and blah blah fucking blah.
I considered Rackspace, but previous experience with their cloud offering had been underwhelming. That really only left Amazon Web Services (AWS). I confess I didn’t make this decision very scientifically. Maybe my age and the fact that I’ve done this stuff every damn day for twenty years has baked it all into my subconscious, but I figured we’d go for broke. Hell, I’d fired the team, hired a bunch of new guns, and we had not a single damn line of source code to start with. Why not, right? In for a penny, in for a pound.
Turns out, this decision looks prescient in hindsight too. Embracing the true nature of on-demand computing infrastructure deeply, fundamentally changes the game of architecting big distributed systems. When obtaining computes, bandwidth, resources, databases, the whole shmear allllll turns into function calls, the world changes. This realization came by degrees, and as it did, we incorporated this learning aggressively into the evolving architecture.
This shuffling all transpired over the course of just a few months, and concurrently with it, I was constantly sketching and re-sketching the outlines of a highly scalable, modular architecture. The team kicked the pieces around generally, and worked in earnest on the most critical and/or least-likely-to-change bits.
We kicked into high gear in the Spring of 2010.
Today I love saying “Why spend money on computes and bandwidth to render HTML? We let the browser do that for us.” Fine work, Mr. Southall.
Similarly, Matt Mitchell, a senior developer at Room Key now, came to me saying, “Colin, there’s this tool called Solr, and I think it will make searching our hotel database much easier.” Again, I can only shake my head at what happens when you hire smart people, treat them like adults, and actually listen to what they come up with. Solr quickly went from a piecewise solution to our searching needs to something far more interesting. In our problem domain, stale reads of certain data are tolerable, and exploiting that was a lever we could pull. Eventually, we ended up baking an instance of Solr/Lucene directly into our individual application processes, making it possible to achieve true linear horizontal scalability for the application.
As of this writing, we’re on track to having 9 million uniques a month, from zero at the start of 2012. We did so with absolutely no fuss, no late nights, no hand wringing, and a laughably small amount of additional opex spend. But I digress.
With the revelation about Solr, we were able to decouple the front end of the stack, which had stringent performance and scalability requirements, from the back end, into which we could now stuff all of our messy relational data and processing intensive activities. With a giant wall between them.
At this point, we were still focused on using Ruby/Eventmachine as a cornerstone of our technology stack. But here we hit a snag.
Yep. I’m gonna beat up Ruby, because it was a mistake.
I started using Ruby back before there was a Rails. Before the invention of fire. (Late 2000.) No one loved Ruby more than I did, both as an individual practitioner, and as a CTO. Ruby rocks.
Except. Yeah, except it doesn’t scale. ::ducks::
OK, that’s not completely fair. Ruby does scale. But it doesn’t scale well, or easily, and in benchmarking and stress testing, I was seeing that we were going to have a use a small truck load of resources at AWS, or spend a bunch of preciousssss developer time making it scale. It was too expensive. I wanted high user::machine density, and I didn’t want to have my developers do handstands to get it.
Yeah, I could’ve done lots of things. I didn’t have time to do that shit. Not with a four person team. Not with cash running out, and promises to keep, and no time to work for Ruby. I needed something that would work for me. This thing had to be FAST. It had to drive hardware to the limit without driving us crazy.
The bottom line is that it was too much work to make Ruby go fast enough.
I knew I didn’t want to sacrifice the pure programming joy that Ruby delivers. Ruby makes smart, intense, SEAL-team-dangerous developers happy. It’s a great big chainsaw kitana of object deliciousness. It gets out of your way. I wanted something that made programming fun.
So, Java was out. Wayyyyy out.
I also wanted something with maturity, libraries, support… something with gravitas. Python? Meh.
I tried Scala, and threw up a little in my mouth.
I looked at Go. I tried to like it. Then IO. Erlang. Haskel.
Finally, I looked at Clojure.
You can read my blog entry about my satori experience with Clojure; I won’t belabor it from an individual practitioner viewpoint, here. Instead, let me tell you that as the CTO at a cash-strapped startup, Clojure was the answer to a prayer. Just like Paul Graham says about the averages. [http://www.paulgraham.com/avg.html]
A little background on our application might help. Our job is to give prospective hotel bookers a view into what their options are. At the time we were making the decision to migrate from Ruby to Clojure, the system was using so-called “realtime” rates and availability checks with hotel suppliers to get that information. That meant that when a visitor conducted a search, we would spin up dozens of individual HTTP requests to hotel supplier sites to get rates and availability data at that very moment. We’d parse the responses, collate them, and present them back to the UI in just a few seconds.
You might ask why we didn’t cache that information, but suffice it to say there were significant business drivers for that decision.
Managing this concurrent (and long-running) IO was a major theme for us, and Ruby did so reasonably well using EventMachine. However, we had to normalize the returned data into a single unified data model, and none of our partners had simple (or compact) XML representations of the data, so not only did we have IO issues, but CPU-bound processing issues as well. The combination of the two made the EventMachine implementation suffer from less-then-stellar throughput, and because of Ruby’s green threads implementation and global interpreter lock, we had to run oodles of Ruby processes on each box to achieve reasonable throughput.
Perhaps just as importantly, the reactor pattern’s upside-down flow of control style of programming was (and is) a pain in the ass. It was hard to read, hard to maintain, and generally obstreperous.
I can already year you Ruby folks protesting “Fibers!” and so on. Heh. Have fun storming that castle.
Clojure was a whole different story, and addressed these issues admirably, for all the reasons you’ll discover when you look into it further. (Hint, hint.)
While the team was wrestling with other pieces of the system, I personally prototyped the piece of our stack with the highest scalability and performance demands using Clojure, and benchmarked. It was immediately obvious that it was a game changer. Thus we began our journey from being a mostly-Ruby shop to a almost-solely-Clojure shop.
Again, to their credit, hotelicopter’s founders didn’t bat an eyelash. Lisp, Ruby, blah blah blah. Just get it done, Colin.
I’m OK with that.
Bear in mind that at this point, we already had a running business - the consumer-oriented hotel metasearch engine. It was running on the awful PHP code, and we were putting just enough time into it so it wouldn’t completely fall over.
Meantime, we had done the leg-work to figure out what it was we could and should build as the first step towards our new B2B platform. This new platform was what I had resolved to build in Clojure on AWS.
We began picking off pieces and building them, learning Clojure as we went. It turned out to be a very steep learning curve, but despite that we were doing reasonably well with the new language and environment within a few months.
Furiously coding away, in true MVP (Minimum Viable Product) style, we launched early customers while still filling in the gaps in the platform. It worked! It was fast, reliable, simple, and scalable. It looked like we had a winner.
I’m gonna pause this part of the story, and pull another thread. I’ll weave them back together shortly.
;; The Saga of Hotel Distribution - Or - How to Boil a Frog
Back in the Good Old Days (tm), before the invention of the Intertubes, to book a hotel you went to see a human being. A travel agent.
This agent of travel was endowed with special powers. Namely, the power to access an arcane oracular system known as a GDS - a “Global Distribution System”. This system allowed the travel agent to search for hotel rates and availability, and to book your rooms! How exciting, and how quaint!
When a travel agent booked a room for you like this, they were paid a 10% commission by the hotel.
Back then, the Internet was a weird, fringy thing. The hotel chains and hotel owners figured it was a flash in the pan. Like laserdisks, or Segas.
The earliest Online Travel Agencies (OTAs) connected to the GDSs to get inventory and book rooms. But then something else started to happen…
When the hotel suppliers were approached by these fledgling OTAs to sell hotel rooms directly - not through the GDSs - they figured, sure. Why not? We have some “distressed inventory” - some hotel rooms that we can’t sell, chronically, and we’ll give these online travel agents this cruddy inventory. We’ll sell it to them wholesale, cuz we’re not gonna make a dime from it any other way. And they can do what they will with this inventory.
So they started giving inventory directly to the OTAs. And the OTAs started selling. Before you knew it, the hotel suppliers started giving the OTAs non-distressed inventory, at a discount. Not much, ya know. Just a little.
Bit by bit… Like the frog that doesn’t figure out it should jump out of the cold pan of water on the stove until it’s boiling and too late… The water started getting hotter. And hotter.
This distribution channel for hotels opened a pandora’s box of problems. For one thing, the customer was at arm’s length from the hotel. The hotel didn’t get to shape the customer experience of booking. The hotel was at the mercy of the OTA as far as being compared to other hotels. The customer wasn’t exposed to messaging about the hotel’s loyalty program, or even basic branding. The list goes on and on.
But maybe the very worst part was (and is) how much it cost the hotel. Usually about 30% - three times as much as paying a travel agent’s commission in the Good Old Days.
The poor frogs, er, hotels, were waking up to the fact that they were now hooked on OTAs for distribution. But, they couldn’t quit the crack cocaine of selling inventory through the OTA channel - it was too much volume. Too big to fail! Boiling water!
It got worse.
After 9/11, the bottom fell out of the flights market. Airlines were scrabbling to stay alive, and commissions for selling airline tickets plummeted. In a competitive frenzy, one by one, the OTAs stopped charging customers fees for booking airlines.
All of this spelt trouble for the OTAs. Flights had been the lion’s share of their revenue. They had quarterly numbers to make. They needed to make up this lost revenue somewhere… but where?
Oh, yeah. Let’s get it from the hotels!
It was a bad scene. Finally, the hotels cried “Uncle!”. Some of the biggest hotels in the world got together and said, “Let’s do something. Let’s start our own OTA. We’ll own it! It will send visitors directly to our own web site to book! And we’ll be able to set the commission. Let’s make it low, like it was in the Good Old Days! Huzzah!” So say we all.
So around the beginning of 2010, they formed a joint venture. That joint venture, Room Key, needed a technology platform. One that looked suspiciously like hotelicopter’s.
;; Against the Grain
See, these two subplots DO intersect.
The joint venture began looking for a company to acquire or to partner with, to find a web platform that could grow to meet their ambitions. They looked at hotelicopter. For a variety of serendipitous reasons, we looked pretty good. Soon enough, the acquisition process was in full swing, and they put hotelicopter under the microscope. Actually it felt more like a colonoscopy.
Every single one of the technology choices I made as CTO of a scrappy startup were called into question. They questioned anything that didn’t fit the model of how they do business. Ie., everything. The culture clash was epic.
They wondered if Amazon was reliable. This might seem like a strange thing, but in their world, the world of proprietary infrastructure, which is remarkably unreliable, it made sense. Literally, they wanted to know how many nines, how many OC-48s, disaster recovery, etc. They wondered if Amazon scaled. They wondered if our architecture scaled. They wondered if our code was fast enough.
When I say “wondered”, here, I mean wondered in the way a dentist pulling an abscessed tooth out of your mouth wonders if he’ll need to stand up and put one foot on the chair to get more leverage, or maybe just use the drill some more?
They questioned the choice of Clojure. They asked for justification why we wouldn’t be using Java. They wondered where we’d ever be able to find enough programmers to work on such a fringy language. They couldn’t understand what Solr was, much less why we used it. Their collective eyes glazed over at discussions at the genetic algorithms we used to optimize Solr weights. They disputed my assertions about our ability to test the system. And so on, and so on, ad nauseam.
The choices I’d made never felt more against the grain than during the due diligence carried out on hotelicopter. I’ll touch on a couple of the juicy ones.
Regarding Amazon, there were a variety of questions, ranging from “What is it?” to “Isn’t operating your own infrastructure cheaper?” to “Does it really scale?” to “Does your application scale?” The silly questions were easy enough to rebut, but the last two - “Does Amazon scale?” and “Does your app scale on Amazon?” were persistent and difficult to explain. To address these issues, it seemed far more effective to “Show, don’t tell,” and so we used developer time to build load and stress tests, and we ran them. No real surprise, we found a couple of bugs, but the point was amply made that yes, Amazon does in fact work as advertised, and more importantly, our architecture would scale to handle the anticipated load.
Explaining AWS was a pain the butt. Explaining Clojure was a whole different animal.
Here, the questions bordered on incredulous criticism. A few choice derisive phrases came up, including “toy language”, “your pet language”, and so on.
There were concerns about finding talent that had a kernel of truth in them. My response was that I was looking for the veteran software craftsmen I described above, and they were damned hard to find no matter what, and that pretty much anyone we hired would have to be trained to use the language. I’m not sure that went over so well.
Another question was, “How is Clojure suited to large programming teams?” My response was that I never intended us to have a large programming team. Programming in the large is a higher-order “code smell” (“architecture smell”?) that means you’re doing it wrong. Decoupled, distributed systems mean you shouldn’t be worrying about this problem any more. “Just pass messages.” I don’t think that was quite what they expected to hear, either.
I laid some of the skepticism to rest once I was able to explain that Clojure was a JVM-hosted language, which meant that much of the Java ecosystem could be leveraged, including debugging tools, profiling, etc. Although no one said so aloud, I think that they took this to mean that if the acquisition was completed, the system could be migrated to Java. Heh heh.
Finally, there were questions about the performance characteristics of Clojure. Those were easy enough to address by pointing to the results of the scaling tests we conducted.
Looking back, I think that two things made the “sell” of Clojure and AWS (and all of our other beating-the-averages decisions) possible: 1) the empirical results we could show, and 2) the fact that Clojure, and hence the system, was hosted on the JVM. Ultimately I think the former is what sealed the deal; I didn’t have to have *arguments* about the characteristics of our system. Instead, I could simply point to pretty graphs and charts.
The due diligence dragged on and on, but you already know the punchline. Eventually we got the thumbs up, and so hotelicopter was acquired by the joint venture in the Fall of 2011, and became Room Key. I like to think that when that happened, the state of the art for Online Travel Agencies (OTAs) just got a little better.
You might wonder why we didn’t refuse the offer, and continue to remain hotelicopter, free and wildly tipping over apple carts in the hospitality industry. The truth is that we weren’t making enough money fast enough, and our very, very patient investors had waited long enough. They wanted out, and I can’t blame them. We exited handsomely, and although it was no home run, it was a respectable double. I’m not complaining.
;; Things We Flubbed and Things We Did Right
There are a few things on hit I didn’t cover above, like our use of genetic algorithms, and for that I apologise. But this is already a bit of a War and Peace, and if you’ve made it this far you deserve a break. Thank you for your time and attention, and good luck out there.
;; Quotes From the Characters
“There was no bureaucracy or process or politics to restrain team members from taking risks and doing good work. Of course, everyone was accountable, and if you took a risk, you had to justify it with results. But the important thing was that the culture encouraged and inspired good workers to do their best.”
“Clojure takes a similar world-view. Unlike Java, where two-thirds of what you write is for the compiler, Clojure really gets out of your way. When you write Java, you’re constantly thinking about the bureaucracy/process that the compiler demands. The compiler enforces the kind of oversight and restrictive management that big slow-moving corporate cultures love. In a way, it lets managers manage their coders without having to stand over their shoulders.”
“Clojure, on the other hand, trusts the developer entirely and merely asks him to express his intent. This allows good developers to do really good work. Of course, it also allows bad developers [to] really make a mess. So as an organization using Clojure, you have to make a different choice. Instead of hiring potentially mediocre programmers and throwing them into an environment that polices them, you hire really good developers and trust them.”
— Andrew Diamond, Senior Developer formerly with hotelicopter / Room Key
“I don’t think the value of the AWS infrastructure can be overstated. The number of things we don’t worry about and the number of people we don’t employ would be considered black magic by a good portion of the 1990s computer industry.”
— Chris Hapgood, Senior Room Key Developer
“If you have a gut feeling, go with it … but follow through. I remember when I suggested that we chuck mongo and use Solr only. This was risky, and I paid in sweating bullets… but only for one ridiculously stressful afternoon :) —- totally worth it ha.”
“Team is everything. Everyone we have complements [sic] each other. Personality is critical. Passion is a must. Everyone on our team rocks.”
“Oh, leave your ego at home. Trust me it feels good.”
“Don’t be afraid of a huge challenge or change, embrace it. Even if it hurts at first.”
“Stay positive when around your team members. It’s too easy to complain, and it’s contagious.”
— Matt Mitchell, Senior Room Key Developer
“Before joining the hotelicopter team, I had already begun to experiment with rudimentary single-page apps using AJAX to pull in JSON data and which manipulated the browser DOM accordingly. I was already very encouraged by the speedy and responsive user experience that resulted. “
“Then I arrived at hotelicopter and found an old-school server-side app with PHP-generated HTML and lots of sluggish round-trips to the server. The product director John Demarchi and I started to formulate ideas for a next-generation hotel search UI. His existing vision (which I immediately subscribed to) was the idea of what he termed “site-as-application” (or in other words what we would nowadays term a web-app) - a site that looked and responded more like a traditional desktop application.”
“So when the dev team sat down with Colin and started to thrash out what a newly architected search engine might look like, it seemed like a natural fit and so, with his blessing, I set about building the first prototype and consulting with the back-end team on an API… except this germ of an idea eventually grew into something even better. “
“One day, Colin suggested, “why not build this thing to be portable?” and so the UI became not just a single-page web application but an SDK comprising data models and mutually aware UI components that could be placed on any web page anywhere and bring fully featured (and monetisable) hotel content and metasearch functionality to anyone who needed it. The potential was (and still is) enormous.”
— Tom Southall, Room KeyFront End Development Manager
;; Executive Summary
Re-reading this saga, and pondering Andrew’s observation about big company bureaucracy / process, I’m struck that the adage, “Your code will end up reflecting the culture of your company,” (roughly paraphrased) is deeply, profoundly true. The problems of our work are truly sociological, not technological.
It was a really fun ride, and Room Key is going to be even more fun. I’m still looking for kick-ass developers. Send me some code.
I feel similarly about Java and Lisp. Also felt the same way about C++ as I do about Java. For most of the last 3 decades MUMPS has been my bread and butter, but more and more I’m ready for a new challenge and it does seem like Lisp/Scheme/Clojure could be it. It will be interesting if you can learn and use Clojure without learning and using at least a bit of Java.