AMO Scalability: Then and Now

Struggling with scalability on AMO is nothing new but the tools we use to solve the problems have changed over time. Here is a bit of information on the performance evolution AMO has gone through. I wanted to link to the wayback machine for all our old versions, but I get "Redirect Errors" for the addons.mozilla.org domain. I'll have to make due with code repositories.

Version 1 of AMO wasn't concerned with caching. It was straight PHP talking directly to a single MySQL box. Short, easy, and not very scalable.</a>

Version 2 of AMO progressed through several caching systems. The site used the Smarty template engine so our first step was to turn on the built in Smarty cache. That didn't give us the performance we needed, so Mike Morgan started caching page output in PEAR's Cache_Lite. I don't remember the specifics of this implementation since it was so short lived (less than a month), but the CVS log, mentions problems with "scalability in a clustered environment." Our next step was to store the same page output in memcached instead of Cache_Lite which brought pretty satisfying results. Thus began our abuse of memcached.

In addition to memcached and expanding the number of web servers it ran on, version 2 also boasted two other significant performance improvements. The first was the ability to talk to a slave database for read-only queries which, when combined with a load balancer, let us scale database servers horizontally. The second was installing a NetScaler in front of addons.mozilla.org giving us the benefits of a reverse proxy cache and SSL offloading. These changes bought us precious time when hoards of Firefox 1.5 users were clamoring for add-ons. In fact, I'd say we were in pretty good shape at that point.

Fast forward to Version 3 (the current version). We've expanded the memcache servers from one to two and instead of page output we're storing database queries and their results. We're still using a single master database but are using two slaves now for read only queries. There are several NetScalers around the world caching pages locally[1] for closer regions. We've survived quite a while on this system but we're starting to push the envelope again and we're going to need to make some changes to be able to scale for Firefox 3 and still provide a good user experience. I'll write more about our plans as they develop.

[1] Users who are logged in to AMO don't get the local caches - their connection is always to San Jose, CA.

5 Comments

Just curious, how is this charted against the usage growth of AMO?
-- Seamus, 18 Apr 2008
Just curious, how is this charted against the usage growth of AMO?


We started using urchin on AMO about this time last year so I don't have numbers from before that without time consuming log crunching. I can tell you we're averaging 125M hits per day right now, and a year ago it was 50M.
-- Wil Clouser, 18 Apr 2008
125M hits/day is approximately 1500 hits/second on average, but more interesting, what is the maximum? And are those "hits" or "page views" (i.e. does it include images and other resources)? And how many of the page views can't just be flat files served by lighttpd (i.e. all public and most of the private it seems, cause the only real difference from the public pages are the first name shown -- which could just be added by javascript from a cookie or from a json-request with a long cache time and a cache-killer in a cookie)? Just curious.

I don't know if you changed it, but last time there was talk about amo design, it seemed like you didn't use the most optimize schema for the translations (i.e. one general translations table, instead of a translations table for each table with data for the fields on columns instead of rows). Also, what about pre-generating translated php-files.
-- AndersH, 21 Apr 2008
125M hits/day is approximately 1500 hits/second on average, but more interesting, what is the maximum? And are those “hits” or “page views” (i.e. does it include images and other resources)?

Those are hits (including images and other resources), so it's probably not a particularly good stat since our redesign added a lot of images, etc. On the other hand, we've also unloaded update checks to another server, so those shouldn't be counted anymore. As far as maximum, it's close to 200M right now.

The better statistic, page views, is around 5-7M/day. This has been pretty volatile lately (probably due to the redesign) but it looks fairly consistent averaged over the past few months.


And how many of the page views can’t just be flat files served by lighttpd (i.e. all public and most of the private it seems, cause the only real difference from the public pages are the first name shown — which could just be added by javascript from a cookie or from a json-request with a long cache time and a cache-killer in a cookie)? Just curious.

Plenty of it could be, but I suspect the netscaler offers as good if not better performance than lighttpd (<-- totally unresearched claim! :) ). One of our obvious-in-hindsight mistakes was storing add-on preview images in the database. I think it started with the idea of just storing icons in the db (max 32x32) which wouldn't be bad. Then it somehow encompassed add-on preview images, which can be much larger, and it's slowing our queries down. A pretty bad idea, and I think it would be a perf boost to store those on disk instead of the db.

Also, regarding the javascript idea - that's actually what we did in version 2. In v3 there are more changes than just name - the menu changes depending on whether you are a developer, admin, localizer, etc. and there are some "edit" buttons if you have permission to things. Certainly possible with js, but more to think about.


I don’t know if you changed it, but last time there was talk about AMO design, it seemed like you didn’t use the most optimize schema for the translations (i.e. one general translations table, instead of a translations table for each table with data for the fields on columns instead of rows). Also, what about pre-generating translated php-files.

Like the images above, we haven't really altered the db schema much even if we should. It'll probably be targeted for another update (the next one being far less visual and more backend).
-- Wil Clouser, 21 Apr 2008
Also in that history was a pair and then a triplet of Squid proxies pointing at a single back-end box, followed by a triplet of Apache+mod_proxy+mod_cache proxies sitting in front of that same box, followed by 4 webservers behind an LVS load balancer with each of those proxying the /developer/ subdirectory back to that original single box, before we had the Netscaler.
-- Dave Miller, 21 Apr 2008

Post a comment

Feel free to email me with any comments, and I'll be happy to post them on the articles. This is a static site so it's not automatic.