Archive for February, 2008

Hungry Hungry Add-ons Manager

Friday, February 15th, 2008

For about 48 hours the AMO API was thrashing because of the popularity (and hunger!) of the new add-ons manager in Firefox 3 Beta 3.  The dust has settled and the servers are humming happily along, so now is a good time to blog about what happened and how we’ll handle future releases successfully.

Stop.  Take a deep breath.  Alright, here we go.

What happened this week?

Now that the API is functional (most major bugs have been ironed out) we got a rude awakening this week and found out exactly how much traffic the improved Add-ons Manager can generate, but it’s a nice problem to have and we’re happy it’s been well received.

Wednesday, around peak time, the API started clobbering our databases:

db03 load average

Shortly after we entered our peak traffic window, we had to turn off the API to keep the normal AMO working.  Diagnosis found that:

  • Load was not utilizing the read-only slave and was focused mainly on the master read/write database (mrdb03).
  • Cache hit rates were down to 60% from the usual 90% for memcached
  • When our databases hit peak CPU, the app cluster would tumble because of the piling requests

How it was fixed?

Wednesday, IT and Webdev spent quite a bit of time getting the API back up.  Starting with the three points above, we:

  • Off-loaded read-only traffic to DB slaves
  • Investigated optimizations for the API
  • Looked at cache rules and cache policies for both memcache and the hardware load balancer

However, Thursday didn’t fare any better for the cluster.  This time the slaves started to melt near peak time — forcing us to once again temporarily disable the API.  Under-utilizing memcache was the main issue.  Cache headers were fine, slave was utilized, app nodes were fine — just too many damn queries flying at our database servers! :)

Load got high, but we disabled the API before it became critical

So on Thursday we continued our look into what was going on.  We tried to figure out why our cache hit rate was so low (60% instead of 90%).  Digging through AMO, we found CACHE_PAGES_FOR, which set the expire time on memcache records when calling Memcache::set(), was set to 60 seconds.  We increased this to 7200 to aggressively cache database traffic and were collectively off for valentine’s dinner.

The next day, Memcache was our valentine.

mrdb03 survived Friday without a blip

db04 load was higher than the read/write master
The combination of our efforts worked:

  • Overall query traffic was reduced dramatically
  • What traffic that did make it past memcache was well distributed onto 2 read-only slaves (db04, db04-2)
  • App code was optimized to reduce overhead and unnecessary database traffic — this was done by placing hard limit on the number of search results returned by the API, among other things

How will we scale?

So these growing pains will help us move forward.  Here is our plan of attack for scaling this beast for the Firefox 3 onslaught:

  • Move the API (services.addons.mozilla.org) to a separate docroot with its own read-only slaves and more aggressive caching policies that are separate from the main AMO
  • Optimize client code to reduce the number of requests needed to retrieve data and also imploring local caching methods for redundant content or content that doesn’t change over time very much
  • Offload even more traffic onto read-only slaves
  • Upgrade to CakePHP to latest 1.1.x stable branch, which optimizes auto-generated queries quite a bit (thanks to clouserw for researching this)
  • Refactor how we pull localized strings from our database
  • Optimize our search performance on AMO and the API
  • Switch default CakePHP data source to read-only slaves
  • Find ways to use memcache at higher levels (caching larger objects instead of at just query level)

Once again it was a great team effort to get things running smoothly.  Thanks to IT for helping us troubleshoot this.  We’ll continue to build on this experience to ensure better reliability in future releases.

Looking back at the last three days, the Firefox 3 Beta 3 release was a success in more ways than one.  It showed everyone what the web can do, but it also helped us wrap our heads around the API and how much traffic it generates.  All of this will make for a better Firefox 3.0 release.

AMO 3.2 Preview

Friday, February 15th, 2008

AMO has a new look and we need your help to polish it off. Please tell us what you think!

Here are some screenshots:

Rec vs. ExpExperimental close-upDev StatsFeatured add-onReviewsApp ChooserDeveloper CP Nav

Aside from a new look, here are few highlights in AMO 3.2:

What we’d like to know:

  • Does the reskin help you find what you need quicker?
  • Does the absence of “types” confuse things? (plugins, search plugins, themes, extensions)
  • What should we do to make things better/easier to use for you?

Keep in mind that we are still ironing out some wrinkles. For more information:

Thanks, and looking forward to hearing from everyone.

AMO Update r10238

Monday, February 11th, 2008

Yes, we are over 10,000 commits in our subversion repository. This last update for AMO trunk includes the following fixes, among others:

  • Update sk locale from bug 367271
  • Improving install experience for non-browser apps (bug 401272, r=clouserw)
  • adding GUID to categories RSS feed to enable feed readers to distinguish fresh items from old ones (bug 411834)
  • merging new strings from Thunderbird install experience fix (r9576, bug 401272) into all other locales
  • fixing “all versions” RSS feed, bug 392183
  • Fix bug 394590
  • Fix bug 378782
  • Total download counting in maintenance script; bug 409341; r=morgamic
  • Adding pt_PT locale from bug 391197
  • Update pt-BR locale from bug 380221
  • Update zh-CN locale from bug 407472
  • fixing data sanitization for UTF-8 characters: bug 412580, r=laura
  • Adding support for application wildcards in categories; bug 408525; r=clouserw
  • adding test for UTF-8 sanitization (bug 412580)
  • fixing pagination sanitization, bug 412580, r=fligtar
  • Checking in reviewcount column and maint script from bug 408680.
  • Firefox 3 additem notices; bug 406898; r=morgamic
  • fix bug 415085
  • Fixing sanitization of discussion dates on addons detail page (bug 414541)
  • Unflag sr-flagged add-on; bug 371214; r=fwenzel
  • minor change to bin database class; bug 409341; r=morgamic
  • Checking in review count column stuff from 408680. r=fwenzel.
  • fixing memcaching for select queries that start with whitespace (bug 416403, r=morgamic)
  • Update fr locale from bug 366239

I want to thank everyone on the AMO team for their hard work, especially localizers who have worked really hard to port AMO to their native language. 2008 is already turning out to be a great year — let’s keep it up!