Jun 4

Databases and Web Applications

Category: MySQL

The Internet is full of stuff.  When you get right down to it, one of the amazing things about the low cost of entry to writing software is that we can end up with such a diverse ecosystem.  If you’ve ever tried to pick software to power a wiki, blog, forum, webmail, or any of the zillions of other things you can do on the web, you will quickly realize that there are so, so many options.  Frankly it’s overwhelming.

Mozilla is no exception to the rule, either.  With the number of sites that we run, we’ve ended up using a rather large number of different applications.  Two or three wiki platforms, WordPress, Drupal, and many custom applications that have been written over the years.  This is all well and good, but we run into a real issue when it comes to scaling these applications at the database layer.  Almost every application that we run suffers from the same problem: it assumes that one database will handle all of its needs.

In the real world that we live in, especially at the scale of Mozilla, that’s sometimes not the case.  Whether it’s because the databases don’t have enough RAM or fast enough disks, or because there are too many applications on one machine, the sad truth is that capacity limits will be hit at some point.  Sure, you can generally buy a new machine with faster disks and more RAM, but after doing that a few times, what do you do with the older hardware?  What about when there are no faster disks, no faster processors?  (This is the age old question about scaling up versus scaling out, yes, yes it is.)

Let us first have a little bit of discussion to establish some basic database concepts.  These might not be very useful for some of you who are familiar with the subject matter, but some of you may not be used to think about how you’re actually using them.  Humor me for a few minutes while I build a common ground for us to work from here.  (Or skip ahead!  Your call.)

In general, you can break up database access into two activities: getting data out (SELECT) and putting data in (INSERT, UPDATE, DELETE, REPLACE, etc).  Or, in more common terms: reading and writing.  After all, that’s what a database does, right?  It stores data.  Someone puts the data in, someone pulls the data out.  That’s about as simple as it gets, really.

(The astute reader will probably note that I’m avoiding more complex topics like transactions, replication lag, and other things as I meander through the rest of this post.  I may get into those in later articles, but for now, I’m not too concerned about that in the examples we’re discussing.)

Now let me describe the typical database setup here at Mozilla.  We have four or five database clusters.  (Note that the term ‘cluster’ simply means a logical collection of database machines, we’re not actually talking about MySQL Cluster.)  In most of these clusters there are two machines: the master and the slave.  The slave is a read-only mirror of the master, but is constantly replicating changes from the master, ensuring that the copy of the data on the slave is in sync with the master at all times.  This is pretty straightforward when you think about it.  All of that “putting data in” (write traffic) must happen on the master, but you can get data out of either machine.  (They have the same data!  You can read from either one!)

It doesn’t always have to be setup like this, however: it’s easy to have more than one slave!  In fact, you can have as many slaves as you want and still only have one master.  If you think about it, this lets you scale out - but only on your read traffic!  Since you have more slaves, you can handle more of the “getting data out” part of things, but you don’t get any extra ability to put data in since everybody still has to put all of the data in.  (No, I’m not going to discuss scaling the write traffic right now.)

If you are having some trouble picturing this or aren’t entirely sure what it looks like, here’s a bit of a thought experiment.  Let’s pretend the database is like a book.  (Which it is, if you think about it.  Books contain information, so do databases.  Books have an index in the back, databases have indexes.)

The first situation you have is that of the “only one database” setup.  There, you have one book in your hands.  Now let’s say your friends want to read from the book.  Obviously only so many people can be crowded around the book at the same time before it gets too crowded.  You don’t have to worry about who’s writing to it though, because you can always see the current/correct/most up to date data.  Everyone is on the same book!

Now let’s say you have too many people who want to see what’s in the book.  What do you do?  Either you tell everybody to take turns (some applications can do this - batch style applications - but it’s not useful for web/interactive applications so we can’t use this answer) or you can make a copy of the book!  Now you have two books.  Well, the second book is a copy of the first book at a certain point in time, so you must have somebody copying all of the changes made in the first book over to the second book.  But now you can have twice as many people reading the books, because you have twice as many books!

If you add a third book, the capacity grows as you’d expect.  50% more room for eyeballs than two books.  The fourth book nets you 33% more read capacity than just three books.  But note that you never get more write capacity!  Every time someone changes the “master” book, that change must still be written to all of the “slave” books.  (Again, not going into write scaling here…)

Anyway, let’s move on.  Typically, web applications are very read heavy.  Blogs, wikis, forums, and just about every other application that we run is skewed towards pulling data out of a database and presenting it for the users to view.  Lots of eyeballs, so we need lots of books!  This is also the pattern of use that works really well in the master-slave environment we just talked about.  Since there are so many reads going on, if you’re running out of capacity on your cluster, you can just add another book/slave.

This is how the Addons site (for example) works.  It sends all of its read requests (lots of them) randomly to one of the database slaves allocated for it.  And since we expect the traffic for this application to go up with Firefox 3, we can just add another database.  (I’m actually doing that this week, pre-emptively building a new database for it!)  This allows us here in IT to scale the application without involving the web development team, which is a win for everybody.  Hurrah!

But now, to bring this back to the main reason I wrote this post: most of the software we use - the blogs, wikis, etc - don’t know how to use slaves.  They only use a single database, which leaves us with little ability to scale the applications.  What if the Spread Firefox site suddenly tripled in traffic?  Well, the existing database could probably handle it, sure.  What about an order of magnitude increase?  We’d have to move it off of the database it’s on and find some beefier digs!  How about another order of magnitude?  I don’t think we could take it, because we’re going to be limited to one database no matter how hard we try.

(The diligent reader might note that I am being somewhat loose with details… there are other ways [such as caching] to scale, and we do use them.  If SFX did end up growing like that, we’d find a way to deal with it!  But that’s not the point of this particular exercise.  I want more options!)

Anyway.  For the web developers out there, this is something to consider as you write your applications.  Make your database layer scale!  You might think this is a hard thing to program for, but it’s really not!  You simply need to be able to get two database handles: one to the master, one to (one of) the slaves.  Then you decide which one to use in each situation that you come up against and use it!

* Rendering out a page full of comments for someone to view?  Use the slave!  You’re not writing anything, so this is easily done.

* Letting somebody post to their blog?  Use the master!  You need to send all of that write traffic on up.

* Doing a full text search or using INSTR on a table to try to find something?  Use those slaves!  That’s what they’re there for!

* Updating a hit count?  Write that stuff to the master!

I think by this point you’re getting the idea.  I really hope this blog post wasn’t too basic and might be useful to somebody out there.  Yes, there are definitely some areas I’ve glossed over.  But hey, it feels like enough rambling for today.  :-)

5 Comments so far

  1. Colin June 4th, 2008 10:37 pm

    Interestingly, this is something I’ve just been considering at work with Wordpress and SimpleMachines Forums as all the traffic from that hits our Master DB, with no way to offload it to the slave.

    However, for someone who doesn’t necessarily have the time to rewrite the entire application, it’s not always easy to make the changes required to get it working :(

  2. LpSolit June 5th, 2008 9:58 am

    Bugzilla already knows how to use DB slaves for years. \o/

  3. Mark Smith June 5th, 2008 9:59 am

    Colin - yeah! That’s one thing I am really hoping to use MySQL Proxy, Proximo, dormando’s Proxy for MySQL, etc for. But each of them seems to have its own set of particular issues.

    LpSolit - the shadow database stuff in 3.0.x? That only sends search queries to the slave, it doesn’t actually send page view traffic. It should send all of that to the slaves! :)

  4. Axel Hecht June 5th, 2008 3:18 pm

    When surfing around on django stuff, I came across http://softwaremaniacs.org/soft/mysql_cluster/en/. Did anyone have a look at that? Does it do the good job well?

    Or at least better than nothing or writing sql by hand? Which is something I kinda try not to.

  5. Dean J. Garrett June 6th, 2008 5:20 pm

    I used a web-hosted database service for a client project last year and it worked out very well because it didn’t require any database software to buy or install. All the interfaces were simple HTML. Of course my client’s requirements were pretty simple, to publish a growing list of coffeehouses. But if that’s all you need, a software-as-a-service offering is pretty good. Check out: http://www.HostedDatabase.com

Leave a comment