Archive for January, 2008

Applied diagnostics, part 1.

January 30th, 2008 | Category: MySQL

In the first of what is sure to be many similar posts, today I’m going to write a little bit about an issue that was faced two weeks back.  For lack of any particular projects (my third day on the job, they were still letting me warm up and poke around free-form) I was getting a little intimate with the database that most of the queries for Bugzilla and Bonsai and various other projects run against.

A brief description of the setup: this database cluster is two databases, one Master and one Slave.  All write queries from the various sites go to the master, and all (well, many) read queries go to the slave.  It’s fairly straightforward.  For the past some weeks (I’m not entirely sure on the timeframe) the slave had been having random load spikes that was causing it to become unavailable and page the currently on call member of the team to deal with it.

This was quite annoying for whoever was on call, as it tended to happen most around 3-6AM.  For me, the first step in diagnosing any problem with a database is generally to get a good birds eye view of what is going on and try to understand where the problem is.  On a conceptual level, a database runs into load issues when it is maxing out on some particular resource that it needs to answer a query.  This may sound basic, but it’s the platform for some hardcore applied diagnostics!

There are many ways you can approach the problem of “my database is overloaded!”, but two big ones I can think of are analyzing the traffic and analyzing the system.  Both are useful, and you will probably want to do both to get a clear picture of what is going on.  You have to start somewhere though, and in this particular case, I started with analyzing the traffic itself.

First, I took a look to see what the machine was busy doing.  SHOW FULL PROCESSLIST run in MySQL will give you a list of all active connections and any currently running queries.  Note that a lot of the time, you may not see anything!  On most healthy servers, you will probably see none or only a few queries in the output.  In a server with problems, you might see a lot of queries in the output.  Or just a few queries that are taking a long time!  If you do see queries, look at them.  Understand what they’re doing.  Are you seeing the same query over and over?  Or is it a bunch of different queries?  What stare they in?  Is the time field climbing and climbing?

In this case, I saw one particular query that was being executed on the slave at the rate of several per second.  My first expectation was that this query was bad - perhaps it wasn’t hitting an index?  Here’s the query so you can follow along:

SELECT type, UNIX_TIMESTAMP(ci_when), people.who, repositories.repository, dirs.dir, files.file, revision, stickytag, branches.branch, addedlines, removedlines, descs.description
FROM checkins,people,repositories,dirs,files,branches,descs
WHERE people.id=whoid AND repositories.id=repositoryid AND dirs.id=dirid AND files.id=fileid AND branches.id=branchid AND descs.id=descid AND repositories.repository = ‘/l10n’ AND ci_when >= ‘2008/01/02 20:22:03′

(Apologies for this not being very readable… I can’t seem to figure out how to work a PRE tag in this blogging software…)

My gut instinct on seeing this query is: ew, lots of joins.  But overall, it’s not too scary.  I started analyzing the tables involved, taking full advantage of MySQL’s EXPLAIN command, and discovered that the query is about as optimal as it’s going to get.  Every join is on an index, and everything looked good.  With that rudimentary check done, and not seeing anything particularly wrong with the query, I ran it to look at the performance.

Woah, gravy!  The query takes some 6+ seconds to run!  That’s generally not a good thing, especially when you are seeing many of these queries being executed every second.  Debugging began in earnest at this point, since my gut instinct here was that something in the database wasn’t tuned.  I mean, a relatively simple query taking several seconds to execute, doesn’t sound right!  I next dove into the server itself, to determine what was going on.

Presumably there’s a bottleneck somewhere.  Something is making these queries take six seconds.  Well, as you might expect, your best friends for locating bottlenecks are going to be the tools you use every day.  A few moments watching top showed that the CPU was maxing out when these queries were hitting the server pretty hard.  Some time watching vmstat revealed almost no disk activity.  Well, that rules out I/O throughput issues.  At this point it’s fairly easy to decide to continue investigating into MySQL itself, the server doesn’t seem to be having grievous problems.

Diving into SHOW STATUS gave me two particular areas of question at first blush, but neither of which seemed likely to be the culprit.  The number of opened tables (Opened_tables) was quite high - in the millions - but wasn’t growing at a rate that was overly worrying.  Yes, the query in question has a lot of joins, but even maxing out on concurrent threads we were not hitting the table limit too hard.  I did raise the limit on open tables, but performance was negligibly impacted, if at all.

The second thing I noticed was that the query cache was not being used that well.  When the number of hits is approximately equal to the number of inserts, and the number of pruned items is similarly high, then it’s not doing you much good.  Of course, since the queries are all varying by a few seconds each time (look back at the ci_when comparison in the WHERE clause), there’s really not much gain in having the query cache enabled for this particular query.  I bumped the cache size a little to reduce the number of low memory prunes, but again performance was unchanged.

At this point, something else caught my eye - the number of Key_read_requests that MySQL was reporting.  It was huge!  Compared to the number of queries being run, the value was very, very high.  This is about when it dawned on me - this query really is scanning a lot of data!  I started investigating, and sure enough, every time this query runs it has to do tens to hundreds of thousands of key reads.  No amount of index caching is going to help this query, it just has so much data to dig through that it’s literally taking that long to push the bytes around.  A little crestfallen, I backtracked slightly to try to find another approach.

One thing in particular bothered me about this query.  The conditional on the ci_when column was exactly two weeks ago, and every time the query ran it was creeping forward by only a second or two each time.  If it was possible to make this query use more stable values (say, even times every 15 minutes or so) then we could take advantage of the query cache, and instantly improve server overall performance if not the performance of the individual queries.  Some asking around revealed that this query belonged to the Tinderbox application, and some more poking showed that nobody needed it to be exactly two weeks ago.

With that knowledge, and the help of Dave Miller, I submitted a one line patch to the application to change the times generated.  It was something like this:

$time -= $time % 3600;

Which basically flattens the time down onto even hours.  The SQL is then generated and fed to the database.  Once this went live, the load on the server plummeted - here’s the CPU graph:

DB CPU Graph

Sometimes the problem is the database, perhaps your disks are slow, you don’t have enough memory, the tables are not properly indexed for the queries you’re doing, etc.  Sometimes, however, the problems are at another end, and a simple change to the application can negate an incredible amount of load by making the two play together better.

At any rate, that’s a lot of words to say: “this query wasn’t taking advantage of the query cache, so I fixed it.”  I aim these entries towards the new diagnostician, someone who hasn’t gone through this, and is trying to learn about how to approach a problem and not just how to solve one particular problem.  The process is more interesting than the solution!  Enjoy.

No comments

An introduction, of sorts.

January 22nd, 2008 | Category: Personal

I tend towards the verbose, and if you stick around to this blog (and I manage to write in it) then you will probably become familiar with this.  (This is the end of the only nod to the apologetic I will ever make for my writing style.)

Questions are welcome.  If at any time you are curious about how something works or why something was done in a certain way, ask.  I will do my best to answer.  If you want to chat, you can find me on the Mozilla IRC network (irc.mozilla.org) as xb95.

For introduction: my name is Mark Smith.  I am a computer handyman of sorts, with previous experience doing things such as Perl programming for Danga Interactive and Six Apart, infrastructure design and construction from silicon up to app (again for Six Apart), .NET and Python programming for CCP Games, and plenty of my own projects on the side.  I generally consider myself more of an engineer than an admin, but my experience lends me many skills that find themselves well applied to Keeping Things Running and Making Things Run Better.

My interests are solidly on scale.  How do you take something and make it run for millions of users?  Working at Danga and later Six Apart was a really great environment for this, as I’d been with the LiveJournal project since 2001-ish (starting as a volunteer, moving to become an employee in 2004).  Working with the great people at these companies really helped me to understand many of the problems around Scale and some of the solutions to them.  The year I spent in Iceland working for CCP was good for showing me more solutions.  Their vastly different approach was fascinating, and I think the time I spent there was well spent.

I started here at the Mozilla Corporation (MoCo!) last week on Monday as a member of the IT team.  Responsibilities will include anything and everything, with a specific ear towards database (MySQL) related work, issues of scale, and wherever else I can be of use.  I’ve already done some work on one of the databases, and then ended up submitting a patch for Tinderbox.  This promises to be a very interesting job and a great learning experience.  Plus, being able to work on something that affects so many people in such a meaningful way … wow.

That’s about all of the introduction that is useful, I think.  Again, please find me on IRC should you need me (I think I’m on all of the major networks; Perl, Freenode, Mozilla, …) and always as xb95.

An actual informational post will be forthcoming sometime later today or tomorrow.  So I don’t flood you … yet.

No comments