Archive for the 'Socorro' Category

Socorro Database Migration

Monday, September 22nd, 2008

In our last post we described one of our Q4 goals as:

work with PostgreSQL to update and optimize our hardware and software configurations

Aravind has gone ahead and migrated the Socorro database to its new hardware and we’ve upgraded PostgreSQL to 8.3 in order to take advantage of performance optimizations that went into the most recent release. We’ve also updated the configuration so xlogs and logs are written to a different data store than the main database.

Unfortunately this required a full migration over the weekend, which required some downtime. This morning we are running a full vacuum which will also take a considerable amount of time to finish. There is no ETA, but we will post an update when it is completed.

Update: We have postponed the full vacuum/analyze until a maintenance window later this week — it has taken way too long to complete and can’t afford to lock tables for this long.

The Crashing Edge

Wednesday, September 17th, 2008

This will be the first of a weekly blog about the crash reporting system — much like the other *edge blogs out there.

In short, we’ve rewritten most of the system to accomodate throughput that is more than 10 times the projected traffic.  It is not because Firefox 3 is crashing more — we are seeing increased traffic because:

  • Client-side throttling has not been effective
  • Overall number of users has increased
  • Mac Intel builds now submit crashes properly

What have we been up to?

First, two bug lists:

Here are some issues that have been addressed in the last 4-6 weeks (not all of which had bugs):

  • server-side throttling now possible and configurable
  • improved separation between monitor and processor jobs
  • updated collector now saves unprocessed crashes grouped by hour
  • mysterious death of main monitor thread resolved
  • collector, monitor, processor now use a common config
  • SVN structure simplified, no longer required to use /dist or /scripts/dist-* scripts to copy things inside SVN
  • Python/Pylons reporter replaced with PHP/Kohana reporter
  • web application now clustered with memcache support
  • removal of SQLAlchemy from all tiers of the system

Known Issues

Known issues are mostly listed in the 0.6 buglist. If you know of a problem that isn’t already filed please file it to help us keep track of things.  Pressing issues with the reporter are:

  • database bottleneck
  • inefficient queries
  • corrupted summary tables for topcrashers

What’s Next?

Our goals as we move into Q4 2008 are:

  • work with PostgreSQL to update and optimize our hardware and software configurations
  • move expensive aggregate queries to cron jobs, summary tables
  • move focus from performance and maintenance to feature development (e.g. new reports, graphs)
  • fully document the system to make sure it is easy to contribute or learn about how this works

In the next week, we will be focusing on documentation and making the reporter usable.  As we resolve our scaling issues, if anybody is interested in revamping some of the UI, you are welcome to jump in.  Find us in #breakpad on IRC.

Socorro Processor Updates

Monday, April 21st, 2008

Last Friday we pushed some important updates to Socorro:

  • Bug 426940 - Reduce or eliminate delay in collector to monitor hand-off
  • Bug 426940 - Fix processor handling of error conditions
  • Bug 428300 - status page too slow

This means:

  • When you submit a crash report you won’t have to wait longer than 30-60 seconds to view your report
  • The processor now has better handling of minidump_stackwalk fatal errors
  • There is an improved server status page where you can view stats on the current queue

Thanks to Lars and Aravind for getting this out the door.  The next couple of weeks will be spent improving reporter performance and closing out milestone 0.5 bugs.

Crash Analysis: now in Open Source flavor

Monday, April 21st, 2008

History can tell you that companies don’t disclose crashes in their software. They keep a pretty close eye on what crashes and bugs are disclosed.

Mozilla doesn’t.

Rather than being the exception, openness is the rule, and that is one of the coolest things about being a part of this. My job, my everday tasks, they aren’t secret, and they are not to drive profits. They are to drive the web.

soccorro screenshot

In that spirit, our crash reporting system (Socorro) is available to whoever wants to view it. Aside from user-bound statistics, crash information is available in full and anybody in the community can learn about where in the code their client crashed. They can also help provide hints or comments about what they were doing at the time they crashed.

This opens the door for the community to learn valuable things about their software and how they use it:

  • What crashes the most? What crashes the most over time? What is the breakdown across branches, versions and products?
  • Where did we crash? Crash signatures provide a head start for locating the cause for a crash. From there, full stack traces are available to analyze callback and find the source of the actual crash.
  • What was installed? What modules were installed for a given crash? Soon we will also be able to understand what extensions were installed so we can understand the correlation between core client crashes and crashes caused by faulty extensions. The end result is a closer relationship with the extension developer community and better quality in our add-ons space.
  • How are we doing? Overall the jackpot question is — are we crashing more or less? How are we doing with this beta, alpha or rc1? Are we regressing in real-life situations despite positive automated testing results??

All of this was possible because of a collaborative effort between quite a few parties:

  • Mark Mentovai and the breakpad team, for writing a great client and processor under a flexible open source license that is easy to integrate
  • Ted Mielczarek for his work on the client, processor and integrating the project into Firefox 3
  • Benjamin Smedberg and Robert Sayre for their work in getting the initial versions of the breakpad server off the ground

Where do we go from here?

Of the many projects we have in 2008, this is one of the most exciting. It’s an opportunity to open up information that hasn’t historically been available to the masses, and hack on a great tool for improving the quality of all Mozilla projects

Socorro Updates

Tuesday, July 31st, 2007

Socorro has had a few improvements over the last week. bsmedberg, ispiked and luser worked hard to bring you:

  • Graphs to show crash population over time
  • Crash reports by operating system
  • Crash reports by build and operating system

There are plenty of fun graphs that show trends over time. For example, see this graph that shows a common crash in gfxFont::Draw:

Draw

Up next, we’ll be working on finishing up milestone 2 items and work towards milestone 3. For more about the project and what we’re working on:

If you have any comments please find us in mozilla.dev.quality or #breakpad@irc.mozilla.org on IRC!