Archive for the 'Breakpad' Category

Socorro’s File System Storage

Friday, October 10th, 2008

As the scope and depth of the Socorro/Breakpad project has evolved in the last nine months, the most nonvolatile requirement of the project has been a file system as the initial server side storage for submitted crash dumps. The file system gets used as an ad hoc hierarchical database, but it isn’t optimized for the type of lookups that we need to do. Unfortunately, the original implementation was without indexing by name leaving us using a search (sort of like using find over NFS with 9 million entries). We patched and then patched the patches as the magnitude of our scaling problem erupted in front of us. Now that the emergencies are over and server side throttling is in place, we can revisit and re-engineer a proper interface for the file system.

Indexing by a single key is simple in a file system. The path is the key to a file. Indexing by a second key in a file system requires links. We chose to use symbolic links as our pointers to indicate relationships between files. There’s an interesting thing about symbolic links: if the path they point to isn’t too long, the path is stored right with the inode. Reading the target path from a symbolic link is fast. We created two radix sort hierarchies, one for the name and one for date. We store our uuid named data files on the name branch. Then we have symbolic links span the gap between the branches to indicate the time interval in which a file was submitted.

the New Socorro File System Layout

Our applications use these two hierarchies in a certain order every time. On first reading the file system, we’re looking only for dumps that we haven’t seen before. We walk the hierarchy on the date side and we know that every symbolic link we encounter represents a new crash dump. We pass the uuid on to the next application for processing and then we remove the symbolic links. That guarantees that any link we find on the date side is always new. From then on, we only will need to lookup that uuid by name. The remaining hierarchy of the name branch is optimized for that.

We also use this hierarchy for long term storage of crash dumps using a different file system root. Using the name branch, we can look up files by uuid very quickly even if the number of stored files is huge. When it’s time to retire old data that is no longer needed, we can walk the date branch to find and delete only items older than a threshold.

We expect this new system to speed up priority processing significantly. The uniform API embodied by our new class JsonDumpStorage will increase reliability and consistency across the applications that use the file system.

For more information, see: SocorroFileSystem
For progress information on deployment: Bug 458798

Socorro Delays

Monday, July 7th, 2008

Over the last week, we’ve encountered some problems in our monitor and processor caused by a large number of pending jobs:

  • main monitor thread takes > 1 hr to complete a full scan of pending jobs on disk
  • priority job processing depends on this thread

In order to fix these delays, we have made priority job monitoring a separate thread from the main queue thread, but we are working to reduce the delay back to 1-2 minutes (which is what we’re used to!).

Currently, we are blocked by issues with this new method related to filesystem scanning.  This is blocking the archiving and data re-import mentioned in our last set up updatesSee the related code.

We are working to fix this asap, and will provide updates this evening.

Socorro Database Updates

Monday, June 23rd, 2008

The Socorro database, which is the main database for Firefox 3 crashes, will be going through some maintenance upgrades this week. Starting tonight, this means:

  • Report data prior to June 23rd will be temporarily unavailable.
  • New reports will work as usual on our new database.
  • Old data will be imported into the new database using an improved partitioning plan.

Reasons why we need to do this now:

  • Firefox 3 crash throughput has been about triple the projected amount (due to the popularity of Firefox 3).
  • The current database has a partition that is unmanageable without significant downtime (2-3 days)
  • The Socorro reporter is not responsive because of the size of the current partition.

No individual reports will be lost, they will be restored over the next week. Please let us know if you have any questions.

Socorro Processor Updates

Monday, April 21st, 2008

Last Friday we pushed some important updates to Socorro:

  • Bug 426940 - Reduce or eliminate delay in collector to monitor hand-off
  • Bug 426940 - Fix processor handling of error conditions
  • Bug 428300 - status page too slow

This means:

  • When you submit a crash report you won’t have to wait longer than 30-60 seconds to view your report
  • The processor now has better handling of minidump_stackwalk fatal errors
  • There is an improved server status page where you can view stats on the current queue

Thanks to Lars and Aravind for getting this out the door.  The next couple of weeks will be spent improving reporter performance and closing out milestone 0.5 bugs.

Crash Analysis: now in Open Source flavor

Monday, April 21st, 2008

History can tell you that companies don’t disclose crashes in their software. They keep a pretty close eye on what crashes and bugs are disclosed.

Mozilla doesn’t.

Rather than being the exception, openness is the rule, and that is one of the coolest things about being a part of this. My job, my everday tasks, they aren’t secret, and they are not to drive profits. They are to drive the web.

soccorro screenshot

In that spirit, our crash reporting system (Socorro) is available to whoever wants to view it. Aside from user-bound statistics, crash information is available in full and anybody in the community can learn about where in the code their client crashed. They can also help provide hints or comments about what they were doing at the time they crashed.

This opens the door for the community to learn valuable things about their software and how they use it:

  • What crashes the most? What crashes the most over time? What is the breakdown across branches, versions and products?
  • Where did we crash? Crash signatures provide a head start for locating the cause for a crash. From there, full stack traces are available to analyze callback and find the source of the actual crash.
  • What was installed? What modules were installed for a given crash? Soon we will also be able to understand what extensions were installed so we can understand the correlation between core client crashes and crashes caused by faulty extensions. The end result is a closer relationship with the extension developer community and better quality in our add-ons space.
  • How are we doing? Overall the jackpot question is — are we crashing more or less? How are we doing with this beta, alpha or rc1? Are we regressing in real-life situations despite positive automated testing results??

All of this was possible because of a collaborative effort between quite a few parties:

  • Mark Mentovai and the breakpad team, for writing a great client and processor under a flexible open source license that is easy to integrate
  • Ted Mielczarek for his work on the client, processor and integrating the project into Firefox 3
  • Benjamin Smedberg and Robert Sayre for their work in getting the initial versions of the breakpad server off the ground

Where do we go from here?

Of the many projects we have in 2008, this is one of the most exciting. It’s an opportunity to open up information that hasn’t historically been available to the masses, and hack on a great tool for improving the quality of all Mozilla projects

Socorro Updates

Friday, April 4th, 2008

We’ve pushed some important updates in the last couple of days:

  • refactor of processor code, which is 1/3 of the breakpad server
    architecture
  • update of reporter to allow for instant queuing of requested reports

This means:

  • If you submit a crash, going to that crash page will:
    • Show you a “haven’t queued it yet” page instead of a 404
      page that will update in < 10 min
    • Once queued, you’ll see a “report pending” page that will
      redirect to the finished report in < 21 seconds
  • Wait time for reports from testers is reduced to 10 min max,
    sometimes 21 seconds best-case
  • We are working on eliminating the 10 min portion but there are
    reasons why we can’t spam the monitor that is responsible for
    queuing new reports that are on disk — more on that next week (I
    want this to get down to: load, wait 20 seconds, BAM! see your report)

Thanks for everyone’s patience with the crash report backlog during releases — we hope this helps many of you.

Let me know if you have any questions.  More to come in the next few weeks!  Thanks to Lars, Ted and Aravind for their help with developing/ testing and pushing these updates.

Socorro Updates

Tuesday, July 31st, 2007

Socorro has had a few improvements over the last week. bsmedberg, ispiked and luser worked hard to bring you:

  • Graphs to show crash population over time
  • Crash reports by operating system
  • Crash reports by build and operating system

There are plenty of fun graphs that show trends over time. For example, see this graph that shows a common crash in gfxFont::Draw:

Draw

Up next, we’ll be working on finishing up milestone 2 items and work towards milestone 3. For more about the project and what we’re working on:

If you have any comments please find us in mozilla.dev.quality or #breakpad@irc.mozilla.org on IRC!

Partitioning Fun in PostgreSQL

Tuesday, May 15th, 2007

Last week I learned a few things (the hard way) about PostgreSQL (pgsql) partitioning:

  • You really have to read the “caveats” part of the manual (scroll down, very bottom)
  • Server config matters (SET constraint_exclusion = on;) if you want to avoid unnecessarily checking out-of-bounds partitions
  • Child tables don’t inherit permissions, so owning the parent table won’t automagically grant you access to child tables — which in this case prevented me from setting up my explicit indexes.
  • Foreign key constraints on a parent table are not inherited. This is actually a bug in pgsql. They plan on changing “CREATE TABLE LIKE” in pgsql 8.3 but fixing this for INHERIT statements is a bit different since it means inheriting for the lifetime of the parent->child relationship (instead of one-time-only during creation). For more, read chizu’s thread on the issue — he actually submitted a patch that fixed things for table creation. :)

Long story short, partitioning is tricky business. For more, see bsmedberg’s blog, where he talks about query performance in queries spanning multiple partitions and his battle with the postgresql query planner.

From the very beginning, even before you start messing with the query planner, I think it’s important to understand the pitfalls of inheritance in pgsql and how it will affect your use of partitions.

So just be extra careful when you read the manual and make sure you’re aware of these issues when you plan your schema. If you’re not, it’ll come back to bite you.