Recently, we’ve been working on planning out the future of Socorro. If you’re not familiar with it, Socorro is Mozilla’s crash reporting system.
You may have noticed that Firefox has become a lot less crashy recently – we’ve seen a 40% improvement over the last five months. The data from crash reports enables our engineers to find, diagnose, and fix the most common crashes, so crash reporting is critical to these improvements.
We receive on our peak day each week 2.5 million crash reports, and process 15% of those, for a total of 50 GB. In total, we receive around 320Gb each day! Right now we are handicapped by the limitations of our file system storage (NFS) and our database’s ability to handle really large tables. However, we are in the process of moving to Hadoop, and currently all our crashes are also being written to HBase. Soon this will become our main data storage, and we’ll be able to do a lot more interesting things with the data. We’ll also be able to process 100% of crashes. We want to do this because the long tail of crashes is increasingly interesting, and we may be able to get insights from the data that were not previously possible.
I’ll start by taking a look at how things have worked to date.
The data flows as follows:
Future Socorro releases are a joint project between Webdev, Metrics, and IT. Some of our milestones focus on infrastructure improvements, others on code changes, and still others on UI improvements. Features generally work their way through to users in this order.
The current production version is 1.6.3, which was released last Wednesday. We don’t usually do second dot point releases but we did 1.6.1, 1.6.2, and 1.6.3 to get Out Of Process Plugin (OOPP) support out to engineers as it was implemented.
When an OOPP becomes unresponsive, a pair of twin crashes are generated: one for the plugin process and one for the browser process. For beta and pre-release products, both of these crashes are available for inspection via Socorro. Unfortunately, Socorro throttles crash submissions from released products due to capacity constraints. This means one or the other of the twins may not be available for inspection. This limitation will vanish with the release of Socorro 1.8.
You can now see whether a given crash signature is a hang or a crash, and whether it was plugin or browser related. In the signature tables, if you see a stop sign symbol, that’s a hang. A window means it is crash report information from the browser, and a small blue brick means it is crash report information from the plugin.
If you are viewing one half of a hang pair for a pre-release or beta product, you’ll find a link to the other half at the top right of the report.
You can also limit your searches (using the Advanced Search Filters) to look just at hangs or just at crashes, or to filter by whether a report is browser or plugin related.
We are in the process of baking 1.7. The key feature of this release is that we will no longer be relying on NFS in production. All crash report submissions are already stored in HBase, but with Socorro 1.7, we will retrieve the data from HBase for processing and store the processed result back into HBase.
In 1.8, we will migrate the processors and minidump_stackwalk instances to run on our Hadoop nodes, further distributing our architecture. This will give us the ability to scale up to the amount of data we have as it grows over time. You can see how this will simplify our architecture in the following diagram.
With this release, the 15% throttling of Firefox release channel crashes goes away entirely.
You may have noticed 1.9 is missing. In this release we will be making the power of Hbase available to the end user, so expect some significant UI changes.
Right now we are in the process of specifying the PRD for 2.0. This involves interviewing a lot of people on the Firefox, Platform, and QA teams. If we haven’t scheduled you for an interview and you think we ought to talk to you, please let us know.
This is a big list, obviously. We need your feedback – what should we work on first?
One thing that we’ve learned so far through the interviews is that people are not familiar with the existing features of Socorro, so expect further blog posts with more information on how best to use it!
As always, we welcome feedback and input on our plans.
You can contact the team at socorro-dev@mozilla.com, or me personally at laura@mozilla.com.
In addition, we always welcome contributions. You can find our code repository at
http://code.google.com/p/socorro/
We hold project meetings on a Wednesday afternoon – details and agendas are here
https://wiki.mozilla.org/Breakpad/Status_Meetings
Tags: Socorro
Great post, Laura. I’m excited about the changes coming up and how it will improve the stability of Firefox. We’ve come a long way but we still have a lot of room to grow — it’s a good place to be.
morgamic on May 21st, 2010 at 1:14 amLooks great! Some questions:
* It doesn’t look like it’s possible to compare 3.6.3 with 3.6.4 by showing both graphs at the same time. Any plans for that?
David Tenser on May 27th, 2010 at 1:50 am* Why is 3.6.4 listed as a current release — shouldn’t it be 3.6.3?
* If I compare the crashes per ADUs on 3.6.3 and 3.6.4, it looks like 3.6.4 crashes 10x more often. 3.6.3 is about 0.25 and 3.6.4 is around 3.0. I assume this is wrong?