‘Breakpad’ Archives
Three Weeks with the New Socorro File System
Three weeks ago today, we deployed the new Socorro file system into production. It was the first in in a series of engineered improvements to the Socorro codebase. By “engineered”, I mean that it was the first major improvement to the code that wasn't done during an emergency with a gun to our heads. For the previous half year, we've been reactive instead of proactive. The new file system has performed quite well. The most outward expression of this improvement is the speed at which priority jobs are processed. A priority job is any submitted ... Read More »
Socorro’s File System Storage
As the scope and depth of the Socorro/Breakpad project has evolved in the last nine months, the most nonvolatile requirement of the project has been a file system as the initial server side storage for submitted crash dumps. The file system gets used as an ad hoc hierarchical database, but it isn't optimized for the type of lookups that we need to do. Unfortunately, the original implementation was without indexing by name leaving us using a search (sort of like using find over NFS with 9 million entries). We patched and then patched the patches as the magnitude ... Read More »
Socorro Delays
Over the last week, we've encountered some problems in our monitor and processor caused by a large number of pending jobs: main monitor thread takes > 1 hr to complete a full scan of pending jobs on disk priority job processing depends on this thread In order to fix these delays, we have made priority job monitoring a separate thread from the main queue thread, but we are working to reduce the delay back to 1-2 minutes (which is what we're used to!). Currently, we are blocked by issues with this new method related to filesystem scanning. This is blocking the archiving and ... Read More »
Socorro Database Updates
The Socorro database, which is the main database for Firefox 3 crashes, will be going through some maintenance upgrades this week. Starting tonight, this means: Report data prior to June 23rd will be temporarily unavailable. New reports will work as usual on our new database. Old data will be imported into the new database using an improved partitioning plan. Reasons why we need to do this now: Firefox 3 crash throughput has been about triple the projected amount (due to the popularity of Firefox 3). The current database has a partition that is unmanageable without significant downtime (2-3 days) The Socorro reporter is not responsive because ... Read More »
Socorro Processor Updates
Last Friday we pushed some important updates to Socorro: Bug 426940 - Reduce or eliminate delay in collector to monitor hand-off Bug 426940 - Fix processor handling of error conditions Bug 428300 - status page too slow This means: When you submit a crash report you won't have to wait longer than 30-60 seconds to view your report The processor now has better handling of minidump_stackwalk fatal errors There is an improved server status page where you can view stats on the current queue Thanks to Lars and Aravind for getting this out the door. The next couple of weeks will be spent ... Read More »
Crash Analysis: now in Open Source flavor
History can tell you that companies don't disclose crashes in their software. They keep a pretty close eye on what crashes and bugs are disclosed. Mozilla doesn't. Rather than being the exception, openness is the rule, and that is one of the coolest things about being a part of this. My job, my everday tasks, they aren't secret, and they are not to drive profits. They are to drive the web. In that spirit, our crash reporting system (Socorro) is available to whoever wants to view it. Aside ... Read More »
Socorro Updates
We've pushed some important updates in the last couple of days: refactor of processor code, which is 1/3 of the breakpad server architecture update of reporter to allow for instant queuing of requested reports This means: If you submit a crash, going to that crash page will: Show you a "haven't queued it yet" page instead of a 404 page that will update in < 10 min Once queued, you'll see a "report pending" page that will redirect to the finished report in < 21 seconds Wait time for reports from testers is reduced to 10 min max, sometimes 21 seconds best-case We are working on eliminating the 10 min portion ... Read More »
Socorro Updates
Socorro has had a few improvements over the last week. bsmedberg, ispiked and luser worked hard to bring you: Graphs to show crash population over time Crash reports by operating system Crash reports by build and operating system There are plenty of fun graphs that show trends over time. For example, see this graph that shows a common crash in gfxFont::Draw: Up next, we'll be working on finishing up milestone 2 items and work towards milestone 3. For more about the project and what we're working on: See the project home page... Read More »
Partitioning Fun in PostgreSQL
Last week I learned a few things (the hard way) about PostgreSQL (pgsql) partitioning: You really have to read the "caveats" part of the manual (scroll down, very bottom) Server config matters (SET constraint_exclusion = on;) if you want to avoid unnecessarily checking out-of-bounds partitions Child tables don't inherit permissions, so owning the parent table won't automagically grant you access to child tables -- which in this case prevented me from setting up my explicit indexes. Foreign key constraints on a parent table are not inherited. This is actually a bug in pgsql. They plan on changing "CREATE ... Read More »
