<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Mozilla Web Development &#187; Socorro</title>
	<atom:link href="http://blog.mozilla.com/webdev/category/soccoro/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/webdev</link>
	<description>Everybody Likes Ninjas</description>
	<lastBuildDate>Wed, 01 Feb 2012 16:41:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Moving Socorro to HBase</title>
		<link>http://blog.mozilla.com/webdev/2010/07/26/moving-socorro-to-hbase/</link>
		<comments>http://blog.mozilla.com/webdev/2010/07/26/moving-socorro-to-hbase/#comments</comments>
		<pubDate>Mon, 26 Jul 2010 21:33:04 +0000</pubDate>
		<dc:creator>Laura Thomson</dc:creator>
				<category><![CDATA[Socorro]]></category>
		<category><![CDATA[Web Development]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=1230</guid>
		<description><![CDATA[We&#8217;ve been incredibly busy over on the Socorro project, and I have been remiss in blogging. Over the next week or so I&#8217;ll be catching up on what we&#8217;ve been doing in a series of blog posts. If you&#8217;re not familiar with Socorro, it is the crash reporting system that catches, processes, and presents crash [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve been incredibly busy over on the Socorro project, and I have been remiss in blogging.  Over the next week or so I&#8217;ll be catching up on what we&#8217;ve been doing in a series of blog posts.  If you&#8217;re not familiar with Socorro, it is the crash reporting system that catches, processes, and presents crash data for Firefox, Thunderbird, Fennec, Camino, and Seamonkey.  You can see the output of the system at <a href="http://crash-stats.mozilla.com">http://crash-stats.mozilla.com</a>.   The project&#8217;s code is also being used by people outside Mozilla: most recently <a href="http://www.vigilgames.com/">Vigil Games</a> are using it to catch crashes from <a href="http://www.vigilgames.com/videos/dark-millennium-online ">Warhammer 40,000: Dark Millenium Online</a>.</p>
<p>Back in June we launched Socorro 1.7, and we&#8217;re now approaching the release of 1.8.  In this post, I&#8217;ll review what each of these features represents on our roadmap.</p>
<p>First, a bit of history on data storage in Socorro.  Until recently, when crashes were submitted, the collector placed them into storage in the file system (NFS).  Because of capacity constraints, the collector follows a set of throttling rules in its configuration file in order to make a decision about how to disseminate crashes.   Most crashes go to deferred storage and are not processed unless specifically requested.  However, some crashes are queued into standard storage for processing.  Generally this has been all crashes from alpha, beta, release candidate and other “special” versions;  all crashes with a user comment; all crashes from low volume products such as Thunderbird and Camino; and a specified percentage of all other crashes.  (Recently this has been between ten and fifteen percent.)</p>
<p>The monitor process watched standard storage and assigned jobs to processors. A processor would pick up crashes from standard storage, process them, and write them to two places: our PostgreSQL database, and back into file system storage.  We had been using PostgreSQL for serving data to the webapp, and the file system storage for serving up the full processed crash.</p>
<p>For some time prior to 1.7, we&#8217;d been storing all crashes in HBase in parallel with writing them into NFS.  The main goal of 1.7 was to make HBase our chief storage mechanism.  This involved rewriting the collector and processor to write into HBase.  The monitor also needed to be rewritten to look in HBase rather than NFS for crashes awaiting processing.  Finally, we have a web service that allows users to pull the full crash, and this also needed to pull crashes from HBase rather than NFS.</p>
<p>Not long before code freeze, we decided we should add a configuration option to the processor to continue storing crashes in NFS as a fallback, in case we had any problems with the release.  This would allow us to do a staged switchover, putting processed crashes in both places until we were confident that HBase was working as intended.</p>
<p>During the maintenance window for 1.7 we also took the opportunity to upgrade HBase to the latest version.  We are now using Cloudera&#8217;s CDH2 Hadoop distribution and HBase 0.20.5.</p>
<p>The release went fairly smoothly, and three days later we were able to turn off the NFS fallback.</p>
<p>We&#8217;re now in the final throes of 1.8.  While we now have crashes stored in HBase, we are still capacity constrained by the number of processors available.  In 1.8, the processors and their associated minidump_stackwalk processes will be daemonized and move to run on the Hadoop nodes.  This means that we will be able to horizontally scale the number of processors with the size of the data.  Right now we are running fifteen Hadoop nodes in production and this is planned to increase over the rest of the year.</p>
<p>Some of the associated changes in 1.8 are also really exciting.  We are introducing a new component to the system, called the registrar.  This process will track heartbeats for each of the processors.  Also in this version, we have added an introspection API for the processors.  The registrar will act as a proxy, allowing us to request status and statistical information for each of the processors.  We will need to rebuild the status page (visible at http://crash-stats.mozilla.com/status) to use this new API, but we will have much better information about what each processor is doing.</p>
<p>We will freeze on 1.8 later this week, and expect release in about two weeks&#8217; time.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2010/07/26/moving-socorro-to-hbase/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Socorro: Mozilla&#8217;s Crash Reporting System</title>
		<link>http://blog.mozilla.com/webdev/2010/05/19/socorro-mozilla-crash-reports/</link>
		<comments>http://blog.mozilla.com/webdev/2010/05/19/socorro-mozilla-crash-reports/#comments</comments>
		<pubDate>Wed, 19 May 2010 19:49:40 +0000</pubDate>
		<dc:creator>Laura Thomson</dc:creator>
				<category><![CDATA[Socorro]]></category>
		<category><![CDATA[Web Development]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=1014</guid>
		<description><![CDATA[Recently, we&#8217;ve been working on planning out the future of Socorro.  If you&#8217;re not familiar with it, Socorro is Mozilla&#8217;s crash reporting system. You may have noticed that Firefox has become a lot less crashy recently &#8211; we&#8217;ve seen a 40% improvement over the last five months.  The data from crash reports enables our engineers [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, we&#8217;ve been working on planning out the future of Socorro.  If you&#8217;re not familiar with it, Socorro is Mozilla&#8217;s crash reporting system.</p>
<p>You may have noticed that Firefox has become a lot less crashy recently &#8211;  <a href="http://blog.mozilla.com/metrics/2010/04/08/dramatic-stability-improvements-in-firefox/">we&#8217;ve  seen a 40% improvement over the last five months</a>.  The data from  crash reports enables our engineers to find, diagnose, and fix the most  common crashes, so crash reporting is critical to these improvements.</p>
<p>We receive on our peak day each week <strong>2.5 million crash reports</strong>,  and process 15% of those, for a total of 50 GB.  In total, <strong>we receive around 320Gb each day</strong>!  Right now we are handicapped  by the limitations of our file system storage (NFS) and our database&#8217;s  ability to handle really large tables.   However, we are in the process of moving to Hadoop, and currently all our crashes are also being  written to HBase.  Soon this will become our main data storage, and  we&#8217;ll be able to do a lot more interesting things with the data.  We&#8217;ll  also be able to process 100% of crashes.  We want to do this because the  long tail of crashes is increasingly interesting, and we may be able to  get insights from the data that were not previously possible.</p>
<p>I&#8217;ll start by taking a look at how things have worked to date.</p>
<h2>History of Crash Reporting</h2>
<p><a href="http://blog.mozilla.com/webdev/files/2010/05/Socorro.OSCON_.1.png"><img class="alignnone size-medium wp-image-1019" title="Current Architecture" src="http://blog.mozilla.com/webdev/files/2010/05/Socorro.OSCON_.1-300x150.png" alt="Current Socorro Architecture" width="300" height="150" /></a></p>
<p>The data flows as follows:</p>
<ul>
<li>When Firefox crashes, the crash is submitted to Mozilla by a part of the browser known as Breakpad.  At Mozilla&#8217;s end, this is where Socorro comes into play.</li>
<li>Crashes are submitted to the collector, which writes them to storage.</li>
<li> The monitor watches for crashes arriving, and queues some of them for processing.  Right now, we throttle the system to only process 15% of crashes due to capacity issues.  (We also pick up and transform other crashes on demand as users request them.)</li>
<li>Processors pick up crashes and process them.  A processor gets its next job from a queue in our database, invokes minidump_stackwalk (a part of Breakpad) which combines the crash with symbols, where available.  The results are written back into the database.   Some further processing to generate reports (such as top crashes) is done nightly by a set of cron jobs.</li>
<li>Finally, the data is available to Firefox and Platform engineers (and anyone else that is interested) via the webui, at <a href="http://crash-stats.mozilla.com">http://crash-stats.mozilla.com</a></li>
</ul>
<h4>Implementation Details</h4>
<ul>
<li>The collector, processor, monitor and cron jobs are all written in Python.</li>
<li>Crashes are currently stored in NFS, and processed crash information in a PostgreSQL database.</li>
<li>The web app is written in PHP (using the Kohana framework) and draws data both from Postgres and from a Pythonic web service.</li>
</ul>
<h2>Roadmap</h2>
<p>Future Socorro releases are a joint project between Webdev, Metrics, and IT.  Some of our milestones focus on infrastructure improvements, others on code changes, and still others on UI improvements.  Features generally work their way through to users in this order.</p>
<ul>
<li>
<h3>1.6 &#8211; 1.6.3 (in production)</h3>
<p style="padding-left: 30px;">The current production version is 1.6.3, which was released last Wednesday.  We don&#8217;t usually do second dot point releases but we did 1.6.1, 1.6.2, and 1.6.3 to get Out Of Process Plugin (OOPP) support out to engineers as it was implemented.</p>
<p style="padding-left: 30px;">When an OOPP becomes unresponsive, a pair of twin crashes are generated: one for the plugin process and one for the browser process.  For beta and pre-release products, both of these crashes are available for inspection via Socorro.  Unfortunately, Socorro throttles crash submissions from released products due to capacity constraints.  This means one or the other of the twins may not be available for inspection.  This limitation will vanish with the release of Socorro 1.8.</p>
<p style="padding-left: 30px;">You can now see whether a given crash signature is a hang or a crash, and whether it was plugin or browser related.  In the signature tables, if you see a stop sign symbol, that&#8217;s a hang.  A window means it is crash report information from the browser, and a small blue brick means it is crash report information from the plugin.</p>
<p style="padding-left: 30px;">If you are viewing one half of a hang pair for a pre-release or beta product, you&#8217;ll find a link to the other half at the top right of the report.</p>
<p style="padding-left: 30px;">You can also limit your searches (using the Advanced Search Filters) to look just at hangs or just at crashes, or to filter by whether a report is browser or plugin related.</p>
</li>
<li>
<h3>1.7 (Q2)</h3>
<p style="padding-left: 30px;">We are in the process of baking 1.7.  The key feature of this release is that we will no longer be relying on NFS in production. All crash report submissions are already stored in HBase, but with Socorro 1.7, we will retrieve the data from HBase for processing and store the processed result back into HBase.</p>
</li>
<li>
<h3>1.8 (Q2)</h3>
<p style="padding-left: 30px;">In 1.8, we will migrate the processors and minidump_stackwalk instances to run on our Hadoop nodes, further distributing our architecture.  This will give us the ability to scale up to the amount of data we have as it grows over time. You can see how this will simplify our architecture in the following diagram.</p>
<p style="padding-left: 30px;"><a href="http://blog.mozilla.com/webdev/files/2010/05/diagram.11.png"><img class="alignnone size-medium wp-image-1020" title="New Socorro Architecture" src="http://blog.mozilla.com/webdev/files/2010/05/diagram.11-300x142.png" alt="New Socorro Architecture" width="300" height="142" /></a></p>
<p style="padding-left: 30px;">With this release, the 15% throttling of Firefox release channel crashes goes away entirely.</p>
</li>
<li>
<h3>2.0 (Q3 2010)</h3>
<p style="padding-left: 30px;">You may have noticed 1.9 is missing.  In this release we will be making the power of Hbase available to the end user, so expect some significant UI changes.</p>
<p style="padding-left: 30px;">Right now we are in the process of specifying the PRD for 2.0.  This involves interviewing a lot of people on the Firefox, Platform, and QA teams.  If we haven&#8217;t scheduled you for an interview and you think we ought to talk to you, please let us know.</p>
</li>
</ul>
<h2>Features under consideration</h2>
<ul>
<li>Full text search of crashes</li>
<li>Faceted search: start by finding crashes that match a particular signature, and then drill down into them by category.<br />
Which of these crashes involved a particular extension or plugin?  Which ones occured within a short time after startup?</li>
<li>The ability to write and run your own Map/Reduce jobs (training will be provided)</li>
<li> Detection of &#8220;explosive crashes&#8221; that appear quickly</li>
<li>Viewing crashes by &#8220;build time&#8221; instead of clock time</li>
<li> Classification of crashes by component</li>
</ul>
<p>This is a big list, obviously.  We need your feedback &#8211; what should we work on first?</p>
<p>One thing that we&#8217;ve learned so far through the interviews is that people are not familiar with the existing features of Socorro, so expect further blog posts with more information on how best to use it!</p>
<h2>How to get involved</h2>
<p>As always, we welcome feedback and input on our plans.</p>
<p>You can contact the team at socorro-dev@mozilla.com, or me personally at laura@mozilla.com.</p>
<p>In addition, we always welcome contributions.  You can find our code repository at<br />
<a href="http://code.google.com/p/socorro/">http://code.google.com/p/socorro/</a></p>
<p>We hold project meetings on a Wednesday afternoon &#8211; details and agendas are here<br />
<a href="https://wiki.mozilla.org/Breakpad/Status_Meetings">https://wiki.mozilla.org/Breakpad/Status_Meetings</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2010/05/19/socorro-mozilla-crash-reports/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Socorro Moves to New Hardware</title>
		<link>http://blog.mozilla.com/webdev/2009/05/15/socorro-moves-to-new-hardware/</link>
		<comments>http://blog.mozilla.com/webdev/2009/05/15/socorro-moves-to-new-hardware/#comments</comments>
		<pubDate>Fri, 15 May 2009 20:49:58 +0000</pubDate>
		<dc:creator>lars</dc:creator>
				<category><![CDATA[Socorro]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=414</guid>
		<description><![CDATA[What has two quad core 3GHz 64bit CPUs, sixteen gigs of RAM and makes the Socorro users happy? That would be the new hardware that the Socorro system moved to during a six hour operation on Thursday night. The new hardware was recommended by the folks from the aptly named PostgreSQL Experts, Inc after an [...]]]></description>
			<content:encoded><![CDATA[<p>What has two quad core 3GHz 64bit CPUs, sixteen gigs of RAM and makes the Socorro users happy?  That would be the new hardware that the Socorro system moved to during a six hour operation on Thursday night.  The new hardware was recommended by the folks from the aptly named <a href="http://pgexperts.com">PostgreSQL Experts, Inc</a> after an intense week of consultation and analysis in March earlier this year.  After auditing our existing system of hardware and software, it was apparent that we were woefully underpowered for what we were trying to do.  While simply tuning PostgreSQL helped in the interim, a more powerful platform was clearly in order.</p>
<p>Before we deployed the new hardware, we had to take several steps to tame our voracious use of disk space.  In the previous week, we removed the archived dumps from the database.  They were rarely ever accessed but took up the lion&#8217;s share of our disk space.  By migrating them to file system storage, we made a three hundred gig database migration onto new hardware into a migration of only sixty gig.  </p>
<p>While there may be a need for tuning over the next week, Socorro users should have a much accelerated experience using the Socorro Web site.</p>
<p>Many thanks to <em>aravind</em> for shepherding this project through IT, <em>chizu</em> in IT for his ﻿db cloning/replication scripting/tweaking and <em>jberkus</em> from PostgreSQL Experts for his superior navigation skills and a steady hand at the PostgreSQL tiller.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2009/05/15/socorro-moves-to-new-hardware/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Socorro Dumps Wave Good-bye to the Relational Database</title>
		<link>http://blog.mozilla.com/webdev/2009/04/20/socorro-dumps-wave-good-bye-to-the-relational-database/</link>
		<comments>http://blog.mozilla.com/webdev/2009/04/20/socorro-dumps-wave-good-bye-to-the-relational-database/#comments</comments>
		<pubDate>Mon, 20 Apr 2009 23:54:02 +0000</pubDate>
		<dc:creator>lars</dc:creator>
				<category><![CDATA[Socorro]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=365</guid>
		<description><![CDATA[Let&#8217;s say we&#8217;ve got some twenty-five million chunks of data ranging in size from one K to several meg. Let&#8217;s also say that we only rarely ever need to access this data, but when we do, we need it fast. Would your first choice be to save this data in a relational database? That&#8217;s the [...]]]></description>
			<content:encoded><![CDATA[<p>Let&#8217;s say we&#8217;ve got some twenty-five million chunks of data ranging in size from one K to several meg.  Let&#8217;s also say that we only rarely ever need to access this data, but when we do, we need it fast.  Would your first choice be to save this data in a relational database?</p>
<p>That&#8217;s the situation that we&#8217;ve got in Socorro right now.  Each time we catch a crash coming in from the field, we process it and save a &#8220;cooked&#8221; version of the dump in the database.  We also save some details about the crash in other tables so that we can generate some aggregate statistics. </p>
<p>It&#8217;s that cooked dump that&#8217;s causing some concern.  The only time that we ever access that data is when someone requests that specific crash using the Socorro UI.  Considering that these cooked crashes take up nearly three quarters of the storage needs of our database, there&#8217;s not a lot of value there for the effort.  They inflate the hardware requirements for our database, make backups take too long and complicate any future database replication plans that we might consider.</p>
<p>We&#8217;re about to migrate our instance of Socorro to new shiny 64bit hardware.  Moving these great drifts of cooked dumps would take hours and necessitate potentially more than a day of down time for  production.  We don&#8217;t want that.</p>
<p>It&#8217;s time for a great migration.  All those dumps are going to leave the database.  We&#8217;re spooling them out into a file system storage scheme.  At the same time, we&#8217;re reformatting them into JSON.  In the next version of Socorro, when a user requests their dump by UUID, it will be served by Apache directly from a file system as a compressed JSON file.  The client will decompress it and through javascript magic give the same display that we&#8217;ve got now.</p>
<p>There&#8217;s some future benefits to moving this data into a file system format.  Think about all of this data sitting there in a Hadoop friendly format waiting for a future data mining project.  We&#8217;ve nothing specific planned, but we&#8217;ve got the first step done.</p>
<p>We&#8217;re hoping to get the data migration done within the week.  New versions of the processing programs will have to be deployed as well as the changes to the Web application.  Once that&#8217;s done, we can proceed to the deployment of our fancy new hardware.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2009/04/20/socorro-dumps-wave-good-bye-to-the-relational-database/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Crash Reporter Homepage Reskin</title>
		<link>http://blog.mozilla.com/webdev/2009/03/02/crash-reporter-homepage-reskin/</link>
		<comments>http://blog.mozilla.com/webdev/2009/03/02/crash-reporter-homepage-reskin/#comments</comments>
		<pubDate>Mon, 02 Mar 2009 20:02:50 +0000</pubDate>
		<dc:creator>Austin King</dc:creator>
				<category><![CDATA[Socorro]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=309</guid>
		<description><![CDATA[The crash reporter has been given a new look, and the homepage has a new Dashboard. Our UX Engineer Neil Lee has applied some simplifications to the query form. This redesign was focused on the homepage and global navigation. Another new feature is that MTBF and Top Crashers By Signature can be exported in CSV [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://crash-stats.mozilla.com/">crash reporter</a> has been given a new look, and the homepage has a new <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=465660">Dashboard</a>.</p>
<p><img src="http://blog.mozilla.com/webdev/files/2009/03/crash_reporter_homepage.png" alt="Screenshot of Crash Reporter homepage" width="500" height="269" /></p>
<p>Our UX Engineer Neil Lee has applied some simplifications to the query form. This redesign was focused on the homepage and global navigation.</p>
<p>Another new feature is that MTBF and Top Crashers By Signature can be <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=478305">exported in CSV format</a>. In the future, as we want to slice and dice different reports, it should be trivial to add this feature to other reports.</p>
<p>In addition I&#8217;ve fixed a handful of issues:</p>
<ul>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=428110">428110</a> &#8211; Quick and dirty changes to speed up crash analysis</li>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=478043">478043</a> &#8211; Make &#8216;is exactly&#8217; the default choice</li>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=479256">479256</a> &#8211; Clarify labels to be Date Processed</li>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=470524">470524</a> &#8211; Crash signatures not indented</li>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=479460">479460</a> &#8211; Bad Unicode in User Comments</li>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=479447">479447</a> &#8211; report/list with no results has JS error</li>
</ul>
<p>We would love your feedback. Check out some <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=480617">recently filed bugs</a> or send us some feedback and <a href="https://bugzilla.mozilla.org/enter_bug.cgi?product=Webtools&amp;component=Socorro">file a new Socorro bug.</a></p>
<p>We&#8217;ve had some known issues around MTBF and Top Crashes by Signature in the last month and are working on fixing these issues. The upside is that <a href="http://crash-stats.mozilla.com/mtbf/of/SeaMonkey/development">SeaMonkey is now in MTBF</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2009/03/02/crash-reporter-homepage-reskin/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Socorro Partitioning Rolled Back</title>
		<link>http://blog.mozilla.com/webdev/2009/02/02/socorro-partitioning-rolled-back/</link>
		<comments>http://blog.mozilla.com/webdev/2009/02/02/socorro-partitioning-rolled-back/#comments</comments>
		<pubDate>Mon, 02 Feb 2009 18:29:15 +0000</pubDate>
		<dc:creator>Mike Morgan</dc:creator>
				<category><![CDATA[Socorro]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=179</guid>
		<description><![CDATA[This Thursday and Friday we attempted to push updates to re-partition our crash report database and optimize the reporting tool to take advantage of it.  This was the deployment of bug 432450 and a fix for bug 444749, among others. Our first attempt suffered from a network timeout, which required an eleven hour restore and [...]]]></description>
			<content:encoded><![CDATA[<p>This Thursday and Friday we attempted to push <a href="http://blog.mozilla.com/webdev/2009/01/20/socorro-database-partitioning-is-coming/">updates to re-partition our crash report database</a> and optimize the reporting tool to take advantage of it.  This was the deployment of <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=432450">bug 432450</a> and a fix for <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=444749">bug 444749</a>, among others.</p>
<p>Our first attempt suffered from a network timeout, which required an eleven hour restore and re-run.  The re-run, done Friday, was done using a socket connection but would have required an additional 1-3 days of downtime, which was well outside our originally announced window.  Consequently, the database was rolled back to its contents as of 6:55PM PDT, January 29.  Reports have since resumed processing.</p>
<p>We plan on doing the following:</p>
<ul>
<li>Set up a complete replica of production to test this process end-to-end.  Our dry runs were done on a staging database that was roughly 1/5 the size.  We anticipated a scaling of O(n), but in practice on the production server, we got performance more inline with O(n^2).   So we did not expect the full extent of timeouts or how much downtime would be needed.  This will be avoided in future updates and we are setting up a stage database from a recent dump (once we gather the hardware for it).</li>
<li>Push a now+ partitioning script.  The work done in <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=432450">bug 432450</a>, on top of a complex migration script for old data, has logic for handling new partitions automatically which benefits new reports.  Since we don&#8217;t want to keep adding to our old database schema, we will push these updates so that new reports are properly partitioned.  Pros &#8211; in a week or two, things will be speedy and we aren&#8217;t going to struggle with timeouts.  Cons &#8211; we aren&#8217;t migrating the last 4 weeks.  We will not see a performance increase when querying data older than the date of the repartitioning.</li>
</ul>
<p>We would like to push the partitioning script (without migration of old data) on Thursday.  We will announce when it will be as soon as we know.</p>
<p>Long term, we are already in the process of seeking additional resources to help examine our database configuration and systems architecture.  We will have more updates on that process in the future.</p>
<p>Our team wants this work deployed as much as everyone else.  Thanks to everyone for their patience as we work through these issues.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2009/02/02/socorro-partitioning-rolled-back/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Socorro Database Partitioning is Coming</title>
		<link>http://blog.mozilla.com/webdev/2009/01/20/socorro-database-partitioning-is-coming/</link>
		<comments>http://blog.mozilla.com/webdev/2009/01/20/socorro-database-partitioning-is-coming/#comments</comments>
		<pubDate>Tue, 20 Jan 2009 21:20:20 +0000</pubDate>
		<dc:creator>lars</dc:creator>
				<category><![CDATA[Socorro]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=154</guid>
		<description><![CDATA[How big can a table in a database get? Well, the answer varies by database, for most modern databases, the answer is &#8220;really huge&#8221;. That&#8217;s what we&#8217;ve going in Socorro, some honking big tables. Queries can get slow on big tables. Sure you can add indexing to prevent having to scan the whole thing for [...]]]></description>
			<content:encoded><![CDATA[<p>How big can a table in a database get?  Well, the answer varies by database, for most modern databases, the answer is &#8220;really huge&#8221;.  That&#8217;s what we&#8217;ve going in Socorro, some honking big tables.  Queries can get slow on big tables.  Sure you can add indexing to prevent having to scan the whole thing for the most common queries, but you can&#8217;t index every column without slowing performance elsewhere and using up ever more disk space.  Indexes can get really huge, too.</p>
<ul>
<li>tables divided into sub tables</li>
<li>queries optimized to use smaller subtables</li>
<li>ad hoc queries sped up</li>
<li>summary queries ignore irrelevant data</li>
<li>smaller indexes</li>
<li>simplified data retirement</li>
<li><em>fresh lemon scent</em>*</li>
</ul>
<p>Partitioning, a feature offered by many RDMSs, is a trick to help manage titanic tables.  You take a table and break it up into smaller tables, each containing a part of the whole.   Say we have a data set called &#8216;reports&#8217;.  Rather than storing them all in one table, we store them in several smaller tables.  The data in each smaller table share a common trait.  For example, for reports from Week 1, Week 2, and Week 3 would each have their own table.  The master table &#8216;reports&#8217; physically has no data in it at all.  The database knows that when we reference &#8216;reports&#8217; it means the union of all the smaller tables.  If we do a query on the &#8216;reports&#8217; table and we ask for reports from January, the database is clever enough to just look to the weekly sub-tables for January instead of looking at all the sub-tables.</p>
<p><img src="http://people.mozilla.com/~lars/socorro/images/db.partitioning.png" alt="the reports table divided into sub tables by week" /></p>
<p>We&#8217;re currently converting the Socorro database into a partitioned database (okay, technically it was already partitioned, but its partitioning was degenerate and didn&#8217;t work properly).  We&#8217;re testing a Python script that is going to take the &#8216;reports&#8217; table, along with the associated &#8216;frames&#8217; and &#8216;dumps&#8217; tables, and start breaking them into little one week chunks. Unfortunately, because of the massive size of the tables, we cannot afford to have two copies of the data in the database at the same time.  The chunking of the data will be destructive.  After a week of data is copied into a new partition, that corresponding week of data will be deleted from the original table.</p>
<p>The database, the Socorro Web App, breakpad crash processing and aggregate analysis will be down during the migration process.  However, <strong>data collection will not be down: we&#8217;re not going to lose new crash data</strong>.</p>
<p>If we were to chunk the entire dataset, the migration process is estimated to take more than twenty hours.  As a compromise, we&#8217;re going to chunk only the most recent four weeks of data and leave the rest as a single oversize partition.  This will significantly reduce the time that migration takes and, therefore reduce the down time.  We can get away with this because most of the aggregate reports only look at the most recent few weeks of data anyway.</p>
<p>Another advantage to partitioning is in the retirement of old data.  In the future, we&#8217;re probably only going to keep at most one hundred twenty days of history.  Any more than that and our storage needs would require its own building.  To get rid of the old data, all we need to do is delete the oldest partitions.  That action is fast because it doesn&#8217;t even require looking at indexes or scanning tables.</p>
<p>Partitioning is going to allow Socorro to scale much more smoothly.  At the same time, it will make our aggregate reporting much more efficient.</p>
<p>This repartitioning process will happen within the next week.  We will announce the scheduled down time in advance.  And be assured, because of the file system changes what we made during our last big Socorro update, data collection will <strong>not</strong> be down while we&#8217;re repartitioning.</p>
<p>* <em>also available in fresh wintergreen and sparkling pumpkin</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2009/01/20/socorro-database-partitioning-is-coming/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Top Crashers By Url and MTBF</title>
		<link>http://blog.mozilla.com/webdev/2008/12/30/top-crashers-by-url-and-mtbf/</link>
		<comments>http://blog.mozilla.com/webdev/2008/12/30/top-crashers-by-url-and-mtbf/#comments</comments>
		<pubDate>Tue, 30 Dec 2008 23:06:03 +0000</pubDate>
		<dc:creator>Austin King</dc:creator>
				<category><![CDATA[Socorro]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=133</guid>
		<description><![CDATA[  Working with ss and chofman, we&#8217;ve created 2 new types of reports: a Top Crashers by Url and a Mean Time Before Failure (MTBF).   Given the current state of performance of the non-report parts of Socorro&#8217;s webapp, most of the thought and time have gone into the backend piece of these reports. You can read about the ReportDatabaseDesign [...]]]></description>
			<content:encoded><![CDATA[<p> </p>
<div class="mceTemp">
<dl id="attachment_138" class="wp-caption alignleft" style="width: 160px;">
<dt class="wp-caption-dt"><a href="http://blog.mozilla.com/webdev/files/2008/12/crashreporter-512.png"><img class="size-thumbnail wp-image-138 " title="Crash Reporter Icon" src="http://blog.mozilla.com/webdev/files/2008/12/crashreporter-512-150x150.png" alt="Sofa's Crash Reporter Icon via Alex Faaborg" width="150" height="150" /></a></dt>
</dl>
</div>
<p>Working with ss and chofman, we&#8217;ve created 2 new types of reports: a Top Crashers by Url and a Mean Time Before Failure (MTBF).</p>
<p> </p>
<p>Given the current state of performance of the non-report parts of Socorro&#8217;s webapp, most of the thought and time have gone into the backend piece of these reports. You can read about the <a title="More info on Report Database Design on wiki" href="http://code.google.com/p/socorro/wiki/ReportDatabaseDesign">ReportDatabaseDesign</a> on the project&#8217;s wiki.</p>
<h3>Top Crashers by URL</h3>
<blockquote><p>On which websites do our browser builds crash the most? Which curses do our users hurl at us when this happens?</p></blockquote>
<p>This report uses the optional url feild of a crash report to answer this question. It has two modes <strong>byurl</strong> and <strong>bydomain</strong>. You can read more about the details on <a title="Details of Top Crashers By Url report" href="http://code.google.com/p/socorro/wiki/TopCrashersByUrl">TopCrashersByUrl</a>. Crashes which have a comment, include the comment and a link to the actual crash report. Don&#8217;t worry, personal details have been removed, we don&#8217;t tie a specific user to a specific url.</p>
<p>We will be putting links into Socorro to these new reports, with the work neilio is doing, but for now here are various links.</p>
<p>We&#8217;ve enabled top crashers by URL for Firefox <a href="http://crash-stats.mozilla.com/topcrasher/byurl/Firefox/3.0.5">3.0.5</a>, <a href="http://crash-stats.mozilla.com/topcrasher/byurl/Firefox/3.1b2">3.1b2</a>, <a href="http://crash-stats.mozilla.com/topcrasher/byurl/Firefox/3.1b3pre">3.1b3pre</a>, and <a href="http://crash-stats.mozilla.com/topcrasher/byurl/Firefox/3.0.6pre">3.0.6pre</a>. Each of these link to &#8220;by domain&#8221; breakdowns, so 3.0.6pre has a link to this <a href="http://crash-stats.mozilla.com/topcrasher/bydomain/Firefox/3.0.6pre">by domains</a> view.</p>
<h3>MTBF</h3>
<blockquote><p>Is this new release more crashy than previous releases?</p></blockquote>
<p>Squeaking in before New Year&#8217;s Eve&#8217;s MFBT comes the MTBF report. It is a graph of the average number of seconds a release runs before crashing. Details are at <a href="http://code.google.com/p/socorro/wiki/MeanTimeBeforeFailure">MeanTimeBeforeFailure</a> on the wiki.</p>
<p>We&#8217;re running MTBF reports for 14 releases:</p>
<p>Firefox <a href="http://crash-stats.mozilla.com/mtbf/of/Firefox/major">major</a>, <a href="http://crash-stats.mozilla.com/mtbf/of/Firefox/milestone">milestone</a>, and <a href="http://crash-stats.mozilla.com/mtbf/of/Firefox/development">development</a> releases.</p>
<p>Thunderbird <a href="http://crash-stats.mozilla.com/mtbf/of/Thunderbird/milestone">milestone</a>, and <a href="http://crash-stats.mozilla.com/mtbf/of/Thunderbird/development">development</a> releases. (No Milestone releases in Socorro yet)</p>
<p>Coming Soon: <strong>SeaMonkey</strong></p>
<p>These reports are for a release in general as well as stats for Mac and Win, allowing for drilling down into OS. Several frontend enhancements to this report are coming.</p>
<p>, Product and versions in these reports include:</p>
<ul>
<li>Firefox 3.0.4</li>
<li>Firefox 3.0.5  </li>
<li>Firefox 3.1a2</li>
<li>Firefox 3.1b1</li>
<li>Firefox 3.1b2 </li>
<li>Firefox 3.0.4pre</li>
<li>Firefox 3.0.5pre</li>
<li>Firefox 3.0.6pre</li>
<li>Firefox 3.1b3pre</li>
<li>Firefox 3.1b2pre</li>
<li>Thunderbird 3.0a3</li>
</ul>
<ul>
<li>Thunderbird 3.0b1</li>
<li>Thunderbird 3.0b1pre</li>
<li>Thunderbird 3.0b2pre</li>
</ul>
<p>I&#8217;ve gotten a good dose of feedback on tweaks to make and bugs to fix, but hopefully you&#8217;ll find these new reports useful. Tomcat has already mentioned augmenting his list of urls to populate his test automation for 3.1 (using spider to test most popular urls) with the urls in these reports.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2008/12/30/top-crashers-by-url-and-mtbf/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Socorro wireframes</title>
		<link>http://blog.mozilla.com/webdev/2008/12/04/socorro-wireframes/</link>
		<comments>http://blog.mozilla.com/webdev/2008/12/04/socorro-wireframes/#comments</comments>
		<pubDate>Thu, 04 Dec 2008 17:52:13 +0000</pubDate>
		<dc:creator>Neil Lee</dc:creator>
				<category><![CDATA[Socorro]]></category>
		<category><![CDATA[Design]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=122</guid>
		<description><![CDATA[As part of our ongoing work on the Mozilla crash reporting system (codenamed &#8220;Socorro&#8221;) a redesign of the entire interface is in the works, and I have some preliminary wireframes to share for feedback and discussion. My personal goals with this redesign are to make working with crash data more efficient, and to make each [...]]]></description>
			<content:encoded><![CDATA[<p>As part of our ongoing work on the Mozilla crash reporting system (codenamed &#8220;Socorro&#8221;) a redesign of the entire interface is in the works, and I have some preliminary wireframes to share for feedback and discussion.</p>
<p>My personal goals with this redesign are to make working with crash data more efficient, and to make each screen as useful and as intuitive as possible. This project is an interesting challenge, however, as there are not a lot of publicly-available examples of crash reporting systems to use as a baseline.</p>
<h3>&#8220;Home&#8221; page</h3>
<p>Current page: <a href="http://crash-stats.mozilla.com/">http://crash-stats.mozilla.com/</a></p>
<p>Currently when you go to the <a href="http://crash-stats.mozilla.com/">Socorro home page</a> you&#8217;re dumped right into search. This makes a lot of sense, but there could be a lot more useful data right up front. This wireframe tries to incorporate more of a &#8220;dashboard&#8221; approach with top crashers for release versions and other pertinent information.</p>
<div style="text-align:center"><a href="http://blog.mozilla.com/webdev/files/2008/12/search-basic-full.png"><img src="http://blog.mozilla.com/webdev/files/2008/12/search-basic1.png" alt="search-basic.png" border="0" width="450" /></a></p>
<p><strong>Home page, default configuration</strong></div>
<div style="text-align:center"><a href="http://blog.mozilla.com/webdev/files/2008/12/search-full-full.png"><img src="http://blog.mozilla.com/webdev/files/2008/12/search-full1.png" alt="search-full.png" border="0" width="450" /></a></p>
<p><strong>Home page with advanced filters visible</strong></p>
</div>
<p>Some key changes in this wireframe:</p>
<ol>
<li>Search is now called &#8220;Filter&#8221; as that&#8217;s really what you&#8217;re doing.</li>
<li>The filter options have been cleaned up a bit, and more specific filters are now grouped under <em>Advanced</em> and hidden by default.</li>
<li>The Advanced Filters toggle will remember if it was opened or closed, so if you regularly access any of these options they will be easily accessible.</li>
<li>There is a new &#8220;top crashers&#8221; widget that shows the top 3-5 reported crashes for current release versions.</li>
<li>The boxes underneath the filter / top crashers widget are for other chunks of data such as top crashers for development versions, mean time before failure, top URLs that cause crashes, etc.</li>
<li>The <strong>versions</strong> filter auto-fills with just the versions available for the selected product, to help keep the number of options down.</li>
<li>The &#8220;Mozilla Developers&#8221; button at the top right-hand corner opens a jump menu that lists all of the Mozilla developer web sites / tools. I think it&#8217;s kind of silly the various Mozilla developer sites aren&#8217;t linked together and this navigational tool addresses that deficiency.</li>
</ol>
<h3>Top Crashers</h3>
<p>Current page: <a href="http://crash-stats.mozilla.com/topcrasher">http://crash-stats.mozilla.com/topcrasher</a></p>
<div style="text-align:center"><a href="http://blog.mozilla.com/webdev/files/2008/12/topcrashers-full.png"><img src="http://blog.mozilla.com/webdev/files/2008/12/topcrashers.png" alt="topcrashers.png" border="0" width="450" /></a></div>
<p>At the moment the existing page is nothing more than a jumping point to link you to the various product versions. The redesigned wireframe tries to float up crash report information for more commonly used product versions while whittling the number of versions down to a more sensible number.</p>
<h3>Individual Crash Signatures</h3>
<p>Example: <a href="http://crash-stats.mozilla.com/report/list?product=Firefox&amp;version=Firefox%3A3.1b2pre&amp;query_search=signature&amp;query_type=contains&amp;query=&amp;date=&amp;range_value=1&amp;range_unit=weeks&amp;do_query=1&amp;signature=mozcrt19.dll%400x1838a">Crash reports in mozcrt19.dll@0x1838a</a></p>
<div style="text-align:center"><a href="http://blog.mozilla.com/webdev/files/2008/12/individual-signature-full.png"><img src="http://blog.mozilla.com/webdev/files/2008/12/individual-signature2.png" alt="individual-signature.png" border="0" width="450" /></a></div>
<p>The redesign does away with the tabbed interface currently in use and brings everything onto one page for quicker access. The top left-hand box displays either the number of crashes by operating system (for release versions) or the number of crashes by build (for development versions).</p>
<h3>Give us your comments, your feedback, your huddled criticism</h3>
<p>As I mentioned these are quite preliminary and I&#8217;m very interested to hear your comments and whether you feel these wireframes are headed in the right direction.</p>
<p>Redesigning a tool such as Socorro is rather challenging as there aren&#8217;t many existing examples of good design in this area (or many publicly available examples at all for that matter). I also had to keep in mind that whatever is created needs to work for both Mozilla&#8217;s specific requirements as well as be generic enough to be adaptable by others, which makes this redesign even more tricky.</p>
<p>I spent quite a bit of time looking at similar or parallel systems such as network and web hosting dashboards to get some ideas for how this information could be displayed and these wireframes incorporate some of my discoveries.</p>
<p>Beefs, bouquets, comments, suggestions?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2008/12/04/socorro-wireframes/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Three Weeks with the New Socorro File System</title>
		<link>http://blog.mozilla.com/webdev/2008/12/01/three-weeks-with-the-new-socorro-file-system/</link>
		<comments>http://blog.mozilla.com/webdev/2008/12/01/three-weeks-with-the-new-socorro-file-system/#comments</comments>
		<pubDate>Mon, 01 Dec 2008 17:12:11 +0000</pubDate>
		<dc:creator>lars</dc:creator>
				<category><![CDATA[Breakpad]]></category>
		<category><![CDATA[Socorro]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=107</guid>
		<description><![CDATA[Three weeks ago today, we deployed the new Socorro file system into production. It was the first in in a series of engineered improvements to the Socorro codebase. By “engineered”, I mean that it was the first major improvement to the code that wasn&#8217;t done during an emergency with a gun to our heads. For [...]]]></description>
			<content:encoded><![CDATA[<p>Three weeks ago today, we deployed the new Socorro file system into production.  It was the first in in a series of engineered improvements to the Socorro codebase.  By “engineered”, I mean that it was the first major improvement to the code that wasn&#8217;t done during an emergency with a gun to our heads.  For the previous half year, we&#8217;ve been reactive instead of proactive. </p>
<p>The new file system has performed quite well.  The most outward expression of this improvement is the speed at which priority jobs are processed.  </p>
<p>A priority job is any submitted crash for which someone has requested a report.  There can be a backlog of submitted crashes and it might take from several minutes to several hours for the processing programs to get around to a particular job.  If someone requests a particular crash, we&#8217;ve got a way for that job to jump the queue for immediate processing.   Prior to the new file system, the biggest hurdle to processing a job quickly was simply finding it.   There was no index to assist in find a job quickly.</p>
<p>The new file system changed that.  All entries are indexed as they&#8217;re inserted.  To see how it&#8217;s done, see my previous blog posting.  This gives us very fast access to any crash dump which translates to response times of thirty to ninety seconds for priority job requests.  Try it.  Considering the volume of crashes we get, it&#8217;s amazing that we can zero in and process a crash so quickly.</p>
<p>The last two weeks hasn&#8217;t all been champagne and fireworks.  We had a scare about forty eight hours after deployment.  The automatic indexing scheme uses a radix algorithm to spread crash dumps evenly through a branching file system structure.  During design, we chose to make this structure four levels deep.  Each level did a 256 way bifurcation of the directory tree.  That translates into 256^4 possible directories or about 4.3 billion.  Once a directory was created, we never retired it, thinking that it would be faster to reuse old directories than to bother destroying and creating them all the time.  At the rate that we received new files, we calculated that it would take years to clog up the file system.  We banked on the assumption that we had at least 4.3 billion inodes available in the file system.</p>
<p>It was a bad assumption.  It turns out that we&#8217;re using some sort of  black box storage systems with variable sized inodes.  We didn&#8217;t have 4.3G inodes available, we had only 64M. Back into reactive coding as performance art, we took twenty four hours to brainstorm, code, and deploy a solution.  Changing the number of levels from four to three was an obvious way to reduce our foot print: 256^3 is only 16M.  The number of levels of our radix directory structure is now a configuration option.  The trick was making four days of data stored with four levels compatible with new data being collected with fewer levels.  I managed that by encoding the number of levels into uuid of each crash.</p>
<p>Next time you see a crash uuid, take a look at the digits.  The seventh digit from the right end will tell you how deep your crash is stored in the file system.  If it&#8217;s &#8217;0&#8242; then you&#8217;re stored four levels deep.  Any other digit is to be taken literally: &#8217;2&#8242; – two levels, &#8217;3&#8242; – three levels.  This  crazy scheme lets the depth be switchable at run time.  If directories are getting too crowded, we can raise the depth.  Of it we start getting running out of inodes, we can lower the depth.</p>
<p>Great thanks to Frank Griswold for the coding and to Aravind for not throwing knives at me.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2008/12/01/three-weeks-with-the-new-socorro-file-system/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

