<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Mozilla Web Development &#187; Socorro</title>
	<atom:link href="http://blog.mozilla.com/webdev/tag/soccoro/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/webdev</link>
	<description>Everybody Likes Ninjas</description>
	<lastBuildDate>Wed, 01 Feb 2012 16:41:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>The future of crash reporting</title>
		<link>http://blog.mozilla.com/webdev/2010/08/05/the-future-of-crash-reporting/</link>
		<comments>http://blog.mozilla.com/webdev/2010/08/05/the-future-of-crash-reporting/#comments</comments>
		<pubDate>Thu, 05 Aug 2010 15:48:45 +0000</pubDate>
		<dc:creator>Laura Thomson</dc:creator>
				<category><![CDATA[Web Development]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[Socorro]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=1245</guid>
		<description><![CDATA[In recent blog posts I&#8217;ve talked about our plans for Socorro and our move to HBase. Today, I&#8217;d like to invite community feedback on the draft of our plans for Socorro 2.0.  In summary, we have been moving our data into HBase, the Hadoop database.  In 1.7 we began exclusively using HBase for crash storage.  [...]]]></description>
			<content:encoded><![CDATA[<p>In recent blog posts I&#8217;ve talked about <a href="http://blog.mozilla.com/webdev/2010/05/19/socorro-mozilla-crash-reports/">our plans</a> for <a href="http://code.google.com/p/socorro/">Socorro</a> and our <a href="http://blog.mozilla.com/webdev/2010/07/26/moving-socorro-to-hbase/">move to HBase</a>.</p>
<p>Today, I&#8217;d like to invite community feedback on the <a href="https://wiki.mozilla.org/Socorro:PRD_2.x">draft of our plans for Socorro 2.0</a>.  In summary, we have been moving our data into HBase, the Hadoop database.  In 1.7 we began exclusively using HBase for crash storage.  In 1.8 we will move the processors and minidump_stackwalk to Hadoop.</p>
<h3>Here comes the future</h3>
<p>In 1.9, we will enable pulling data from HBase for the webapp via a web services layer.  This layer is also known as  &#8220;the pythonic middleware layer&#8221;.  (Nominations for a catchier name are open.  My suggestion of calling it &#8220;hoopsnake&#8221; was not well received.)</p>
<p>In 2.0 we will expose HBase functionality to the end user.  We also have a number of other improvements planned for the 2.x releases, including:</p>
<ul>
<li>Full text search of crashes</li>
<li>Faceted search</li>
<li>Ability for users to run MapReduce jobs from the webapp</li>
<li>Better visibility for explosive and critical crashes</li>
<li>Better post-crash user engagement via email</li>
</ul>
<p>Full details can be found in the <a href="https://wiki.mozilla.org/Socorro:PRD_2.x">draft PRD</a>.  If you prefer the visual approach you can read the <a href="https://wiki.mozilla.org/File:Socorro.next_summit.key.gz">slides</a> I presented at the <a href="https://wiki.mozilla.org/Summit2010">Mozilla Summit</a> last month.</p>
<h3>Give us feedback!</h3>
<p>We welcome all feedback from the community of users &#8211; please take a look and let us know what we&#8217;re missing.  We&#8217;re also really interested in feedback about the best order in which to implement the planned features.</p>
<p>You can send your feedback to laura at mozilla dot com &#8211; I look forward to reading it.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2010/08/05/the-future-of-crash-reporting/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Moving Socorro to HBase</title>
		<link>http://blog.mozilla.com/webdev/2010/07/26/moving-socorro-to-hbase/</link>
		<comments>http://blog.mozilla.com/webdev/2010/07/26/moving-socorro-to-hbase/#comments</comments>
		<pubDate>Mon, 26 Jul 2010 21:33:04 +0000</pubDate>
		<dc:creator>Laura Thomson</dc:creator>
				<category><![CDATA[Socorro]]></category>
		<category><![CDATA[Web Development]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=1230</guid>
		<description><![CDATA[We&#8217;ve been incredibly busy over on the Socorro project, and I have been remiss in blogging. Over the next week or so I&#8217;ll be catching up on what we&#8217;ve been doing in a series of blog posts. If you&#8217;re not familiar with Socorro, it is the crash reporting system that catches, processes, and presents crash [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve been incredibly busy over on the Socorro project, and I have been remiss in blogging.  Over the next week or so I&#8217;ll be catching up on what we&#8217;ve been doing in a series of blog posts.  If you&#8217;re not familiar with Socorro, it is the crash reporting system that catches, processes, and presents crash data for Firefox, Thunderbird, Fennec, Camino, and Seamonkey.  You can see the output of the system at <a href="http://crash-stats.mozilla.com">http://crash-stats.mozilla.com</a>.   The project&#8217;s code is also being used by people outside Mozilla: most recently <a href="http://www.vigilgames.com/">Vigil Games</a> are using it to catch crashes from <a href="http://www.vigilgames.com/videos/dark-millennium-online ">Warhammer 40,000: Dark Millenium Online</a>.</p>
<p>Back in June we launched Socorro 1.7, and we&#8217;re now approaching the release of 1.8.  In this post, I&#8217;ll review what each of these features represents on our roadmap.</p>
<p>First, a bit of history on data storage in Socorro.  Until recently, when crashes were submitted, the collector placed them into storage in the file system (NFS).  Because of capacity constraints, the collector follows a set of throttling rules in its configuration file in order to make a decision about how to disseminate crashes.   Most crashes go to deferred storage and are not processed unless specifically requested.  However, some crashes are queued into standard storage for processing.  Generally this has been all crashes from alpha, beta, release candidate and other “special” versions;  all crashes with a user comment; all crashes from low volume products such as Thunderbird and Camino; and a specified percentage of all other crashes.  (Recently this has been between ten and fifteen percent.)</p>
<p>The monitor process watched standard storage and assigned jobs to processors. A processor would pick up crashes from standard storage, process them, and write them to two places: our PostgreSQL database, and back into file system storage.  We had been using PostgreSQL for serving data to the webapp, and the file system storage for serving up the full processed crash.</p>
<p>For some time prior to 1.7, we&#8217;d been storing all crashes in HBase in parallel with writing them into NFS.  The main goal of 1.7 was to make HBase our chief storage mechanism.  This involved rewriting the collector and processor to write into HBase.  The monitor also needed to be rewritten to look in HBase rather than NFS for crashes awaiting processing.  Finally, we have a web service that allows users to pull the full crash, and this also needed to pull crashes from HBase rather than NFS.</p>
<p>Not long before code freeze, we decided we should add a configuration option to the processor to continue storing crashes in NFS as a fallback, in case we had any problems with the release.  This would allow us to do a staged switchover, putting processed crashes in both places until we were confident that HBase was working as intended.</p>
<p>During the maintenance window for 1.7 we also took the opportunity to upgrade HBase to the latest version.  We are now using Cloudera&#8217;s CDH2 Hadoop distribution and HBase 0.20.5.</p>
<p>The release went fairly smoothly, and three days later we were able to turn off the NFS fallback.</p>
<p>We&#8217;re now in the final throes of 1.8.  While we now have crashes stored in HBase, we are still capacity constrained by the number of processors available.  In 1.8, the processors and their associated minidump_stackwalk processes will be daemonized and move to run on the Hadoop nodes.  This means that we will be able to horizontally scale the number of processors with the size of the data.  Right now we are running fifteen Hadoop nodes in production and this is planned to increase over the rest of the year.</p>
<p>Some of the associated changes in 1.8 are also really exciting.  We are introducing a new component to the system, called the registrar.  This process will track heartbeats for each of the processors.  Also in this version, we have added an introspection API for the processors.  The registrar will act as a proxy, allowing us to request status and statistical information for each of the processors.  We will need to rebuild the status page (visible at http://crash-stats.mozilla.com/status) to use this new API, but we will have much better information about what each processor is doing.</p>
<p>We will freeze on 1.8 later this week, and expect release in about two weeks&#8217; time.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2010/07/26/moving-socorro-to-hbase/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Socorro: Mozilla&#8217;s Crash Reporting System</title>
		<link>http://blog.mozilla.com/webdev/2010/05/19/socorro-mozilla-crash-reports/</link>
		<comments>http://blog.mozilla.com/webdev/2010/05/19/socorro-mozilla-crash-reports/#comments</comments>
		<pubDate>Wed, 19 May 2010 19:49:40 +0000</pubDate>
		<dc:creator>Laura Thomson</dc:creator>
				<category><![CDATA[Socorro]]></category>
		<category><![CDATA[Web Development]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=1014</guid>
		<description><![CDATA[Recently, we&#8217;ve been working on planning out the future of Socorro.  If you&#8217;re not familiar with it, Socorro is Mozilla&#8217;s crash reporting system. You may have noticed that Firefox has become a lot less crashy recently &#8211; we&#8217;ve seen a 40% improvement over the last five months.  The data from crash reports enables our engineers [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, we&#8217;ve been working on planning out the future of Socorro.  If you&#8217;re not familiar with it, Socorro is Mozilla&#8217;s crash reporting system.</p>
<p>You may have noticed that Firefox has become a lot less crashy recently &#8211;  <a href="http://blog.mozilla.com/metrics/2010/04/08/dramatic-stability-improvements-in-firefox/">we&#8217;ve  seen a 40% improvement over the last five months</a>.  The data from  crash reports enables our engineers to find, diagnose, and fix the most  common crashes, so crash reporting is critical to these improvements.</p>
<p>We receive on our peak day each week <strong>2.5 million crash reports</strong>,  and process 15% of those, for a total of 50 GB.  In total, <strong>we receive around 320Gb each day</strong>!  Right now we are handicapped  by the limitations of our file system storage (NFS) and our database&#8217;s  ability to handle really large tables.   However, we are in the process of moving to Hadoop, and currently all our crashes are also being  written to HBase.  Soon this will become our main data storage, and  we&#8217;ll be able to do a lot more interesting things with the data.  We&#8217;ll  also be able to process 100% of crashes.  We want to do this because the  long tail of crashes is increasingly interesting, and we may be able to  get insights from the data that were not previously possible.</p>
<p>I&#8217;ll start by taking a look at how things have worked to date.</p>
<h2>History of Crash Reporting</h2>
<p><a href="http://blog.mozilla.com/webdev/files/2010/05/Socorro.OSCON_.1.png"><img class="alignnone size-medium wp-image-1019" title="Current Architecture" src="http://blog.mozilla.com/webdev/files/2010/05/Socorro.OSCON_.1-300x150.png" alt="Current Socorro Architecture" width="300" height="150" /></a></p>
<p>The data flows as follows:</p>
<ul>
<li>When Firefox crashes, the crash is submitted to Mozilla by a part of the browser known as Breakpad.  At Mozilla&#8217;s end, this is where Socorro comes into play.</li>
<li>Crashes are submitted to the collector, which writes them to storage.</li>
<li> The monitor watches for crashes arriving, and queues some of them for processing.  Right now, we throttle the system to only process 15% of crashes due to capacity issues.  (We also pick up and transform other crashes on demand as users request them.)</li>
<li>Processors pick up crashes and process them.  A processor gets its next job from a queue in our database, invokes minidump_stackwalk (a part of Breakpad) which combines the crash with symbols, where available.  The results are written back into the database.   Some further processing to generate reports (such as top crashes) is done nightly by a set of cron jobs.</li>
<li>Finally, the data is available to Firefox and Platform engineers (and anyone else that is interested) via the webui, at <a href="http://crash-stats.mozilla.com">http://crash-stats.mozilla.com</a></li>
</ul>
<h4>Implementation Details</h4>
<ul>
<li>The collector, processor, monitor and cron jobs are all written in Python.</li>
<li>Crashes are currently stored in NFS, and processed crash information in a PostgreSQL database.</li>
<li>The web app is written in PHP (using the Kohana framework) and draws data both from Postgres and from a Pythonic web service.</li>
</ul>
<h2>Roadmap</h2>
<p>Future Socorro releases are a joint project between Webdev, Metrics, and IT.  Some of our milestones focus on infrastructure improvements, others on code changes, and still others on UI improvements.  Features generally work their way through to users in this order.</p>
<ul>
<li>
<h3>1.6 &#8211; 1.6.3 (in production)</h3>
<p style="padding-left: 30px;">The current production version is 1.6.3, which was released last Wednesday.  We don&#8217;t usually do second dot point releases but we did 1.6.1, 1.6.2, and 1.6.3 to get Out Of Process Plugin (OOPP) support out to engineers as it was implemented.</p>
<p style="padding-left: 30px;">When an OOPP becomes unresponsive, a pair of twin crashes are generated: one for the plugin process and one for the browser process.  For beta and pre-release products, both of these crashes are available for inspection via Socorro.  Unfortunately, Socorro throttles crash submissions from released products due to capacity constraints.  This means one or the other of the twins may not be available for inspection.  This limitation will vanish with the release of Socorro 1.8.</p>
<p style="padding-left: 30px;">You can now see whether a given crash signature is a hang or a crash, and whether it was plugin or browser related.  In the signature tables, if you see a stop sign symbol, that&#8217;s a hang.  A window means it is crash report information from the browser, and a small blue brick means it is crash report information from the plugin.</p>
<p style="padding-left: 30px;">If you are viewing one half of a hang pair for a pre-release or beta product, you&#8217;ll find a link to the other half at the top right of the report.</p>
<p style="padding-left: 30px;">You can also limit your searches (using the Advanced Search Filters) to look just at hangs or just at crashes, or to filter by whether a report is browser or plugin related.</p>
</li>
<li>
<h3>1.7 (Q2)</h3>
<p style="padding-left: 30px;">We are in the process of baking 1.7.  The key feature of this release is that we will no longer be relying on NFS in production. All crash report submissions are already stored in HBase, but with Socorro 1.7, we will retrieve the data from HBase for processing and store the processed result back into HBase.</p>
</li>
<li>
<h3>1.8 (Q2)</h3>
<p style="padding-left: 30px;">In 1.8, we will migrate the processors and minidump_stackwalk instances to run on our Hadoop nodes, further distributing our architecture.  This will give us the ability to scale up to the amount of data we have as it grows over time. You can see how this will simplify our architecture in the following diagram.</p>
<p style="padding-left: 30px;"><a href="http://blog.mozilla.com/webdev/files/2010/05/diagram.11.png"><img class="alignnone size-medium wp-image-1020" title="New Socorro Architecture" src="http://blog.mozilla.com/webdev/files/2010/05/diagram.11-300x142.png" alt="New Socorro Architecture" width="300" height="142" /></a></p>
<p style="padding-left: 30px;">With this release, the 15% throttling of Firefox release channel crashes goes away entirely.</p>
</li>
<li>
<h3>2.0 (Q3 2010)</h3>
<p style="padding-left: 30px;">You may have noticed 1.9 is missing.  In this release we will be making the power of Hbase available to the end user, so expect some significant UI changes.</p>
<p style="padding-left: 30px;">Right now we are in the process of specifying the PRD for 2.0.  This involves interviewing a lot of people on the Firefox, Platform, and QA teams.  If we haven&#8217;t scheduled you for an interview and you think we ought to talk to you, please let us know.</p>
</li>
</ul>
<h2>Features under consideration</h2>
<ul>
<li>Full text search of crashes</li>
<li>Faceted search: start by finding crashes that match a particular signature, and then drill down into them by category.<br />
Which of these crashes involved a particular extension or plugin?  Which ones occured within a short time after startup?</li>
<li>The ability to write and run your own Map/Reduce jobs (training will be provided)</li>
<li> Detection of &#8220;explosive crashes&#8221; that appear quickly</li>
<li>Viewing crashes by &#8220;build time&#8221; instead of clock time</li>
<li> Classification of crashes by component</li>
</ul>
<p>This is a big list, obviously.  We need your feedback &#8211; what should we work on first?</p>
<p>One thing that we&#8217;ve learned so far through the interviews is that people are not familiar with the existing features of Socorro, so expect further blog posts with more information on how best to use it!</p>
<h2>How to get involved</h2>
<p>As always, we welcome feedback and input on our plans.</p>
<p>You can contact the team at socorro-dev@mozilla.com, or me personally at laura@mozilla.com.</p>
<p>In addition, we always welcome contributions.  You can find our code repository at<br />
<a href="http://code.google.com/p/socorro/">http://code.google.com/p/socorro/</a></p>
<p>We hold project meetings on a Wednesday afternoon &#8211; details and agendas are here<br />
<a href="https://wiki.mozilla.org/Breakpad/Status_Meetings">https://wiki.mozilla.org/Breakpad/Status_Meetings</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2010/05/19/socorro-mozilla-crash-reports/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Socorro Moves to New Hardware</title>
		<link>http://blog.mozilla.com/webdev/2009/05/15/socorro-moves-to-new-hardware/</link>
		<comments>http://blog.mozilla.com/webdev/2009/05/15/socorro-moves-to-new-hardware/#comments</comments>
		<pubDate>Fri, 15 May 2009 20:49:58 +0000</pubDate>
		<dc:creator>lars</dc:creator>
				<category><![CDATA[Socorro]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/webdev/?p=414</guid>
		<description><![CDATA[What has two quad core 3GHz 64bit CPUs, sixteen gigs of RAM and makes the Socorro users happy? That would be the new hardware that the Socorro system moved to during a six hour operation on Thursday night. The new hardware was recommended by the folks from the aptly named PostgreSQL Experts, Inc after an [...]]]></description>
			<content:encoded><![CDATA[<p>What has two quad core 3GHz 64bit CPUs, sixteen gigs of RAM and makes the Socorro users happy?  That would be the new hardware that the Socorro system moved to during a six hour operation on Thursday night.  The new hardware was recommended by the folks from the aptly named <a href="http://pgexperts.com">PostgreSQL Experts, Inc</a> after an intense week of consultation and analysis in March earlier this year.  After auditing our existing system of hardware and software, it was apparent that we were woefully underpowered for what we were trying to do.  While simply tuning PostgreSQL helped in the interim, a more powerful platform was clearly in order.</p>
<p>Before we deployed the new hardware, we had to take several steps to tame our voracious use of disk space.  In the previous week, we removed the archived dumps from the database.  They were rarely ever accessed but took up the lion&#8217;s share of our disk space.  By migrating them to file system storage, we made a three hundred gig database migration onto new hardware into a migration of only sixty gig.  </p>
<p>While there may be a need for tuning over the next week, Socorro users should have a much accelerated experience using the Socorro Web site.</p>
<p>Many thanks to <em>aravind</em> for shepherding this project through IT, <em>chizu</em> in IT for his ﻿db cloning/replication scripting/tweaking and <em>jberkus</em> from PostgreSQL Experts for his superior navigation skills and a steady hand at the PostgreSQL tiller.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/webdev/2009/05/15/socorro-moves-to-new-hardware/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

