<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blog of Data &#187; ETL</title>
	<atom:link href="http://blog.mozilla.com/data/tag/etl/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/data</link>
	<description>Mozilla metrics team technical articles</description>
	<lastBuildDate>Thu, 01 Sep 2011 21:30:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Pentaho Hadoop integration</title>
		<link>http://blog.mozilla.com/data/2010/05/19/pentaho-hadoop-integration/</link>
		<comments>http://blog.mozilla.com/data/2010/05/19/pentaho-hadoop-integration/#comments</comments>
		<pubDate>Wed, 19 May 2010 16:14:10 +0000</pubDate>
		<dc:creator>deinspanjer</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Kettle]]></category>
		<category><![CDATA[Pentaho]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=194</guid>
		<description><![CDATA[Pentaho announced this morning that they were going to be adding some features to Pentaho Data Integration (Kettle) and to their BI suite to make it easy for people to use Kettle to retrieve, manipulate, and store data in Hadoop, and to integrate Hadoop communication into the reporting and analysis layer. They posted a nice [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.pentaho.com/news/releases/20100519_pentaho_harnesses_apache_hadoop_to_deliver_big_data_analytics.php" target="_blank">Pentaho announced</a> this morning that they were going to be adding some features to Pentaho Data Integration (Kettle) and to their BI suite to make it easy for people to use Kettle to retrieve, manipulate, and store data in Hadoop, and to integrate Hadoop communication into the reporting and analysis layer.</p>
<p>They posted a nice five minute screencast on their <a href="http://www.pentaho.com/hadoop/" target="_blank">Hadoop landing page</a> demonstrating a couple of pieces of Hive integration.  In it, they retrieve data using Hive, and they also use a Hive user defined function that is implemented as an embedded Kettle transformation.</p>
<p>I&#8217;m very excited to see this announcement.  Besides the significant work we&#8217;ve been doing on the Metrics team to integrate HBase into the Socorro project, we also have major plans for our Hadoop clusters for general data storage and processing.</p>
<p>Right now, we have Kettle jobs and transformations that manipulate gigabytes of data per hour, loading it into our data warehouse.  One of the things I love about Kettle is the ability to quickly and easily define, review, and extend complex jobs such as our end-of-day data aggregation:</p>
<p><a href="http://blog.mozilla.com/data/files/2010/05/2010-05-19_1155.png"><img class="aligncenter size-medium wp-image-195" title="EOD Job" src="http://blog.mozilla.com/data/files/2010/05/2010-05-19_1155-300x87.png" alt="" width="300" height="87" /></a></p>
<p>In the future, as we have more data stored in Hadoop, I want to be able to run transformations on that data.  Sometimes, if the transformations involve lots of RDBMS work, I&#8217;ll want to be streaming the data out of HDFS.  For other types of transformations that involve mostly business logic and text transformations, being able to run that code directly in a Hadoop Map Reduce job will be a fantastic feature.</p>
<p>My personal feeling is that people in the Hadoop community really need something visual and flexible like the Kettle interface for defining and manipulating this type of business logic.  Great strides have been made with projects such as Cascading, but it is still raw code, and I feel that excludes a lot of people who could be getting work done faster and better if they had a good tool to help them adapt to the world of Map Reduce.</p>
<p>Currently, someone can start up Kettle&#8217;s GUI and start constructing jobs and transformations simply by piecing together steps of work such as reading a set of text files, performing a regex on them, doing some value lookups, then aggregating the data.  If they could then save that transformation and execute it as a Hadoop Map Reduce job, I think it will be revolutionary for both worlds of ETL and Hadoop.</p>
<p>When Mozilla Metrics starts tackling some of the Hadoop data processing jobs that we have scheduled, we&#8217;ll be making significant open source contributions to both communities to realize this vision, and I really hope that it will help widen the accessibility of Hadoop to a new group of potential users.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2010/05/19/pentaho-hadoop-integration/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Update: Bugzilla SQR</title>
		<link>http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/</link>
		<comments>http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/#comments</comments>
		<pubDate>Sat, 15 Aug 2009 03:27:35 +0000</pubDate>
		<dc:creator>skrueger</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Bugzilla]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Kettle]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=140</guid>
		<description><![CDATA[I have had the chance to improve the bugzilla SQR in many ways. I have improved the overall run time inside of the ETL (both in kettle and in a python script), fixed a few bugs (A major one that was causing problem with the Open Bug Count), added new dimensions, and constructed a few [...]]]></description>
			<content:encoded><![CDATA[<p>I have had the chance to improve the bugzilla SQR in many ways.  I have improved the overall run time inside of the ETL (both in kettle and in a python script), fixed a few bugs (A major one that was causing problem with the Open Bug Count), added new dimensions, and constructed a few dashboards.  All my changes will be able to be found at <a href="http://sourceforge.net/projects/qareports/">sourceforge</a>.</p>
<p>I added a bug severity dimension,  added a component level onto the product dimension,  added a team level onto the person dimension, and a days dimension to track bugs over a distribution.</p>
<p>I have made some mock up dashboards here and posted bellow are a few snap shots of the charts in them.  Tell me what you think and what you would find useful!</p>

<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/stacked/' title='stacked'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/stacked-150x150.jpg" class="attachment-thumbnail" alt="stacked" title="stacked" /></a>
<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/open_bugs/' title='open_bugs'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/open_bugs-150x150.png" class="attachment-thumbnail" alt="open_bugs" title="open_bugs" /></a>
<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/net_open/' title='net_open'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/net_open-150x150.png" class="attachment-thumbnail" alt="net_open" title="net_open" /></a>
<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/distinct_issue/' title='distinct_issue'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/distinct_issue-150x150.png" class="attachment-thumbnail" alt="distinct_issue" title="distinct_issue" /></a>
<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/components/' title='components'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/components-150x150.png" class="attachment-thumbnail" alt="components" title="components" /></a>
<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/close_bugs/' title='close_bugs'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/close_bugs-150x150.png" class="attachment-thumbnail" alt="close_bugs" title="close_bugs" /></a>

]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Shell script analytics</title>
		<link>http://blog.mozilla.com/data/2009/07/29/shell-script-analytics/</link>
		<comments>http://blog.mozilla.com/data/2009/07/29/shell-script-analytics/#comments</comments>
		<pubDate>Thu, 30 Jul 2009 02:58:15 +0000</pubDate>
		<dc:creator>deinspanjer</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Bash]]></category>
		<category><![CDATA[ETL]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=99</guid>
		<description><![CDATA[Recently, I was asked if I could provide a breakdown of Firefox users on the Macintosh platform by whether they were using the Intel or PPC chipset.  For anyone who only cares about seeing that data and not the "how" behind it, look no further than this link:
<a href="http://manyeyes.alphaworks.ibm.com/wikified/mozilla/FirefoxMacUsageBreakdownTrend">Firefox on Macintosh processor breakdown trends in Many Eyes</a>

For anyone else, what follows is a detailed post about the volume of some of the data we parse, and some helpful AWK scripts that I use to parse it at times.]]></description>
			<content:encoded><![CDATA[<p>Recently, I was asked if I could provide a breakdown of Firefox users on the Macintosh platform by whether they were using the Intel or PPC chipset.  For anyone who only cares about seeing that data and not the &#8220;how&#8221; behind it, look no further than this link:<br />
<a href="http://manyeyes.alphaworks.ibm.com/wikified/mozilla/FirefoxMacUsageBreakdownTrend">Firefox on Macintosh processor breakdown trends in Many Eyes</a></p>
<p>For anyone else, follow along..<span id="more-99"></span><br />
While my standard processing libraries parse out a lot of information from our daily traffic logs, the user agent string is not something that I parse out of the main log data because the current processing engine I have for user agent strings is a little slow and it takes a lot of CPU.</p>
<p>Sometimes, especially when dealing with log files as your source data, it just takes too much time to build up a big system to do one-off processing.  Way back when, Unix was founded on the idea of being able to chain together small specific tools to be able to do larger jobs efficiently.  That is something that I find very relevant in my day to day work.</p>
<p>So, in order to provide the numbers requested above, I turned to my tool chain of command line utilities.  One of my favorite tools in that tool chain is AWK.  It may not be quite as powerful as Perl or Python, but it is quick and simple and it is quite good at getting the job done.</p>
<p>Now, our automatic update web log traffic is a medium sized source of data I have to churn through.  Here are a few data points:</p>
<ul>
<li>We generate about four to six gigabytes of log data per day for download.mozilla.org.  It spikes to as high as thirty gigabytes on release days and on release weeks, the median is much higher, somewhere between ten to fifteen gigabytes.</li>
<li>We generate about thirty gigabytes of log data per day for aus.mozilla.org during the week.  It dips as low as twenty-two gigabytes on Saturdays.</li>
<li>We generate more than two hundred and sixty gigabytes of log data per day for addons.mozilla.org and versioncheck.addons.mozilla.org during the week.  It dips to just over two hundred on the weekends.</li>
</ul>
<p>Update:  I don&#8217;t know if anyone noticed, but I had a typo in the file glob for the versioncheck histogram. That is why there was a spike in the middle instead of the larger volume of traffic for the whole time period.</p>
<p>How did I get those numbers you ask? Why with a shell script:</p>
<div style="overflow:scroll;">
<pre><code>[me@etl01 download.mozilla.org]$ gzip -l access_2009-07*.gz | awk -f <a title="Source awk script" href="http://people.mozilla.com/~deinspanjer/histogram_sum.awk" target="_blank">~/bin/histogram_sum.awk</a> scale=500000000 column=8 width=10</code>
2009-07-08 -   6.44GB: **************
2009-07-09 -   6.19GB: **************
2009-07-10 -   5.68GB: *************
2009-07-11 -   4.59GB: **********
2009-07-12 -   4.90GB: ***********
2009-07-13 -   6.02GB: *************
2009-07-14 -   5.85GB: *************
2009-07-15 -   5.81GB: *************
2009-07-16 -   6.13GB: **************
2009-07-17 -   9.37GB: *********************
2009-07-18 -   6.45GB: **************
2009-07-19 -   6.00GB: *************
2009-07-20 -   7.12GB: ****************
2009-07-21 -  12.72GB: ****************************
2009-07-22 -  33.78GB: *************************************************************************
2009-07-23 -  23.09GB: **************************************************
2009-07-24 -  15.41GB: **********************************
2009-07-25 -  10.95GB: ************************
2009-07-26 -  10.38GB: ***********************
2009-07-27 -  11.75GB: **************************
2009-07-28 -  10.39GB: ***********************
2009-07-29 -   7.32GB: ****************
Total counts = 216.93GB

<code>[me@etl01 aus.mozilla.org]$ gzip -l access_2009-{06-30,07}*.gz | awk -f <a href="http://people.mozilla.com/~deinspanjer/histogram_sum.awk" target="_blank">~/bin/histogram_sum.awk</a> scale=500000000 column=8 width=10</code>
2009-06-30 -  30.55GB: ******************************************************************
2009-07-01 -  29.88GB: *****************************************************************
2009-07-02 -  29.21GB: ***************************************************************
2009-07-03 -  26.12GB: *********************************************************
2009-07-04 -  21.97GB: ************************************************
2009-07-05 -  25.63GB: ********************************************************
2009-07-06 -  30.20GB: *****************************************************************
2009-07-07 -  29.88GB: *****************************************************************
2009-07-08 -  29.78GB: ****************************************************************
2009-07-09 -  29.22GB: ***************************************************************
2009-07-10 -  27.06GB: ***********************************************************
2009-07-11 -  22.74GB: *************************************************
2009-07-12 -  25.69GB: ********************************************************
2009-07-13 -  29.81GB: *****************************************************************
2009-07-14 -  29.49GB: ****************************************************************
2009-07-15 -  29.34GB: ****************************************************************
2009-07-16 -  28.79GB: **************************************************************
2009-07-17 -  26.98GB: **********************************************************
2009-07-18 -  22.44GB: *************************************************
2009-07-19 -  25.11GB: ******************************************************
2009-07-20 -  29.50GB: ****************************************************************
2009-07-21 -  29.32GB: ***************************************************************
2009-07-22 -  28.15GB: *************************************************************
2009-07-23 -  25.53GB: *******************************************************
2009-07-24 -  24.34GB: *****************************************************
2009-07-25 -  20.89GB: *********************************************
2009-07-26 -  23.84GB: ****************************************************
2009-07-27 -  28.23GB: *************************************************************
2009-07-28 -  28.34GB: *************************************************************
2009-07-29 -  22.12GB: ************************************************
Total counts = 810.15GB

<code>[me@etl01 logs]$ cat
 &lt;(cd addons.mozilla.org; gzip -l access_2009-{06-30,07}*.gz | tail -n +2)
 &lt;(cd versioncheck.addons.mozilla.org; gzip -l access_2009-</code>{06-30,07}*.gz | tail -n +2)
 | sort -k 4 | awk -f <a href="http://people.mozilla.com/~deinspanjer/histogram_sum.awk" target="_self">~/bin/histogram_sum.awk</a> scale=5000000000 column=8 width=10
2009-06-30 - 284.10GB: **************************************************************
2009-07-01 - 285.41GB: **************************************************************
2009-07-02 - 259.99GB: ********************************************************
2009-07-03 - 227.03GB: *************************************************
2009-07-04 - 192.92GB: ******************************************
2009-07-05 - 216.33GB: ***********************************************
2009-07-06 - 247.53GB: ******************************************************
2009-07-07 - 244.36GB: *****************************************************
2009-07-08 - 243.49GB: *****************************************************
2009-07-09 - 235.46GB: ***************************************************
2009-07-10 - 218.33GB: ***********************************************
2009-07-11 - 190.16GB: *****************************************
2009-07-12 - 207.19GB: *********************************************
2009-07-13 - 231.92GB: **************************************************
2009-07-14 - 232.91GB: ***************************************************
2009-07-15 - 230.37GB: **************************************************
2009-07-16 - 228.75GB: **************************************************
2009-07-17 - 236.00GB: ***************************************************
2009-07-18 - 203.75GB: ********************************************
2009-07-19 - 214.33GB: ***********************************************
2009-07-20 - 238.31GB: ****************************************************
2009-07-21 - 242.12GB: ****************************************************
2009-07-22 - 277.91GB: ************************************************************
2009-07-23 - 272.68GB: ***********************************************************
2009-07-24 - 235.63GB: ***************************************************
2009-07-25 - 202.74GB: ********************************************
2009-07-26 - 216.85GB: ***********************************************
2009-07-27 - 241.81GB: ****************************************************
2009-07-28 - 238.79GB: ****************************************************
2009-07-29 - 232.97GB: ***************************************************
Total counts = 7099.05GB</pre>
</div>
<p>The command I ran to generate the stats driving the Many Eyes site above was this:</p>
<div style="overflow:scroll;"><code> </code></p>
<pre>pv access_2009-{06-30,07}*.gz |
 gzip -cd | grep -F "Macintosh" |
 sed -rne '/"GET \/update\/[1-3]\/Firefox\/.*?Macintosh/s/^.*?\[([0-9]{2}\/[a-zA-Z]{3}\/[0-9]{4}).*?(Intel|PPC) Mac OS X.*/\1\t\2/p' |
 awk -f <a title="Source awk script" href="http://people.mozilla.com/~deinspanjer/tabulate_fields.awk" target="_blank">~/bin/tabulate_fields.awk</a> kf=1 cf=2</pre>
</div>
<p>The interesting parts of this command are:</p>
<p><strong><span style="color: #003300;">pv access_2009-{06-30,07}*.gz</span></strong> &#8212; <a href="http://www.ivarch.com/programs/pv.shtml" target="_blank">Pipe Viewer</a> a utility for monitoring the progress of data through a pipeline.  When the argument to pv is a file, it stats file file to determine how large it is so that it can calculate an ETA to finish processing it.  This is why I do the otherwise foolish trick of cat&#8217;ing to gzip -cd instead of passing the file list direclty to gzip or zgrep.  Here, I&#8217;m passing in a fileglob of all the download access logs from June 30 to July 28th (the day I ran it).</p>
<p><strong><span style="color: #003300;">grep -F &#8220;Macintosh&#8221;</span></strong> &#8212; I go ahead and pay the penalty for an extra process here because grep -F is much more efficient at throwing away the 98% of the log data that I&#8217;m not interested in than the much more complicated sed regex that follows.  The best part of <strong>pv</strong> is that it lets you actually test optimizations like this.  When I ran without the grep, I was processing about 1.5 MB/s.  When I put in the simple grep filter, that number jumped up to 5.5 MB/s.</p>
<p><span style="color: #003300;"><strong>sed -rne</strong></span> &#8230; &#8212; I won&#8217;t go into the gritty details about the regex for now.  Suffice to say that this regex looks for Firefox update checks that have the string Macintosh later in the request (hopefully in the user agent data).  For those matches, it then pulls out the request date and a single word, &#8220;Intel&#8221; or &#8220;PPC&#8221; if it is immediately followed by &#8220;Mac OS X&#8221;.</p>
<p><strong><span style="color: #003300;">awk -f <a href="http://people.mozilla.com/~deinspanjer/tabulate_fields.awk" target="_blank">~/bin/tabulate_fields.awk</a> kf=1 cf=2</span></strong> &#8212; This is my pièce de résistance.  Given many different types of tabular data, it can count occurrences of values for a given key value, and it can also sum a numeric column for the counted occurrences. This is a fairly simple use case.  It just keys off of the date in the first column of the output of <strong>sed</strong> and counts the occurrences of &#8220;Intel&#8221; or &#8220;PPC&#8221; for each day.</p>
<p>Hope you enjoyed this geek session.  If you would like to take the time to show me how I can do this things much more elegantly or quickly using Perl or Python, I&#8217;d actually love to hear it!  I really should learn to stop avoiding these wonderful languages just because I&#8217;m more familiar with an older tool. ;)</p>
<p>-Daniel</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/07/29/shell-script-analytics/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Processing web access logs with a Kettle cluster</title>
		<link>http://blog.mozilla.com/data/2009/06/26/processing-web-access-logs-with-a-kettle-cluster/</link>
		<comments>http://blog.mozilla.com/data/2009/06/26/processing-web-access-logs-with-a-kettle-cluster/#comments</comments>
		<pubDate>Sat, 27 Jun 2009 06:57:00 +0000</pubDate>
		<dc:creator>deinspanjer</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Kettle]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=90</guid>
		<description><![CDATA[There is a lot to be said on the above topic, but for the moment, I just wanted to drop a quick note about some ad-hoc work I did today: I ran an analysis on a year and a half of FTP log files, filtering for some specific requests, and filtering out but summarizing uninteresting [...]]]></description>
			<content:encoded><![CDATA[<p>There is a <strong>lot</strong> to be said on the above topic, but for the moment, I just wanted to drop a quick note about some ad-hoc work I did today:</p>
<p>I ran an analysis on a year and a half of FTP log files, filtering for some specific requests, and filtering out but summarizing uninteresting traffic from our heartbeat monitors and such.</p>
<p>I put together this simple Kettle transformation, and ran it with a cluster consisting of 32 fairly low powered slaves.  The results were pleasing, especially considering I didn&#8217;t even go through a tuning process to determine the optimal number of step copies or row set sizes.</p>
<p>8782 files containing 432 million records.  The processing was completed in 47 minutes giving a throughput of about 156 thousand rows per second.</p>
<p>There are a couple of screenshots after the cut.</p>
<p><span id="more-90"></span></p>
<div style="float: left;">
<div id="attachment_93" class="wp-caption alignright" style="width: 160px"><a href="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0226.png"><img class="size-thumbnail wp-image-93" title="parse_ftp_logs" src="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0226-150x150.png" alt="Transformation flow" width="150" height="150" /></a><p class="wp-caption-text">Transformation flow</p></div>
</div>
<div style="float: right;">
<div id="attachment_92" class="wp-caption alignleft" style="width: 160px"><a href="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0218.png"><img class="size-thumbnail wp-image-92" title="trans_exec_log" src="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0218-150x150.png" alt="Transformation log" width="150" height="150" /></a><p class="wp-caption-text">Transformation log</p></div>
</div>
<div style="clear: both;">
<hr />-Daniel</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/06/26/processing-web-access-logs-with-a-kettle-cluster/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Creating a sample bugzilla database using kettle</title>
		<link>http://blog.mozilla.com/data/2009/06/05/creating-a-sample-bugzilla-database-using-kettle/</link>
		<comments>http://blog.mozilla.com/data/2009/06/05/creating-a-sample-bugzilla-database-using-kettle/#comments</comments>
		<pubDate>Fri, 05 Jun 2009 22:04:55 +0000</pubDate>
		<dc:creator>skrueger</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Bugzilla]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Kettle]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=13</guid>
		<description><![CDATA[Mozilla&#8217;s bugzilla database contains approx. 480,000 bugs and approx. 5,000,000 entries in bugs_activity table and is too large for the initial development that I am doing. I want to construct a smaller sample Bugzilla data base that I can use to develop and run tests with for my project in a more efficient manner. To [...]]]></description>
			<content:encoded><![CDATA[<p>
<a href="https://bugzilla.mozilla.org/">Mozilla&#8217;s bugzilla</a> database contains approx. 480,000 bugs and approx. 5,000,000 entries in bugs_activity table and is too large for the initial development that I am doing.  I want to construct a smaller sample Bugzilla data base that I can use to develop and run tests with for my <a href="http://blog.mozilla.com/data/2009/06/04/software-quality-reports-bugzilla-analysis/">project</a> in a more efficient manner. To construct this new sample database I first want to prune the database by only housing the tables that are required. To figure out which tables are necessary I went inside of the SQR ETL job in spoon and wrote down the tables that were being used.  The 8 necessary tables I found were:
</p>
<ul>
<li>verions</li>
<li>products</li>
<li>resolution</li>
<li>bug_status</li>
<li>priority</li>
<li>bugs</li>
<li>bugs_activity</li>
<li>profiles</li>
</ul>
<p>
These tables below could just be transferred in whole because of there small size and the information contained in them was just meta/attribute data of the bugs and besides that they didn&#8217;t have any related information to the bugs.bug_id:
</p>
<ul>
<li>verions</li>
<li>products</li>
<li>resolution</li>
<li>bug_status</li>
<li>priority</li>
</ul>
<p>
For the bugs I had to construct a statistical sampling and based on the bugs statistical sampling gather corresponding entries in the bugs_activity and profiles tables.  I selected data from the bugs table and passed it through a reservoir sampling step that would pull a random amount of rows from the returned result set. I decided to pull a random sample of 1,000 bugs.  After that I had a new job that matches the bug_id&#8217;s pulled from the reservoir sample with the bugs_activity table, and I then had to match the assigned_to and reporter fields from the random sample with the userid in the profiles table.
</p>
<p>
The creating of the Bugzilla Sample Database inside of spoon consists of 3 main jobs:</p>
<ol>
<li>Resetting the Sample Bugzilla Database</li>
<li>Transferring the Non-bugs tables</li>
<li>Transferring the Bugs tables</li>
</ol>
<p>The first step will reset the sample Bugzilla database where we reset or clear any information stored in the database.  The step follows the pseudo code below.<br />
<code><br />
For Table t in the set of Tables T {<br />
&nbsp;&nbsp;&nbsp;If t exists {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Drop t<br />
&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;Create t<br />
}<br />
</code>
</p>
<p>
The second step will transfer the trivial tables mentioned above that do not involve the bugs table.
</p>
<p>
The third step will collect a sampling from the bugs table, output the entries into the new bugs table, select the bugs_activity entries based on bug_id, and profiles entries based on assigned_to and reporters, and output these into the new bugs_activity and profiles entries.
</p>
<p>
To begin you will want to have is access to a MySQL database where you can create, drop, and insert.  The name of the database doesn&#8217;t matter but mine was called bugs_sample.  After you have access to a MySQL database open up spoon, create a job, and setup the database connections as described below.</p>
<ol>
<li>Create a new job by going to <code>File &gt; New &gt; Job</code> .</li>
<li>Right click in the Staging area and select <code>Job settings</code> from the menu that appears.  This will open up the Job properties dialog window.</li>
<li>Change the <code>Job name:</code> field to something like &#8220;Create Bugzilla Sample Database&#8221;.</li>
<li>In the top left of spoon click on the Magnifying Glass with the label &#8220;View&#8221; under it.</li>
<li>Right-Click on the label Database connections.  From the drop down menu select <code>New</code>.  This will open up the Database connection dialog box.</li>
<li>The first database connection to create will be for the real Bugzilla database.  Fill in the credentials to log into the Bugzilla database.  An example is below.<br />
<a href="http://blog.mozilla.com/data/files/2009/06/bugzilladatabaseconnection.png"><img src="http://blog.mozilla.com/data/files/2009/06/bugzilladatabaseconnection-300x214.png" alt="bugzilladatabaseconnection" title="bugzilladatabaseconnection" width="300" height="214" class="aligncenter size-medium wp-image-26" /></a></li>
<li>Click Test and you should get a message like the one below but with your credentials.<br /><a href="http://blog.mozilla.com/data/files/2009/06/bugzilladatabaseconnectionmessageok.png"><img src="http://blog.mozilla.com/data/files/2009/06/bugzilladatabaseconnectionmessageok-300x148.png" alt="bugzilladatabaseconnectionmessageok" title="bugzilladatabaseconnectionmessageok" width="300" height="148" class="aligncenter size-medium wp-image-27" /></a><br />
If you didn&#8217;t receive this message check your MySQL connections and credentials.</li>
<li>Repeat previous two steps above to create a database connection for the new Bugzilla Sample database.</li>
<li>After the two database connections are set up it is now time to create a new job for the first step in the process.  This will be done by creating a new job, and inside the job checking to see if each table exists.  If it does drop it.  Create the table.  As described earlier.  Follow this <a href="http://screencast.com/t/pzAlGJCnR">video</a> to create the first part of the job.<br />
The schema for each table can be found <a href="http://www.ravenbrook.com/project/p4dti/tool/cgi/bugzilla-schema/">here</a>.</li>
<li>Repeat the last part of the previous step for each table mentioned above.  The end result should look something like the image below.<br /><a href="http://blog.mozilla.com/data/files/2009/06/createsamplebugzilladatabaseendresult.png"><img src="http://blog.mozilla.com/data/files/2009/06/createsamplebugzilladatabaseendresult-300x127.png" alt="createsamplebugzilladatabaseendresult" title="createsamplebugzilladatabaseendresult" width="300" height="127" class="aligncenter size-medium wp-image-31" /></a></li>
<li>Now we will create a new job for the second part of the process where we transfer all tables that don&#8217;t directly relate to the bugs table over to the new Sample Database.  Create a new job entitled Transfer Non-bugs.  This job will be composed of multiple transformations that will transfer tables from the Bugzilla database to our new Bugzilla Sample database.  To construct the transformation to transfer a table from the Bugzilla database to the new Bugzilla Sample database follow this <a href="http://screencast.com/t/bsoupWtAv">video</a>.  Create a new transformation for each non related to bugs table that are listed above.</li>
<li>Once a transformation for each non related bug table has been completed and saved.  Create a job to connect them together like the image below.<br /><a href="http://blog.mozilla.com/data/files/2009/06/transfernonbugstables.png"><img src="http://blog.mozilla.com/data/files/2009/06/transfernonbugstables-300x134.png" alt="transfernonbugstables" title="transfernonbugstables" width="300" height="134" class="aligncenter size-medium wp-image-34" /></a></li>
<li>Now it is time to create the third and last part of the transformation where we collect a sampling from the bugs table and based of this sampling select corresponding bugs_activity, and profiles entries.  This is composed of two parts.  The first part we will gather bugs and then do a reservoir sampling on this gathering.  Then we will store the data to the new table and finally we will save some values to variables for use in a later query.  This <a href="http://www.screencast.com/users/SimonKrueger/folders/Jing/media/7c5e944d-9d81-45fd-b649-0cee3b899bad">video</a> describes how to construct the first part.
<p>NOTE: It was necessary to assign these values into a variable.  I tried feeding the output fields from the group by CSV directly into a table input step query but the JDBC driver will only accept the first entry in the CSV when using the a <code>WHERE IN (?)</code> statement.  By using the variable in the <code>WHERE IN (${VARIABLE})</code> query eliminated this problem.</p>
</li>
<li>The second part is two take the variables that were just assigned and used them in a WHERE of a query to select data that we need from the Bugzilla database and then dump the data into our new Bugzilla Sample database.  This <a href="http://www.screencast.com/users/SimonKrueger/folders/Jing/media/e60454c6-4b7a-47f6-8204-97ecde4a0531">video</a> describes this second part of the transformation.</li>
<li>Finally we want to construct jobs that tie everything all together.  This <a href="http://www.screencast.com/users/SimonKrueger/folders/Jing/media/82ede206-30f6-4b9f-809e-828cb91dcdc8">video</a> shows the final job with everything connected.</li>
</ol>
<p>If you would like to change your sample size change it in the reservoir sampling step.  I would even recommend having a variable in this location to easily change the sample size when running the job.  Also change the seed value inside of the reservoir sampling step each time to guarantee a new sample.</p>
<p>I have attached and included the source files in a <a href="http://people.mozilla.org/~skrueger/CreateSampleDatabase.zip">compressed file</a> for anyone interested.  Read the README after unzipping.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/06/05/creating-a-sample-bugzilla-database-using-kettle/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Software Quality Reports: Bugzilla Analysis</title>
		<link>http://blog.mozilla.com/data/2009/06/04/software-quality-reports-bugzilla-analysis/</link>
		<comments>http://blog.mozilla.com/data/2009/06/04/software-quality-reports-bugzilla-analysis/#comments</comments>
		<pubDate>Thu, 04 Jun 2009 17:43:25 +0000</pubDate>
		<dc:creator>skrueger</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Bugzilla]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Kettle]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=3</guid>
		<description><![CDATA[I am working on a project this summer that analyzes Bugzilla. The basis of this project has been started by Nick Goodman and he entitled it Software Quality Reports (SQR). Software Quality Reports gives product managers, project managers, development managers, and software engineers more information on things like bug burn down rate by product, issues [...]]]></description>
			<content:encoded><![CDATA[<p>I am working on a project this summer that analyzes Bugzilla.  The basis of this project has been started by Nick Goodman and he entitled it Software Quality Reports (SQR).  Software Quality Reports gives product managers, project managers, development managers, and software engineers more information on things like bug burn down rate by product, issues by status and product, average days to resolution by priority and product, open vs close trend by product, etc.  I am going to take Nick Goodman&#8217;s SQR and improve it by making it more scalable and adding addition features that don&#8217;t currently exist.</p>
<p>
A large part of this project involves doing an ETL(Extract Transform Load) on the bugzilla database into a star schema in side of a data warehouse. To design and run the ETL process I am using a program from the open source community project Pentaho BI (Business Intelligence) Suite called Pentaho Data Integration (PDI, and formerly known as kettle) and spoon &#8212; the graphical tool which is used to design and test every PDI process. Once the data is loaded in to the star schema,  I will then use the Pentaho BI server to create graphs and charts to visualize and and drill down the data.</p>
<p>
I will be reporting my work through out this blog and I hope that it will allow any one interested to participate, learn, or contribute.  To get started right now you can read up about <a href="http://community.pentaho.com/">Pentaho</a>, <a href="http://kettle.pentaho.org/">Pentaho Data Integration (formerly known as kettle)</a>, and Nick Goodman&#8217;s Software Quality Reports can be found on <a href="http://sourceforge.net/projects/qareports/">sourceforge</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/06/04/software-quality-reports-bugzilla-analysis/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

