<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blog of Data &#187; Kettle</title>
	<atom:link href="http://blog.mozilla.com/data/tag/kettle/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/data</link>
	<description>Mozilla metrics team technical articles</description>
	<lastBuildDate>Thu, 01 Sep 2011 21:30:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Pentaho Hadoop integration</title>
		<link>http://blog.mozilla.com/data/2010/05/19/pentaho-hadoop-integration/</link>
		<comments>http://blog.mozilla.com/data/2010/05/19/pentaho-hadoop-integration/#comments</comments>
		<pubDate>Wed, 19 May 2010 16:14:10 +0000</pubDate>
		<dc:creator>deinspanjer</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Kettle]]></category>
		<category><![CDATA[Pentaho]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=194</guid>
		<description><![CDATA[Pentaho announced this morning that they were going to be adding some features to Pentaho Data Integration (Kettle) and to their BI suite to make it easy for people to use Kettle to retrieve, manipulate, and store data in Hadoop, and to integrate Hadoop communication into the reporting and analysis layer. They posted a nice [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.pentaho.com/news/releases/20100519_pentaho_harnesses_apache_hadoop_to_deliver_big_data_analytics.php" target="_blank">Pentaho announced</a> this morning that they were going to be adding some features to Pentaho Data Integration (Kettle) and to their BI suite to make it easy for people to use Kettle to retrieve, manipulate, and store data in Hadoop, and to integrate Hadoop communication into the reporting and analysis layer.</p>
<p>They posted a nice five minute screencast on their <a href="http://www.pentaho.com/hadoop/" target="_blank">Hadoop landing page</a> demonstrating a couple of pieces of Hive integration.  In it, they retrieve data using Hive, and they also use a Hive user defined function that is implemented as an embedded Kettle transformation.</p>
<p>I&#8217;m very excited to see this announcement.  Besides the significant work we&#8217;ve been doing on the Metrics team to integrate HBase into the Socorro project, we also have major plans for our Hadoop clusters for general data storage and processing.</p>
<p>Right now, we have Kettle jobs and transformations that manipulate gigabytes of data per hour, loading it into our data warehouse.  One of the things I love about Kettle is the ability to quickly and easily define, review, and extend complex jobs such as our end-of-day data aggregation:</p>
<p><a href="http://blog.mozilla.com/data/files/2010/05/2010-05-19_1155.png"><img class="aligncenter size-medium wp-image-195" title="EOD Job" src="http://blog.mozilla.com/data/files/2010/05/2010-05-19_1155-300x87.png" alt="" width="300" height="87" /></a></p>
<p>In the future, as we have more data stored in Hadoop, I want to be able to run transformations on that data.  Sometimes, if the transformations involve lots of RDBMS work, I&#8217;ll want to be streaming the data out of HDFS.  For other types of transformations that involve mostly business logic and text transformations, being able to run that code directly in a Hadoop Map Reduce job will be a fantastic feature.</p>
<p>My personal feeling is that people in the Hadoop community really need something visual and flexible like the Kettle interface for defining and manipulating this type of business logic.  Great strides have been made with projects such as Cascading, but it is still raw code, and I feel that excludes a lot of people who could be getting work done faster and better if they had a good tool to help them adapt to the world of Map Reduce.</p>
<p>Currently, someone can start up Kettle&#8217;s GUI and start constructing jobs and transformations simply by piecing together steps of work such as reading a set of text files, performing a regex on them, doing some value lookups, then aggregating the data.  If they could then save that transformation and execute it as a Hadoop Map Reduce job, I think it will be revolutionary for both worlds of ETL and Hadoop.</p>
<p>When Mozilla Metrics starts tackling some of the Hadoop data processing jobs that we have scheduled, we&#8217;ll be making significant open source contributions to both communities to realize this vision, and I really hope that it will help widen the accessibility of Hadoop to a new group of potential users.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2010/05/19/pentaho-hadoop-integration/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Update: Bugzilla SQR</title>
		<link>http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/</link>
		<comments>http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/#comments</comments>
		<pubDate>Sat, 15 Aug 2009 03:27:35 +0000</pubDate>
		<dc:creator>skrueger</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Bugzilla]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Kettle]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=140</guid>
		<description><![CDATA[I have had the chance to improve the bugzilla SQR in many ways. I have improved the overall run time inside of the ETL (both in kettle and in a python script), fixed a few bugs (A major one that was causing problem with the Open Bug Count), added new dimensions, and constructed a few [...]]]></description>
			<content:encoded><![CDATA[<p>I have had the chance to improve the bugzilla SQR in many ways.  I have improved the overall run time inside of the ETL (both in kettle and in a python script), fixed a few bugs (A major one that was causing problem with the Open Bug Count), added new dimensions, and constructed a few dashboards.  All my changes will be able to be found at <a href="http://sourceforge.net/projects/qareports/">sourceforge</a>.</p>
<p>I added a bug severity dimension,  added a component level onto the product dimension,  added a team level onto the person dimension, and a days dimension to track bugs over a distribution.</p>
<p>I have made some mock up dashboards here and posted bellow are a few snap shots of the charts in them.  Tell me what you think and what you would find useful!</p>

<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/stacked/' title='stacked'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/stacked-150x150.jpg" class="attachment-thumbnail" alt="stacked" title="stacked" /></a>
<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/open_bugs/' title='open_bugs'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/open_bugs-150x150.png" class="attachment-thumbnail" alt="open_bugs" title="open_bugs" /></a>
<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/net_open/' title='net_open'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/net_open-150x150.png" class="attachment-thumbnail" alt="net_open" title="net_open" /></a>
<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/distinct_issue/' title='distinct_issue'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/distinct_issue-150x150.png" class="attachment-thumbnail" alt="distinct_issue" title="distinct_issue" /></a>
<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/components/' title='components'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/components-150x150.png" class="attachment-thumbnail" alt="components" title="components" /></a>
<a href='http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/close_bugs/' title='close_bugs'><img width="150" height="150" src="http://blog.mozilla.com/data/files/2009/08/close_bugs-150x150.png" class="attachment-thumbnail" alt="close_bugs" title="close_bugs" /></a>

]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/08/14/update-bugzilla-sqr/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Processing web access logs with a Kettle cluster</title>
		<link>http://blog.mozilla.com/data/2009/06/26/processing-web-access-logs-with-a-kettle-cluster/</link>
		<comments>http://blog.mozilla.com/data/2009/06/26/processing-web-access-logs-with-a-kettle-cluster/#comments</comments>
		<pubDate>Sat, 27 Jun 2009 06:57:00 +0000</pubDate>
		<dc:creator>deinspanjer</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Kettle]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=90</guid>
		<description><![CDATA[There is a lot to be said on the above topic, but for the moment, I just wanted to drop a quick note about some ad-hoc work I did today: I ran an analysis on a year and a half of FTP log files, filtering for some specific requests, and filtering out but summarizing uninteresting [...]]]></description>
			<content:encoded><![CDATA[<p>There is a <strong>lot</strong> to be said on the above topic, but for the moment, I just wanted to drop a quick note about some ad-hoc work I did today:</p>
<p>I ran an analysis on a year and a half of FTP log files, filtering for some specific requests, and filtering out but summarizing uninteresting traffic from our heartbeat monitors and such.</p>
<p>I put together this simple Kettle transformation, and ran it with a cluster consisting of 32 fairly low powered slaves.  The results were pleasing, especially considering I didn&#8217;t even go through a tuning process to determine the optimal number of step copies or row set sizes.</p>
<p>8782 files containing 432 million records.  The processing was completed in 47 minutes giving a throughput of about 156 thousand rows per second.</p>
<p>There are a couple of screenshots after the cut.</p>
<p><span id="more-90"></span></p>
<div style="float: left;">
<div id="attachment_93" class="wp-caption alignright" style="width: 160px"><a href="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0226.png"><img class="size-thumbnail wp-image-93" title="parse_ftp_logs" src="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0226-150x150.png" alt="Transformation flow" width="150" height="150" /></a><p class="wp-caption-text">Transformation flow</p></div>
</div>
<div style="float: right;">
<div id="attachment_92" class="wp-caption alignleft" style="width: 160px"><a href="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0218.png"><img class="size-thumbnail wp-image-92" title="trans_exec_log" src="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0218-150x150.png" alt="Transformation log" width="150" height="150" /></a><p class="wp-caption-text">Transformation log</p></div>
</div>
<div style="clear: both;">
<hr />-Daniel</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/06/26/processing-web-access-logs-with-a-kettle-cluster/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Creating a sample bugzilla database using kettle</title>
		<link>http://blog.mozilla.com/data/2009/06/05/creating-a-sample-bugzilla-database-using-kettle/</link>
		<comments>http://blog.mozilla.com/data/2009/06/05/creating-a-sample-bugzilla-database-using-kettle/#comments</comments>
		<pubDate>Fri, 05 Jun 2009 22:04:55 +0000</pubDate>
		<dc:creator>skrueger</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Bugzilla]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Kettle]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=13</guid>
		<description><![CDATA[Mozilla&#8217;s bugzilla database contains approx. 480,000 bugs and approx. 5,000,000 entries in bugs_activity table and is too large for the initial development that I am doing. I want to construct a smaller sample Bugzilla data base that I can use to develop and run tests with for my project in a more efficient manner. To [...]]]></description>
			<content:encoded><![CDATA[<p>
<a href="https://bugzilla.mozilla.org/">Mozilla&#8217;s bugzilla</a> database contains approx. 480,000 bugs and approx. 5,000,000 entries in bugs_activity table and is too large for the initial development that I am doing.  I want to construct a smaller sample Bugzilla data base that I can use to develop and run tests with for my <a href="http://blog.mozilla.com/data/2009/06/04/software-quality-reports-bugzilla-analysis/">project</a> in a more efficient manner. To construct this new sample database I first want to prune the database by only housing the tables that are required. To figure out which tables are necessary I went inside of the SQR ETL job in spoon and wrote down the tables that were being used.  The 8 necessary tables I found were:
</p>
<ul>
<li>verions</li>
<li>products</li>
<li>resolution</li>
<li>bug_status</li>
<li>priority</li>
<li>bugs</li>
<li>bugs_activity</li>
<li>profiles</li>
</ul>
<p>
These tables below could just be transferred in whole because of there small size and the information contained in them was just meta/attribute data of the bugs and besides that they didn&#8217;t have any related information to the bugs.bug_id:
</p>
<ul>
<li>verions</li>
<li>products</li>
<li>resolution</li>
<li>bug_status</li>
<li>priority</li>
</ul>
<p>
For the bugs I had to construct a statistical sampling and based on the bugs statistical sampling gather corresponding entries in the bugs_activity and profiles tables.  I selected data from the bugs table and passed it through a reservoir sampling step that would pull a random amount of rows from the returned result set. I decided to pull a random sample of 1,000 bugs.  After that I had a new job that matches the bug_id&#8217;s pulled from the reservoir sample with the bugs_activity table, and I then had to match the assigned_to and reporter fields from the random sample with the userid in the profiles table.
</p>
<p>
The creating of the Bugzilla Sample Database inside of spoon consists of 3 main jobs:</p>
<ol>
<li>Resetting the Sample Bugzilla Database</li>
<li>Transferring the Non-bugs tables</li>
<li>Transferring the Bugs tables</li>
</ol>
<p>The first step will reset the sample Bugzilla database where we reset or clear any information stored in the database.  The step follows the pseudo code below.<br />
<code><br />
For Table t in the set of Tables T {<br />
&nbsp;&nbsp;&nbsp;If t exists {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Drop t<br />
&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;Create t<br />
}<br />
</code>
</p>
<p>
The second step will transfer the trivial tables mentioned above that do not involve the bugs table.
</p>
<p>
The third step will collect a sampling from the bugs table, output the entries into the new bugs table, select the bugs_activity entries based on bug_id, and profiles entries based on assigned_to and reporters, and output these into the new bugs_activity and profiles entries.
</p>
<p>
To begin you will want to have is access to a MySQL database where you can create, drop, and insert.  The name of the database doesn&#8217;t matter but mine was called bugs_sample.  After you have access to a MySQL database open up spoon, create a job, and setup the database connections as described below.</p>
<ol>
<li>Create a new job by going to <code>File &gt; New &gt; Job</code> .</li>
<li>Right click in the Staging area and select <code>Job settings</code> from the menu that appears.  This will open up the Job properties dialog window.</li>
<li>Change the <code>Job name:</code> field to something like &#8220;Create Bugzilla Sample Database&#8221;.</li>
<li>In the top left of spoon click on the Magnifying Glass with the label &#8220;View&#8221; under it.</li>
<li>Right-Click on the label Database connections.  From the drop down menu select <code>New</code>.  This will open up the Database connection dialog box.</li>
<li>The first database connection to create will be for the real Bugzilla database.  Fill in the credentials to log into the Bugzilla database.  An example is below.<br />
<a href="http://blog.mozilla.com/data/files/2009/06/bugzilladatabaseconnection.png"><img src="http://blog.mozilla.com/data/files/2009/06/bugzilladatabaseconnection-300x214.png" alt="bugzilladatabaseconnection" title="bugzilladatabaseconnection" width="300" height="214" class="aligncenter size-medium wp-image-26" /></a></li>
<li>Click Test and you should get a message like the one below but with your credentials.<br /><a href="http://blog.mozilla.com/data/files/2009/06/bugzilladatabaseconnectionmessageok.png"><img src="http://blog.mozilla.com/data/files/2009/06/bugzilladatabaseconnectionmessageok-300x148.png" alt="bugzilladatabaseconnectionmessageok" title="bugzilladatabaseconnectionmessageok" width="300" height="148" class="aligncenter size-medium wp-image-27" /></a><br />
If you didn&#8217;t receive this message check your MySQL connections and credentials.</li>
<li>Repeat previous two steps above to create a database connection for the new Bugzilla Sample database.</li>
<li>After the two database connections are set up it is now time to create a new job for the first step in the process.  This will be done by creating a new job, and inside the job checking to see if each table exists.  If it does drop it.  Create the table.  As described earlier.  Follow this <a href="http://screencast.com/t/pzAlGJCnR">video</a> to create the first part of the job.<br />
The schema for each table can be found <a href="http://www.ravenbrook.com/project/p4dti/tool/cgi/bugzilla-schema/">here</a>.</li>
<li>Repeat the last part of the previous step for each table mentioned above.  The end result should look something like the image below.<br /><a href="http://blog.mozilla.com/data/files/2009/06/createsamplebugzilladatabaseendresult.png"><img src="http://blog.mozilla.com/data/files/2009/06/createsamplebugzilladatabaseendresult-300x127.png" alt="createsamplebugzilladatabaseendresult" title="createsamplebugzilladatabaseendresult" width="300" height="127" class="aligncenter size-medium wp-image-31" /></a></li>
<li>Now we will create a new job for the second part of the process where we transfer all tables that don&#8217;t directly relate to the bugs table over to the new Sample Database.  Create a new job entitled Transfer Non-bugs.  This job will be composed of multiple transformations that will transfer tables from the Bugzilla database to our new Bugzilla Sample database.  To construct the transformation to transfer a table from the Bugzilla database to the new Bugzilla Sample database follow this <a href="http://screencast.com/t/bsoupWtAv">video</a>.  Create a new transformation for each non related to bugs table that are listed above.</li>
<li>Once a transformation for each non related bug table has been completed and saved.  Create a job to connect them together like the image below.<br /><a href="http://blog.mozilla.com/data/files/2009/06/transfernonbugstables.png"><img src="http://blog.mozilla.com/data/files/2009/06/transfernonbugstables-300x134.png" alt="transfernonbugstables" title="transfernonbugstables" width="300" height="134" class="aligncenter size-medium wp-image-34" /></a></li>
<li>Now it is time to create the third and last part of the transformation where we collect a sampling from the bugs table and based of this sampling select corresponding bugs_activity, and profiles entries.  This is composed of two parts.  The first part we will gather bugs and then do a reservoir sampling on this gathering.  Then we will store the data to the new table and finally we will save some values to variables for use in a later query.  This <a href="http://www.screencast.com/users/SimonKrueger/folders/Jing/media/7c5e944d-9d81-45fd-b649-0cee3b899bad">video</a> describes how to construct the first part.
<p>NOTE: It was necessary to assign these values into a variable.  I tried feeding the output fields from the group by CSV directly into a table input step query but the JDBC driver will only accept the first entry in the CSV when using the a <code>WHERE IN (?)</code> statement.  By using the variable in the <code>WHERE IN (${VARIABLE})</code> query eliminated this problem.</p>
</li>
<li>The second part is two take the variables that were just assigned and used them in a WHERE of a query to select data that we need from the Bugzilla database and then dump the data into our new Bugzilla Sample database.  This <a href="http://www.screencast.com/users/SimonKrueger/folders/Jing/media/e60454c6-4b7a-47f6-8204-97ecde4a0531">video</a> describes this second part of the transformation.</li>
<li>Finally we want to construct jobs that tie everything all together.  This <a href="http://www.screencast.com/users/SimonKrueger/folders/Jing/media/82ede206-30f6-4b9f-809e-828cb91dcdc8">video</a> shows the final job with everything connected.</li>
</ol>
<p>If you would like to change your sample size change it in the reservoir sampling step.  I would even recommend having a variable in this location to easily change the sample size when running the job.  Also change the seed value inside of the reservoir sampling step each time to guarantee a new sample.</p>
<p>I have attached and included the source files in a <a href="http://people.mozilla.org/~skrueger/CreateSampleDatabase.zip">compressed file</a> for anyone interested.  Read the README after unzipping.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/06/05/creating-a-sample-bugzilla-database-using-kettle/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Software Quality Reports: Bugzilla Analysis</title>
		<link>http://blog.mozilla.com/data/2009/06/04/software-quality-reports-bugzilla-analysis/</link>
		<comments>http://blog.mozilla.com/data/2009/06/04/software-quality-reports-bugzilla-analysis/#comments</comments>
		<pubDate>Thu, 04 Jun 2009 17:43:25 +0000</pubDate>
		<dc:creator>skrueger</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Bugzilla]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Kettle]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=3</guid>
		<description><![CDATA[I am working on a project this summer that analyzes Bugzilla. The basis of this project has been started by Nick Goodman and he entitled it Software Quality Reports (SQR). Software Quality Reports gives product managers, project managers, development managers, and software engineers more information on things like bug burn down rate by product, issues [...]]]></description>
			<content:encoded><![CDATA[<p>I am working on a project this summer that analyzes Bugzilla.  The basis of this project has been started by Nick Goodman and he entitled it Software Quality Reports (SQR).  Software Quality Reports gives product managers, project managers, development managers, and software engineers more information on things like bug burn down rate by product, issues by status and product, average days to resolution by priority and product, open vs close trend by product, etc.  I am going to take Nick Goodman&#8217;s SQR and improve it by making it more scalable and adding addition features that don&#8217;t currently exist.</p>
<p>
A large part of this project involves doing an ETL(Extract Transform Load) on the bugzilla database into a star schema in side of a data warehouse. To design and run the ETL process I am using a program from the open source community project Pentaho BI (Business Intelligence) Suite called Pentaho Data Integration (PDI, and formerly known as kettle) and spoon &#8212; the graphical tool which is used to design and test every PDI process. Once the data is loaded in to the star schema,  I will then use the Pentaho BI server to create graphs and charts to visualize and and drill down the data.</p>
<p>
I will be reporting my work through out this blog and I hope that it will allow any one interested to participate, learn, or contribute.  To get started right now you can read up about <a href="http://community.pentaho.com/">Pentaho</a>, <a href="http://kettle.pentaho.org/">Pentaho Data Integration (formerly known as kettle)</a>, and Nick Goodman&#8217;s Software Quality Reports can be found on <a href="http://sourceforge.net/projects/qareports/">sourceforge</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/06/04/software-quality-reports-bugzilla-analysis/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

