<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blog of Data &#187; kettle</title>
	<atom:link href="http://blog.mozilla.com/data/tag/kettle/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/data</link>
	<description>Mozilla metrics team&#039;s technical articles</description>
	<lastBuildDate>Sun, 07 Mar 2010 04:12:43 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Processing web access logs with a Kettle cluster</title>
		<link>http://blog.mozilla.com/data/2009/06/26/processing-web-access-logs-with-a-kettle-cluster/</link>
		<comments>http://blog.mozilla.com/data/2009/06/26/processing-web-access-logs-with-a-kettle-cluster/#comments</comments>
		<pubDate>Sat, 27 Jun 2009 06:57:00 +0000</pubDate>
		<dc:creator>deinspanjer</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[etl]]></category>
		<category><![CDATA[kettle]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=90</guid>
		<description><![CDATA[There is a lot to be said on the above topic, but for the moment, I just wanted to drop a quick note about some ad-hoc work I did today:
I ran an analysis on a year and a half of FTP log files, filtering for some specific requests, and filtering out but summarizing uninteresting traffic [...]]]></description>
			<content:encoded><![CDATA[<p>There is a <strong>lot</strong> to be said on the above topic, but for the moment, I just wanted to drop a quick note about some ad-hoc work I did today:</p>
<p>I ran an analysis on a year and a half of FTP log files, filtering for some specific requests, and filtering out but summarizing uninteresting traffic from our heartbeat monitors and such.</p>
<p>I put together this simple Kettle transformation, and ran it with a cluster consisting of 32 fairly low powered slaves.  The results were pleasing, especially considering I didn&#8217;t even go through a tuning process to determine the optimal number of step copies or row set sizes.</p>
<p>8782 files containing 432 million records.  The processing was completed in 47 minutes giving a throughput of about 156 thousand rows per second.</p>
<p>There are a couple of screenshots after the cut.</p>
<p><span id="more-90"></span></p>
<div style="float: left;">
<div id="attachment_93" class="wp-caption alignright" style="width: 160px"><a href="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0226.png"><img class="size-thumbnail wp-image-93" title="parse_ftp_logs" src="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0226-150x150.png" alt="Transformation flow" width="150" height="150" /></a><p class="wp-caption-text">Transformation flow</p></div>
</div>
<div style="float: right;">
<div id="attachment_92" class="wp-caption alignleft" style="width: 160px"><a href="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0218.png"><img class="size-thumbnail wp-image-92" title="trans_exec_log" src="http://blog.mozilla.com/data/files/2009/06/2009-06-27_0218-150x150.png" alt="Transformation log" width="150" height="150" /></a><p class="wp-caption-text">Transformation log</p></div>
</div>
<div style="clear: both;">
<hr />-Daniel</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/06/26/processing-web-access-logs-with-a-kettle-cluster/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
