Posts Tagged ‘kettle’

Processing web access logs with a Kettle cluster

Friday, June 26th, 2009

There is a lot to be said on the above topic, but for the moment, I just wanted to drop a quick note about some ad-hoc work I did today:

I ran an analysis on a year and a half of FTP log files, filtering for some specific requests, and filtering out but summarizing uninteresting traffic from our heartbeat monitors and such.

I put together this simple Kettle transformation, and ran it with a cluster consisting of 32 fairly low powered slaves. The results were pleasing, especially considering I didn’t even go through a tuning process to determine the optimal number of step copies or row set sizes.

8782 files containing 432 million records. The processing was completed in 47 minutes giving a throughput of about 156 thousand rows per second.

There are a couple of screenshots after the cut.

(more…)