Processing web access logs with a Kettle cluster

There is a lot to be said on the above topic, but for the moment, I just wanted to drop a quick note about some ad-hoc work I did today:

I ran an analysis on a year and a half of FTP log files, filtering for some specific requests, and filtering out but summarizing uninteresting traffic from our heartbeat monitors and such.

I put together this simple Kettle transformation, and ran it with a cluster consisting of 32 fairly low powered slaves. The results were pleasing, especially considering I didn’t even go through a tuning process to determine the optimal number of step copies or row set sizes.

8782 files containing 432 million records. The processing was completed in 47 minutes giving a throughput of about 156 thousand rows per second.

There are a couple of screenshots after the cut.

Transformation flow

Transformation flow

Transformation log

Transformation log


-Daniel

Tags: ,

Leave a Reply