There is a lot to be said on the above topic, but for the moment, I just wanted to drop a quick note about some ad-hoc work I did today:
I ran an analysis on a year and a half of FTP log files, filtering for some specific requests, and filtering out but summarizing uninteresting traffic from our heartbeat monitors and such.
I put together this simple Kettle transformation, and ran it with a cluster consisting of 32 fairly low powered slaves. The results were pleasing, especially considering I didn’t even go through a tuning process to determine the optimal number of step copies or row set sizes.
8782 files containing 432 million records. The processing was completed in 47 minutes giving a throughput of about 156 thousand rows per second.
There are a couple of screenshots after the cut.
