The following is an example of parsing an exim_mainlog using Hadoop streaming. I’ve implemented both the mapper and the reducer in Python. The mapper and reducer don’t handle all of Exim’s log formats yet but this can be easily extended in the mapper and reducer if you actually end up using the output (this is just an example).
The following is the EximMapper.py file.
#!/usr/bin/python import sys, re date_time_id_rest = '([0-9-]{10})\s([0-9:-]{8})\s([a-zA-Z0-9-]{16})\s(.*)' def eximSplit(line): matches = re.match(date_time_id_rest, line) if matches is not None: key = matches.group(3) val = matches.group(0) print key + '\t' + val return for line in sys.stdin: eximSplit(line)
The following is the EximReducer.py file. It may take a while to understand this one – the main intricacy is that the reducer script must remember state information between runs when run under Hadoop Streaming.
#!/usr/bin/python import sys, re keys = 0 (lastKey, lastVal) = (None, '') for line in sys.stdin: (key, val) = line.strip().split('\t') if lastKey and lastKey != key: print lastKey + '\t' + lastVal (lastKey, lastVal) = (key, val) keys += 1 else: (lastKey, lastVal) = (key, val + '\n' + lastVal) if lastKey: print lastKey + '\t' + lastVal
That is it. Now we can run our mapper and reducer under bash to see if it will work. This is an easy way of visualizing a Hadoop work unit for those who are familiar with pipes in bash.
cat exim_mainlog | ./EximMapper.py | sort | ./EximReducer.py
Ideally that will organize all the entries in the exim_mainlog into transactions separated and arranged by their transaction ID. If that is the case, we’re onto running this thing under Hadoop. We’ll want to grab our hadoop-streaming.jar file and put it in the classpath like so. You’ll have to update the paths I used to be relative to your environment.
$HADOOP/bin/hadoop jar hadoop-0.20.2-streaming.jar \ -input exim_mainlog \ -output StreamingOutputDirectory \ -mapper EximMapper.py \ -reducer EximReducer.py
