The MapReduce paradigm on the shell

 

Hadoop is hdfs + m/r; map reduce it’s a programming paradigm that everyone can use, so let’s get the idea of the concept behind using a simple approach.

Let’s get some text file to process on the shell, for example we can use War & Peace

http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=2600

One approach to get some sort of counting on the text is using the well know grep command, it counts the line with a specified word

grep “Napoleon ” 2600.txt | wc -l

image

The concept of map is a function that works on key-value pairs; in our case we can consider the key the file row (or the offset from the beginning of the file) and value the file row itself.

This simple script (map1.sh) will count the lines in the word

#!/bin/bash
kcount=0
while read line ; do
if [ “$line”=”Napoleon,1” ] ; then
    let kcount=kcount+1
fi
done

echo “Napoleon, $kcount”

N> don’t forget to apply the chmod u+rwx map1.sh

Getting something like this:

image

Note that each occurrence is emitted with a 1, this idea is quite common practice, when a specific value is found and we are looking for the number of it (sum) we can think in such a way.
Now let’s define the reducer (red1.sh) that will sum up what has been emitted by the mapper

#!/bin/bash
kcount=0
while read line ; do
if [ “$line”=”Napoleon,1” ] ; then
    let kcount=kcount+1
fi
done

echo “Napoleon, $kcount”

From this will have the number of occurrences of the word:

image

N> the value is slightly bigger as grep counted the lines, here we are counting the occurrences (probably some lines had more than one occurrence in it)

    Parallelism
The important idea is that the problem can be decomposed and parallelized without affecting the result: finding tokens in a file, specifically on rows input, is a computation that can be done independently; we can process each row separately and then sum up all the occurrences found… still the result will be correct

On Hadoop, we can slice the input file on several datanode (assuming the input is very big, as usually it happens in big-data problems), and still the algorithm will work, in this way we can scale out just adding nodes to the cluster.

In our case we used a simple mapper looking for a single word, the natural generalization is to generate X mapper for N different key we want to compute and allocate N reducers to make them counting on the mapper output, the X mappers can split the input on different subset and process it in a size / X time (roughly) and then the reducers will receive the mapper output for one of the 1..N different keys (words) we want to count and get the output result.

image

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s