MapReduce WordCount visual example

When starting to play with Hadoop, getting into the core concepts could be not so easy, so let’s try to simplify it… The HelloWorld you might have written with your first programming language is back and in Hadoop (and generally speaking in the MapReduce framework) is called WordCount. Let’s give a quick look to the main steps of the framework to appreciate how the problem will be solved in the distribute version…

MapReduce (MaR) at very high levels works in this way:

  • Iterate on an high # of input records, splitting them across the node
  • Extract something of interesting [map] from the local data
  • Sort intermediate results
    • Distribute that to the reducers
  • Aggregate intermediate results [reduce]
  • Generate final output from the reduces values

The users define the map and reduce ‘functions’ keeping in mind

  • In map a couple (K1/V1) generates a list of new couples list(K2/V2)
  • In reduce all the values associated to the same key (K2,list(V2) are processed to generate a list(V2)

So let’s apply it to an example to visualize the data processing in the framework:

This is my first post and I am new to Hadoop, so I expect to review it and write more post in the following days…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s