Moving code to data, the MapReduce way

MapReduce suits problems that are horizontally scalable, it means the cardinality of the problems does not affect the way we defined the solution to the problem itself.
Let’s think to a simple example to appreciate the different approach from the centralized approach and way to think with. Our problem is to find the tallest scholars by age in a small town…at the end it’s a simple problem… You can write a small program to loop through a list of name, age, height and school of origin and find the answer easily. The only step that is necessary is to move data from each school where all the data is processed. This is called moving data to code, the opposite of what is natural to do in a distributed environment. It’s ok to do that if the cardinality of data is small, but easy it turns out it’s not when we start approaching bigger cardinality. Imagine now we have not a small town, but we need to process all the schools of the country or more! We should move all the data around and compute it centrally; about that a lot of time there is a waste of time of comparing local min values of a school with other values already processed for other schools (imaging we process data as it arrives from the N schools)
With MapReduce we move the computation logic (the code) to the N schools itself (note shipping a piece of code is always less expensive than sending data across the network); this process is basically mapping the code code onto the nodes, in this way each school we’ll find the tallest scholars by age (we can imagine a list couples of data), still this data is not nough to get the answer to the original problem, the next step is todo a so called reduction phase: the local max values found are processed getting the final answer.
From this simple problem we can see why for some problems/cardinality of data makes more sense to go for a distributed approach rather than using a centralized solution. Using Hadoop and MapReduce we can leverage a framework where we have to amnage to write just the cod for the map and reduction functions, all the the rest is managed by the framework.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s