MapReduce WordCount visual example

22 May

When starting to play with Hadoop, getting into the core concepts could be not so easy, so let’s try to simplify it… The HelloWorld you might have written with your first programming language is back and in Hadoop (and generally speaking in the MapReduce framework) is called WordCount. Let’s give a quick look to the main steps of the framework to appreciate how the problem will be solved in the distribute version…

MapReduce (MaR) at very high levels works in this way:

  • Iterate on an high # of input records, splitting them across the node
  • Extract something of interesting [map] from the local data
  • Sort intermediate results
    • Distribute that to the reducers
  • Aggregate intermediate results [reduce]
  • Generate final output from the reduces values

The users define the map and reduce ‘functions’ keeping in mind

  • In map a couple (K1/V1) generates a list of new couples list(K2/V2)
  • In reduce all the values associated to the same key (K2,list(V2) are processed to generate a list(V2)

So let’s apply it to an example to visualize the data processing in the framework:

This is my first post and I am new to Hadoop, so I expect to review it and write more post in the following days…

Have a good worry!

29 Oct

Simple techniques are the best to get results… that’s for sure.

How much time do we spent thinking to what if then … else and worry about that…?! What a shame, it isn’t?

A simple technique I’ve read about and I expanded would save you time and pain…

Set a 10 minutes alarm near you seat…

Use a piece of paper divided in 2 areas or better a diary  in this way

I worried about…   I might need to worry about…
  1. xxx
 
  1. xxx

 

I suggest to use 2/3 for what you have worried about in the last 24 hours and 1/2 for what you might worry in the next 24h.

Use this 10 mins to properly worry about them. Keep going until you hear the alarm, in case you start to think to something nice, be strong and drag your mind back to the worries.

Do it again tomorrow and anytime something comes to your mind to worry, add to the list, and keep locked there.

With practice you will able to get your worrying down to a few mins a day…

Another suggestions, optional, it so to put 1 to 3 star to each one to give a level of importance and of course put a nice straight line to delete which one are not worrying you anymore

file_1_439

I worried about…   I might need to worry about…
  1. xxx
 
  1. xxx

Spdrdr

8 Oct

I would like to share with you my last weekend experience, I’ve been to my first speed-reading workshop in Dublin.

Why investing money on this type of  course, you would ask… Because of my job I have to update my skills and competencies all the time, so it makes sense to invest on techniques  to be able to read in a fast and efficient way…. it isn’t?

The weekend has been full of info, the main resource is http://www.spdrdng.com/ the best I can do now is to give a 40% overviews of the contents (not more as I paid, you have too ….)

Still reading… good. Let’s see some learning I’ve good from the weekend:

Learning n 1: Reading state

The state of your body influences how you read and the quality of your actions, so:

First thing: before starting:

  1. a deep breath
  2. exile and smile Smile
  3. open your peripheral vision
    1. Move your hands near you ears and then behind, start to move your fingers and and move forward, stop as soon you can see both your hands shacking your fingers, then notice your your see

peripheral vision, helps to use all your resources when making any experience

4. take your awareness to your concentration point

the concentration point is about 15 cms above and slightly behind the top of your head

Learning n 2: always stick with a 20/25 mins chunk

Use a 20 mins or max 25 mins slot of work, so anything your are going to do will be doing in 20mins, time. Why? task tends to occupy all the time you have available, so it’s better to allocate a resource every time.

Learning n 3: SMART purpose

Define your smart purpose, what do you expect from your 20 mins;

S = specific, say in detail what do you want to reach

M = measurable, you have to know when you have achieved in an objective way

A = achievable, when achieve what it does give to you, WIFM = what is for me

R = real, can be done in 20 mins realistically

T = timely, you must define a time that you decide (between 20 and 25 mins)

Learning n 4: Previewing

Start reading the summaries, toc and index, to understand the contents your brains needs to have a big picture, this will act as a guide where you add your details reading the text.

Learning n 5: the Pareto principle

The Pareto principle, applied to this context,  says you can get 80% of what is relevant from a book in 20% of the contents, and more important the remaining content will give you only the remaining 20%, so it’s always better using your time and speed reading techniques to read more books on the same topic catching the 80% in each one…

So doing some maths:

1 book with  slow reading = 100 % of time, maybe 100% of the book content (usually less)

1 book with  fast reading  = 80% of the contents in 20 % of the time

2 book with fast reading = 80% of the contents in 20 % of the time

3 book with fast reading = 80% of the contents in 20 % of the time

4 book with fast reading = 80% of the contents in 20 % of the time

5 book with fast reading = 80% of the contents in 20 % of the time

===

400% of the topic in 100 % of the time, you can become almost an expert in the same time Winking smile

Learning n 6: Find information

Change your thinking:

from the quantity of reading

to the quality of information

Another way to say it’s read the message not the words; this opens your mind to another concept that focus on the information acquired while reading the text, rather that reading the words (what we have been thought at school)

Learning 7:  Use speed-reading eye patterns

Look around the page for hot=spot, where the relevant information is kept, these patterns will break the old habit of reading sequentially…

  • dipping = read quickly looking for hot spots and then dip into the relevant material and then speed up again
  • pacer = use a pencil to read the text
  • headings = focus always to title, heading and notes
  • horizontal underlying = with the pacer underline the text, (fast) using the peripheral vision to get the message
  • super reading = scroll vertically the text looking for hot spots
  • capital I – reads the fist 2 o 3 lines of a page, super reading the middle and read the last 2
  • skittering = zig-zag on the text

(this is usually what you can find on speed reading book)

Learning n 8: Use mindmaps and rhizomappping

At the end of the 20 mins session build your knowledge, what you have understood on relevance (considering the purpose for which you are reading)

Learning 9: Review information on maps regularly

Create a mind map of the content using your memory and then look for what you have missed, at these rime intervals

  • the day after
  • a week after
  • a month after

Learning 10: Use syntopic processing

Have a 75 mins session with 4 books rather then one (on the same topic). Build an incremental map, adding the info found and related  reference of each book, in this case you can define 11 point on the smart purpose

Lesson 11: Direct reading

This technique uses the unconscious mind to read the book; the day before turn over the pages one at time at the rate of one page at the second, making sure to be able to see the 4 corners

Lesson 12: Set  High expectation

High expectation are intimately connected with high results, so think big of what you can do

Lesson 13: Read more

More you read, and more information will be kept in your brain, so adding new information will help to acquire new ones.

 

This is an example of the rhizomap of the 1st days:

2012-10-07 15.37.10

I hope you can invest your time on something similar, this post is more for inspirational purpose, rather that touching you about speed-reading techniques

SSIS TooManyTasks on GitHub

4 Jun

I’ve just started a project on GitHub; it’s an SSIS solution to train on each SSIS control flow and data flow task. I think you need to know all the tools available in your belt to deliver an efficient and agile solution. Every BI developer should remember that every BI project by nature has a very short development time and a very long maintenance time so it’s likely to happen to have different people working on the same project on different stages, a full knowledge of the toolset it’s so necessary.

Here we have the readme with the actual implementation

The solution at the moment is based on sql 2008 but I have in plan to fork for 2008 and 2012 and to  check the relevant differences between them.

The solution structure is the following, and I will update from time to time to cover all… but it will take a bit of time 8-P

image

Cloning dataset definition in SSRS and reusing it…

28 May

Very often in SSRS you need to define separate datasets and use them to build a complex report, instead of defining a monolithic dataset grabbing all your the data you need is better to define separate smaller datasets.

Unluckily SSRS does not support a copy and paste functionality to replicate the structure of a dataset and then do some changes to the new one

image

This is very annoying specially when a dataset is parameterized

image

Usually the params are logically shared among different dataset on the same page, so for the new dataset you should redefine the params name, the hierarchy level etc….

A faster way is to open the RDL mark-up [View Code] and copy and paste the dataset definition content with a new name and that’s you need only to change measures/dims in the new dataset…

Steps

  • open RDL with View Code
  • find   <DataSets>
  • copy and paste the <Dataset> xml content  you want to clone… changing the name (it has to be unique of course)

 

image

  • Close and reopen the report and you’ll see the next cloned dataset

image

And now it’s ready to be edited in the query designer…

Example:

image

I hope it helps, and you’ll like the post.

SSIS Framework package

9 May

A very interesting book on real case SSIS design solutions is  Microsoft SQL Server 2008 Integration Services: Problem, Design, Solution Erik Veerman, Jessica M. Moss, Brian Knight, Jay Hackney

One of the first chapters is very interesting, the authors show how to develop a SSIS Framework capable to login the package execution at task level. In practice some tasks are executed on key events such as OnExecute, OnError ect to gather relevant info and save that error info in a support db

image

The second good advice is to use package configuration using environment variables, so the deployment on different machine would be hassles…

image

What I did is to add an extra task to clear the package logs and map a variable in the configuration to enable disable the task (through an expression)

You can find the solution with the SSIS packages here

and you can copy the template package in

C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE\PrivateAssemblies\ProjectItems\DataTransformationProject\DataTransformationItems

in this way when you add a new item in a SSIS project you can choose the new template with the template.

The extra code (compared to the code in the book) is about the logs table handling

Changing the value of the CleraTabLogs entry you can decide to reset or not the log tables

image

You can check that in the sql profiler if you like

image

I advice to read the book to understand the flow first.

ps

CRS is a project I am working and for which I created this code…sorry no time to celan up the names….

Visually BI: OLAP Modelling Concepts

23 Mar

http://mamatucci.wordpress.com/2012/03/19/visually-bi-ssas-part-1/

In this 4th part the topic covered is the star schema  and related concepts

 

The map is here

#DW.O.1

http://dl.dropbox.com/u/63221860/BI.Blog/Visually%20OLAP%20Modeling%20Concepts/%23DW.O.1.png

 

 

types of table

fact table

types of columns

keys

foreign key (FK) values that relate rows in the fact table to rows in the dimension tables

measures

measures are the values as stored and displayed in an OLAP cube

facts

Facts are individual values stored in rows

numeric values

business metric

degenerate dimension

Fact tables can also contain columns that are neither keys nor facts.

Aggregation Type

clip_image001

additive

semiadditive

Nonadditive

significant storage space

justify every column added to any fact table

Fact prefix

deep and narrow

dimension table

they give context or meaning to” the facts

business entity

denormalized source data

attributes

default member

convert nulls to value

allow duplicates

inclusive

when choosing which attributes add to the dim being “inclusive” in dimensional modelling is preferred….

types of columns

newly generated primary key (PK)

surrogate key

loading from disparate sources this guarantee uniqueness

original PK from the source system

business key

additional columns

to describe the business entity

Dim Prefix

wide and not very deep

SCD

when modeling dimensional data is to review business requirements for the desired outcome when dimension member data is updated

or deleted.

Last change wins

Type 1

overwriting previous value

Changing Attribute

Retain some history

Type 2

Historical Attribute

adding a new record (or row value) when the dimension member value changes

Retain all history

Type 3

adding additional attributes (or column values) when the dimension member value changes

SSIS Slowly Changing Dimension wizard.

star schema

advantages

work best in the BIDS

cleansing and validation

during the ETL process

clip_image002 grain statements

to determine how to create the model

key metrics for the business

the numeric values related to the business

by what factors evaluate the key metrics

by what

multiple facts table

in case of different granularity for the same key metric

by what level of granularity

evaluation of the key metrics are evaluated by factors how often? … by day, by hour

sign-off procedure

during the requirements phase, to define them and get formal approval

SME

subject matter experts (SMEs)

snowflake

 

the difference is a dimension on more are based than one source relational table

Dimension Usage grid

Regular

or star type: there is a zero or one-to-many relationship between the rows in the fact table and the dimension table

clip_image003

Referenced

or snowflake dimension: you must select the intermediate dimension table and define the relationship between the two dimension tables by selecting the appropriate key columns

clip_image004

intermediate tables

Materialize

it improves dimension query performance

FK between tables of the same Dim

to define nature of relationships between dims and facts and granularity level

tools

database modelling tool

to generate script

views

for etl process

source control

to save docs and diagrams too

iterative process

1. you’ll simply create the skeleton tables

2. will refine the model by adding details

customer’s business terminology

 

Please like this article if you found it useful.

Mario

Follow

Get every new post delivered to your Inbox.

Join 80 other followers