Using HackPack as unstructured daily todo list

I have been using  a lot todoist to keep all my todo tasks for a while, the only things I don’t like of this type of tools is that they have ben designed following a strict structured approach…

The reality is my tasks, your tasks are never so well organised so trying to add too many boundaries and structure might end up to be counterproductive… things change, you need to add notes, ideas, mark what done and to focus on what to do next…

I’ve been using hackpad for a few weeks now and I really like the Markdown syntax (specially if you come from a console world)…

So how to use it to keep things effective with a very little of effort? I though the best is to give you a visual example and then to find your way…

Here some screenshots of the steps to build it…


View Unstructured TODO List on Hackpad.

Plan Big, an agile pomodoro spreadsheet

Plan Big, an agile pomodoro spreadsheet

I’ve seen too many colleagues starting their day without spending 10 mins thinking upfront what to do during the day, ending up in this way with loads of time wasted and confusion… honestly it’s much better starting your day with a thought what you’ve done, what you need to do today and what you plan to achieve tomorrow; this is essential in a team to have a scope and a feeling to work for something meaningful… I cannot get why this simple steps are still not a dogma in so many companies… are we supposed to be slave of incompetent PM/ Team leaders forever making projects to fail just for a luck of commons sense?

You can use this google doc here to help yourself, to use it you need 2 minutes of your rapacious time and it will help you to be more focused and to think on what are you working on…

to plan your daily task – 

1st you clone it in you google doc 
 2nd you update it with the daily task and how long do you think is going to take considering


1 = 25 min commitment ( a pomodoro ) 
so for each task you update the column DONE 
Note the time if updated every minute and the state of Estimated Done changes when changing the DONE column; in this way you can have see in realtime view how long it is going to take you before finishing your daily tasks
Last but not least you might save the daily sheet, to keep track of your week, just add a new sheet for each new week and copy and paste the columns at the end of the day, so you can switch off your brain and enjoy your free time

The MapReduce paradigm on the shell


Hadoop is hdfs + m/r; map reduce it’s a programming paradigm that everyone can use, so let’s get the idea of the concept behind using a simple approach.

Let’s get some text file to process on the shell, for example we can use War & Peace

One approach to get some sort of counting on the text is using the well know grep command, it counts the line with a specified word

grep “Napoleon ” 2600.txt | wc -l


The concept of map is a function that works on key-value pairs; in our case we can consider the key the file row (or the offset from the beginning of the file) and value the file row itself.

This simple script ( will count the lines in the word

while read line ; do
if [ "$line"="Napoleon,1" ] ; then
    let kcount=kcount+1

echo “Napoleon, $kcount”

N> don’t forget to apply the chmod u+rwx

Getting something like this:


Note that each occurrence is emitted with a 1, this idea is quite common practice, when a specific value is found and we are looking for the number of it (sum) we can think in such a way.
Now let’s define the reducer ( that will sum up what has been emitted by the mapper

while read line ; do
if [ "$line"="Napoleon,1" ] ; then
    let kcount=kcount+1

echo “Napoleon, $kcount”

From this will have the number of occurrences of the word:


N> the value is slightly bigger as grep counted the lines, here we are counting the occurrences (probably some lines had more than one occurrence in it)

The important idea is that the problem can be decomposed and parallelized without affecting the result: finding tokens in a file, specifically on rows input, is a computation that can be done independently; we can process each row separately and then sum up all the occurrences found… still the result will be correct

On Hadoop, we can slice the input file on several datanode (assuming the input is very big, as usually it happens in big-data problems), and still the algorithm will work, in this way we can scale out just adding nodes to the cluster.

In our case we used a simple mapper looking for a single word, the natural generalization is to generate X mapper for N different key we want to compute and allocate N reducers to make them counting on the mapper output, the X mappers can split the input on different subset and process it in a size / X time (roughly) and then the reducers will receive the mapper output for one of the 1..N different keys (words) we want to count and get the output result.


A closer look to hdfs


I am keeping up-skilling on Hadoop & NoSql in general, so this post might be useful if you’re doing the same…

Hdfs is optimized for steaming operations, to achieve high performance when processing huge amount of data, a trade off is it’s poor on random seeks, but because of the nature of the way data is meant to be processed, this is not a real limitation.
The structure is intrinsically write once, so data cannot be deleted or changes; because there is no real reread of the same blocks (in the steaming processing) no local cache needs; the failure handling is a core feature; for example even if one rack (of machine) has communication issues, hadoop is still able to keep m/r jobs going (with degrading performances as probably data needs to be re-replicated);

Namenode and datanode

The clusters can supply different services, consider them as nodes that can do different responsibilities along the time-line. The hdfs uses a master-slave model: the namenode is a traffic coordinator, it provides a single namespace for all the file-system (of course it’s spread along the clusters), the datanode is where the data is located, namenode looks after any datanode that fails (when it’s so it’s considered not reachable or down).
Let’s see a diagram where a client uses the namenode when using hdfs, so the namenode check the status of the datanodes and then, they start to stream the data requested by the client


Note the secondary namenode is not a failover node, it does checkpoint for the namenode fs so if the namenode goes down it can used to restart the process; it’s usually on a separate server.

Filesystem Namespace

Hdfs support the majority command of unix fs, (create, move, cp etc), this operations happened though namenode, the manipulation goes through the namenode not directly accessing the datanode. Some features such has hard/soft link are not supported as not relevant in the hdfs world.

Block replication
Because of the streaming nature of hdfs and quantity of data, each datablock is 64 MB.

If we have a file that have this size


Hdfs will break this in 5 parts (hp the size is 64*4 + X) we have the block is replicated 3 times (by defualt), so the namenode will spread the 5 blocks on the available datanode


The namenode itself maintains an imaginary table of where the data is replicated so in case of failure it will instruct a free datanode to replicate the data of the other 2 available.

Reading hdfs


let’s see the steps when a client reads from hdfs.
Step 1
the client calls the distributed file-system, it calls the namenode to figure out from which datanode in the cluster the block (we want read data from) is located in the cluster. Don’t forget that as data is replicated, so we expected to have a list of namenodes holding copies of the data, from which we can read;
Step 2
This list of datanodes is based on the proximity to the client; this makes sense when looking to minimize wait time; the namenode is aware of this information when the cluster is built (adding/removing clusters)
Step 3
A FSDataInputStream is returned to the client, this is an objects that abstrscts the process
Step 4
so the client can read the data from the datanode; data can be streamed to the client
Step 5
Connections are opened/closed during this process to read data from the cluster, don’t forget is a distributed system
Writing to hdfs
the writing is basically specular and the main difference is we have a node pipe of blocks as the data is replicated across (3) datanodes, and we have a FSDataOutputStream and checking phase oh Acknowledge so the namenode can update is internal references, sure the data is correctly replicated.


Buy this

Exploring the world of Hadoop Q/A

In this post some Q/A useful to explore the Hadoop world, this can be used even for a fist interview, make use of it…

??? What does H(adoop) allow to do at high level??? to manage huge volumes of data allowing big problems to be broken down in smaller elements so the parallel analysis can be done quickly and cost effectively, scaling out when necessary.

??? Which are the 2 main components of H ??? we have H distributed filesystem (storage cluster) and MapReduce engine to implement data processing

??? Can you add on the fly new clusters ??? yes H can manage dynamically the scale-out, so when a cluster is down it can be replaced transparently to the user; the machine are named commodity hardware as they have not specialised hw.

??? Can you update data in hdfs ??? No, data is written once and read many times, that’s one reason hdfs is not POSIX compliant, you can append data (HBASE db)

??? what’s a block ??? when a file is too big (based on the current configuratin) it’s split across the DataNodes (DNs)

??? how do blocks relate to DNs ??? DNs are server that contain blocks for a given set of files; they peform data validation

??? what does the NameNode do (regarding the DataNode) ??? it manages file access, reads, writes, replication of data blocks across DataNodes.

??? what is a filesystem namespace ??? the complete collection of all the files in a cluster

??? who is the smarter DataNode or NamNode ??? NameNode is the one to communicate with DataNodes checking their status and looking for issues and manages their access. It’s usually provide of loads of ram and replication.

??? what does it mean that DataNode are resilient ??? data blocks are replicated across multiple data nodes, this replication and all the mechanism provide high availability of data

??? what is a Rack ID and what about its usage ??? hdfs is rack-ware, to ensure replication effeiecency the NN uses an id to keep track of where the DNs are physically located (it’s faster to copy data in the same rack than between 2 different ones)

??? what heartbeats are??? they are messages used by NN to check the DNs healthy, a DN that does not send a heartbeat will be removed for a while by the ND from the working DDs list; all it’s transparent to the user (it might come back later is communciation is reenstablished)

??? how hdfs ensure integrity ??? it uses transaction logs (to keep track of the operation, and necessary for rebuilding data) and checksum validation (to ensure the data is valid and not corrupted)

??? what is and where is about the blocks metadata ??? it’s a detailed description of when the file was crateds, accessed … where the blocks are stored, how many DNs and file are on the cluster, transaction log info; it’s stored in the NN and loaded in RAM

??? what is a data pipeline ??? it’s a connection between multiple data nodes to support data transfer; the data in the block is forwarded to different DNs to ensure replicas

??? what is the rebalancer ??? it’s a hdfs service that balance the DNs and avoid traffic congestions

??? can you draw the workflow for a 2 cluster system ???

files from hdfs –>

input to InputFormat –> (Split)* –> (RR)* –> (Map)* –> Partitioner –>

shuffle process communicating with he 2nd cluster to collect data (if any) –> Sort –> Reduce –> Outputformat –>

file are written to hdfs

??? what is the functionality for the InputFormat and RecordReader functions ??? they convert the input file of a m/r program (we want to run) to something that can be processed

??? what is an InputSplit ??? the InputFormat processes the input file and it can decide to split it in pieces, then it assigns a RR so the data can be processed by the map function (specifically the key-value pair)

??? in a m/r job do we have a relation between key-value pair and map ??? a map instance is defined for each key-value pair

??? what does the OutputCollector do ??? it collects the output from the independent map instantiated to process the input

??? what happens after all the map tasks are completed ??? the intermediate results are gathered in partition, a shuffling phase occurs sorting he output so the reducers can process that efficiently

??? can we start the Reduce phase before all the mappings is done ??? No

??? what does the RecordWriter do ??? it writes data from the OutputFormat reducers to the hdfs

Simple Job search on GitHub & LiVe

Simple Job search on GitHub & LiVe

When looking for a new role it’s always frustrating and time wasting checking job boards and recruiter websites. A cool approach is using that means downloading all the job details from a website and collect them; this is usually a tool used by in web marketing/web campaign ect; but you need to know about the structure of the page and how to pull relevant html tags from it.

As I need to keep an eye on the Irish market (an sometimes abroad) I spent a few hours putting together a simple ws project you can download here

It’s a dummy ASP.NET page that has a list of option (job sites) and a IFRAME where the actual web page  page is loaded, so you can reloaded in sequence them with a click on the list on the right, rather then going and placing the keyword and (optionally) the region/city where to search.

Usage examples

The usage is easy: select the 2 option What and Where and one job site on the right column and click Reload

Then click on another web site on the right and the page will reload it with the same what and where selection … saving you a lot of time…

(note the Region is kept in the address in this example as the ws support it in the search query)

Open in a new page  careerjet looking for [ETL,Ireland]

How to add a new site?

The procedure is very easy; let’s do an example for

Put XXX in the keyword (it will be mapped to _what_ in the 1st screen) and YYY in region (mapped to _Where_), something like this

Now copy and past it and use the string to update the job panel at the bottom 

Cwjobs    |   |UK|true


Note> substitute the & with & as this is an xml doc so you need to escape that …

QueryString Note

The websites that uses session state rather than query string will not work as you cannot use this approach…

Dump pane

A time saver function is to use the Dump panel to open all the website in one go…

and scrolling down

N> I updated the ws list in the previous panel to show results only for two…

Update AppHarbor

I’ve found this interesting platform to deploy pages on the cloud

So I deployed the project on it…


You can use it on the web without need to installing it ;)

I hope it helps!


the CAP theorem

In NoSql solutions there is always a tradeoff between consistency availability and partition tolerance. Thinking of ER architectures we have the possibility to relax isolation levels to reach acceptable performance, for example with read uncommitted isolation where data not committed yet can be read, or using read committed transaction level to eliminate read-write conflicts, or the serialised isolation, the highest one.

But what happens in the NoSql world where the scenario is much different from a single-server system (where most relational db systems live in)? We have a high number of clusters that communicate between them and operate on the same data (replicated across the network); the goal (ideally) is to have all the clusters in sync with the same data so, in a way, the clusters should be tolerant to network partitions that might happen in case of communication issues: the lack of “communication” leads to a trade off of consistency versus availability (if a cluster cannot communicate with the other its data might not be consistent so I might decide to limit the data processing and hence limiting the availability of my system to reply to requests)

Let's see some example to dig into the problem and understand the approach:

Let's suppose we have an hotel booking system, and this is our biz scenario:

Mario is booking an Holiday package (last on left) and he's connected to a server in Dublin (where the room and offers are kept), in the middle time Jack is looking at the same offer and its's connected to another server in NY. To ensure consistency both the servers need to agree on who is going to get the special offer and who not: this ensure consistency; but it the network link breaks (between Dublin and NY) so we have sacrifice availability, that means nobody can book. This is the idea of tradeoff in practice: if the network is partitioned (line broken) you cannot have consistency and availability too…

In real life implementation some extra booking are allowed or some extra rooms (in hotels) are kept to accommodate overbooking case; this is necessary and sometimes is not possible to implement the ideal scenario where all the data is consistent, as possible using Acid transaction in non big data solutions; we can rely on the aggregate oriented transaction nature of NoSql as dbs support them, so the biz requirement are relevant in the design of how we build the aggregates… we can think of availability as the maximum latency the system can tolerate; when it gets too high, we give up and treat data as unavailable…