I am keeping up-skilling on Hadoop & NoSql in general, so this post might be useful if you’re doing the same…
Hdfs is optimized for steaming operations, to achieve high performance when processing huge amount of data, a trade off is it’s poor on random seeks, but because of the nature of the way data is meant to be processed, this is not a real limitation.
The structure is intrinsically write once, so data cannot be deleted or changes; because there is no real reread of the same blocks (in the steaming processing) no local cache needs; the failure handling is a core feature; for example even if one rack (of machine) has communication issues, hadoop is still able to keep m/r jobs going (with degrading performances as probably data needs to be re-replicated);
Namenode and datanode
The clusters can supply different services, consider them as nodes that can do different responsibilities along the time-line. The hdfs uses a master-slave model: the namenode is a traffic coordinator, it provides a single namespace for all the file-system (of course it’s spread along the clusters), the datanode is where the data is located, namenode looks after any datanode that fails (when it’s so it’s considered not reachable or down).
Let’s see a diagram where a client uses the namenode when using hdfs, so the namenode check the status of the datanodes and then, they start to stream the data requested by the client
Note the secondary namenode is not a failover node, it does checkpoint for the namenode fs so if the namenode goes down it can used to restart the process; it’s usually on a separate server.
Hdfs support the majority command of unix fs, (create, move, cp etc), this operations happened though namenode, the manipulation goes through the namenode not directly accessing the datanode. Some features such has hard/soft link are not supported as not relevant in the hdfs world.
Because of the streaming nature of hdfs and quantity of data, each datablock is 64 MB.
If we have a file that have this size
Hdfs will break this in 5 parts (hp the size is 64*4 + X) we have the block is replicated 3 times (by defualt), so the namenode will spread the 5 blocks on the available datanode
The namenode itself maintains an imaginary table of where the data is replicated so in case of failure it will instruct a free datanode to replicate the data of the other 2 available.
let’s see the steps when a client reads from hdfs.
the client calls the distributed file-system, it calls the namenode to figure out from which datanode in the cluster the block (we want read data from) is located in the cluster. Don’t forget that as data is replicated, so we expected to have a list of namenodes holding copies of the data, from which we can read;
This list of datanodes is based on the proximity to the client; this makes sense when looking to minimize wait time; the namenode is aware of this information when the cluster is built (adding/removing clusters)
A FSDataInputStream is returned to the client, this is an objects that abstrscts the process
so the client can read the data from the datanode; data can be streamed to the client
Connections are opened/closed during this process to read data from the cluster, don’t forget is a distributed system
Writing to hdfs
the writing is basically specular and the main difference is we have a node pipe of blocks as the data is replicated across (3) datanodes, and we have a FSDataOutputStream and checking phase oh Acknowledge so the namenode can update is internal references, sure the data is correctly replicated.