In this post some Q/A useful to explore the Hadoop world, this can be used even for a fist interview, make use of it…
??? What does H(adoop) allow to do at high level??? to manage huge volumes of data allowing big problems to be broken down in smaller elements so the parallel analysis can be done quickly and cost effectively, scaling out when necessary.
??? Which are the 2 main components of H ??? we have H distributed filesystem (storage cluster) and MapReduce engine to implement data processing
??? Can you add on the fly new clusters ??? yes H can manage dynamically the scale-out, so when a cluster is down it can be replaced transparently to the user; the machine are named commodity hardware as they have not specialised hw.
??? Can you update data in hdfs ??? No, data is written once and read many times, that’s one reason hdfs is not POSIX compliant, you can append data (HBASE db)
??? what’s a block ??? when a file is too big (based on the current configuratin) it’s split across the DataNodes (DNs)
??? how do blocks relate to DNs ??? DNs are server that contain blocks for a given set of files; they peform data validation
??? what does the NameNode do (regarding the DataNode) ??? it manages file access, reads, writes, replication of data blocks across DataNodes.
??? what is a filesystem namespace ??? the complete collection of all the files in a cluster
??? who is the smarter DataNode or NamNode ??? NameNode is the one to communicate with DataNodes checking their status and looking for issues and manages their access. It’s usually provide of loads of ram and replication.
??? what does it mean that DataNode are resilient ??? data blocks are replicated across multiple data nodes, this replication and all the mechanism provide high availability of data
??? what is a Rack ID and what about its usage ??? hdfs is rack-ware, to ensure replication effeiecency the NN uses an id to keep track of where the DNs are physically located (it’s faster to copy data in the same rack than between 2 different ones)
??? what heartbeats are??? they are messages used by NN to check the DNs healthy, a DN that does not send a heartbeat will be removed for a while by the ND from the working DDs list; all it’s transparent to the user (it might come back later is communciation is reenstablished)
??? how hdfs ensure integrity ??? it uses transaction logs (to keep track of the operation, and necessary for rebuilding data) and checksum validation (to ensure the data is valid and not corrupted)
??? what is and where is about the blocks metadata ??? it’s a detailed description of when the file was crateds, accessed … where the blocks are stored, how many DNs and file are on the cluster, transaction log info; it’s stored in the NN and loaded in RAM
??? what is a data pipeline ??? it’s a connection between multiple data nodes to support data transfer; the data in the block is forwarded to different DNs to ensure replicas
??? what is the rebalancer ??? it’s a hdfs service that balance the DNs and avoid traffic congestions
??? can you draw the workflow for a 2 cluster system ???
files from hdfs –>
input to InputFormat –> (Split)* –> (RR)* –> (Map)* –> Partitioner –>
shuffle process communicating with he 2nd cluster to collect data (if any) –> Sort –> Reduce –> Outputformat –>
file are written to hdfs
??? what is the functionality for the InputFormat and RecordReader functions ??? they convert the input file of a m/r program (we want to run) to something that can be processed
??? what is an InputSplit ??? the InputFormat processes the input file and it can decide to split it in pieces, then it assigns a RR so the data can be processed by the map function (specifically the key-value pair)
??? in a m/r job do we have a relation between key-value pair and map ??? a map instance is defined for each key-value pair
??? what does the OutputCollector do ??? it collects the output from the independent map instantiated to process the input
??? what happens after all the map tasks are completed ??? the intermediate results are gathered in partition, a shuffling phase occurs sorting he output so the reducers can process that efficiently
??? can we start the Reduce phase before all the mappings is done ??? No
??? what does the RecordWriter do ??? it writes data from the OutputFormat reducers to the hdfs