Why huge block-sizes?Lets say, HDFS is storing a 1000Mb file.
With a 4k block size, 256,000 requests will be required to get that file (1 request per block).
In HDFS, those requests go across a network and come with a lot of overhead.
Additionally, each request is processed by the NameNode to figure out the block's physical location.
With 64Mb blocks, the number of requests goes down to 16, which is much much more efficient for network traffic.
It reduces the load on the NameNode and also reduces the meta-data for the entire file, allowing meta-data to be stored in memory.
Thus, for large files, a bigger block size in HDFS is a boon.
Map-ReduceConceptually, map-reduce functions look like:
map (key1, value1) ----> list <key2, value2>
reduce (key2, list<value2>) -----> list <key3, value3>
i.e. map takes a key/value as an input and emits a list of key-value pairs.
Hadoop collects all these emitted key-value pairs, groups them by key and calls reduce for each group.
That's why the input to the "reduce" function is one key but multiple values.
Reduce function is free to emit whatever it wants as the same is just flushed to the HDFS.
Each map or reduce job is called a Task.
And all tasks for one map-reduce work make up one Job.
|Email:||(Your email is not shared with anybody)|