blog




  • Essay / IMPROVING DATA LOCALITY AND AVAILABILITY IN HBASE...

    It is very obvious that Hadoop Map Reduction is effective because the logic or processing is brought closer to the data rather than moving the data to the logic. Each input task is divided into several smaller sections called blocks. In fact, the default size of each block in the Hadoop distributed file system is actually larger than expected. Now the default size of a block in datanode is 64 MB or 128 MB is even chosen. Although larger block sizes can even be chosen when we are sure that the data size is large enough for the block size. Each map task is allocated to a block in HDFS. It is even worth noting that if the number of blocks is less, then automatically the number of mappers is also less for obvious reasons. The name node has a central metadata file that maintains details of each file's location and always places logic close to the data. This is the phenomenon by which Hadoop ensures that data locality is preserved. There is always a curiosity if HDFS can preserve data locality and can successfully bring data closer to logic then how will HBase manage to do the same. An interesting fact to note is that HBase saves its data in HDFS permanently. The actual data is saved in HFile and even its log files in WAL are saved in HDFS at the end of the process. It uses FileSystem.create (Path path) to create these files in HDFS according to your desired location. Two access models are used in this scenario: -i. Direct random access to data blocks in HBase. ii. When parsing tables by MapReduce, you wonder if care has been taken to ensure that HDFS blocks are close to where they are read by HBase. The different types of violations that disrupt data locality in HBase are: -1. The HBase balancer...... middle of paper ......nment Manager is started by the master when the cluster is restarted.2. The Assignment Manager searches the META table to find allocated region entries.3. If the assigned region is still active, the assignment manager tries to keep it that way.4. If there are regions that are not active and allocated to a region server, LoadBalancerFactory is started to randomly allocate regions among region servers.5. After the region is assigned to the region server, a corresponding update in the META directory is even made. If a region server goes down at any time, the regions are no longer available because they cannot be accessed. The master will even throw an error indicating that the region server has failed. Regions from the down region server will be reassigned among live region servers again.