"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

June 03, 2012

Hadoop Basics - Part I


This post is to get started with Hadoop basics. This post is my notes on Hadoop feature - Fault Tolerance.

HDFS (Hadoop Distributed File System) 
  • One of the key features is Fault Tolerance
  • Inbuilt capability to handle data failure issues. Multiple copies of same dataset is managed by the system
  • Once a particular dataset is not accessible system can replace with another accessible copy of same data set 
How this is achieved

HDFS is based on Master - Slave Architecture

Master - NameNode (Manages the Data)
  • Many to 1 relationship between NameNode & DataNode
  • NameNode manages the Data - How it is stored in DataNodes, How Data is Replicated between DataNodes is managed by NameNode
  • The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster
  • Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data
  • EditLog is stored in the Namenode’s local filesystem

Slave - DataNode (Stores the Data)
  • A file is split into one or more blocks and set of blocks are stored in DataNodes (Each file is a sequence of blocks)
  • DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
  • BlockReport contains all the blocks on a Datanode (source - Link1, Link2
If there is an issue with DataAccess Heartbeat would be a indicator for Data Issues. In such cases NameNode identifies and replaces with replicated DataNode copy available to be used as alternative for inaccessible DataNode.
 
 
Happy Learning!!!

No comments: