"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

June 29, 2012

My Big Data Notes - HDFS, MapReduce

Data - Contains / Represents meaningful information
Big Data - Data that challenges current limits in terms of volumes and speed of data generation

Big Data refers to large data generated from social media, ecommerce sites, data from mobile devices, sensors etc. Data volume generated is huge and the rapid rate of data growth
Big Data Technology refer to technology that helps to process large volumes of data in an economically viable way with latest technologies using commodity hardware

Hadoop ecosystem is the key for Big Data Technology. Hadoop technologies are inspired by Google’s infrastructure
  • Processing – Mapreduce inspired from Google Map Reduce
  • File Storage – HDFS inspired from Google File System
  • Database – HBase - Inspired from Big Table
  • Pig, Hive - for Analytics

SourceLink
  • Storage Layer is - HDFS
  • Processing Layer - Map Reduce
HDFS (Hadoop Distributed File System) 

  • File System Written in Java
  • High throughout, Effective to read large chunks of data
  • Runs on Commodity hardware 
  • Fault Tolerant ( Data is replicated, Data not accessible is made available with backups available from it, making it highly available system)
  • Scalable (Ability to scale - add / remove hardware for file system storage / processing for Map Reduce Jobs)
  • Run on Diverse Environments
HDFS - Split, Scatter, Replicate and manage data across servers
HDFS Internals 
  • Master Slave Architecture based
  • One NameNode and multipel DataNodes
  • NameNode manages data storage, allocation, processing
  • DataNode - Read / Write Operations performed upon instruction from NameNodes
  • Status of DataNode is sent as HeartBeats
Map Reduce 
  • Parallel data processing framework
  • Designed to execute jobs in parallel
  • Two Phases Mapper and Reducer
  • Mapper splits jobs into parallel jobs and executes them in parallel
  • Reduce consolidates the results obtained from individual jobs
  • Mapper phase need to be completed for Reduce job to start working
  • Computation is performed on raw data, computation is performed where data is available (Data locality)
  • Moving code to where data is located is much cheaper, efficient approach for large data volumes
Map Reduce - Distributed, Parallel data processing framework


HBase 

  • NOSQL database hosted on top of HDFS
  • Columnar based Database
  • Targeted for Random Reads, Real time query processing
  • HBase uses HDFS as its data storage layer, This takes care of fault tolerance, scalability aspects
Hive
  • Targeted for Analytics
  • Natural choice for SQL Developers is Hive 
Pig
  • Scripting language for analytics queries
In next set of posts we will see in detail about Hbase, Hive and Pig
 
More Reads
 
 
Happy Learning!!!
  

June 22, 2012

Hadoop Quick Bytes

Hadoop Quick Bytes
 
Have reviewed couple of youtube sessions on Hadoop Basics. Listed below are short one liners and fundamentals of Hadoop framework

Hadoop - Open Source Framework, Targeted for Batch / Offline Data Processing, Data & I/O intensive Applications

HDFS - Split, Scatter, Replicate and Manage Data across nodes

Map Reduce - Divide tasks, Co-locate parts of data, and manage failure across nodes

Map Reduce is a Paradigm shift 
  • Operate on File Splits
  • Operate on one block of file
  • Operate on Key, Value Pair
  • Processing is not to move data
  • Move code to where data is available
  • Data Locality is the key in Map Reduce Programming Approach
 
HDFS Features
  • Fault Tolerant - When Nodes Fail, Replicated & Data Distributed is leveraged to recover lost data
  • Self-Healing - Rebalance Fail, When a task allocated to a node fails, Job is reallocated to another free node 
  • Scalable - Ability to store data in new nodes and participate in executing map reduce jobs

Key Strategy Shift - Map Reduce Job is executed where the data is stored. This is in sharp contrast to traditional ETL process where data is loaded (Delta pull) from production systems, perform data cleansing and loading it for target system for refreshing data marts.


Happy Learning!!!!
 

June 17, 2012

Columnar Database Query Execution Basics

This post is to learn answer for How Query execution works in a Columnar database.


Row based storage - This is B-Tree based storage. If we have defined Clustered index on the table, Data would be ordered in based on the Clustered index key. There are a bunch of posts in msdn / blogs to explore more on this. Early Materialization happens in case of row based storage.


Coming to Columnar storage, How it works. From link Query execution in columnar database.

For column based Storage
  • Late Materialization (Load all required columns and then perform required join / filters)
  • Invisible join is another interesting concept
This link  was very useful about columnar databases. Below slide from tutorial is very good on fetching required columns and apply required filters for column store.


Happy Learning!!!

June 09, 2012

TSQL Tip for a Day

Today's post is learning from my question. Below is sample example
 
Step 1 - Create Table and Load Sample Data 
CREATE TABLE TestC
(Comments Char(100),)
INSERT INTO TestC VALUES('A'),('B'),('C'),('D'),('E'),('F'),('G')SELECT * FROM TestC 

Step 2 - Pivot Query

DECLARE @Query VARCHAR(MAX)
DECLARE @cols VARCHAR(MAX)
SET @cols = STUFF((SELECT distinct ',' + QUOTENAME(Comments) 
            FROM [dbo].[TestC]
            FOR XML PATH(''), TYPE
           ).value('.', 'NVARCHAR(MAX)') 
        ,1,1,'')
 
SET @Query = '
SELECT * from ( SELECT *
    FROM TestC ) AS P 
PIVOT( MIN(Comments) FOR Comments in ('+ @cols + ') ) pvt'
 EXEC(@Query)    
 
Table Entries
 

Query Result
 


Couple of Interesting TSQL questions and answers from dba stackexchange site.

What is the difference between select count(*) and select count(any_non_null_column)?

  • COUNT(*) will include NULLS
  • COUNT(column_or_expression) won't.
Next Set of Questions List
Happy Learning!!!

June 03, 2012

Hadoop Basics - Part II

[You may also like - Hadoop Basics - Part I]

Next Post is learning about Streaming Data Access in HDFS
Summarising from the link
  • HDFS is targeted for batch processing
  • Emphasis is high throughput for Data Reads
This looks ok but how it is achieved ? This question was useful to understand it - What is meant by “streaming data access” in HDFS?

The answer is very easy to interpret and understand. Please find below answer and underlined is important lines from the answer.




Since Data Stored Sequentials and Read Sequentially. Cost of Random Reads, Time to locate Data Node is minimal.

Another beautiful explanation from Google Research Paper - Google File System


Please feel free to add your comments.
Happy Learning!!!

Hadoop Basics - Part I


This post is to get started with Hadoop basics. This post is my notes on Hadoop feature - Fault Tolerance.

HDFS (Hadoop Distributed File System) 
  • One of the key features is Fault Tolerance
  • Inbuilt capability to handle data failure issues. Multiple copies of same dataset is managed by the system
  • Once a particular dataset is not accessible system can replace with another accessible copy of same data set 
How this is achieved

HDFS is based on Master - Slave Architecture

Master - NameNode (Manages the Data)
  • Many to 1 relationship between NameNode & DataNode
  • NameNode manages the Data - How it is stored in DataNodes, How Data is Replicated between DataNodes is managed by NameNode
  • The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster
  • Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data
  • EditLog is stored in the Namenode’s local filesystem

Slave - DataNode (Stores the Data)
  • A file is split into one or more blocks and set of blocks are stored in DataNodes (Each file is a sequence of blocks)
  • DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
  • BlockReport contains all the blocks on a Datanode (source - Link1, Link2
If there is an issue with DataAccess Heartbeat would be a indicator for Data Issues. In such cases NameNode identifies and replaces with replicated DataNode copy available to be used as alternative for inaccessible DataNode.
 
 
Happy Learning!!!