Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): June 2012

June 29, 2012

My Big Data Notes - HDFS, MapReduce

Data - Contains / Represents meaningful information
Big Data - Data that challenges current limits in terms of volumes and speed of data generation

Big Data refers to large data generated from social media, ecommerce sites, data from mobile devices, sensors etc. Data volume generated is huge and the rapid rate of data growth

Big Data Technology refer to technology that helps to process large volumes of data in an economically viable way with latest technologies using commodity hardware

Hadoop ecosystem is the key for Big Data Technology. Hadoop technologies are inspired by Google’s infrastructure

Processing – Mapreduce inspired from Google Map Reduce
File Storage – HDFS inspired from Google File System
Database – HBase - Inspired from Big Table
Pig, Hive - for Analytics

Source - Link

Storage Layer is - HDFS
Processing Layer - Map Reduce

HDFS (Hadoop Distributed File System)

File System Written in Java
High throughout, Effective to read large chunks of data
Runs on Commodity hardware
Fault Tolerant ( Data is replicated, Data not accessible is made available with backups available from it, making it highly available system)
Scalable (Ability to scale - add / remove hardware for file system storage / processing for Map Reduce Jobs)
Run on Diverse Environments

HDFS - Split, Scatter, Replicate and manage data across servers

HDFS Internals

Master Slave Architecture based
One NameNode and multipel DataNodes
NameNode manages data storage, allocation, processing
DataNode - Read / Write Operations performed upon instruction from NameNodes
Status of DataNode is sent as HeartBeats

Map Reduce

Parallel data processing framework
Designed to execute jobs in parallel
Two Phases Mapper and Reducer
Mapper splits jobs into parallel jobs and executes them in parallel
Reduce consolidates the results obtained from individual jobs
Mapper phase need to be completed for Reduce job to start working
Computation is performed on raw data, computation is performed where data is available (Data locality)
Moving code to where data is located is much cheaper, efficient approach for large data volumes

Map Reduce - Distributed, Parallel data processing framework

HBase

NOSQL database hosted on top of HDFS
Columnar based Database
Targeted for Random Reads, Real time query processing
HBase uses HDFS as its data storage layer, This takes care of fault tolerance, scalability aspects

Hive

Targeted for Analytics
Natural choice for SQL Developers is Hive

Pig

Scripting language for analytics queries

In next set of posts we will see in detail about Hbase, Hive and Pig

More Reads

Hadoop: Basic Concepts” webinar recording released

OSSCube releases “The Motivation for Hadoop” webinar resources

Happy Learning!!!

June 22, 2012

Hadoop Quick Bytes

Hadoop Quick Bytes

Have reviewed couple of youtube sessions on Hadoop Basics. Listed below are short one liners and fundamentals of Hadoop framework

Hadoop - Open Source Framework, Targeted for Batch / Offline Data Processing, Data & I/O intensive Applications

HDFS - Split, Scatter, Replicate and Manage Data across nodes

Map Reduce - Divide tasks, Co-locate parts of data, and manage failure across nodes

Map Reduce is a Paradigm shift

Operate on File Splits
Operate on one block of file
Operate on Key, Value Pair
Processing is not to move data
Move code to where data is available
Data Locality is the key in Map Reduce Programming Approach

HDFS Features

Fault Tolerant - When Nodes Fail, Replicated & Data Distributed is leveraged to recover lost data
Self-Healing - Rebalance Fail, When a task allocated to a node fails, Job is reallocated to another free node
Scalable - Ability to store data in new nodes and participate in executing map reduce jobs

Key Strategy Shift - Map Reduce Job is executed where the data is stored. This is in sharp contrast to traditional ETL process where data is loaded (Delta pull) from production systems, perform data cleansing and loading it for target system for refreshing data marts.

Happy Learning!!!!

June 17, 2012

Columnar Database Query Execution Basics

This post is to learn answer for How Query execution works in a Columnar database.

Row based storage - This is B-Tree based storage. If we have defined Clustered index on the table, Data would be ordered in based on the Clustered index key. There are a bunch of posts in msdn / blogs to explore more on this. Early Materialization happens in case of row based storage.

Coming to Columnar storage, How it works. From link Query execution in columnar database.

For column based Storage

Late Materialization (Load all required columns and then perform required join / filters)
Invisible join is another interesting concept

This link was very useful about columnar databases. Below slide from tutorial is very good on fetching required columns and apply required filters for column store.

Happy Learning!!!

June 12, 2012

Hadoop Basics - Part III

Three google research papers to understand MapReduce, BigData evolution

More Reads
Google: Cluster Computing and MapReduce

Happy Learning!!!

June 09, 2012

TSQL Tip for a Day

Today's post is learning from my question. Below is sample example

Step 1 - Create Table and Load Sample Data

CREATE TABLE TestC

(Comments Char(100),)
INSERT INTO TestC VALUES('A'),('B'),('C'),('D'),('E'),('F'),('G')SELECT * FROM TestC

Step 2 - Pivot Query

DECLARE @Query VARCHAR(MAX)

DECLARE @cols VARCHAR(MAX)
SET @cols = STUFF((SELECT distinct ',' + QUOTENAME(Comments)
            FROM [dbo].[TestC]
            FOR XML PATH(''), TYPE
           ).value('.', 'NVARCHAR(MAX)')
        ,1,1,'')

SET @Query = '

SELECT * from ( SELECT *
FROM TestC ) AS P
PIVOT( MIN(Comments) FOR Comments in ('+ @cols + ') ) pvt'
EXEC(@Query)

Table Entries

Query Result

Couple of Interesting TSQL questions and answers from dba stackexchange site.

What is the difference between select count(*) and select count(any_non_null_column)?

COUNT(*) will include NULLS
COUNT(column_or_expression) won't.

Next Set of Questions List

Happy Learning!!!

June 03, 2012

Hadoop Basics - Part II

[You may also like - Hadoop Basics - Part I]

Next Post is learning about Streaming Data Access in HDFS
Summarising from the link

HDFS is targeted for batch processing
Emphasis is high throughput for Data Reads

This looks ok but how it is achieved ? This question was useful to understand it - What is meant by “streaming data access” in HDFS?

The answer is very easy to interpret and understand. Please find below answer and underlined is important lines from the answer.

Since Data Stored Sequentials and Read Sequentially. Cost of Random Reads, Time to locate Data Node is minimal.

Another beautiful explanation from Google Research Paper - Google File System

Please feel free to add your comments.

Happy Learning!!!

Hadoop Basics - Part I

This post is to get started with Hadoop basics. This post is my notes on Hadoop feature - Fault Tolerance.

HDFS (Hadoop Distributed File System)

One of the key features is Fault Tolerance
Inbuilt capability to handle data failure issues. Multiple copies of same dataset is managed by the system
Once a particular dataset is not accessible system can replace with another accessible copy of same data set

How this is achieved

HDFS is based on Master - Slave Architecture

Master - NameNode (Manages the Data)

Many to 1 relationship between NameNode & DataNode
NameNode manages the Data - How it is stored in DataNodes, How Data is Replicated between DataNodes is managed by NameNode
The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster
Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data
EditLog is stored in the Namenode’s local filesystem

Slave - DataNode (Stores the Data)

A file is split into one or more blocks and set of blocks are stored in DataNodes (Each file is a sequence of blocks)
DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
BlockReport contains all the blocks on a Datanode (source - Link1, Link2)

If there is an issue with DataAccess Heartbeat would be a indicator for Data Issues. In such cases NameNode identifies and replaces with replicated DataNode copy available to be used as alternative for inaccessible DataNode.

Happy Learning!!!

June 29, 2012

June 22, 2012

June 17, 2012

June 12, 2012

June 09, 2012

June 03, 2012

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts