Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): July 2012

July 31, 2012

Big Data Conference Notes - Part III

#6. Fifth Elephant Conference – Big Data Analytics @ InMobi

I would rate this as the best session in the conference. The journey of inmobi in managing growing data analytics and providing analytics @ real-time is impressive.

Gaurav gave a complete walkthrough from using perl to Hadoop, Pig and finally ended up building their own analytics platform on top of Hadoop

Scale of Data @ InMobi

3 billion impressions per day
100 primary dimensions and 300 derived dimensions
50 measures

Data Characteristics

Highly Dynamic data and Analytic needs
Frequent addition of new dimensions
Dynamic query patterns
Canned and adhoc reports
Different kind of customers (Sales, Analysts, Executives)
Canned Reports – Day in and Day out reports without any change

Journey of Analytics @ Inmobi

Beginning of Analytics

Initially perl scripts
Logs summarised using perl
Perl could not handle increasing data volumes (Q2 2010)

Hadoop Adoption

Map Reduce jobs written to aggregate logs and populate Database
3 machine Hadoop cluster setup was done
Challenging was writing map reduce jobs took a lot of time
With Increasing DB Views this was harden to accommodate with custom MR jobs creation

Hadoop, Map Reduce and Pig

Pig was adopted; Pig was aggregated logs and pushing data into database
For Complex operations custom MR jobs were written

More Analytics, More Data, Growing Measures

Analytics was becoming increasing complex
DB suffered ‘limited angle view’ problems
Hive was not mature when they tried it out. Hive was resource consuming; it was not creating optimal jobs. Data transfer between mapper / reducer was not scalable
Pig jobs were written for new requirements for fetching and loading data in DB

Realisation

He highlighted the challenges in adopting open source frameworks

Too much customization and constant fine tuning required
Difficult to absorb business changes while trying to customize the platform
Different open source framework at different parts of stack, Difficult to integrate and maintain
Pig not suited for business users

YODA (InMobi Inhouse Analytics Framework)

Complete stack was custom built (ETL, Query Processor, Query builder, Visualization)
Built on top of Hadoop
SQL Like operations (sum, select, min, max UDF supported)
Optimized for storage and queries for data model
Protobuf was used for message exchange

Please view the session if you get a chance. It is amazing, Very informative Session

#7. Fifth Elephant Conference – Messaging Architecture @ facebook

10 TB messages / Day
Haystack is used for storing facebook photos
Needle in a haystack: efficient storage of billions of photos

Facebook principle is - “Choose best design not implementation But get things done fast”

LSM Trees

Stores things in a set of trees
High write throughput
Recent data clustered
Inherently snapshotted

Cassandra Vs HBASE

HBase worked out
Cassandra (Distributed Database)
HDFS – Distributed Storage

#8. Fifth Elephant Conference – Recommendation Engine @ Flipkart

Build on top of Hadoop, Cassandra, Redis, Memcache
Cassandra for storing logs
Map reduce jobs run to identify user browse history, common patterns
Identified data stored in redis (key value pair based storage)
Caching is done using memcache

Happy Learning!!!

Big Data Conference Notes - Part II

In Continuation with previous post.

#4. Fifth Elephant Conference –Cloud Story for Big Data by AWS Evangelist Joe Ziegler

This was a beginner session; there was not in-depth discussion on tools / architecture approach. Amazon is undisputed leader right now. The underlying tools/ techniques are now applied by other competitors Azure, VMware Cloud, Google Cloud. After Google, Amazon harnessed the power of Hadoop, Map Reduce (Elastic Mapreduce), S3 Storage and provide it on AWS Platform.

Data becomes so large that you need to innovate to store, process it. Bigger data is harder data, Multiple sources and multiple different formats of data. By end of 2012 2.7 Zeta bytes of data will be generated and 90% of it is unstructured.

Why Cloud ?

Elastic (Spin off machines on need basis
Pay per use
No Capital investment
Faster time to market
Focus on Core Complexity

Cloud benefits

Reusable – Deploy take snapshot from cloud and use it to deploy later
Managed Services – Managed hosted Hadoop environment @ Amazon
Scale
Innovation

Cloud reduces cost of experimentation

S3 – Simple storage service

Beginner level session, Big Data and Cloud are best friend. Cloud provides infrastructure to host / run big data infrastructure. AWS offerings, Customer case studies were highlighted

#5. Fifth Elephant Conference – Real time Analytics @ flipkart

They explained custom real time analytics for supply chain orders / procurements etc.. Both log files and database are read, data processed and reflected in visual graphs.

Lot of custom tools developed for automating logs collection across servers, custom replication setup (multi threaded approach)

I have presented a rough architecture. More on these tools you can find by searching the net

All of them designed from open source framework and linux platform

DB – Mysql
ElasticSearch – Open source text based indexing (similar to DB)
StatchD – Network Daemon Tool on Node.Js
StatsD Layer – For RegEx patterns, Aggregates, Deviations

This is very interesting approach, Looking at both Database & Application Events to ensure data is loaded / monitored on both the ends.

This can still be simplified by storing in a NOSQL Database and querying on top of it

I am not sure if this would simplify their approach. It again depends on the production scenario / usage. All of this approach / Architecture can be implemented using .NET / Java / Python.

Happy Learning!!!

Big Data Conference Notes - Part I

This post is primarily notes taken during Big Data Conference - The Fifth Elephant.

#1. Fifth Elephant Conference - Crunching Big Data, Google Scale by Rahul Kulkarni

First Session was ‘Scaling Data Google Scale’ by Google Employee Rahul Kulkarni. Captured below are notes from the session

Session covered on Google App Engine, Google Compute Engine, How google manages processing huge volumes of data. The two primary factors around data processing are Compute at scale, Adhoc querying on large volume of data

Google App Engine

PaaS (Provided as Platform as a Service)
Stats on Data processing volumes – 7.5B hits per day and 2 Trillion transactions per month

Google Compute Engine

IaaS (Infrastructure as a service)
Analytics workload targeted
Supports Deploying your own cluster
Example of how Genome processing (large data sets) was shared. GCE reduced computation time for genome processing significantly

Google White Papers

Google whitepapers to checkout
Dremel (2010)
Drapper (2010) – For Tracing purpose
Flume (2010) – Data Pipeline
Protocol buffers (2008)
Chubby (2006)

Other interesting white papers I have shared in my earlier posts

Google’s Approach for Data Processing (Adhoc Queries)

Big Query Approach - Uses Column oriented storage
Supports Map reduce jobs as well (3 Phases Mapper, Shuffler, Reducer)
Big Query Supports small joins, In case of joins the required data is moved to where column data is located

Google Cloud based Solution for Data

App Engine (Front End)
Big Query (Process Data)
Cloud Storage (Data Storage)

Links Provided – developer.google.com

Key Learning’s

Google cloud platform can be used for prototypes involving big data
Columnar databases gaining market share for analytics (Hadapt, Vertica etc..)
Bunch of new whitepapers I learnt from the session talk

#2. Fifth Elephant Conference – In Data We Believe Session Notes

Session by Harish Pillay from Redhat, Briefly covered on big data characteristics, opportunities, offerings from Red hat for Big Data

What is Data? 1’s and 0’s organized in a manner that provides meaning when interpreted

Structured Data Characteristics – Schema available, normalized, predictable, known

Unstructured Data Characteristics – Semistructured like log files, unorganized, no fixed schema

Redhat offerings for cloud, big data were discussed. Redhat Linux, JBOSS, Redhat storage and openshift products were highlighted.

#3. Fifth Elephant Conference – Hadoop ecosystem overview Session Notes

Session by Vinayak Hegde from InMobi. How they manage big data processing. What tools and framework they rely on for data processing

Introductory slides covering on data generated in large volumes from mobile, social networks, financial system, tweets, blogs etc..

He listed dozen open source projects for different layers involved in data processing. Listed below are projects I noted during the session. Data Stack was a very good slide

Session was full of tools used at each layer. Unfortunately presentation was cut short as it exceeded allowed duration. This tools list is a good starter kit to start exploring.

Key Learning’s

Open source tools that can be leveraged for custom Hadoop based cluster setup and management. These tools are a good place to get started for large scale Hadoop installations

Happy Learning!!!

July 29, 2012

Protobuf and LRM Trees

This post is based on learning's from The Fifth Elephant - Big Data Conference Notes.

Protobuf - Many Sessions emphasised on adoption of Protobuf. Google's data interchange format. They output xml in terms of performance, simple and easy to use. (Link)

Protobuf tutorial - Link1, Link2
Using Protocol Buffers on .Net platform (Part I)
Protocol Buffers and WCF
Some internals of protobuf-net

LRM ( Left-to-Right-Minima) Trees – This was highlighted during Facebook messaging Architecture. Left-to-Right-Minima Trees. Academic paper on LRM tree available in link

From Quora - What is the expanded form of the "SSTable" abbreviation used by Google's BigTable?

An SSTable provides a persistent, ordered, immutable map from keys to values [1], where both keys and values are arbitrary bytes. Apache HBase, the implementation of SSTable is called HFile.
Where can I get a paper about SSTables?

Log-Structured Merge-Tree - The Log-Structured Merge-Tree - writes are always fast regardless of the size of dataset (append-only), and random reads are either served from memory or require a quick disk seek

From Link

Before going to the topic, Refresher for B and B+ Trees

Happy Learning!!!

July 28, 2012

Big Data Conference - The Fifth Elephant

Last two days I attended big data conference The Fifth Element. Below listed sessions were very good

Scaling Data Google Scale
Inmobi Real Time Analytics Architecture
A Herd of Elephants - Navigating the Hadoop Ecosystem
Messaging architecture at Facebook

Some notable quotes from Big Data Conference

“Data is the new raw material for any business on pair with Capital, People and Labour”

“We are in the age of Data Revolution”

“Big Data is all about gaining useful insights from data while keeping cost of retrieval and storage low”

“If all you have is a hammer; everything that you look at will look like a nail” (Traditional BI
Approach won’t fit all problems)

Unstructured data rules the world!!

I will posting my notes from the sessions in coming posts. You can also checkout the videos of sessions from link

Happy Learning!!!

July 22, 2012

Windows Azure Internals - Webinar Notes

Excellent webinar with in-depth discussion on Azure design, application deployment, underlying data centre and application management architecture.

Webinar notes on Windows Azure Internals

Currently Data Centre running in 8 locations
New Data Centres coming up to meet growing demands
Data Centre Operating System - Fabric Controller (Equivalent of OS for Single Machine)
Run on Blade Servers
Fabric controller manages hardware, provisioning it, deploying and managing Applications
1000 machines constitute to one cluster. Cluster is managed by Fabric controller
Primary and Multiple Secondary Fabric controllers for Highavailability purpose
5 Fabric Controllers running on a cluster was demoed in this webinar
Primary Fabric Controller maintains the current state of the System and Syncs with other Fcs
DLA Architecture (Old Network Topology), Quantum10 Architecture - bisected graph - (New Network Topology)
Quantum10 Architecture adopted from bing architecture
Software Load balancer instead of Hardware Load balancer. Software Load balancer also performs port mapping
Fabric Controller has OS Images + Host OS (Azure OS) + Maintenance OS (Used for first time boot) + Role

Sequence of How app is Deployed on individual Node

Maintance OS runs on first time boot
Windows Azure runs next step
Fabric Control agent starts and Connects to Fabric Controller
Machine Ready to Deploy App

RDFE Concepts (Red Dog Front End” (RDFE))

Manages Subscription management, Billing, User Access & Service Management
Responsible for picking clusters and deploying services (Apps)

Fabric Controller Service Deployment Steps

Process Service model files
Allocate compute and node resources
Prepare nodes
Configure Network attributes (Load balancer, Virtual IP addresses mapped to Dynamic IP of blades)

Based on DeploymentId Identify Servers running the applications to troubleshoot issues

Fabric Controller Primary connects to FC Agent, Fabric Controller agent executes commands on the VM (Node)

Best Practice use latest version of windows OS for best performance

On PaaS role instance 3 Disks

Resource Disk (Size of Disk = Size of VM)
Differencing Disk (Ensure other systems not affected, Reset option is available)
Role Image
New VMs created for every new version of application
Old Role is removed and New Role Image Created

Iaas Operation

No Cached Image
Blob Storage offered
Only one Disk OS Disk by default
IaaS offers different disk types
OS Disk, Data Disk (Random Reads), Striped Data Disk and Temp Disk (Local disk spindles can be shared)

Health Maintenance

Overview of crash management, timeouts, server issues
Restart, Automatic Recovery logic before finally requesting for human interference for failed recovery

Looking forward for more Windows Azure Sessions...

Happy Learning!!!

July 02, 2012

Google Cloud Platform

Google Cloud Platform offerings....

Google Big Query - Big Data Analytics (PaaS - Targeted for Data Processing)
Google App Engine is an application hosting and development platform (PaaS)
Google Compute Engine - Manage large-scale workloads computing (IaaS)
Google Cloud Storage - Store and manage data (IaaS)

More Reads
What is (isn't) Google App Engine?
Google Cloud Storage Introduction

Happy Learning!!

July 31, 2012

July 29, 2012

July 28, 2012

July 22, 2012

July 02, 2012

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts