"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

July 31, 2012

Big Data Conference Notes - Part III

#6. Fifth Elephant Conference – Big Data Analytics @ InMobi

I would rate this as the best session in the conference. The journey of inmobi in managing growing data analytics and providing analytics @ real-time is impressive.
Gaurav gave a complete walkthrough from using perl to Hadoop, Pig and finally ended up building their own analytics platform on top of Hadoop
Scale of Data @ InMobi

  • 3 billion impressions per day
  • 100 primary dimensions and 300 derived dimensions
  • 50 measures
Data Characteristics
  • Highly Dynamic data and Analytic needs
  • Frequent addition of new dimensions
  • Dynamic query patterns
  • Canned and adhoc reports
  • Different kind of customers (Sales, Analysts, Executives)
  • Canned Reports – Day in and Day out reports without any change
Journey of Analytics @ Inmobi
Beginning of Analytics
  • Initially perl scripts
  • Logs summarised using perl
  • Perl could not handle increasing data volumes (Q2 2010)
Hadoop Adoption
 
  • Map Reduce jobs written to aggregate logs and populate Database
  • 3 machine Hadoop cluster setup was done
  • Challenging was writing map reduce jobs took a lot of time
  • With Increasing DB Views this was harden to accommodate with custom MR jobs creation
Hadoop, Map Reduce and Pig
  • Pig was adopted; Pig was aggregated logs and pushing data into database
  • For Complex operations custom MR jobs were written
More Analytics, More Data, Growing Measures
  • Analytics was becoming increasing complex
  • DB suffered ‘limited angle view’ problems
  • Hive was not mature when they tried it out. Hive was resource consuming; it was not creating optimal jobs. Data transfer between mapper / reducer was not scalable
  • Pig jobs were written for new requirements for fetching and loading data in DB
Realisation

He highlighted the challenges in adopting open source frameworks
  • Too much customization and constant fine tuning required
  • Difficult to absorb business changes while trying to customize the platform
  • Different open source framework at different parts of stack, Difficult to integrate and maintain
  • Pig not suited for business users

YODA (InMobi Inhouse Analytics Framework)
 
  • Complete stack was custom built (ETL, Query Processor, Query builder, Visualization)
  • Built on top of Hadoop
  • SQL Like operations (sum, select, min, max UDF supported)
  • Optimized for storage and queries for data model
  • Protobuf was used for message exchange

Please view the session if you get a chance. It is amazing, Very informative Session

#7. Fifth Elephant Conference – Messaging Architecture @ facebook

Facebook principle is - “Choose best design not implementation But get things done fast”
LSM Trees
  • Stores things in a set of trees
  • High write throughput
  • Recent data clustered
  • Inherently snapshotted
Cassandra Vs HBASE
  • HBase worked out
  • Cassandra (Distributed Database)
  • HDFS – Distributed Storage
#8. Fifth Elephant Conference – Recommendation Engine @ Flipkart
  • Build on top of Hadoop, Cassandra, Redis, Memcache
  • Cassandra for storing logs
  • Map reduce jobs run to identify user browse history, common patterns
  • Identified data stored in redis (key value pair based storage)
  • Caching is done using memcache

Happy Learning!!!

Big Data Conference Notes - Part II

In Continuation with previous post.

#4. Fifth Elephant Conference –Cloud Story for Big Data by AWS Evangelist Joe Ziegler


This was a beginner session; there was not in-depth discussion on tools / architecture approach. Amazon is undisputed leader right now. The underlying tools/ techniques are now applied by other competitors Azure, VMware Cloud, Google Cloud. After Google, Amazon harnessed the power of Hadoop, Map Reduce (Elastic Mapreduce), S3 Storage and provide it on AWS Platform.


Data becomes so large that you need to innovate to store, process it. Bigger data is harder data, Multiple sources and multiple different formats of data. By end of 2012 2.7 Zeta bytes of data will be generated and 90% of it is unstructured.


Why Cloud ?
  • Elastic (Spin off machines on need basis
  • Pay per use
  • No Capital investment
  • Faster time to market
  • Focus on Core Complexity
Cloud benefits 
  • Reusable – Deploy take snapshot from cloud and use it to deploy later
  • Managed Services – Managed hosted Hadoop environment @ Amazon
  • Scale 
  • Innovation
Cloud reduces cost of experimentation

S3 – Simple storage service
Beginner level session, Big Data and Cloud are best friend. Cloud provides infrastructure to host / run big data infrastructure. AWS offerings, Customer case studies were highlighted
#5. Fifth Elephant Conference – Real time Analytics @ flipkart
They explained custom real time analytics for supply chain orders / procurements etc.. Both log files and database are read, data processed and reflected in visual graphs.
Lot of custom tools developed for automating logs collection across servers, custom replication setup (multi threaded approach)
I have presented a rough architecture. More on these tools you can find by searching the net
All of them designed from open source framework and linux platform
  • DB – Mysql
  • ElasticSearch – Open source text based indexing (similar to DB)
  • StatchD – Network Daemon Tool on Node.Js
  • StatsD Layer – For RegEx patterns, Aggregates, Deviations
This is very interesting approach, Looking at both Database & Application Events to ensure data is loaded / monitored on both the ends.
This can still be simplified by storing in a NOSQL Database and querying on top of it
I am not sure if this would simplify their approach. It again depends on the production scenario / usage. All of this approach / Architecture can be implemented using .NET / Java / Python.
Happy Learning!!!

Big Data Conference Notes - Part I


This post is primarily notes taken during Big Data Conference - The Fifth Elephant.

#1. Fifth Elephant Conference - Crunching Big Data, Google Scale by Rahul Kulkarni

First Session was ‘Scaling Data Google Scale’ by Google Employee Rahul Kulkarni. Captured below are notes from the session
Session covered on Google App Engine, Google Compute Engine, How google manages processing huge volumes of data. The two primary factors around data processing are Compute at scale, Adhoc querying on large volume of data
Google App Engine 

  • PaaS (Provided as Platform as a Service)
  • Stats on Data processing volumes – 7.5B hits per day and 2 Trillion transactions per month
Google Compute Engine

  • IaaS (Infrastructure as a service)
  • Analytics workload targeted
  • Supports Deploying your own cluster
  • Example of how Genome processing (large data sets) was shared. GCE reduced computation time for genome processing significantly
Google White Papers

  • Google whitepapers to checkout
  • Dremel (2010)
  • Drapper (2010) – For Tracing purpose
  • Flume (2010) – Data Pipeline
  • Protocol buffers (2008)
  • Chubby (2006)
Other interesting white papers I have shared in my earlier posts
Google’s Approach for Data Processing (Adhoc Queries)

  • Big Query Approach - Uses Column oriented storage 
  • Supports Map reduce jobs as well (3 Phases Mapper, Shuffler, Reducer)
  • Big Query Supports small joins, In case of joins the required data is moved to where column data is located
Google Cloud based Solution for Data

  • App Engine (Front End)
  • Big Query (Process Data)
  • Cloud Storage (Data Storage)


Links Provided – developer.google.com

Key Learning’s
  • Google cloud platform can be used for prototypes involving big data
  • Columnar databases gaining market share for analytics (Hadapt, Vertica etc..)
  • Bunch of new whitepapers I learnt from the session talk
#2. Fifth Elephant Conference – In Data We Believe Session Notes

Session by Harish Pillay from Redhat, Briefly covered on big data characteristics, opportunities, offerings from Red hat for Big Data

What is Data? 1’s and 0’s organized in a manner that provides meaning when interpreted

Structured Data Characteristics – Schema available, normalized, predictable, known

Unstructured Data Characteristics – Semistructured like log files, unorganized, no fixed schema
Redhat offerings for cloud, big data were discussed. Redhat Linux, JBOSS, Redhat storage and openshift products were highlighted.


#3. Fifth Elephant Conference – Hadoop ecosystem overview Session Notes
Session by Vinayak Hegde from InMobi. How they manage big data processing. What tools and framework they rely on for data processing

Introductory slides covering on data generated in large volumes from mobile, social networks, financial system, tweets, blogs etc..

He listed dozen open source projects for different layers involved in data processing. Listed below are projects I noted during the session. Data Stack was a very good slide
 
 
Session was full of tools used at each layer. Unfortunately presentation was cut short as it exceeded allowed duration. This tools list is a good starter kit to start exploring.


Key Learning’s  
  • Open source tools that can be leveraged for custom Hadoop based cluster setup and management. These tools are a good place to get started for large scale Hadoop installations
Happy Learning!!! 

July 29, 2012

Protobuf and LRM Trees

This post is based on learning's from The Fifth Elephant - Big Data Conference Notes.

Protobuf - Many Sessions emphasised on adoption of Protobuf. Google's data interchange format. They output xml in terms of performance, simple and easy to use. (Link)
Protobuf tutorial - Link1, Link2
Using Protocol Buffers on .Net platform (Part I)
Protocol Buffers and WCF
Some internals of protobuf-net


LRM ( Left-to-Right-Minima) Trees – This was highlighted during Facebook messaging Architecture. Left-to-Right-Minima Trees. Academic paper on LRM tree available in link



From Quora - What is the expanded form of the "SSTable" abbreviation used by Google's BigTable?

An SSTable provides a persistent, ordered, immutable map from keys to values [1], where both keys and values are arbitrary bytes. Apache HBase, the implementation of SSTable is called HFile.
Where can I get a paper about SSTables?

Log-Structured Merge-Tree - The Log-Structured Merge-Tree - writes are always fast regardless of the size of dataset (append-only), and random reads are either served from memory or require a quick disk seek

From Link


Before going to the topic, Refresher for B and B+ Trees

Happy Learning!!!

July 28, 2012

Big Data Conference - The Fifth Elephant

Last two days I attended big data conference The Fifth Element. Below listed sessions were very good
  • Scaling Data Google Scale
  • Inmobi Real Time Analytics Architecture
  • A Herd of Elephants - Navigating the Hadoop Ecosystem
  • Messaging architecture at Facebook
Some notable quotes from Big Data Conference

“Data is the new raw material for any business on pair with Capital, People and Labour”

“We are in the age of Data Revolution”

“Big Data is all about gaining useful insights from data while keeping cost of retrieval and storage low”

“If all you have is a hammer; everything that you look at will look like a nail” (Traditional BI
Approach won’t fit all problems)

Unstructured data rules the world!!


I will posting my notes from the sessions in coming posts. You can also checkout the videos of sessions from link


Happy Learning!!!

July 22, 2012

Windows Azure Internals - Webinar Notes


Excellent webinar with in-depth discussion on Azure design, application deployment, underlying data centre and application management architecture.




Webinar notes on Windows Azure Internals
  • Currently Data Centre running in 8 locations 
  • New Data Centres coming up to meet growing demands
  • Data Centre Operating System - Fabric Controller (Equivalent of OS for Single Machine)
  • Run on Blade Servers
  • Fabric controller manages hardware, provisioning it, deploying and managing Applications
  • 1000 machines constitute to one cluster. Cluster is managed by Fabric controller
  • Primary and Multiple Secondary Fabric controllers for Highavailability purpose
  • 5 Fabric Controllers running on a cluster was demoed in this webinar
  • Primary Fabric Controller maintains the current state of the System and Syncs with other Fcs
  • DLA Architecture (Old Network Topology), Quantum10 Architecture  - bisected graph - (New Network Topology)
  • Quantum10 Architecture adopted from bing architecture
  • Software Load balancer instead of Hardware Load balancer. Software Load balancer also performs port mapping
  • Fabric Controller has OS Images + Host OS (Azure OS) + Maintenance OS (Used for first time boot) + Role   
Sequence of How app is Deployed on individual Node
  • Maintance OS runs on first time boot
  • Windows Azure runs next step
  • Fabric Control agent starts and Connects to Fabric Controller
  • Machine Ready to Deploy App
RDFE Concepts (Red Dog Front End” (RDFE))
  • Manages Subscription management, Billing, User Access & Service Management
  • Responsible for picking clusters and deploying services (Apps)

Fabric Controller Service Deployment Steps
  • Process Service model files
  • Allocate compute and node resources
  • Prepare nodes
  • Configure Network attributes (Load balancer, Virtual IP addresses mapped to Dynamic IP of blades)
Based on DeploymentId Identify Servers running the applications to troubleshoot issues

Fabric Controller Primary connects to FC Agent, Fabric Controller agent executes commands on the VM (Node)
  • Best Practice use latest version of windows OS for best performance

On PaaS role instance 3 Disks
  • Resource Disk (Size of Disk = Size of VM)
  • Differencing Disk (Ensure other systems not affected, Reset option is available)
  • Role Image
  • New VMs created for every new version of application
  • Old Role is removed and New Role Image Created
Iaas Operation
  • No Cached Image
  • Blob Storage offered
  • Only one Disk OS Disk by default
  • IaaS offers different disk types
  • OS Disk, Data Disk (Random Reads), Striped Data Disk and Temp Disk (Local disk spindles can be shared)
Health Maintenance
  • Overview of crash management, timeouts, server issues
  • Restart, Automatic Recovery logic before finally requesting for human interference for failed recovery
Looking forward for more Windows Azure Sessions...

Happy Learning!!!

July 02, 2012

Google Cloud Platform

Google Cloud Platform offerings....

  • Google Big Query - Big Data Analytics (PaaS - Targeted for Data Processing)
  • Google App Engine is an application hosting and development platform (PaaS)
  • Google Compute Engine - Manage large-scale workloads computing  (IaaS)
  • Google Cloud Storage - Store and manage data (IaaS)

More Reads
What is (isn't) Google App Engine?
Google Cloud Storage Introduction


Happy Learning!!