"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

July 31, 2012

Big Data Conference Notes - Part I


This post is primarily notes taken during Big Data Conference - The Fifth Elephant.

#1. Fifth Elephant Conference - Crunching Big Data, Google Scale by Rahul Kulkarni

First Session was ‘Scaling Data Google Scale’ by Google Employee Rahul Kulkarni. Captured below are notes from the session
Session covered on Google App Engine, Google Compute Engine, How google manages processing huge volumes of data. The two primary factors around data processing are Compute at scale, Adhoc querying on large volume of data
Google App Engine 

  • PaaS (Provided as Platform as a Service)
  • Stats on Data processing volumes – 7.5B hits per day and 2 Trillion transactions per month
Google Compute Engine

  • IaaS (Infrastructure as a service)
  • Analytics workload targeted
  • Supports Deploying your own cluster
  • Example of how Genome processing (large data sets) was shared. GCE reduced computation time for genome processing significantly
Google White Papers

  • Google whitepapers to checkout
  • Dremel (2010)
  • Drapper (2010) – For Tracing purpose
  • Flume (2010) – Data Pipeline
  • Protocol buffers (2008)
  • Chubby (2006)
Other interesting white papers I have shared in my earlier posts
Google’s Approach for Data Processing (Adhoc Queries)

  • Big Query Approach - Uses Column oriented storage 
  • Supports Map reduce jobs as well (3 Phases Mapper, Shuffler, Reducer)
  • Big Query Supports small joins, In case of joins the required data is moved to where column data is located
Google Cloud based Solution for Data

  • App Engine (Front End)
  • Big Query (Process Data)
  • Cloud Storage (Data Storage)


Links Provided – developer.google.com

Key Learning’s
  • Google cloud platform can be used for prototypes involving big data
  • Columnar databases gaining market share for analytics (Hadapt, Vertica etc..)
  • Bunch of new whitepapers I learnt from the session talk
#2. Fifth Elephant Conference – In Data We Believe Session Notes

Session by Harish Pillay from Redhat, Briefly covered on big data characteristics, opportunities, offerings from Red hat for Big Data

What is Data? 1’s and 0’s organized in a manner that provides meaning when interpreted

Structured Data Characteristics – Schema available, normalized, predictable, known

Unstructured Data Characteristics – Semistructured like log files, unorganized, no fixed schema
Redhat offerings for cloud, big data were discussed. Redhat Linux, JBOSS, Redhat storage and openshift products were highlighted.


#3. Fifth Elephant Conference – Hadoop ecosystem overview Session Notes
Session by Vinayak Hegde from InMobi. How they manage big data processing. What tools and framework they rely on for data processing

Introductory slides covering on data generated in large volumes from mobile, social networks, financial system, tweets, blogs etc..

He listed dozen open source projects for different layers involved in data processing. Listed below are projects I noted during the session. Data Stack was a very good slide
 
 
Session was full of tools used at each layer. Unfortunately presentation was cut short as it exceeded allowed duration. This tools list is a good starter kit to start exploring.


Key Learning’s  
  • Open source tools that can be leveraged for custom Hadoop based cluster setup and management. These tools are a good place to get started for large scale Hadoop installations
Happy Learning!!! 

No comments: