Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Big Data Conference Notes

July 31, 2012

Big Data Conference Notes - Part I

This post is primarily notes taken during Big Data Conference - The Fifth Elephant.

#1. Fifth Elephant Conference - Crunching Big Data, Google Scale by Rahul Kulkarni

First Session was ‘Scaling Data Google Scale’ by Google Employee Rahul Kulkarni. Captured below are notes from the session

Session covered on Google App Engine, Google Compute Engine, How google manages processing huge volumes of data. The two primary factors around data processing are Compute at scale, Adhoc querying on large volume of data

Google App Engine

PaaS (Provided as Platform as a Service)
Stats on Data processing volumes – 7.5B hits per day and 2 Trillion transactions per month

Google Compute Engine

IaaS (Infrastructure as a service)
Analytics workload targeted
Supports Deploying your own cluster
Example of how Genome processing (large data sets) was shared. GCE reduced computation time for genome processing significantly

Google White Papers

Google whitepapers to checkout
Dremel (2010)
Drapper (2010) – For Tracing purpose
Flume (2010) – Data Pipeline
Protocol buffers (2008)
Chubby (2006)

Other interesting white papers I have shared in my earlier posts

Google’s Approach for Data Processing (Adhoc Queries)

Big Query Approach - Uses Column oriented storage
Supports Map reduce jobs as well (3 Phases Mapper, Shuffler, Reducer)
Big Query Supports small joins, In case of joins the required data is moved to where column data is located

Google Cloud based Solution for Data

App Engine (Front End)
Big Query (Process Data)
Cloud Storage (Data Storage)

Links Provided – developer.google.com

Key Learning’s

Google cloud platform can be used for prototypes involving big data
Columnar databases gaining market share for analytics (Hadapt, Vertica etc..)
Bunch of new whitepapers I learnt from the session talk

#2. Fifth Elephant Conference – In Data We Believe Session Notes

Session by Harish Pillay from Redhat, Briefly covered on big data characteristics, opportunities, offerings from Red hat for Big Data

What is Data? 1’s and 0’s organized in a manner that provides meaning when interpreted

Structured Data Characteristics – Schema available, normalized, predictable, known

Unstructured Data Characteristics – Semistructured like log files, unorganized, no fixed schema

Redhat offerings for cloud, big data were discussed. Redhat Linux, JBOSS, Redhat storage and openshift products were highlighted.

#3. Fifth Elephant Conference – Hadoop ecosystem overview Session Notes

Session by Vinayak Hegde from InMobi. How they manage big data processing. What tools and framework they rely on for data processing

Introductory slides covering on data generated in large volumes from mobile, social networks, financial system, tweets, blogs etc..

He listed dozen open source projects for different layers involved in data processing. Listed below are projects I noted during the session. Data Stack was a very good slide

Session was full of tools used at each layer. Unfortunately presentation was cut short as it exceeded allowed duration. This tools list is a good starter kit to start exploring.

Key Learning’s

Open source tools that can be leveraged for custom Hadoop based cluster setup and management. These tools are a good place to get started for large scale Hadoop installations

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

July 31, 2012

Big Data Conference Notes - Part I

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts