"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 29, 2016

Naive Bayes Classifier

Naive Based Classifier Notes and Examples

  • Work on assumption occurrence of word i is not dependent on occurrence of word i+1
  • Usually a sentence will have context only when words occur with appropriate terms and positions
  • For example purpose, we have listed below two classes and a testing document to classify the same













Ref - Link

Happy Learning!!!

February 23, 2016

Hierarchical Clustering


  • Compute distance in every pair of cluster
  • Merge nearest ones until number of clusters = number of clusters needed
  • Entire process can be represented as dendrogram
  • At the end of the algorithm dendogram is plotted
Measuring Distance between clusters
  • Single (Minimum Distance between two pairs one from each clusters)
  • Complete (Maximum  between two pairs one from each clusters)
  • Average (Average of all possible pairs)

Happy Learning!!!

K-medoids, K-means

Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means
  • Prone to outliers (Squared Euclidean gives greater weight to more distant points)
  • Can't handle categorical data
  • Work with Euclidean only
K-Medoids
  • Restrict centre to data points
  • Centre picked up only from data points
  • We use same sum of squares for cost function but distance is not Euclidean distance
  • Use your own custom distance functions when involved with numerical and categorical variables
  • Example (25 languages, 24 columns, M/F/N - 2 columns) - Compute your own custom distance functions. It is one less because all zero combinations will also be treated as one attribute
Distance measure for numerical variables
  • Euclidean based distance
  • Correlation based distance
  • Mahalanobis distance
Distance measure for category variables
  • Matching coef and Jaquard’s coef

Happy Learning!!!

February 22, 2016

R and SQL Server

This post is example for querying SQL Server and visualizing data using twitter. Package used is ROBDC. Sample walk through code snippet provided.

Happy Learning!!!

February 19, 2016

February 18, 2016

Cluster Analytics - Deep Dive on K Means

Had a Good Session on K-means clustering. Code snippet, notes in this post

Clustering - Assignment of observation into subsets that are similar in some sense

K-Means Clustering
  • Highly used algorithm
  • You need to decide on number of data groups
How it works ?
  • Start with random guess of cluster centres
  • Go through every point and compute distance between C1 and C2 (Cluster Centres)
How to measure good clustering ?
  • Intra cluster distances minimized
  • Inter cluster distances maximized
Cost function 
  • Sum of squared distances from each point to its cluster center should be minimum
  • Iteration to Iteration cost function will keep decreasing
Learning Points
  • While trying to measure, global cluster centre, local cluster centre points were identified
  • For all cluster points, sum of squares computed with global cluster centre
Mathematical Learnings 
What is sum-of-squared distances method ?
What is Euclidean Distance ?
  • Distance between two points in the plane with coordinates (x, y) and (a, b) is given by
  • Link - ref
What is local optima ?
  • Local optima are defined as the relative best solutions within a neighbour solution set.
How to choose value of K ?

Elbow method - Plot with number of clusters and compute cost function. When there is sharp decline that would denote optimum number of clusters


Elbow method - Plot for Elbow method. At centre = 3 there is a steep fall which means 3 is optimum number.

  • Hard Clustering - Object belongs to only one cluster. Element can fall in only one cluster.
  • Soft Clustering - Some object belong to different clusters. Probability of how much it would fit in that cluster
Other Techniques
  • Remove correlation before computing distances
  • Mahalanobis distance measure
  • (1-correlation coefficient)
More Reads

K Means Clustering in R Example
K Means Clustering by Hand / Excel
K means Clustering in R example Iris Data
Linear Regression Example in R using lm() Function
Linear Regression by Hand and in Excel

K-medoids and K-means
Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means
  • Prone to outliers (Squared Euclidean gives greater weight to more distant points)
  • Can't handle categorical data
  • Work with Euclidean only
K-Medoids
  • Restrict centre to data points
  • Centre picked up only from data points
  • We use same sum of squares for cost function but distance is not Euclidean distance
 Distance measure for numerical variables
  • Euclidean based distance
  • Correlation based distance
  • Mahalanobis distance
Distance measure for category variables
  • Matching coef and Jaquards coef
Measuring Distance between clusters
  • Single (Minimum Distance between two pairs one from each clusters)
  • Complete (Maximum  between two pairs one from each clusters)
  • Average (Average of all possible pairs)
Hierarchical Clustering
  • Compute distance in every pair of cluster
  • Manage nearest ones until number of clusters = number of clusters needed
  • Entire process can be represented as dendrogram
  • At the end of the algorithm it is plotted



Happy Learning!!!

February 16, 2016

DBMS Session Three

Relational Model Classification

Key - Subset of attribute
Super Key -Sufficient to identify tuple uniquely
Considerations for primary key
  • not null values
  • Few attributes
  • Key often used in data access clauses
Relational Operators
  • Selection (With Filters Applied)
  • Projection (Select with explicitly specified columns)
  • Cartesian product
  • Union
  • Difference
  • Intersection
  • Join
SQL Refreshers
  • SELECT, WHERE, Aggregate (Group by, Having), JOINs, String operations
  • SET Operations, Handling NULL values, Subqueries
  • DELETION, No EXISTS, Conditional Updates
  • Views, Materialized views
  • Authentication & Authorization (Roles & Permissions)
Happy Learning!!!




February 11, 2016

Data Models

Captured Notes from Session #2 - Data Models

Hierarchical Data Models
  • Tree like structures
  • Used in Windows Registry
  • Frequent Use (IMS)
  • DL / 1 Programming language for IMS
  • Difficult to reorganize
Graph / Network Model
  • Organize collection of records in form of directed graph
  • 3 way relationships can't be maintained
ER Model
  • Defined in terms of Entity, Relationships
  • Never Caught on Physical Model
Object Oriented Database model
  • Difficult mapping programming objects to database objects
Relational Model
  • Better physical data independence
  • Better logical independence
  • Won because of linear algebra
Happy Learning!!!

February 01, 2016

World of Data Science

My second semester classes started. The first session was very interesting and a great introduction to world of data science. I have read / re-read same type of definitions / introductory articles on data science. Prof.Manish Singh session gave a whole new analogy and interesting examples to correlate with.

For big data I have always referred back to 4 Vs. Volume, Veracity, Velocity and Variety. In the same analogy the definition was presented as
  • Internet of Content - Youtube, Ebooks, Wikipedia, New Feeds
  • Internet of People - Email, Facebook, Linkedin etc
  • Internet of Things - Things Devices with UniqueID communicating / managing infrastructure
  • Internet of Location - Spatial Data related analysis 
This Internet of * is a good representation of different forms / flows of information representing four Vs

Big Data = Crude Oil

"Big data is about extracting the ‘crude oil’, transporting it in ‘mega-tankers’, siphoning it through ‘pipelines’ and storing it in massive ‘silos’"

Data Science – Data science is inter disciplinary field to extract knowledge from data.

Data Science workflow involves Data Visualization, Data Analysis, Data processing and Data Storage tasks. Some of tools used in each layer are listed below. 


Tools available

Data Visualization
Ambrose, Tableau, GWT, D3 / Infovis, R/Python, Gephi, Chaco (Graph partitioning tool)

Data Analysis
Mahout, Piggybank, Hive, Pegasus, Girap, Pig, AllReduce. MR

Data Processing

Scheduler – Azkaban, Oozie, Ivory
Cluster Monitoring – (Gangalia + Nagios), Chukwa, Zookeeper

Data Storage
HDFS, HSFTP (HDFS over HTTP), S3, KFS (Kosmos File System)
Data Movement – SQOOP, Flume, Scribe, Kafka, MessageQueue
Columnar Storage – Zebra
Key Value - Hbase

The key ingredients of Data Science are
·         Data Management System
·         Data Mining
·         Computational process to identify patterns in large data sets
·         Use techniques at intersection of multiple disciplines (AI, Stats, Machine Learning, Computer Networks)
·         Data Classification, Clustering, regression and association rule finding and anomaly detection
·         Process Mining
·         Aim to discover, monitor, improve real time processes (eg logs, events, alerts, rules)
·         Information Visualization
·         Visualization techniques for large data sets, Interactive Information Visualization, How to really visualize big data


Databases Vs Data Science
Databases Data Science
Data Value Previous Cheap
Data Volume Modest Massive
Structured Strongly (Schema) Weakly or none (text)
Priorities Consistency, Error Recovery, Auditability Speed, Availability, Query richness
Base Relational Algebra Linear algebra

PS: My professor had provided references to the examples; I am sharing this post based on notes / slides from my session.  

Happy Learning!!!