"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

November 16, 2017

Day #91- Retail Analytics - Data Mining / Analytics

Running a successful #Retail Store has a lot of Data Mining / Analytics challenges to solve and arrive at decisions based on data. Some of interesting Retail Data Mining / Analytics problems are
  • What sells best in each store with item level details
  • What are shopping time/routine for particular store
  • Using web data identify the relevance of shopping district / retail environment
  • What are money making items in the store (Quantity vs Price)
  • What is Sales / Stock ratio?
  • What is the forecast value of minimum orders for items in each store based on sales/traffic trends?
  • What is the correlation between Loss items, Shopping days/periods / people movements?
  • What is the retail price points identified based on End of Season Sales ?Forecasts / Predictions come as next steps after Data Analysis
Happy Analytics!!!

November 15, 2017

Day #90 - Regression Metrics Optimization

RMSE, MSE, R-Squared (Sometimes called L2 Loss)
Tree-Based
  • XGBoost, LightGBM
  • sklearn.RandomForestRegressor
Linear Models
  • sklearn.<>Regression
  • sklearn.SGDRegressor
Neural Networks
  • PyTorch
  • Keras
MAE (L1, Medial Regression)
Tree-Based
  • LightGBM
  • sklearn.RandomForestRegressor
MSPE, MAPE
  • MSPE is weighted version of MSE
  • MAPE is weighted version of MAE
Happy coding and learning!!!

November 14, 2017

Day #89 - Capsule networks

Key lessons
  • Instead of adding layers it nests layers inside it
  • We apply non-linearity to grouped neuros (capsule)
  • Dynamic routing - Replace scalar output feature detector of CNN by routing by agreement based on output
CNN History
  • Latest paper on capsule networks
  • Offers state of art performance for MNIST dataset
  • Convolutional networks - Learn mapping for input data and output label
  • Convolution layer - Series of matrix multiplication and summation operation, Output feature map (bunch of learned features from image)
  • RELU - Apply non-linearity to it (Network can learn both linear and non-linear functions). Solves vanishing gradient problem. (As gradeient is backpropagating its getting smaller and smaller, RELU prevents it)
  • Pooling - Creates sections and take maximum pixel value from each sections
  • Each line of code corresponds to layers in networks
  • Dropout - Neurons randomly turned on to prevent overfits (Regularization technique)
  • For handling rotations - AlexNet added different rotations to generalize to different rotations
  • Deeper networks improved classification accuracy
  • VGGnet adding more layers
  • Googlenet - Convolution with different sizes processed on same input, Several of those together
  • Resnet - Instead of stacking layers, Add operation improved vanishing gradient problem

Convolutional Network Challenges
  • As we go up the hierarchy each of features learnt will be more complex
  • Hierarchy happening with each layers
  • Sub-sampling loses spatial relationships
  • Spatial correlations are missed in sub-sampling and pooling
  • Bad for rotated images (Invariance issues)
Capsule Networks
  • Basic idea - Human brain attains transnational invariance in a better way, Instead of adding layers it nests layers inside it
  • Nested layer is called capsule, group of neurons
  • CNN route by pooling
  • Deeper in terms of nesting
Layer based squashing
  • Based on output neuron we apply non-linearity
  • We apply non-linearity to grouped neuros (capsure)
Dynamic routing
  • Replace scalar output by routing by agreement
  • Hierarchy tree of nested layers
Key difference - All iterations to compute output, For every capsule nested apply operations
Happy coding and learning!!!

Day #88 - Metrics Optimization

Loss vs Metric
  • Metric - Function which we want to use to evaluate the model. Maximum accuracy in classification
  • Optimization Loss - Easy to optimize for given model, Function our model optimizes. MSE, LogLoss
  • Preprocess train and optimize another metric - MSPE, MAPE, RMSLE
  • Optimize another metric postprocess predictions - Accuracy, Kapps
  • Early Stopping - Stop traning when models starts to overfit
 Custom loss functions

Happy Coding and Learning!!!

November 10, 2017

Day #87 - Classification Metrics

  • Accuracy (Essential for classification), Weighted Accuracy = Weighted Kappa
  • Logarithmic Loss (Depends on soft predictions probabilities)
  • Area under Receiver Operating Curve (Considers ordering of objects, tries all threshold to convert soft predictions to hard labels)
  • Kappa (Similar to R Squared)
Notations
N - Number of objects
L - Number of classes
y - Ground truth
yi - Predictions
[a = b] - indicator function
  • Soft labels (soft predictions) are classifier's scores - Probabilities of objects
  • Hard Labels (hard predictions) - argmax fi(x), [f(x)>b], b - threshold for binary classification, Predict label, maximum value from soft prediction and set class for prediction label. Function of soft label
Accuracy Score
  • Most referred measure of classifier quality
  • Higher is better
  • Need hard predictions
  • Number of correctly guessed objects
  • Argmax of soft predictions
Logloss
  • Work with soft predictions
  • Make classifier output posterior probabilities
  • Penalises for wrong answers
  • Set constant to frequencies of each class
Area Under Curve
  • Based on threshold decide percentage of above / below the threshold
  • Metric tries all possible ones and aggregate scores
  • Depends on order of objects
AUC - ROC
  • Compute TruePositive, FalsePositive
  • AUC max value 1
  • Fraction of correctly ordered pairs
AUC = Fraction of  correctly ordered pairs / total number of pairs
 = 1 - (Fraction of incorrectly ordered pairs / total number of pairs)

Cohen's Kappa
  • Score = 1- ((1-accuracy)/(1-baseline))
  • Baselines different for each data
  • Similar to R squared
  • Here R predictions for dataset used as baseline
  • Error = (1- Accuracy)
  • Weighted Error Score = Confusion matrix * Weight matrix and sum their results
  • Weighted Kappa = 1 - ((weighted error)/(weighted baseline error))
  • Useful for medical applications
Happy Learning and Coding!!!

November 09, 2017

Day #86 - Regression Metrics

  • Relative Errors most important to us
  • MSW, MAE work with absolute error not for relative errors
  • MSPE (mean square percentage error)
  • MAPE (mean absolute percentage error) - Weighted version of MAE
  • RMSLE (Root mean square lograthmic error) - RMSE calculated in lograthmic scale - Cares about relative errors
Happy Coding and Learning!!!

November 07, 2017

Day #85 - Regression Metrics Optimization

Metrics
  • Metrics used to evaluate submissions
  • Best result finding optimal hyperplane
  • Exploratory metric analysis along with data analysis
  • Own ways to measure effectiveness of algorithms
Regression - Metrics
  • Mean Aquare Error
  • RMSE
  • R Squared
  • Same from optimization perspective
Classification
  • Accuracy
  • LogLoss
  • AUC
  • Cohen's Kappa
Regression Metrics
N - Samples
y - target values
y~ - target Predictions
yi - target ith value
yi~ - prediction ith object

Mean Square Error
MSE = 1/N(yi - yi~)^2
- Average the squared differences between actuals and targets

RMSE - Root Mean square Error = Sqrt(MSE)

  • Same as scale of target
  • RMSE vs MSE
  • Similar in terms of minimizers
  • Every RMSE minimizer is MSE minimizer
  • MSE(a) > MSE(b) <=> RMSE(a) > RMSE(b)
  • MSE orders in same way as RMSE
  • MSE easier to work with
  • Bit of difference in gradient based model
  • They may not be interchargeable for learning methods (learning rate)
R Squared
  • How much model is better than constant baseline
  • 1 predictions perfect
  • WHEN MSE is 0, R Square = 1
  • All reasonable models score between 0 and 1
MAE - Mean Absolute Error
  • Avg of absolute difference value between target and predictions
  • Widely used in Finance
  • 10$ Error twice worse than 5$ Error
  • MAE easier to justify
  • Median of target values useful for MAE
  • MAE gradient step function -1 smaller than target, +1 when greater than target
  • MAE is not differentiable
MAE vs MSE
  • For outliers - use MAS
  • unexpected but normal MSE
  • MAE robust to outliers
Happy Learning and Coding!!!

November 05, 2017

Day #84 - Data Leaks and Validations

  • Mimic Train / Test Splot as the test data
  • Perform KFold Validations
  • Choose best parameters for models
  • Submission Stage (Can't mimic exact train / test split)
  • Calculate mean and standard deviations of leader board scores
Data Leaks
  • Unexpected information in data that lets you make good predictions
  • Unusable in real world
  • Results of unintentional error
Time Series
  • Incorrect timesplits still exists
  • Check public and private splits
  • Missing feature columns are data leaks
Unexpected Information
  • Use File creation dates
  • Resize features / change creation date
  • ID's no sense to include in model
Happy Learning and Coding!!!

October 31, 2017

Day #83 - Data Splitting Strategies

  • Time based splits
  • Validation to mimic train / test pic
  • Time based trend - differs significantly, Time based patterns important
Different splitting strategies can differ significantly
  • In generated features
  • In a way model will rely on that features
  • In Some kind of target leak
 Split Categories
  •  Random Split (Split randomly by rows, Rows independent of each other), Row wise
  • Device special features for dependency cases
  • Timewise - Before particular date as training, After date as testing data. Useful features based on target
  • Moving window validation
  • By Id - (By Clustering pictures, grouping them and then finding features)
  • Combined (Split date for each shop independently)
Summary
  • In most cases split by Rownumber, Time, Id
  • Logic for feature generation depends on data splitting strategy
  • Set up your validation to mimic the train / test split of competition
Happy Learning and Coding!!!

Day #82 - Validation and Overfitting


  • Train Data (Past), Unseen Test Data (Future)
  • Divide into three parts - Train (Past), Validation (Past), Test (Future)
  • Underfitting (High Error on Both Training and Validation)
  • Overfitting (Doesn't generalize to test data, Low Error on Train, High Error on Validation)
  • Ideas (Lowest Error on both Training and Testing Data)
Validation Strategies
  • Hold Out (divide data into training / testing, No overlap between training / testing data ) - Used on Shuffle Data
  • K-Fold (Repeated hold out because we split our data) - Good Choice for medium amount of data, K- 1 training, one subset - Used on Shuffle Data
  • Leave one out : ngroups = len(train) - Too Little data (Special case of K fold, K = number of samples)
  • Stratification - Similar target distribution over different folds
Stratification useful for
  • Small datasets (Do Random Splits)
  • Unbalanced datasets
  • Multiclass classification
 Stratification preserves the target distribution over different folds

Happy Coding and Learning!!!

October 30, 2017

Day #81 - Dataset Cleaning


Dataset cleaning
  • Constant features (Remove constants features who value remain constant in both training and testing data, Value is constant in training but changes in testing - better to remove those features, Only fraction of features supplied in data, Same value in both training and testing set)
  • Duplicated features (Completely identical columns, This will slow down training time, remove duplicate columns)
  • Duplicated categorical features (Encode categorical features and compare them)
Other things to check
  • Duplicated rows (Duplicated rows with different targets, could be result of mistake, remove those duplicated rows to have high score on test set)
  • Check for common rows in train and test sets (Set labels manually for test rows in training set)
  • Check if dataset is shuffled (Oscillations around mean would be observed)
EDA Checklist
  • Get Domain Knowledge
  • Check How data is generated
  • Explore individual feature
  • Explore pairs and groups
  • Clean features
Happy Learning and Coding!!!

October 29, 2017

Day #80 - Visualizations

EDA is an art. Visualizations are art tools. Several different plots to prove hypothesis

Visualization Tools
  • Histograms (Split into bins, how many points fall in each bins, vary number of bins) - plt.hist(x)
  • XGBoost will benefit from explicit missing values
  • Plots - index versus value, plt.plot(x,'.'), randomness over indices
  • Statistics
Explore Feature Relations
  • Scatter Plots (Draw one features vs other), Data distribution between train and test tests validate how they are distributed
  • Correlation Plots (Run K-means clustering and reorder feature) - How similar features are
  • Plot (index vs feature statistics)
Feature Groups
  • Generate new features based on groups
Pairs
  • ScatterPlot, Scatter matrix
  • Correlation Plot (Corrplot)
 Groups
  •  Corrplot + Clustering
  •  Plot (Index vs feature statistics)
Happy Learning and Coding!!!

Day #79 - Exploratory Data Analysis (EDA)

EDA
  • Looking data, Understanding data
  • Complete data understanding required to build accurate models
  • Generate Hypothesis / Apply Intuition 
  • Top solutions use Advanced and Aggressive Modelling
  • Find insights and magic feature, Start with EDA before hardcore modeling
Visualization
  • Identify Patterns (Visualization to idea)
  • Use patterns to find better models (Idea to visualization, Hypothesis testing)
EDA Steps
  • Domain Knowledge (Google, Wikipedia understand data)
  • Check data is Intuitive (Values in data validate based on acquired domain knowledge, Manual correction of error, Mark incorrect rows and label them for model to leverage it)
  • Understand how data is generated (Test set / Training set generated by the Same Algorithm ? / Need to know underlying data generation Process / Visualize Training / Test set plots)
Exploring Anonymized and Encrypted Data
Anonymized Data
  • Replace data with encrypted text (This will not impact model though)
  • No meaningful names of columns
  • Find unique values of features, sort them and find differences
  • Distance between two consecutive features and the pattern for it
Explore Individual Features
  • Guess the meaning of the columns
  • Guess the types of the column (Categorical, Boolean, Numeric etc..)
Explore Feature Relations
  • Find relation between pairs
  • Find feature groups
Useful Python functions
  • df.dtypes
  • df.info()
  • x.value_counts()
  • x.isnull()
Happy Learning and Coding!!!

Day #78 - Image Processing - Kaggle Lessons

  • Use Trained model on data similar
  • Train network from scratch
  • Using pretrained model and Fine tune later
VGGNet16 Architecture
  • Remove Last layer with new one size of 4
  • Retrain model
  • Benefit from model trained from similar dataset
Image Augmentation
  • Increase number of training samples
  • Image rotations
Happy Learning and Coding!!!

Day #77 - Quick Summary - Kaggle Lessons - Features, Dates, Text

  • For Features - One Hot Encoding, Label Encoding, Frequency Encoding, Ranking, MinMaxScaler, StandardScaler
  • For Dates - Periodicity - Year, Date, Week, Time Slice - Time past since particular moment (before / after), Difference in Dates (Datetime_feature1 - Datetime_feature2), Boolean binary indicating date is holiday or not
  • For Text - Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal, Ngrams can help use local context, Postprocessing - TFiDF,  Use BOW for Ngrams
Happy Coding and Learning!!!

Day #76 - Text Processing - Kaggle Lessons


Bag of Words
  • Create new column for each unique word in data
  • Count occurrences in each documents
  • sklearn.feature_extraction.text.CountVectorizer
  • More comparable by using Term Frequency
  • tf = 1 / x.sum(axis=1)[:,None]
  • x = x*tf
  • Inverse Document Frequency
  • idf = np.log(x.shape[0])/(x>0).sum(0)
  • N Grams
  • Bag of Words (Each row represents text, Each column represents unique word)
  • Classifying document

For N = 1, This is a sentence
Unigrams are - This, is, a , sentence

For N = 2, This is a sentence
bigrams are - This is, is a, a sentence

For N = 3, This is a sentence
Trigrams are - This is a, is a sentence

sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer

Text Preprocessing steps
  • Lower case
  • Lemmatization (using knowledge of vocabulary and morphological analysis of words)
  • democracy, democratic and democratization -> democracy (Lemmatization)
  • Stemming (Chops of ending of words)
  • democracy, democratic, and democratization - democr (Stemming)
  • Stop words (Not contain important information)
sklearn.feature_extraction.text.CountVectorizer: max_df has parameters for stop words

I have done all this in my assignment work. This is there in my github code

For Applying Bag of words
  • Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal
  • Ngrams can help use local context
  • Postprocessing - TFiDF
  • Use BOW for Ngrams
BOW example
  • Sentence - The dog is on the table
  • Representation         - are, cat, dog, is, now, on, the, table
  • BOW representation  - 0,    0,    1,    1,     0,      1,    1,    1
Word to Vectors
  • Get vector representation of words and texts
  • Each word converted to vector
  • Uses nearby words
  • Different words used in same context will be used in vector representation
  • Apply basic operations can be done on vectors
  • Words - Word2Vec, Glove, FastText
  • Sentences - Doc2Vec
  • There are pretrained models
Bag of Words
  • Very large vectors
  • Meaning of each value in vector is unknown
Word2Vec
  • Relatively small vectors
  • Values of vector can be interpreted only in some cases
  • The words with similar meaning often have similar embeddings
Happy Learning, Happy Coding!!!

October 27, 2017

Day #75 - Missing Values

  • Reasons for Missing Values
  • How to Engineer them effectively
  • Hidden Missing Values
  • Plot distribution of values and find from histogram
Filling missing Values
  • -999, -1 (Fill with some value) - useful to provide different category, Perf Suffers
  • mean, median
  • Reconstruct value
  • add isnull column
Reconstruction
  • Missing values in timeseries
  • Temperature values missing for some days of month
  • Based on increase / decrease pattern
  • Ignore missing value while calculating mean
  • Change Categories to frequencies
  • XGBoost can handle NAN
Happy Learning and Coding!!!

Day #74 - Feature Generation - DateTime and Coordinates

DateTime
  • Differ Significantly between numeric and categorial features
  • Periodicity - Year, Date, Week
  • Time Slice - Time past since particular moment (before / after), Time moments in period
  • Difference in Dates (Datetime_feature1 - Datetime_feature2)
  • Special Time period (Medication every 3 days)
  • Sales Predictions (Days since last holiday, Days since weekend, Since last sales campaign)
  • Boolean binary indicating date is holiday or not
  • Sales Context Churn Prediction
  •     (Date Since user registration) - DateDiff
  •     (Date Since last purchase) - DateDiff
  •     (Date Since calling customer service) - DateDiff
  • Periodicity - Day number in week, month, season, year, second, minute, hour
  • Time Slice, Difference between dates
Coordinates
  • This can be used for churn prediction (Likelihood customer will return)
  • In Real Estate Scenario for predictions on Prices
  •     (Distance from School)
  •     (Distance from Airport)
  •     (Flats around particular point)
  • Alternatively distance from maximum expensive flat
  • Centre of clusters and find distances from centre point
  • Aggregated Statistics for surrounding data
Happy Learning and Coding!!!

Day #73 - Feature Generation - Categorical and ordinal features

  • Label Encoding - Based on Sort Order, Order of Appearance
  • Frequency Encoding - Based on Percentage of occurence
Categorical Features
  • Sex, Cabin, Embarked
  • One Hot Encoding
  • pandas.get_dummies
  • sklearn.preprocessing.OneHotEncoder
  • Works well for Linear methods (Minimum is zero, Maximum is 1)
  • Difficult for Tree methods based on One Hot Encoding Approach
  • Store only Non-Zero Elements (Sparse Matrices)
  • Create combination of features and get better results
  • Concatenate strings from both columns
  • One hot encoding it, Find optimal coefficient for every interaction
pclass,sex,pclass_sex
3,male,3male
1,female,1female
3,female,3female
1,female,1female

pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0

Ordinal Features
  • Ordered categorial feature
  • First class expensive, second less, third least expensive
  • Drivers License Type A,B,C,D
  • Level of Education (Sorted in increasingly complex order)
  • Label Encoding, Map to numbers (Tree based)
  • Non Tree can't use effectively
Label Encoding
1. Alphabetical sorted [S,C,D] -> [2,1,3]
 - sklearn.preprocessing.LabelEncoder

2. Order of Appearance
[S,C,Q] -> [1,2,3]
 - Pandas.Factorize

Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.
  • Equal Distributiona apply rank ties
  • from scipy.stats import rankdata
Summary
  • Ordinal is special case of categorial feature
  • Label Encoding maps categories to numbers
  • Frequency encoding maps categories to frequencies
  • Label and frequency encoding are used for Tree based models
  • One-Hot encoding is used for non-tree based models
  • Interactions of categorial features can help linear models and KNN
Happy Coding and Learning!!!

Day #72- Feature Generation - Numeric Features

Feature Generation
  • Predict Apple Sales (Linear Trend)
  • Examples - Add features indicating week number, GBDT will consider min calculated value for each week
  • Created Generated Tree
Numeric Features - Preprocessing
  • Tree based Methods (Decision Tree)
  • Non Tree based Methods (NN, Linear Model, KNN)
Technique #1 - Scaling of values
  • Apply Regularization in equal amounts
  • Do proper scaling
Min Max Scalar
  • To [0,1]
  • sklearn.preprocessing.MinMaxScaler
  • X = (X-X.min())/(X.max()-X.min())
Standard Scaler
  • To mean = 0, std = 1
  • sklearn.preprocessing.StandardScaler
  • X = (X-X.mean())/X.std()
Preprocessing (Scaling) should be done for all features not just for fewer features. Initial impact on the model will be roughly similar
Preprocessing Outliers
  • Calculate lower and upper bound values
  • Rank transformation
  • Better option than Min-Max Scale
Ranking, Transformations
  • scipy.stats.rankdata
  • Log transformation  - np.log(1+x)
  • Raising to power < 1 - np.sqrt(x+2/3)
Feature Generation (Based on Feature Knowledge, Exploratory Data Analysis)
  • Creating new features
  • Engineer using prior knowledge and logic
  • Example, Adding price per square feet if price and size of plot is provided
Summary
  • Tree based methods don't depend on scaling
  • Non-Tree methods hugely depend on scaling
Most often used preprocessing
  • MinMaxScaler - to [0,1]
  • StandardScaler - to mean==0, std==1
  • Rank - sets spaces between sorted values to be equal
  • np.log(1+x) and np.sqrt(1+x)
 Happy Learning and Coding!!!