"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

January 27, 2018

Day #100 - Ensemble Methods

It took more than a year to reach 100 posts. This is a significant milestone. Hoping to reach 200 soon.
  • Combining different machine learning models  for more powerful prediction
  • Averaging or blending
  • Weighted averaging
  • Conditional averaging
  • Bagging
  • Boosting
  • Stacking
  • Stacknet
Averaging ensemble methods
  • Combine two results with simple averaging
  • (model1+model2)/2
  • Considerable improvements with averaging can be achieved
  • Perform better when combined but not individually
  • Weighted average - (model1*0.7 + model2*0.3)
  • Conditional average (If < 50 use model1 else model2)
Bagging
  • Averaging slightly different versions of same model to improve accuracy
  • Example - Random Forest
  • Underfitting - Error in bias
  • Overfitting - Errors in variance
  • Parameters that control bagging - Seed, Subsampling or Bootstrapping, Shuffling, Column Subsampling, Model specific parameters, bags (number of models), More bags better results, parallelism
  • BaggingClassifier and BaggingRegressor from sklearn
  • Independant of each other
Boosting
  • Weight based boosting
  • Form of weighted averaging of models where each model is built sequentially via taking into account of past model performance
  • Add sequentially how well previous models have done
Weight based boosting
  • Number of times certain row appears in data
  • Contribution to error / recalculate weights
  • Parameters - Learning rate, shrinkage, trust many models, number of estimators
  • Parameters - Adaboost (sklearn - python), LogitBoost (Weka - java)
Residual based boosting
  • For videos mostly dominant
  • Calculate error of predictions / direction of error
  • Make Error new target variable
  • Parameters - Learning Rate, Shrinkage, ETA
  • Number of estimators
  • Row sub sampling
  • Column sub sampling
  • Sub boosting type - Fully gradient based, Dart
  • XGboost
  • Lightgbm
  • H2O GBM (Handle categorical variables out of box)
  • Catboost
  • Sklearn's GBM
Stacking
  • Making several predictions of a number of models in a hold out set and then using a different meta model to train these predictions
  • Stacking predictions
  • Splitting training set into two disjoint sets
  • Train several base learners on the first part
  • Make predictions with the base learners on the second (validation) part
  • Using predictions from (3) as the input to train a higher level learner
  • Train Algo 0 on A and make predictions for B and C and Save to B1, C1
  • Train Algo 1 on A and make predictions for B and C and Save to B1, C1
  • Train Algo 2 on A and make predictions for B and C and Save to B1, C1
  • Train Algorithm3 on B1 and make predictions for C1

Happy Learning!!!

Day #99 - Statistics and distance based features

Stats
  • Calculate statistics of derived features from neighborhood analysis
  • User_id / Page_id / Ad_price / Ad_position
  • Use label encoder
  • Treat data points implicitly
  • Add lowest and highest price for position of add
  • maximum and minimum price values
  • Pages user visited
  • Standard deviation of prices
  • Most visited page
  • Many more features
  • Introduce new information
Neighbors
  • Number of houses in 500m, 1000m
  • Average price per sq.m
  • Number of schools / supermarkets / parking lots in 500m / 1000m
  • Distance to closest substation
  • Embrace both group-by and nearest neighbor methods
Matrix Factorizations
  • Approach for feature extraction
  • User / Items mapping matrix
  • User - Attributes matrix
  • U X M = R
  • Row and column related features
  • BOW represent larger parse vector
  • Document represented by small dense vector (Dimensionality reduction)
  • Matrix Factorizations
  • SVD, PCA, TruncatedSVD for sparse matrices
  • NMF (Non-Negative Matrix Factorization) - Zero or Positive Number
  • NMF makes data suitable for decision trees
  • Used for Dimensionality reduction
Example Code
x_all = np.cancatenate([x_train,x_test])
pca.fit(x_all)
x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)

Happy Learning!!!

January 25, 2018

Day #98 - Advanced Hyperparameter tuning

Neural Network Libraries
  • Keras (Easy to learn)
  • Tensorflow (For production this is used)
  • MxNet
  • PyTorch (Popular in community)
  • sklearn's MLP
Neural Nets
  • Number of neurons per layer
  • Number of layers
  • Optimizers
  • SGD + momentum
  • Adam / Adadelta / Adagrad (In practice lead to more overfitting)
  • Batch size (Huge batch size leads to overfitting)
  • Epochs impact
  • Learning rate - not too high not too low, Rate where network converges
  • Regularization
    • L2/L1 for weights
    • Dropout / Dropconnect
    • Static dropconnect
Linear Models (Scikit-learn)
  • SVC / SVR
  • Sklearn wraps libLinear and libSVM
  • Compile yourself for multicore support
  • LogisticRegression / LinearRegression + regularizers
  • SGDClassifier / SGDRegressor
  • Vowpal Rabbit
  • Regularization parameter (C, alpha, lambda)
  • Start with very small value and increase it
  • SVC starts to work slower as C increases
  • Regularization type
    • L1/L2/L1+L2 - try each
    • L1 can be used for feature selection
Happy Learning!!!

January 24, 2018

Day #97 - Hyperparameter tuning

How to tune hyper parameters ?
  • Which parameters affect most
  • Observe impact of change of value of parameter
  • Examine and iterate to find change of impacts
Automatic Hyper-parameter tuning libraries
  • Hyperopt
  • Scikit-optimize
  • Spearmint
  • GPyOpt
  • RoBO
  • SMAC3
Hyper parameter tuning
  • Tree Based Models (Gradient Boosted Decision Trees - XGBoost, LightGBM, CatBoost)
  • RandomForest / ExtraTrees
Neural Nets
  • Pytorch, Tensorflow, Keras
Linear Models
  • SVM, Logistic Regression
  • Vowpal, Wabbitm FTRL
Approach
  • Define function that will run our model
  • Specify range of hyper parameter
  • Adequate range for search
Results
  • Underfitting
  • Overfitting
  • Good Fit and Generalization
Tree based Models
  • GBDT - XGBoost, LightGBM, CatBoost
  • RandomForest, ExtraTrees - Scikit-learn
  • Others - RGF(baidu / fast_rgf)
GBDT
  • XGBoost - max_depth, subsample, colsample_bytree, colsample_bylevel, min_child_weight, lambda, alpha, eta num_round, seed
  • LightGBM - max_depth / num_leaves, bagging_fraction, feature_fraction, min_data_in_leaf, lambda_l1, lamda_l2, learning_rate num_iterations, seed
  • sklearn.RandomForest/ExtraTrees - N_estimators, max_depth, max_features, min_samples_leaf, n_jobs, random_state

Happy Learning!!!

January 22, 2018

Day #96 - Mean Encoding - Extensions and Generalizations

  • Compact transformation of categorical variables
  • Powerful basis of feature engineering
Using target variable in different tasks. Regression, Multi-class
  • More stats - Percentiles, std, distribution bins
  • Introducing new information from one vs all classifiers in multi-class tasks (N Different encodings)
Domains with many-to-many relationships
  • User to Apps relationships
  • Row for user-app relationship
  • Vector for each app`
Time-series
  • Presence of mean prev da, prev week, prev day
  • Based on data create more complicated features
Encoding interactions and numerical features
  • model structure, analyzing trees
  • Extract from decision trees (If they are in neighboring nodes)
  • xgboost, row features
  • Use split points to identify new features
  • Manually add more mean encoded interactions
  • Involving categorical variables evaluate variable interactions
Correct validation reminder
Local experiments
  • Estimate encodings on X_tr
  • Map them to X_tr and X_val
  • Regularize on X_tr
  • Validate mode on X_tr / X_val split
Submission
  • Estimate Encoding on whole Train data
  • Map them to Train and Test
  • Regularize on Train
  • Fit on Train
Happy Learning!!!

December 31, 2017

December 08, 2017

Day #93 - Regularizations

Four methods of Regularization
  • Cross Validation inside training data
    • 4 to 5 folds of K-Fold Validations
    • Split into K non-intersecting subsets
    • Leave one out scheme
    • Target variable leakage is still present in K Fold Scheme
  • Smoothing based on size of category
    • Category big lot of data points
    • Formula = (mean(target)*nrows+globalmean*alpha)/(nrows+alpha)
    • alpha = category size we can trust
  • Add Random Noise
    • Unstable, Hard to make it work
    • Too much noise
    • LOO, Leave one out Regularization
  • Sorting and calculating mean on some type of data
    • Fix sorting order of data
    • Use Rows 0 to N-1 to calculate mean for N-1
    • Least Leakage
 Happy Learning!!!

November 28, 2017

Day #92 - Mean Encoding

Mean Coding
  • Add new variables based on certain features
  • Label encoding is done usually
  • Mean encoding is done as variable count / distinct unique variables
  • The proportion of label encoding also is included in this step
  • Min encoding with label encoding
  • Label encoding - No logical order
  • Mean encoding - Classes are separable
  • We can reach better loss with sorted trees
  • Trees need huge number of splits 
  • Model tries to treat all categories differently
Constructing Mean Encoding
  • Goods - Number of ones in a group
  • Bads - Number of zeros
Likelihood = Goods/(Goods + Bads) = mean(target)
Weight of Evidence = In(Goods/Bads)*100
Count = Goods = sum(target)
Diff = Goods-Bads


Happy Learning!!!

November 24, 2017

Database Sharding and Scalability Basics

Some Key considerations for NOSQL Vs RDBMS
  • Performance - Latency tolerance, How slow my queries can run for huge data sets
  • Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
  • Consistency - Weird results tolerance (Dirty data tolerance)
  • Availability - Downtime tolerance
Options for Scalability
  • Replication - Create copies of database, Application can talk to either database
  • Sharding - Sharding choosing a partition key, Key-value store partition based on key
  • Caching - Precomputed and stored, Manage cache expiration time and refresh logic
For streaming data we had already discussed Events Hub, Apache kafka. Now we have something called KSQL (Kafka streaming SQL to run on continuous data)

Great Session Talk

 

RDBMS VS NOSQL Considerations, Quick Summary
  • Performance - Latency tolerance
  • Durability - Data loss tolerance
  • Consistency - Weird results tolerance (Dirty data tolerance)
  • Availability - Downtime tolerance
Happy Learning!!!

November 16, 2017

Day #91- Retail Analytics - Data Mining / Analytics

Running a successful #Retail Store has a lot of Data Mining / Analytics challenges to solve and arrive at decisions based on data. Some of interesting Retail Data Mining / Analytics problems are
  • What sells best in each store with item level details
  • What are shopping time/routine for particular store
  • Using web data identify the relevance of shopping district / retail environment
  • What are money making items in the store (Quantity vs Price)
  • What is Sales / Stock ratio?
  • What is the forecast value of minimum orders for items in each store based on sales/traffic trends?
  • What is the correlation between Loss items, Shopping days/periods / people movements?
  • What is the retail price points identified based on End of Season Sales ?Forecasts / Predictions come as next steps after Data Analysis
Happy Analytics!!!

November 15, 2017

Day #90 - Regression Metrics Optimization

RMSE, MSE, R-Squared (Sometimes called L2 Loss)
Tree-Based
  • XGBoost, LightGBM
  • sklearn.RandomForestRegressor
Linear Models
  • sklearn.<>Regression
  • sklearn.SGDRegressor
Neural Networks
  • PyTorch
  • Keras
MAE (L1, Medial Regression)
Tree-Based
  • LightGBM
  • sklearn.RandomForestRegressor
MSPE, MAPE
  • MSPE is weighted version of MSE
  • MAPE is weighted version of MAE
Happy coding and learning!!!

November 14, 2017

Day #89 - Capsule networks

Key lessons
  • Instead of adding layers it nests layers inside it
  • We apply non-linearity to grouped neuros (capsule)
  • Dynamic routing - Replace scalar output feature detector of CNN by routing by agreement based on output
CNN History
  • Latest paper on capsule networks
  • Offers state of art performance for MNIST dataset
  • Convolutional networks - Learn mapping for input data and output label
  • Convolution layer - Series of matrix multiplication and summation operation, Output feature map (bunch of learned features from image)
  • RELU - Apply non-linearity to it (Network can learn both linear and non-linear functions). Solves vanishing gradient problem. (As gradeient is backpropagating its getting smaller and smaller, RELU prevents it)
  • Pooling - Creates sections and take maximum pixel value from each sections
  • Each line of code corresponds to layers in networks
  • Dropout - Neurons randomly turned on to prevent overfits (Regularization technique)
  • For handling rotations - AlexNet added different rotations to generalize to different rotations
  • Deeper networks improved classification accuracy
  • VGGnet adding more layers
  • Googlenet - Convolution with different sizes processed on same input, Several of those together
  • Resnet - Instead of stacking layers, Add operation improved vanishing gradient problem

Convolutional Network Challenges
  • As we go up the hierarchy each of features learnt will be more complex
  • Hierarchy happening with each layers
  • Sub-sampling loses spatial relationships
  • Spatial correlations are missed in sub-sampling and pooling
  • Bad for rotated images (Invariance issues)
Capsule Networks
  • Basic idea - Human brain attains transnational invariance in a better way, Instead of adding layers it nests layers inside it
  • Nested layer is called capsule, group of neurons
  • CNN route by pooling
  • Deeper in terms of nesting
Layer based squashing
  • Based on output neuron we apply non-linearity
  • We apply non-linearity to grouped neuros (capsure)
Dynamic routing
  • Replace scalar output by routing by agreement
  • Hierarchy tree of nested layers
Key difference - All iterations to compute output, For every capsule nested apply operations
Happy coding and learning!!!

Day #88 - Metrics Optimization

Loss vs Metric
  • Metric - Function which we want to use to evaluate the model. Maximum accuracy in classification
  • Optimization Loss - Easy to optimize for given model, Function our model optimizes. MSE, LogLoss
  • Preprocess train and optimize another metric - MSPE, MAPE, RMSLE
  • Optimize another metric postprocess predictions - Accuracy, Kapps
  • Early Stopping - Stop traning when models starts to overfit
 Custom loss functions

Happy Coding and Learning!!!

November 10, 2017

Day #87 - Classification Metrics

  • Accuracy (Essential for classification), Weighted Accuracy = Weighted Kappa
  • Logarithmic Loss (Depends on soft predictions probabilities)
  • Area under Receiver Operating Curve (Considers ordering of objects, tries all threshold to convert soft predictions to hard labels)
  • Kappa (Similar to R Squared)
Notations
N - Number of objects
L - Number of classes
y - Ground truth
yi - Predictions
[a = b] - indicator function
  • Soft labels (soft predictions) are classifier's scores - Probabilities of objects
  • Hard Labels (hard predictions) - argmax fi(x), [f(x)>b], b - threshold for binary classification, Predict label, maximum value from soft prediction and set class for prediction label. Function of soft label
Accuracy Score
  • Most referred measure of classifier quality
  • Higher is better
  • Need hard predictions
  • Number of correctly guessed objects
  • Argmax of soft predictions
Logloss
  • Work with soft predictions
  • Make classifier output posterior probabilities
  • Penalises for wrong answers
  • Set constant to frequencies of each class
Area Under Curve
  • Based on threshold decide percentage of above / below the threshold
  • Metric tries all possible ones and aggregate scores
  • Depends on order of objects
AUC - ROC
  • Compute TruePositive, FalsePositive
  • AUC max value 1
  • Fraction of correctly ordered pairs
AUC = Fraction of  correctly ordered pairs / total number of pairs
 = 1 - (Fraction of incorrectly ordered pairs / total number of pairs)

Cohen's Kappa
  • Score = 1- ((1-accuracy)/(1-baseline))
  • Baselines different for each data
  • Similar to R squared
  • Here R predictions for dataset used as baseline
  • Error = (1- Accuracy)
  • Weighted Error Score = Confusion matrix * Weight matrix and sum their results
  • Weighted Kappa = 1 - ((weighted error)/(weighted baseline error))
  • Useful for medical applications
Happy Learning and Coding!!!

November 09, 2017

Day #86 - Regression Metrics

  • Relative Errors most important to us
  • MSW, MAE work with absolute error not for relative errors
  • MSPE (mean square percentage error)
  • MAPE (mean absolute percentage error) - Weighted version of MAE
  • RMSLE (Root mean square lograthmic error) - RMSE calculated in lograthmic scale - Cares about relative errors
Happy Coding and Learning!!!

November 07, 2017

Day #85 - Regression Metrics Optimization

Metrics
  • Metrics used to evaluate submissions
  • Best result finding optimal hyperplane
  • Exploratory metric analysis along with data analysis
  • Own ways to measure effectiveness of algorithms
Regression - Metrics
  • Mean Aquare Error
  • RMSE
  • R Squared
  • Same from optimization perspective
Classification
  • Accuracy
  • LogLoss
  • AUC
  • Cohen's Kappa
Regression Metrics
N - Samples
y - target values
y~ - target Predictions
yi - target ith value
yi~ - prediction ith object

Mean Square Error
MSE = 1/N(yi - yi~)^2
- Average the squared differences between actuals and targets

RMSE - Root Mean square Error = Sqrt(MSE)

  • Same as scale of target
  • RMSE vs MSE
  • Similar in terms of minimizers
  • Every RMSE minimizer is MSE minimizer
  • MSE(a) > MSE(b) <=> RMSE(a) > RMSE(b)
  • MSE orders in same way as RMSE
  • MSE easier to work with
  • Bit of difference in gradient based model
  • They may not be interchargeable for learning methods (learning rate)
R Squared
  • How much model is better than constant baseline
  • 1 predictions perfect
  • WHEN MSE is 0, R Square = 1
  • All reasonable models score between 0 and 1
MAE - Mean Absolute Error
  • Avg of absolute difference value between target and predictions
  • Widely used in Finance
  • 10$ Error twice worse than 5$ Error
  • MAE easier to justify
  • Median of target values useful for MAE
  • MAE gradient step function -1 smaller than target, +1 when greater than target
  • MAE is not differentiable
MAE vs MSE
  • For outliers - use MAS
  • unexpected but normal MSE
  • MAE robust to outliers
Happy Learning and Coding!!!

November 05, 2017

Day #84 - Data Leaks and Validations

  • Mimic Train / Test Splot as the test data
  • Perform KFold Validations
  • Choose best parameters for models
  • Submission Stage (Can't mimic exact train / test split)
  • Calculate mean and standard deviations of leader board scores
Data Leaks
  • Unexpected information in data that lets you make good predictions
  • Unusable in real world
  • Results of unintentional error
Time Series
  • Incorrect timesplits still exists
  • Check public and private splits
  • Missing feature columns are data leaks
Unexpected Information
  • Use File creation dates
  • Resize features / change creation date
  • ID's no sense to include in model
Happy Learning and Coding!!!

October 31, 2017

Day #83 - Data Splitting Strategies

  • Time based splits
  • Validation to mimic train / test pic
  • Time based trend - differs significantly, Time based patterns important
Different splitting strategies can differ significantly
  • In generated features
  • In a way model will rely on that features
  • In Some kind of target leak
 Split Categories
  •  Random Split (Split randomly by rows, Rows independent of each other), Row wise
  • Device special features for dependency cases
  • Timewise - Before particular date as training, After date as testing data. Useful features based on target
  • Moving window validation
  • By Id - (By Clustering pictures, grouping them and then finding features)
  • Combined (Split date for each shop independently)
Summary
  • In most cases split by Rownumber, Time, Id
  • Logic for feature generation depends on data splitting strategy
  • Set up your validation to mimic the train / test split of competition
Happy Learning and Coding!!!

Day #82 - Validation and Overfitting


  • Train Data (Past), Unseen Test Data (Future)
  • Divide into three parts - Train (Past), Validation (Past), Test (Future)
  • Underfitting (High Error on Both Training and Validation)
  • Overfitting (Doesn't generalize to test data, Low Error on Train, High Error on Validation)
  • Ideas (Lowest Error on both Training and Testing Data)
Validation Strategies
  • Hold Out (divide data into training / testing, No overlap between training / testing data ) - Used on Shuffle Data
  • K-Fold (Repeated hold out because we split our data) - Good Choice for medium amount of data, K- 1 training, one subset - Used on Shuffle Data
  • Leave one out : ngroups = len(train) - Too Little data (Special case of K fold, K = number of samples)
  • Stratification - Similar target distribution over different folds
Stratification useful for
  • Small datasets (Do Random Splits)
  • Unbalanced datasets
  • Multiclass classification
 Stratification preserves the target distribution over different folds

Happy Coding and Learning!!!