"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 31, 2017

Day #83 - Data Splitting Strategies

  • Time based splits
  • Validation to mimic train / test pic
  • Time based trend - differs significantly, Time based patterns important
Different splitting strategies can differ significantly
  • In generated features
  • In a way model will rely on that features
  • In Some kind of target leak
 Split Categories
  •  Random Split (Split randomly by rows, Rows independent of each other), Row wise
  • Device special features for dependency cases
  • Timewise - Before particular date as training, After date as testing data. Useful features based on target
  • Moving window validation
  • By Id - (By Clustering pictures, grouping them and then finding features)
  • Combined (Split date for each shop independently)
Summary
  • In most cases split by Rownumber, Time, Id
  • Logic for feature generation depends on data splitting strategy
  • Set up your validation to mimic the train / test split of competition
Happy Learning and Coding!!!

No comments: