"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 30, 2017

Day #81 - Dataset Cleaning


Dataset cleaning
  • Constant features (Remove constants features who value remain constant in both training and testing data, Value is constant in training but changes in testing - better to remove those features, Only fraction of features supplied in data, Same value in both training and testing set)
  • Duplicated features (Completely identical columns, This will slow down training time, remove duplicate columns)
  • Duplicated categorical features (Encode categorical features and compare them)
Other things to check
  • Duplicated rows (Duplicated rows with different targets, could be result of mistake, remove those duplicated rows to have high score on test set)
  • Check for common rows in train and test sets (Set labels manually for test rows in training set)
  • Check if dataset is shuffled (Oscillations around mean would be observed)
EDA Checklist
  • Get Domain Knowledge
  • Check How data is generated
  • Explore individual feature
  • Explore pairs and groups
  • Clean features
Happy Learning and Coding!!!

No comments: