"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 27, 2017

Day #73 - Feature Generation - Categorical and ordinal features

  • Label Encoding - Based on Sort Order, Order of Appearance
  • Frequency Encoding - Based on Percentage of occurence
Categorical Features
  • Sex, Cabin, Embarked
  • One Hot Encoding
  • pandas.get_dummies
  • sklearn.preprocessing.OneHotEncoder
  • Works well for Linear methods (Minimum is zero, Maximum is 1)
  • Difficult for Tree methods based on One Hot Encoding Approach
  • Store only Non-Zero Elements (Sparse Matrices)
  • Create combination of features and get better results
  • Concatenate strings from both columns
  • One hot encoding it, Find optimal coefficient for every interaction
pclass,sex,pclass_sex
3,male,3male
1,female,1female
3,female,3female
1,female,1female

pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0

Ordinal Features
  • Ordered categorial feature
  • First class expensive, second less, third least expensive
  • Drivers License Type A,B,C,D
  • Level of Education (Sorted in increasingly complex order)
  • Label Encoding, Map to numbers (Tree based)
  • Non Tree can't use effectively
Label Encoding
1. Alphabetical sorted [S,C,D] -> [2,1,3]
 - sklearn.preprocessing.LabelEncoder

2. Order of Appearance
[S,C,Q] -> [1,2,3]
 - Pandas.Factorize

Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.
  • Equal Distributiona apply rank ties
  • from scipy.stats import rankdata
Summary
  • Ordinal is special case of categorial feature
  • Label Encoding maps categories to numbers
  • Frequency encoding maps categories to frequencies
  • Label and frequency encoding are used for Tree based models
  • One-Hot encoding is used for non-tree based models
  • Interactions of categorial features can help linear models and KNN

Happy Coding and Learning!!!

No comments: