Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Day #73 - Feature Generation

October 27, 2017

Day #73 - Feature Generation - Categorical and ordinal features

Label Encoding - Based on Sort Order, Order of Appearance
Frequency Encoding - Based on Percentage of occurence

Categorical Features

Sex, Cabin, Embarked
One Hot Encoding
pandas.get_dummies
sklearn.preprocessing.OneHotEncoder
Works well for Linear methods (Minimum is zero, Maximum is 1)
Difficult for Tree methods based on One Hot Encoding Approach
Store only Non-Zero Elements (Sparse Matrices)
Create combination of features and get better results
Concatenate strings from both columns
One hot encoding it, Find optimal coefficient for every interaction

pclass,sex,pclass_sex
3,male,3male
1,female,1female
3,female,3female
1,female,1female

pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0

Ordinal Features

Ordered categorial feature
First class expensive, second less, third least expensive
Drivers License Type A,B,C,D
Level of Education (Sorted in increasingly complex order)
Label Encoding, Map to numbers (Tree based)
Non Tree can't use effectively

Label Encoding
1. Alphabetical sorted [S,C,D] -> [2,1,3]
- sklearn.preprocessing.LabelEncoder

2. Order of Appearance
[S,C,Q] -> [1,2,3]
- Pandas.Factorize

Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.

Equal Distributiona apply rank ties
from scipy.stats import rankdata

Summary

Ordinal is special case of categorial feature
Label Encoding maps categories to numbers
Frequency encoding maps categories to frequencies
Label and frequency encoding are used for Tree based models
One-Hot encoding is used for non-tree based models
Interactions of categorial features can help linear models and KNN

Happy Coding and Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

October 27, 2017

Day #73 - Feature Generation - Categorical and ordinal features

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts