Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Day #76 - Text Processing

October 29, 2017

Day #76 - Text Processing - Kaggle Lessons

Bag of Words

Create new column for each unique word in data
Count occurrences in each documents
sklearn.feature_extraction.text.CountVectorizer
More comparable by using Term Frequency
tf = 1 / x.sum(axis=1)[:,None]
x = x*tf
Inverse Document Frequency
idf = np.log(x.shape[0])/(x>0).sum(0)
N Grams
Bag of Words (Each row represents text, Each column represents unique word)
Classifying document

For N = 1, This is a sentence
Unigrams are - This, is, a , sentence

For N = 2, This is a sentence
bigrams are - This is, is a, a sentence

For N = 3, This is a sentence
Trigrams are - This is a, is a sentence

sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer

Text Preprocessing steps

Lower case
Lemmatization (using knowledge of vocabulary and morphological analysis of words)
democracy, democratic and democratization -> democracy (Lemmatization)
Stemming (Chops of ending of words)
democracy, democratic, and democratization - democr (Stemming)
Stop words (Not contain important information)

sklearn.feature_extraction.text.CountVectorizer: max_df has parameters for stop words

I have done all this in my assignment work. This is there in my github code

For Applying Bag of words

Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal
Ngrams can help use local context
Postprocessing - TFiDF
Use BOW for Ngrams

BOW example

Sentence - The dog is on the table
Representation - are, cat, dog, is, now, on, the, table
BOW representation - 0, 0, 1, 1, 0, 1, 1, 1

BOW Issue

The food was good, not bad at all

The food was bad, not good at all

Both representations are the same however the meaning varies :)

Word to Vectors

Get vector representation of words and texts
Each word converted to vector
Uses nearby words
Different words used in same context will be used in vector representation
Apply basic operations can be done on vectors
Words - Word2Vec, Glove, FastText
Sentences - Doc2Vec
There are pretrained models

Bag of Words

Very large vectors
Meaning of each value in vector is unknown

Word2Vec

Relatively small vectors
Values of vector can be interpreted only in some cases
The words with similar meaning often have similar embeddings

Happy Learning, Happy Coding!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

October 29, 2017

Day #76 - Text Processing - Kaggle Lessons

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts