"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 29, 2017

Day #76 - Text Processing - Kaggle Lessons


Bag of Words
  • Create new column for each unique word in data
  • Count occurrences in each documents
  • sklearn.feature_extraction.text.CountVectorizer
  • More comparable by using Term Frequency
  • tf = 1 / x.sum(axis=1)[:,None]
  • x = x*tf
  • Inverse Document Frequency
  • idf = np.log(x.shape[0])/(x>0).sum(0)
  • N Grams
  • Bag of Words (Each row represents text, Each column represents unique word)
  • Classifying document

For N = 1, This is a sentence
Unigrams are - This, is, a , sentence

For N = 2, This is a sentence
bigrams are - This is, is a, a sentence

For N = 3, This is a sentence
Trigrams are - This is a, is a sentence

sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer

Text Preprocessing steps
  • Lower case
  • Lemmatization (using knowledge of vocabulary and morphological analysis of words)
  • democracy, democratic and democratization -> democracy (Lemmatization)
  • Stemming (Chops of ending of words)
  • democracy, democratic, and democratization - democr (Stemming)
  • Stop words (Not contain important information)
sklearn.feature_extraction.text.CountVectorizer: max_df has parameters for stop words

I have done all this in my assignment work. This is there in my github code

For Applying Bag of words
  • Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal
  • Ngrams can help use local context
  • Postprocessing - TFiDF
  • Use BOW for Ngrams
BOW example
  • Sentence - The dog is on the table
  • Representation         - are, cat, dog, is, now, on, the, table
  • BOW representation  - 0,    0,    1,    1,     0,      1,    1,    1
BOW Issue

The food was good, not bad at all
The food was bad, not good at all

Both representations are the same however the meaning varies :)

Word to Vectors
  • Get vector representation of words and texts
  • Each word converted to vector
  • Uses nearby words
  • Different words used in same context will be used in vector representation
  • Apply basic operations can be done on vectors
  • Words - Word2Vec, Glove, FastText
  • Sentences - Doc2Vec
  • There are pretrained models
Bag of Words
  • Very large vectors
  • Meaning of each value in vector is unknown
Word2Vec
  • Relatively small vectors
  • Values of vector can be interpreted only in some cases
  • The words with similar meaning often have similar embeddings
Happy Learning, Happy Coding!!!

No comments: