"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

January 27, 2018

Day #99 - Statistics and distance based features

Stats
  • Calculate statistics of derived features from neighborhood analysis
  • User_id / Page_id / Ad_price / Ad_position
  • Use label encoder
  • Treat data points implicitly
  • Add lowest and highest price for position of add
  • maximum and minimum price values
  • Pages user visited
  • Standard deviation of prices
  • Most visited page
  • Many more features
  • Introduce new information
Neighbors
  • Number of houses in 500m, 1000m
  • Average price per sq.m
  • Number of schools / supermarkets / parking lots in 500m / 1000m
  • Distance to closest substation
  • Embrace both group-by and nearest neighbor methods
Matrix Factorizations
  • Approach for feature extraction
  • User / Items mapping matrix
  • User - Attributes matrix
  • U X M = R
  • Row and column related features
  • BOW represent larger parse vector
  • Document represented by small dense vector (Dimensionality reduction)
  • Matrix Factorizations
  • SVD, PCA, TruncatedSVD for sparse matrices
  • NMF (Non-Negative Matrix Factorization) - Zero or Positive Number
  • NMF makes data suitable for decision trees
  • Used for Dimensionality reduction
Example Code
x_all = np.cancatenate([x_train,x_test])
pca.fit(x_all)
x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)

Happy Learning!!!

No comments: