"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

September 01, 2017

Exploring Analytics in Microsoft Azure

I am working on Microsoft Azure platform on a BI cloud solution. Some of the key components I worked recently are
  • Azure Data Factory
  • Azure Data Lake
  • Azure SQL Data warehouse
  • Power BI on top of Data warehouse for reporting
I had earlier compared different stacks Microsoft / Google / Amazon.

The high level workflow for cloud based BI Solution and key components are

Step #1 - Moving Data from In premises to Cloud
Here data management gateway is installed on the in-premises machines, Pipelines are created in Azure Data factory to move data from In-premises to Azure Data lake

Step #2 - Azure Data Factory
ADF provides platform for data ingestion, Consuming high volumes of data. This experience setting up pipelines has some similarities and differences compared to SSIS. The key differences are
  • Everything is JSON based
  • Setting up Connections
  • Defining input and output data formats in datsets
  • Input and Output datasets also define the storage locations
  • Defining Pipeline logic which includes, logic, input, output datasets, scheduling for pipeline
  • This is bit straight forward but there is some learning with the tool, configuration properties
Step #3 - Azure Data lake
Azure Datalake is for storing data (RDBMS / No-RDBMS) data, If we have to integrate data from MSSQL, MYSQL for a realtime processing from two sources, We can leverage data lake to store and consolidate it later. The data stored in Datalake are referenced as external tables in AZURE Sql Datawarehouse

Step #4 - Web Application
All the references of data movement from Datalake and connectivity to Datawarehouse is managed by Access control leveraged with a Azure web app. The security aspect is well managed in Azure infrastructure

Step #5 - Data Consolidation into SQL Datawarehouse
The external tables referenced in Datalake can be referenced, queried in TSQL format and data loaded in Azure Datawarehouse tables. This is the location of fact and dimension tables that would power our datawarehouse. This could be done by stored procedures.

Step #6 - Power BI reporting
We have completed Data loading, data consolidation. The next is Power BI. PowerBI has the most power offering for web / mobile platforms. This is convenient and easy to use. The extended Analytics / R Support / Machine Library support also makes it suitable to run both Business Intelligence / Machine Learning solutions.

Security aspects of this architecture is well handled with Firewall, IAM access as needed. This seems very stable even some of the components are constantly updated. This is high level architecture explanation, We will look into To-do exercises in coming weeks.

Happy Learning!!!

July 19, 2017

Day #70 - Machine Learning - Deep Learning Fundamentals

Picture is worth 1000 words, Few examples listed in the book are very precise, clear on Machine Learning fundamentals. Below are few of the images on Machine learning / Deep Learning Concepts

Figure #1

  • How machine learning, AL and Deep Learning are inter-related, The subset representation clearly represents the knowledge boundaries
  • Deep Learning frameworks allow developers to iterate quickly, Making algos accessible to practitioners. Deep learning frameworks help to scale machine learning code for millions of users
  • Its important to note fundamentals of Machine Learning is important to work with Deep Learning

Figure #2

  • In Machine learning, historical data is used to derive learning's / rules from it and apply it for future data predictions
  • From the data we need to identify (relevant features / variables), In this process we use different techniques like PCA, Correlation techniques, Derived features to identify relevant feature attributes for model creation
  • From the vast amount of data we collect through enterprise applications / systems we need to identify / extract relevant data to build models and validate them. Setting up the data pipeline, training with required dataset becomes key for better / high accuracy models
Figure #3

  • High level perspective of Deep Learning, How the nodes are defined, weights computed
  • The loss part for each iteration is compared with predictions and sent back to perform weight updates, This iterations we call it as back propagation
  • Deep Learning term is because the network are 'deep' - multiple hidden layers involved in computation
Figure #4

  • SVM Wide street approach, line that separates two classes
  • Allow non-linear decision boundaries
  • Each dimension represents feature
  • Goal of SVN - Train a model that assigns unseen objects into particular category
  • Advantage - High Dimensionality, Memory Efficiency, Versatility
Happy Learning!!!

May 16, 2017

Day #69 - TSQL Skills for Data Pipeline and Cleanup Work

Pivot is a key thing when it comes to data preparation tasks, MSSQL pivot without aggregation does need a bit of workaround. Two things we will see in this post

Learning #1 - Script for Insert Data generation from MSSQL tables using SSMS (Hidden Gem in MSSQL)

Step 1 -  Database -> Tasks -> Generate Scripts

Step 2 -  Generate the Database objects (Tables as needed)

Step 3 -  Specify Save to Location, Data only option. After you specify options next step script runs and generates insert statement as needed.

Learning #2- Pivot for Data Preparation scenario
For a given scenario of customer/orders, Pivoting the data for next level of tasks

Happy Learning!!!

May 14, 2017

Weekend Seminar - Deep learning in production at Facebook

Good Talk - Deep learning in production at Facebook https://lnkd.in/fX7BZif

Notes from Session
Deep Learning Use Cases
  • Event Prediction - Listing top relevant stories for the user, predicting relevance - Approach - Logistics regression + Deep Neural Networks
  • Machine Translation - Automatically machine translated posts generated for users - Approach - Encoder - Decoder Architecture, Using RNN
  • Natural Language Processing - Understand Context of text - Deep Text - Approach - CNN for words + RNN for sequences
  • Computer Vision - Understand pics - Approach - CNN @ massive scale. Understand different aspects of pictures - Classification, Detection, Segmentation 
Scaling the models
  • Computer faster - Tweaks in FFT, TiledFFT, Winograd to reduce convolution computations, NNPack, CuDNN for CPUs
  • Memory Usage - GPU + Activations Memory released and reallocated during different layers of processing in Deep Networks
  • Compress models - Exploit redundancy in model designs, prune them
Good Insights!!!

Kaggle Vs Enterprise Machine Learning Adoption - Two sides of coin

Reposting Summary from Quora Answer with my perspective added

What you don't learn in Kaggle Competitions
  • Determining business problem to solve with data
  • Real world data imbalance, Accuracy issues, Maintaining Models
  • Miss the challenges of data engineering (What features to select, causational vs correlation in domain context) 
What you learn by experimenting real world data science applications in Production
  • Identifying / Reusing Existing data for first level models 
  • Identifying pipelines to build for more relevant variables
  • ETL / Data Consolidation / Aggregation, Eliminating outliers / Redundant Data
Today's systems have enough Transactional Reporting  / BI Reports in place. The challenge is evolving from the current system, leveraging current data, build a basic model, slowly build pipelines and extend other machine learning use cases.

Happy Learning!!!

April 29, 2017

Day #68 - CNN / RNN and Language Modelling Notes

At the end of every class, I have a feeling there is a lot to learn. People in the industry know things only at the application level. The depth of topics, mathematics discussed in class is very extensive. I always have a feeling of guilt "need to learn more". Every learning needs the breakpoint to correlate/understand end to end, to see the concept in a more familiar perspective. Always Keep Learning and Keep growing.

CNN Notes
  • In a CNN lower layers learn generic features like edges, shapes and feed it to higher layers
  • Earlier layer - Generic features
  • Later layer - Features specific to the corresponding problem
  • For any related problems we can leverage existing network VGG16, VGG19, Alexnet and modify the higher layers based on our need
  • Relu only passed those in Activation function where its > 0
  • Vanishing gradient problem - Weights will stagnate over a period of time
  • 6E/6W - Gradient Error with respect to weights
  • 6E/6I - Gradient Error with respect to Image
  • Main things is weights same across RNN
  • Weights between successive layers same
  • Document Classification, Data Generation, Chatbot, Time series - RNN can be used
LSTM - Long short term memory

Topics from Language Modelling class

Happy Learning!!!

April 28, 2017

Day #67 - Exploring Tableau Visualization

Canadian Car sales data visualization examples. The interpretation varies based on representation presented below. The data has all the details. Exploring same data in different visualization perspective will provide a different interpretation of same data.

Visualization #1 - This representation would help us figure out which month has usually high sales numbers

  • Three months of year (Dec-Jan-Feb) has relatively weak sales figures compared to rest of year
  • March-August trend shows good demand from customers resulting in increased sales
  • Last few months of the year shows decreased demand. This could be seasonal factor/holidays/travel. This need to be validated
Visualization #2 - Consolidated snapshot of comparison of yearly performance of sales numbers, Across several years and across all months (This one is a good big picture)

  • January is the lowest period of sales
  • Sales trend is increasing YOY (year over year)
  • May month consistently tops high sales for many years
The data format looks like below in Visualization #1

Visualization #3 - Data in simple table format

  • Six years total sales data is represented
  • Partial data is available for the year 2016
Happy Learning!!!

April 27, 2017

Day #66 - Maths behind backpropagation

Today it's mathematical learning for neural network fundamentals.
  • In Neural Network, Network forward propagates activation to produce output and it back propagates error to determine weight changes
  • Partial Derivative - Derivative of one of the variables holding the rest constant
  • Backpropagation uses gradient descent method, one needs to calculate the derivative of squared error function with respect to the weights of the network.
Happy Learning!!!

April 26, 2017

Keep Learning - Good Motivation Note

Interesting Slide from presentation - Dev @ 40

Happy Learning!!!

April 23, 2017

Smart Farming

Product #1 - Automated Farming + Design Layout + Soil Monitoring + Solar powered = "Smart Farming"

Product #2 - Counting Fruits + Finding Weeds + Cattle monitoring

Happy Farming!!!

April 20, 2017

Data Science - Find your Winning use case

I observe a lot of technologies discussed in Data Science roles. It covers Big Data, Open Source, and Commercial Tools, R, Python, MapR, Spark, Azure, Various cloud providers etc...

"Identifying relevant domain/product related use case that helps improve business/numbers is the key"

This LinkedIn post provides a great clarity on focus on relevant use cases, small wins, and scale success.

Happy Analytics!!!

April 17, 2017

Day #65 - Python Package Installation commands - Windows

Had an issue running a code, Tried different options, Uninstalling existing version of keras and reinstalling it worked. Bookmarking commands

Happy Learning!!!

April 13, 2017

Day #64 - ETL for Data and Delta Data Management

Custom SSIS example sample for ETL setup for Data Extraction and Update

  • Two Databases (Source and Target)
  • Example with Test Table with few columns
  • Ability to get New Data
  • Ability get Delta Data (Updates)
Step in SSIS Project

Step 1 - Create a Data Flow Task

Step 2 - Add connection managers for Source and Target Databases

Step 3 - The operators and layout is (Source Data -> Lookup in Target Database -> Insert / Update TargetDatabase)

Step 4 - OLEDB Data Source Settings

Step 5 -  Lookup to map for data

Step 6 - Lookup Mapping

Step 7 - Match Non-Matching for Insert / Updates

Step 8 - Match Destination Settings

Step 9 - Non Match Update Query

Step 10 - Non Match Update Params

Reference table script

CREATE TABLE [dbo].[Table_1](
[Col1] [int] NULL,
[Col2] [int] NULL,
[Col3] [int] NULL

Happy Learning!!!!

April 08, 2017

Day #63 - Notes from Text processing and Parallel Programming

Quick Summary notes for future reference

Text Processing - Word Sense Disambiguation
  • Rely on leveraging wordnet (Knowledge sources)
  • from nltk.corpus import wordnet - leverage it
  • Leverage Machine readable dictionary
Lesks Algorithm
  • Sense bag (ambigious word)
  • Context bag (different definitions to context word)
  • Close match will be picked
Walkers Algorithm for word sense disambiguation
  • Use Thesaurus to find scores in context
  • Highest score will be picked up for context relevance
  • Thesaurus Library pywordnet, now part of NLTK
  • Polysemy - many possible meanings for a word or phrase.
  • Homonym - same spelling or pronunciation but different meanings
Parallel Programming
  • Filter locks
  • Bakery Algorithm
Example Implementation - link

Memory Consistency
  • Strict Consistency 
  • Sequentially consistent
  • Relaxed(Weak) consistent
Linearization Point
From Stackoverflow

Coarse Grained Vs Fine Grained
From Stackoverflow

Petersons Algorithm

More Reads - Link

Happy Learning!!!

April 07, 2017

TSQL Code formatting tool

Free tool for TSQL code formatting. Added to SSMS

Happy Formatting!!!

April 06, 2017

April 02, 2017

Fundamentals Again - Day #61 - Hypothesis Testing

  • Alternative Hypothesis - There is difference between groups
  • Null Hypothesis - There is no difference between groups
  • Binomial distribution - Two possible outputs
  • Sampling distributions, Mode, Median, Mean, Variability in distribution (Standard Deviation), Chi Square Distribution 
  • Conduct T-Test, Check the P-value to know Significance
Ref - Coursera

Happy Learning!!!

March 31, 2017

Day #60 - TSQL Profiling - Expressprofiler

Way better and Less complicated than SQL profiler
  • Profiler by DB Name
  • Profile by login account name
These two options are good enough to nail down most of issues. For blocking / deadlock we can hop on to Profile. Basic Checks this tool beats the need

Link - Download

Jasper Report passing parameter between datasets

Happy Learning!!!