You are on page 1of 52

ANALYTIKOS

AN ANALYTICS PRIMER TO ACE


SUMMER INTERVIEWS
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

WHAT IS ANALYTICS?
Analytics is the process of discovering, interpreting, and communicating
significant patterns in data. . Quite simply, analytics helps us see insights and
meaningful data that we might not otherwise detect. Business analytics focuses
on using insights derived from data to make more informed decisions that will
help businesses increase sales, reduce costs, and make business improvements.

There are 4 broad types of Analytics:


• Descriptive Analytics answers the question “What happened?”. This simple
form of analytics uses basic math, such as averages and percent changes, to
show what has already happened in a business. Descriptive analytics, also
called traditional business intelligence (BI), is the first step in the analytics
process, creating a jumping-off point for further investigation.
• Diagnostic Analytics answers the question “Why did something happen?”. It
takes descriptive analytics a step further, using techniques such as data
discovery, drill-down, and correlations to dive deeper into data and identify the
root causes of events and behaviors.
• Predictive Analytics answers the question “What is likely to happen in the
future?”. This branch of advanced analytics uses findings from descriptive and
diagnostic analytics – along with sophisticated predictive modeling, machine
learning, and deep learning techniques – to predict what will happen next.
• Prescriptive Analytics answers the question “What action should we take?”.
This state-of-the-art type of analytics builds on findings from descriptive,
diagnostic, and predictive analytics and uses highly advanced tools and
techniques to assess the consequences of possible decisions and determine
the best course of action in a scenario.

Source: Link • Link


BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

ANALYTICS IN FINANCE
A top consumer bank in Asia enjoyed a significant market share but lagged its
competitors in products per customer. It used advanced analytics to explore
several big data sets: customer demographics and key characteristics, products
held, credit-card statements, transaction and point-of-sale data, online and mobile
transfers and payments, and credit-bureau data. The bank discovered
unsuspected similarities that allowed it to define 15,000 microsegments in its
customer base. It then built a next-product-to-buy model that increased the
likelihood to buy three times over. This is one of the many examples where
analytics has changed the way of functioning of the Finance Industry.
Analytics helped modernize the financial processes and information standards
and put the core finance data in one place, which helped the companies from
many legacy inefficiencies. It also helped build machine learning and artificial
intelligence models for businesses using practical drivers, which helped to make
better strategic decisions and a fair resource allocation. Portfolio management is
one such activity where ML models manage the funds and generate returns
outperforming the Index actively. Advanced statistical or visualization methods
like cognitive frameworks, interactivity and storytelling helped companies avoid
unpleasant surprises and incorporate risk mitigating strategies in their day-to-day
functioning. The past decade has also seen the growth of blockchain-based
financial applications. It enabled more open, inclusive, and secure business
networks, shared operating models, reduced costs, and new products and
services. JP Morgan is an early adopter of this technology. They created JPM
Coin to facilitate real-time value movement and solve hurdles of cross border
transactions.
Financial specialists often have to work on semi-structured or structured data.
Various data science tools such as Apache Spark, SAS, Scikit-learn etc., are
used to process and handle the unstructured data sets.
These data help gain customer insights and understand/predict their behaviour.
Meaningful insights from data can be generated using various tools such as text
analytics, data mining, Natural Language Processing (NLP) and many more. A
thorough analysis is conducted on the data of customers using machine learning
algorithms to analyze the changes and trends in the financial market and values.
Based on these customer insights, personalized delivery of services is made
possible and personalized relationship with customers is ensured by increased
engagement. Simple application is suggestion of schemes or funds to the
customer based on the behavioral patterns.
These were some of many uses of big data & analytics in finance.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

ANALYTICS IN
OPERATIONS
Supply chain analytics use data and quantitative methods to improve decision
making across the Supply chain. It first expands the data set for analysis and then
applies statistical tools to draw meaning insights. Several use-cases are
mentioned below:

Sales, Inventory & Operations Planning


Planning is a data-driven process with existing systems such as Enterprise
Resource planning (ERP), and Supply Chain Management (SCM) planning tools.
Big Data gives increased visibility across various nodes in the Supply chain.
Retailers and manufactures have wider access to Point-of-sale data, inventory
data, & supply data. These can be analyzed on a real time basis to identify
Supply-demand mismatches and thereby play with prices/inventory (think about
Supply-Demand curves in economics) to increase ROI.

Sourcing
Manufacturers deal with 2 broad issues using analytics. The supply risk which is
modeled around Predictive Risk Management by incorporating the necessary
internal and external factors. The cost of supply is handled by building ‘clean cost
sheets’ to declutter the vast pricing information to gain enhanced bargaining
power.

Manufacturing
Machine breakdowns are costly and recurrent in the manufacturing industry.
Predictive maintenance helps operators to monitor the health of the machine
thereby reducing the risk of breakdowns. Another unavoidable cost is the cost of
rework/reject, where the main problem is that it’s very time-consuming to identify
the cause of the defect and which leads to a line of defective items in the interim.
Collecting data across the assembly line and running automatic diagnostics
(checks dimensions of the item at multiple 3D points) on the defective item helps
in identifying both the defect and its root cause quickly.

Warehousing
Long shipping times lead to high inventories. This resulted in the increase of both
size and number of warehouses. Reducing the shipping times will reduce the
warehousing costs manifold. Analytics (using data from sensors) allows business
to track shipments and better manage warehouses

Transportation
Data Analytics helps track fleet performance and suggest maintenance schedules
by gathering data from on-fleet sensors to enable cost savings and better fleet
management.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

ANALYTICS IN
OPERATIONS
Point of Sale
Shelf space management and Product placement are the two key drivers in a
retail store. Managing the shelf space & bundling relevant products will lead to a
significant rise in the store’s revenue. Chaotic sales data can be used for this
purpose to decide which products to place where and which products are to be
put together. Detecting stockouts is another major concern. Manual inspection to
check inventory levels is often costly. Data analytics tools monitor the sales
pattern and send an alert to the inventory department when a high selling item
has suddenly stopped appearing at the POS. The inventory management checks
whether that item is out of stock in the shelf. It also identifies high sale products
depending on the seasonal/erratic unexpected demand spike and recommends
the personnel to place them in strategic locations. This leads to better
supply/demand management thereby increase in revenues.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

ANALYTICS IN SALES &


MARKETING
Marketing analytics comprises the processes and technologies that enable
marketers to evaluate the success of their marketing initiatives. It considers all
marketing efforts across all channels over a span of time – which is essential for
sound decision making and effective, efficient program execution.
On the sales front, well-designed analytics programs deliver significant top-line
and margin growth by guiding sales teams to better decisions.
Some specific use cases here are: Designing targeted promotional campaigns,
Lead generation, lead scoring, coverage planning & field productivity for
salespeople, product pipeline management, demand forecasting, reducing
customer churn, setting up cross-selling & up-selling opportunities, dynamic
pricing, dynamic deal scoring and A/B price testing.
FMCG firms have their own data analytics divisions which take the help of
external data provisioned to them by market research organizations.
Nielsen captures POS sales data (through retailers), using metrics like secondary
sales volume in currency, sales quantity, average display price, average
promotional price and some distribution and coverage metrics like Numeric
distribution, Weighted distribution etc.
Kantar IMRB capture household penetration data which is very useful in
understanding the brand’s adoption in a particular geography.
Millward Brown capture data on brand equity, spontaneous awareness, top of
mind awareness etc. Anything to do with the intangible impact of a brand on
consumers, which can affect sales as well.

Some real-world applications are:


Sales forecasting and planning – Managers take decisions using results from
advanced algorithms like Time Series Forecasting and Bayesian Regression, run
on the secondary sales variable as the dependent metric.
Reporting – Analytics divisions use visualization tools like Tableau (very
popular), Power BI, Spotfire and Qlikview to derive insights from large volumes of
data and make real-time reports.
Text Analytics – Customer feedback is important in understanding the place a
brand or a service holds in consumers’ minds. Sentiment analysis is rising in
popularity and usefulness, be it using text or image.
Investment planning – Brand managers take decisions on which products and
leads to invest in to gain more returns
Web Marketing Analytics – This is very important sector in digital marketing.
Web traffic measurement, keyword analysis and insights into consumer
preferences and trends are some facets of this emerging area

Source: Link • Link


BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

ANALYTICS IN PRODUCT
MANAGEMENT
Product analytics is the process of analyzing how users engage with a product or
service. It enables product teams to track, visualize, and analyze user
engagement and behaviour data. Teams use this data to improve and optimize a
product or service. Product Analytics helps in understanding engagement and
retention while Marketing Analytics helps in understanding traffic and acquisition.
In order to get a quantitative understanding of what users are doing with the
product, the first step is instrumenting it with product analytics. The idea is to fire
an event for every action that a user can take in the product, so you get an
aggregated view of how many users use a feature, and how often they're using it.
For example, if a PM wants to track the number of times a user clicks a specific
button, he/she might fire an event called "big-red-button. click." From there a PM
can see which features need work, which are your most important, and use that
information to prioritize changes.
User actions are commonly called Events. Events include clicks, slides, gestures
(for mobiles), play commands (for audio and video), downloads, page loads, and
text field fills. The event includes the type of element, the name of the element,
and the action the user took. Generic examples of events include Create Account,
Add to List, Submit Feedback, Share Dashboard, Select Option, Play Tutorial,
Change View, and Complete Onboarding.
The way one understands the specific attributes of the tracked interactions is the
work of Event Properties. PMs also want to know the context that distinguishes
activity from impact when analyzed longitudinally. Event properties can include
details like time, duration, count, device, software version, geography, user
demographic, account firmographic, element characteristics (like colour, size,
shape), Boolean (like login: yes/no), and custom attributes (like
basic/pro/enterprise).
There are 3 underlying units of measurements that help in quantifying the product
analytics data: Goals, Key Performance Indicators (KPIs), and Metrics. Goals are
a company’s highest-level priority, such as driving revenue; KPIs measure
progress toward goals; and metrics measure progress toward KPIs.
Product management metrics are usually divided into categories based on what
part of the user lifecycle they’re intended to measure:
• Engagement: How frequently, do users engage with the product? This could
be the number of key actions taken or minutes of video watched.
• Retention: What proportion of users come back to use the product? This is
usually measured across 7-, 30-, or 90-day periods.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

ANALYTICS IN PRODUCT
MANAGEMENT
• Activation: How quickly do users reach a point that helps them realize value?
This could be making a first purchase or listening to three podcasts.
• Acquisition: How broad of a user base does the product have over a recent
time period? This can be the number of paid accounts in the past year, or the
number of users who made a purchase in the past three months.
• Monetization: How many, or how often, are transactions made? How well is
usage of the product translating to revenue for the business? Transaction
metrics can cover in-app purchases to revenue and ad conversions.
• Business-specific: Metrics specific to the business model. For example, an e-
commerce company might want to track average order value.

Various types of analysis are possible through Product Analytics:


• Segmentation: In-depth analysis of data done by dividing users by the
characteristics they share, such as behaviour, signup date, or marketing
source. Data-driven product teams can compare the metrics and KPIs of
different groups and draw distinctions between them. For instance, users who
came from a particular marketing source might be 3x more valuable than the
average user.
• Cohort analysis: A cohort is a segment of users, usually grouped based on
their behavioral or demographic attributes, that has been named and saved
for future comparison. For example, a news site might save two cohorts: one
for its paying subscribers and one for its free visitors. Each has different
behaviors and interests, and the team can cater to each without offending the
other. There are two types of cohorts: absolute and relative. Absolute cohorts
track a fixed group of users, such as those who signed up during the week of
a particular conference. Relative cohorts track a shifting group of users, such
as those who signed up within the past 30 days. Tracking the growth of key
relative cohorts over time is a great way to measure product health.
• Retention analysis: Retention analysis measures how well digital products
keep users coming back. Every company measures retention differently, but
it’s typically tied to repeat actions. A free social media app might define
retention as any user coming back to like a piece of content within seven
days. An enterprise security software might define it as a user renewing their
subscription after one year.
• Funnel analysis: Funnels measure a series of steps users take toward a
desired outcome such as a purchase. Funnels help reveal the health of
processes like onboarding and show where users drop off or get lost.

Source: Link • Link • Link • Link


BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

ANALYTICS IN
CONSULTING
The word Analytics is now ubiquitous. Whether it is IT, BFSI, Manufacturing,
Supply Chain or the Automotive industry, data analytics has spread its wings in
every industry. Two things have led to this unprecedented rise in usage of data
analytics: The amount and variety of data generated in the world today and the
Technological advancements in Big Data and Artificial Intelligence in capturing
and analyzing this data.
By leveraging this technological disruption, consulting firms can help clients
identify and capture most value and meaningful insights from their data and turn
them into competitive advantages. Analytics can help companies eliminate
strategies which will not work out or identify a company's strengths and
weaknesses to build a strategy around it. The big three management
consultancies have started placing a lot of emphasis on usage of Data Analytics
to create "Change that Matters”. Firms can also advise clients on leveraging
analytics to enhance their businesses. The bottom-line here is that in the era of
digital disruption, analytics is essential for survival.
However, consultants should also learn new skill sets to stay ahead of this
disruption, tools such as Excel, Tableau or Microsoft Power BI allow instant
analysis of big data and help in generation of reports and dashboards. These
tools help in discovering hidden business insights and consultants can use these
reports to make preliminary analysis of client's problems and then delve down
deeper into key pain points and strategize solutions around them.
Analytics can give consulting firms an edge by accomplishing the following:
Reduce bias in decisions and provide absolute clarity: By using data analytics,
consultants can add a new data-oriented view to their strategy, this helps in
evaluating the odds of success of their strategy before allocating resources to that
strategy. Data driven decision making provides absolute clarity, helps in making
better decisions, gives less errors and more positive results.
Extract new growth opportunities: Analytics can also enhance strategic planning
by unearthing growth opportunities that would otherwise be hard to spot, be they
attractive industry segments and acquisition targets, ideas for new products or
services, or even new applications for existing offerings. With NLP (Natural
Language Processing) gaining traction, parsing and extracting patterns from
unstructured and structured textual data has become affective, thus helping firms
extract hidden insights and implement broader range of growth opportunities. Text
Analytics solutions, Chatbot solutions, Sentiment and Intent Analysis are some
examples where firms can use NLP to standout.
Identify early-stage trends: Machine Learning can analyze, in real time, publicly
available data spanning millions of web pages, social media content, news
sources, patent filings, and more. Discovering patterns from these data sources,
can help businesses identify emerging trends by, for example, let us assume Ola
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

ANALYTICS IN
CONSULTING
Electric is trying to decide which electric-vehicle battery technology to invest in.
This is not a minor decision—the capital expenditure alone would run into billions
of dollars and will also lock the manufacturer into a specific technology for many
years. Senior management in this company would benefit from knowing how
associated trends are evolving and when a specific technology is likely to have a
clear advantage. They could gain these insights through near real-time tracking of
patent and academic-publication momentum, announcements, and investments
across different technologies. They could also track government regulatory
changes such as zero-emission- vehicle mandates that stimulate demand for
electric vehicles or local ownership rules in countries where lithium supply is
concentrated. By using analytics to track emerging trends, consultancy firms can
help their clients make smart decisions before their competitors do.

Source: Link • Link


BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASIC OF STATISTICS
SAMPLING
Sampling is the process of choosing small number of observations from a large
dataset. The biggest challenge is to choose a subset which accurately represents
your original dataset with limited datapoints

PROBABILITY BASED SAMPLING

One of the most common sampling techniques used in everyday life, where all
elements supposedly have equal chance of occurrence. Hence, every element is
equally likely making the sample an ideal representation of population

Simple Random Sampling - All elements of the sample are randomly selected,
preserving the quality of data.
Example – Choosing a number from 1 -10

Stratified Random Sampling - We divide the population into subgroups or strata


and then members from each sub-group are selected randomly. This technique is
used when the population is not homogeneous.
Example – Age, socioeconomic divisions, nationality, educational achievements
and other such classifications.

Systematic Sampling - Members occurring after a fixed interval are selected.


The member occurring after fixed interval is known as Kth element.
Example – Selecting every 100th person to ask about consumer preferences
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASIC OF STATISTICS
Cluster Sampling - We divide the sample into random clusters and then pick
elements from them. The difference between cluster and stratified random is that
in stratified random, the user divides population based on similar characteristics
while in clustering sample we just randomly divide the population and pick
samples from each cluster.
Example – An NGO creating a sample of girls from 5 neighboring districts to
provide education

NON–PROBABILITY BASED SAMPLING

Non-probability sampling is a type of sampling where each member of the


population has an unknown probability of being selected in the sample. Non-
probability sampling will be adopted when each member of the population cannot
be selected, or the researcher deliberately wants to choose members selectively

Purposive Sampling members for a sample are selected according to the


purpose of the study. For example, if a researcher wants to study the impact of
drugs abuse on health. Every member of the society is not the best respondent
for this study as only the drug addicts have undergone impacts of drug abuse on
their health, and they can provide the real data for this study. Hence, the
researcher deliberately selects only the drug addicts as respondents for his study.

Convenience Sampling is a type of sampling where the members of the sample


are selected based on their convenient accessibility. For example, a researcher
may visit a college or a university and get the questionnaires filled in by volunteer
students. Similarly, a researcher may stand in a market and interview the
volunteer persons.

Snow-ball Sampling is also called chain sampling. It is a type of sampling where


one respondent identifies other respondents (from his friends or relatives) the
study. Snow-ball sampling is adopted in situations where itis difficult to identify the
members of the sample.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASIC OF STATISTICS
Quota Sampling is where the members are selected according to some specific
characteristics chosen by the researcher. These specific characteristics serve as
a quota for selection of members of the sample. Hence, the members are
selected based on these specific characteristics.

HYPOTHESIS TESTING
It is an elemental form of statistical testing to compare the characteristics of 2 or
more samples. Hypothesis testing begins with a Null Hypothesis. A null
hypothesis is the assumption that 2 features in 2 samples are not statistically
different. We associate a confidence level with this line of argument –Alpha.
Alpha is the probability of incorrectly rejecting the null hypothesis or rejecting a
null hypothesis even though it is true. Using some statistical test (which we will
soon discuss) we calculate a p-value which is the probability of existence of
extreme cases(in violation of null hypothesis)in the data as indicated by null
hypothesis if it was true. Thus, if the p-value is greater than alpha, then we reject
null hypothesis and conclude that the feature in our sample is statistically
different. This means that occurrence of cases in the data is much larger than our
tolerable levels, which means it is not rare or extreme –but rather than the trend
of the data.

TESTING METHODS

• Z Test: The most basic and common type of testing done. This is used to
compare a similar feature in samples from two different populations. Eg:
Comparing the effect of a new drug on the BMR of youths and old people.
• T Test: This is the most common test used in clinical trials and tests. This is
used to compare statistical difference in samples from the same population
with the effect of an external agent. Thus, one sample is control – no effect of
agent & the other is experiment –with effect of agent. Thus, the test compares
before and aftereffects of agents. E.g.: Effect of chamomile tea on youth BMR
• ANOVA – Analysis of Variance: ANOVA helps to compare more than 2
samples of the same population (which is the limitation of t Test).
• Chi Square Test: This is the statistical test for qualitative data. This is the
used compare the dependence of 2 features on each other. E.g.: The
concurrence of male gender and taller height in children of age 5
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASIC OF STATISTICS
CORRELATION

Correlation measures the co-occurrence of 2 events or variables. Thus if 2


variables demonstrate similar trends in data together then they are said to be
highly correlated. However, correlation does not indicate causality. Just because
2 variables are very highly correlated it does not mean that one causes the other.
Correlation can be of 2 types – Positive and Negative. A high positive correlation
means that the two variables exhibit high values simultaneously. A high negative
correlation indicates that the variables are opposites in terms of value.

BASIC TYPES

• Pearson Correlation: The simplest correlation. This assumes a linear relation


between the two variables and calculates the correlation values. However,
often this yields inaccurate results as it only considers just linear relations.
• Spearman Correlation: This measures the monotonic relation between two
variables. Thus, unlike Pearson Correlation it does not measure the linear
relation between the two variables. This is done using Rank correlation.
Based on values, the variables are ranked. Then the pairwise correlation is
calculated based on these assigned ranks.
• Distance Correlation: This is done in case the value distance between the
two variables is too high such that the other variable is almost constant
compared to the other variable. To ameliorate this situation, the difference
between each datapoint of each variable is computed and matrix correlations
are calculated. This gives more accurate results compared to Spearman.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

SUPERVISED & UNSUPERVISED


LEARNING
SUPERVISED LEARNING

Supervised learning is a machine learning approach that’s defined by its use of


labeled datasets. These datasets are designed to train algorithms into classifying
data or predicting outcomes accurately. Using labeled inputs and outputs, the
model can measure its accuracy and learn over time.

TYPES OF PROBLEMS

• Classification problems use an algorithm to accurately assign test data into


specific categories, such as separating apples from oranges. Or, in the real
world, supervised learning algorithms can be used to classify spam in a
separate folder from your inbox. Linear classifiers, support vector machines,
decision trees and random forest are all common types of classification
algorithms.

• Regression is another type of supervised learning method that uses an


algorithm to understand the relationship between dependent and independent
variables. Regression models are helpful for predicting numerical values
based on different data points, such as sales revenue projections for a given
business. Some popular regression algorithms are linear regression, logistic
regression and polynomial regression.

UNSUPERVISED LEARNING

Unsupervised learning uses machine learning algorithms to analyze and cluster


unlabeled data sets. These algorithms discover hidden patterns in data without
the need for human intervention
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

SUPERVISED & UNSUPERVISED


LEARNING
TYPES OF TASKS HANDLED

• Clustering is a data mining technique for grouping unlabeled data based on


their similarities or differences. For example, K-means clustering algorithms
assign similar data points into groups, where the K value represents the size
of the grouping and granularity. This technique is helpful for market
segmentation, image compression, etc.
• Association is another type of unsupervised learning method that uses
different rules to find relationships between variables in a given dataset.
These methods are frequently used for market basket analysis and
recommendation engines, along the lines of “Customers Who Bought This
Item Also Bought” recommendations.
• Dimensionality reduction is a learning technique used when the number of
features (or dimensions) in a given dataset is too high. It reduces the number
of data inputs to a manageable size while also preserving the data integrity.
Often, this technique is used in the preprocessing data stage, such as when
autoencoders remove noise from visual data to improve picture quality.

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

DECISION TREES
Decision Trees (DTs) are a non-parametric supervised learning method used
for classification and regression. The goal is to create a model that predicts the
value of a target variable by learning simple decision rules inferred from the data
features. A tree can be seen as a piecewise constant approximation.

COMMON TERMINOLOGIES

• Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the
tree.
• Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
• Attributes : These were the specific features of a data set. For a dataset of
employees, we can consider salary, working hours, rating etc., as the attributes

DECISION TREE PROCESS

Step–1 : Begin the tree with the root node, say S, which contains the complete
dataset
Step–2 : Find the best attribute in the dataset using Attribute Selection Measure
(ASM). Two popular techniques for ASM are Information gain and Gini Index.
Step–3: Divide S into subsets containing possible values for the best attributes.
Step–4: Generate the decision tree node, which contains the best attribute.
Step–5: Recursively make new decision trees using the subsets of the dataset
created in step–3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

DECISION TREES
INFORMATION GAIN & GINI INDEX

Information Gain: Information gain is the measurement of changes in entropy


after the segmentation of a dataset based on an attribute.
It calculates how much information a feature provides us about a class. According
to the value of information gain, we split the node and build the decision tree. A
decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first.
Gini Index: Gini index is a measure of impurity or purity used while creating a
decision tree in the CART(Classification and Regression Tree) algorithm. An
attribute with the low Gini index should be preferred as compared to the high Gini
index.

EXAMPLE

Problem Statement : We are trying to classify flowers into different classes like
Versicolor, Virginica, Setosa based on the attributes like petal length, petal width,
sepal width, sepal length etc.,
ASM Used : Gini Index
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

FACTOR ANALYSIS &


CLUSTERING
FACTOR ANALYSIS

Factor Analysis is the name of a class of multivariate procedures used for


data reduction and summarization.
It is used to identify the underlying set of variables or factors that explain
correlation among a set of variables

Variables Factors
Records

Reduces number of variables from ‘p’ to


2

UNDERSTANDING FACTOR ANALYSIS

Imagine three
variables, X1, X2 and
X3 that are correlated
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

FACTOR ANALYSIS &


CLUSTERING

These points encapsulate every point


between a particular set of variables.
Every cluster of point forms a sort of
ellipse where the points on the major
axis are the points that are most
correlated. Similarly, the points on
the minor axis of such an ellipse are
the one that are the least correlated.

When plotted in the 3D


space the points show
up as spheroids.

𝐿𝑒𝑡 𝑍1 = 𝑊11 𝑋1 + 𝑊12 𝑋2 + 𝑊13 𝑋3


𝐿𝑒𝑡 𝑍2 = 𝑊21 𝑋1 + 𝑊22 𝑋2 + 𝑊23 𝑋3
𝐿𝑒𝑡 𝑍3 = 𝑊31 𝑋1 + 𝑊32 𝑋2 + 𝑊33 𝑋3
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

FACTOR ANALYSIS &


CLUSTERING
This is the COMPONENT
MATRIX. The value of the
component matrix (also
known as factor
loadings) are the weights
that we assigned to
related the variables to
the principal
components.

Every factor in the Component Matrix has an associated Eigenvalue and a


corresponding % Variance of the dataset that the factor explains.
The Eigenvalues essentially tell us how much a particular factor explains
the number of variables (3 here) under analysis.

HOW TO SELECT FACTORS

Selection of the factors is done to reduce the variables that would be used
for further processing. The tradeoff is to minimize the information lost by
reducing the initial variables.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

FACTOR ANALYSIS &


CLUSTERING
CLUSTERING
Clustering is a multivariate statistical procedure which attempts to
reorganize the dataset into relatively homogenous groups.
There are 2 main objectives of clustering:
• Reduce the intra-group distances (improves homogeneity)
• Increase the inter-group distance (improves the distinction between
groups)

MEASURING SIMILARITY – DISTANCE

2 basic distance measures are employed:


• Euclidean Distance: Sum of square roots of the squared differences in
values for each variable
• City block or Manhattan Distance: Sum of absolute differences in values
for each variable

5 types of distance are measures going forward (lowest measure is


selected):
• Inter-Point Distance: Euclidean distance between a benchmark point and
the point we want to classify is calculated and clusters are made
• Single Linkage: We consider distance between the point we want to
classify and the closest members to it in the clusters available
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

FACTOR ANALYSIS &


CLUSTERING
Complete Linkage: We consider distance between the point we want to
classify and the farthest members to it in the clusters available
Average Linkage: We consider intragroup distance within the available
clusters assuming that the point we want to classify is a part of it
Centroid Method: We consider distance between the point we want to
classify and the centroids in the clusters available

TYPES OF CLUSTERING

HIERARCHICAL CLUSTERING
• A clustering procedure characterized by the development of a hierarchy
or tree like structure
• Search for the 2 points that are most similar and group the next most
similar points – going in an iterative manner
• Consists of 2 distinct methods: Agglomerative where we assimilate from
single point clusters to whole dataset and Divisive where we break the
whole dataset to reach single point clusters

NON–HIERARCHICAL CLUSTERING
• Number of clusters are pre-defined, and clusters are built around it
• Algorithm find ‘k’ points farthest from one another and ten starts
grouping other points basis the distance from these ‘k’ seeds (K-Means
Clustering)
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

WHAT’S NEW IN EXCEL?


LET FUNCTION
The LET function assigns names to calculation results. This allows storing
intermediate calculations, values, or defining names inside a formula. o use
the LET function in Excel, you define pairs of names and associated values,
and a calculation that uses them all. You must define at least one
name/value pair (a variable), and LET supports up to 126.

=LET(name1, name_value1, calculation_or_name2, [name_value2,


calculation_or_name3...])

The first name to assign. Must start with a letter. Cannot be


name1 (Required)
the output of a formula or conflict with range syntax.

name_value1 (Required) The value that is assigned to name1.

One of the following:


• A calculation that uses all names within
the LET function. This must be the last argument in
calculation_or_name2 (Required) the LET function.
• A second name to assign to a second name_value. If a
name is specified, name_value2 and
calculation_or_name3 become required.

name_value2 (Optional) The value that is assigned to calculation_or_name2.

One of the following:


• A calculation that uses all names within
the LET function. The last argument in the LET function
calculation_or_name3 (Optional) must be a calculation.
• A third name to assign to a third name_value. If a name
is specified, name_value3 and calculation_or_name4
become required.

=LET(x,1,x+1) → Output = 2
=LET(x,1,y,1,x+y) → Output = 2

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

WHAT’S NEW IN EXCEL?


XLOOKUP FUNCTION
With XLOOKUP, you can look in one column for a search term and return a
result from the same row in another column, regardless of which side the
return column is on. The XLOOKUP function searches a range or an array,
and then returns the item corresponding to the first match it finds. If no
match exists, then XLOOKUP can return the closest (approximate) match.

=XLOOKUP(lookup_value, lookup_array, return_array,


[if_not_found], [match_mode], [search_mode])

Lookup_value (Required)
The value to search for

lookup_array (Required) The array or range to search

return_array (Required) The array or range to return

Where a valid match is not found, return the [if_not_found]


text you supply.
[if_not_found] (Optional)
If a valid match is not found, and [if_not_found] is
missing, #N/A is returned.
Specify the match type:
• 0 - Exact match. If none found, return #N/A. This is the
default.
• -1 - Exact match. If none found, return the next smaller
[match_mode] (Optional) item.
• 1 - Exact match. If none found, return the next larger
item.
• 2 - A wildcard match where , ?, and ~ have special
meaning.
Specify the search mode to use:
• 1 - Perform a search starting at the first item. This is the
default.
• -1 - Perform a reverse search starting at the last item.
[search_mode] (Optional) • 2 - Perform a binary search that relies on lookup_array
being sorted in ascendingorder. If not sorted, invalid
results will be returned.
• -2 - Perform a binary search that relies on lookup_array
being sorted in descendingorder. If not sorted, invalid
results will be returned.

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

WHAT’S NEW IN EXCEL?


XLOOKUP FUNCTION

LAMBDA FUNCTION
LAMBDA function allows the user to create custom, reusable functions and
call them at will.

=LAMBDA([parameter1, parameter2, …,] calculation)

parameter A value that you want to pass to the function, such as a


cell reference, string or number. You can enter up to 253
parameters. This argument is optional.
calculation The formula you want to execute and return as the result
of the function. It must be the last argument and it must
return a result. This argument is required.

=LAMBDA(x,y,x+y)
=mylambda(1,3) → Output = 4

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

WHAT’S NEW IN EXCEL?


LAMBDA FUNCTION
There are 2 steps to creating a LAMBDA function:
• Create the Lambda in the cell
=LAMBDA function ([parameter1, parameter2, ...],calculation) (function
call)
Ex: =LAMBDA(number, number + 1)(1) gives 2 as an output
• Add the Lambda to the name manager
Formulas > Name Manager (Windows) and then select new & fill
according to the table below
Formulas > Define Name (Mac) and then select new & fill according to
the table below

Name: Enter the name for the LAMBDA function.

Scope: Workbook is the default. Individual sheets are also


available.
Comment: Optional, but highly recommended. Enter up to 255
characters. Briefly describe the purpose of the function
and the correct number and type of arguments.
Displays in the Insert Function dialog box and as a tooltip
(along with the Calculation argument) when you type a
formula and use Formula Autocomplete (also called
Intellisense).
Refers to: Enter the LAMBDA function. Press F2 to edit the text and
prevent automatic cell reference insertion.

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

WHAT’S NEW IN EXCEL?


SWITCH FUNCTION
The SWITCH function evaluates one value (called the expression) against a
list of values, and returns the result corresponding to the first matching
value. If there is no match, an optional default value may be returned.

=SWITCH(Value to switch, Value to match1...[2-126], Value to


return if there's a match1...[2-126], Value to return if there's no
match)

TEXTJOIN FUNCTION
The TEXTJOIN function combines the text from multiple ranges and/or
strings, and includes a delimiter you specify between each text value that
will be combined. If the delimiter is an empty text string, this function will
effectively concatenate the ranges.

TEXTJOIN(delimiter, ignore_empty, text1, [text2], …)

delimiter (Required) A text string, either empty, or one or more characters


enclosed by double quotes, or a reference to a valid text
string. If a number is supplied, it will be treated as text.
ignore_empty (Required) If TRUE, ignores empty cells.

text1 (Required) Text item to be joined. A text string, or array of strings,


such as a range of cells.

Source: Link • Link


BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

WHAT’S NEW IN EXCEL?


TEXTJOIN FUNCTION

TEXTJOIN(delimiter, ignore_empty, text1, [text2], …)

[text2, ...] (Optional) Additional text items to be joined. There can be a


maximum of 252 text arguments for the text items,
including text1. Each can be a text string, or array of
strings, such as a range of cells.

=TEXTJOIN(" ",TRUE, "The", "sun", "will", "come", "up",


"tomorrow.") Output = The sun will come up tomorrow.

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

EXCEL SHORTCUTS
FUNCTION WINDOWS MAC
Display the Paste Special dialog box Ctrl + Alt + V ⌘+⌃+V

Display find and replace, replace selected Ctrl + H ⌃+H

Find next match Shift + F4 ⌘+G

Find previous match Ctrl + Shift + F4 ⌘+⇧+G

Create embedded chart Alt + F1 Fn + ⌥ + F1

Create chart in new worksheet F11 Fn + F11

Insert table Ctrl + T ⌃+T

Toggle Autofilter Ctrl + Shift + L ⌘+⇧+F

Select table row Shift + Space ⇧ + Space

Select table column Ctrl + Space ⌃ + Space

Clear slicer filter Alt + C ⌥+C

Toggle table total row Ctrl + Shift + T ⌘+⇧+T

Move to right edge of data region Ctrl + → ⌃+→

Move to left edge of data region Ctrl + ← ⌃+←

Move to top edge of data region Ctrl + ↑ ⌃+↑

Move to bottom edge of data region Ctrl + ↓ ⌃+↓

Move to last cell in worksheet Ctrl + End Fn + ⌃ + →

Move to first cell in worksheet Ctrl + Home Fn + ⌃ + ←

Move right between non-adjacent selections Ctrl + Alt + → ⌃+⌥+→

Move left between non-adjacent selections Ctrl + Alt + ← ⌃+⌥+←

Select active cell only Shift + Backspace ⇧ + Delete

Show the active cell on worksheet Ctrl + Backspace ⌘+G

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

EXCEL SHORTCUTS
FUNCTION WINDOWS MAC
Move active cell clockwise in selection Ctrl + . ⌃+.

Extend the selection to the last cell right Shift + → ⇧+→

Extend the selection to the last cell left Shift + ← ⇧+←

Extend the selection to the last cell up Shift + ↑ ⇧+↑

Extend the selection to the last cell down Shift + ↓ ⇧+↓

Extend the selection to the last cell right Ctrl + Shift + → ⌃+⇧+→

Extend the selection to the last cell left Ctrl + Shift + ← ⌃+⇧+←

Extend the selection to the last cell up Ctrl + Shift + ↑ ⌃+⇧+↑

Extend the selection to the last cell down Ctrl + Shift + ↓ ⌃+⇧+↓

Extend selection to first cell in worksheet Ctrl + Shift + Home Fn + ⌃ + ⇧ + ←

Extend selection to last cell in worksheet Ctrl + Shift + End Fn + ⌃ + ⇧ + →

Display 'Go To' dialog box Ctrl + G ⌃+G

Select cells with comments Ctrl + Shift + O Fn + ⌃ + ⇧ + O

Select current array Ctrl + / ⌃+/

Select row differences Ctrl + \ ⌃+\

Select column differences Ctrl + Shift + | ⌃+⇧+|

Select direct precedents Ctrl + [ ⌃+[

Select all precedents Ctrl + Shift + { ⌃+⇧+{

Select direct dependents Ctrl + ] ⌃+]

Select all dependents Ctrl + Shift + } ⌃+⇧+}

Edit the active cell F2 ⌃+U

Insert or edit comment Shift + F2 Fn + ⇧ + F2

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

EXCEL SHORTCUTS
FUNCTION WINDOWS MAC
Start a new line in the same cell Alt + Enter ⌃ + ⌥ + Return

Flash fill Ctrl + E

Enter and move up Shift + Enter ⇧ + Return

Enter and move right Tab Tab

Enter and move left Shift + Tab ⇧ + Tab

Complete entry and stay in same cell Ctrl + Enter ⌃ + Return

Insert current date Ctrl + ; ⌃+;

Insert current time Ctrl + Shift + ; ⌘+;

Fill down from cell above Ctrl + D ⌃+D

Fill right from cell left Ctrl + R ⌃+R

Copy formula from cell above Ctrl + ‘ ⌃+‘

Copy value from cell above Ctrl + Shift + “ ⌃+⇧+“

Add hyperlink Ctrl + K ⌘+K

Display AutoComplete list Alt + ↓ ⌥+↓

Basic Formatting Ctrl + 1 ⌘+1

Display Format Cells with Font tab selected Ctrl + Shift + F ⌃+⇧+F

Apply or remove strikethrough formatting Ctrl + 5 ⌘+⇧+X

Align center Alt + H + A + C ⌘+E

Align left Alt + H + A + L ⌘+L

Align right Alt + H + A + R ⌘+R

Increase font size one step Alt + H + FG ⌘+⇧+>

Decrease font size one step Alt + H + FK ⌘+⇧+<

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

EXCEL SHORTCUTS
FUNCTION WINDOWS MAC
Apply general format Ctrl + Shift + ~ ⌃+⇧+~

Apply currency format Ctrl + Shift + $ ⌃+⇧+$

Apply percentage format Ctrl + Shift + % ⌃+⇧+%

Apply scientific format Ctrl + Shift + ^ ⌃+⇧+^

Apply date format Ctrl + Shift + # ⌃+⇧+#

Apply time format Ctrl + Shift + @ ⌃+⇧+@

Apply number format Ctrl + Shift + ! ⌃+⇧+!

Add border outline Ctrl + Shift + & ⌘+⌥+0

Add or remove border right Alt + R ⌘+⌥ +→

Add or remove border left Alt + L ⌘+⌥ +←

Add or remove border top Alt + T ⌘+⌥ +↑

Add or remove border bottom Alt + B ⌘+⌥ +↓

Add or remove border upward diagonal Alt + D

Add or remove border horizontal interior Alt + H

Add or remove border vertical interior Alt + V

Remove borders Ctrl + Shift + _ ⌘+⌥ +_

Toggle absolute and relative references F4 ⌘+T

Open the Insert Function Dialog Box Shift + F3 Fn + ⇧ + F3

Autosum selected cells Alt + = ⌘+⇧+T

Toggle formulas on and off Ctrl + ` ⌃+`

Enter array formula Ctrl + Shift + Enter ⌃ + ⇧ + Return

Calculate worksheets F9 Fn + F9

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

EXCEL SHORTCUTS
FUNCTION WINDOWS MAC
Calculate active worksheet Shift + F9 Fn + ⇧ + F9

Evaluate part of a formula F9 Fn + F9

Open the Name Manager Ctrl + F3 Fn + ⌃ + F3

Define name using row and column labels Ctrl + Shift + F3 Fn + ⌃ + ⇧ + F3

Display Insert Dialog box Ctrl + Shift + + ⌘+⇧++

Display Delete dialog box Ctrl + – ⌘+ –

Hide columns Ctrl + 0 ⌃+0

Hide rows Ctrl + 9 ⌃+9

Unhide rows Ctrl + Shift + 9 ⌃+⇧+9

Unhide columns Ctrl + Shift + 0 ⌃+⇧+0

Group rows or columns Alt + Shift + → ⌘+⇧+K

Ungroup rows or columns Alt + Shift + ← ⌘+⇧+J

Create pivot chart on same worksheet Alt + F1

Create pivot chart on new worksheet F11 Fn + F11

Open pivot table wizard Alt + D + P ⌘+⌥+P

Insert new worksheet Shift + F11 Fn + ⇧ + F11

Go to next worksheet Ctrl + PgDn Fn + ⌃ + ↓

Go to previous worksheet Ctrl + PgUp Fn + ⌃ + ↑

Go to next workbook Ctrl + Alt + ← ⌃ + Tab

Go to previous workbook Ctrl + Shift + Tab ⌃ + ⇧ + Tab

Minimize current workbook window Ctrl + F9 ⌘+M

Maximize current workbook window Ctrl + F10 Fn + ⌃ + F10

Source: Link
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
DATA WAREHOUSING

A data warehouse is a centralized repository of integrated data from one or


more disparate sources. Data warehouses store current and historical data
and are used for reporting and analysis of the data.
To move data into a data warehouse, data is periodically extracted from
various sources that contain important business information. As the data is
moved, it can be formatted, cleaned, validated, summarized, and
reorganized. Alternatively, the data can be stored in the lowest level of
detail, with aggregated views provided in the warehouse for reporting. In
either case, the data warehouse becomes a permanent data store for
reporting, analysis, and business intelligence (BI).
Reporting tools don't compete with the transactional systems for query
processing cycles. A data warehouse allows the transactional system to
focus on handling writes, while the data warehouse satisfies the majority of
read requests.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
OLTP & OLAP
Online transaction processing (OLTP) captures, stores, and processes data
from transactions in real time. It uses 3NF Normalisation to reduce data
redundancy and speed of the system. It's write speeds are very fast and it
does not contain historical data in most cases.
Online analytical processing (OLAP) uses complex queries to analyse
aggregated historical data from OLTP systems.

PRIMARY & FOREIGN KEYS


Primary Key (PK) of a table is a column or a set of columns that can uniquely
identify any row of the said table.
Foreign Key (FK) is an implied link or a reference to a table containing the
corresponding PK.

STAR SCHEMA

The most basic unit of a star schema consists of a set of dimension tables and a
fact table.
The dimension tables contain core attributes of the data e.g. date, product, place
etc. They contain a PK that uniquely identifies a row in that dimension
(sometimes surrogate keys are used to add this unique identifier).
The fact tables contains the business measure, KPIs and values. They contain
FK references for the PK values of dimension tables. The dimensionality of the
fact table is the number of FK references it contains. The granularity of the data is
determined by how deep we can drill down from the broadest level of the data.
It can also be understood as the last level at which data can be viewed
successfully.
Generally, dimension tables contain a relatively small number of rows. Fact
tables, on the other hand, can contain a very large number of rows and continue
to grow over time.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
STAR SCHEMA

Dimension tables support filtering and grouping


Fact tables support summarization

This type of model is extensively used in a BI framework. Each Power BI


report visual generates a query that is sent to the Power BI model (dataset).
These queries are used to filter, group, and summarize model data. A well-
designed model, then, is one that provides tables for filtering and grouping,
and tables for summarizing.
A model relationship establishes a filter propagation path between two
tables, and it's the cardinality property of the relationship that determines
the table type. A common relationship cardinality is one-to-many or its
inverse many-to-one. The "one" side is always a dimension-type table while
the "many" side is a fact-type table.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
SNOWFLAKE SCHEMA

It is an extension of a star schema model. The only variation is that her the
dimension side is normalized whereas that is not the case in STAR schema.
So essentially a dimension from STAR schema is broken down into normalized
entities to give a SNOWFLAKE schema.
They reduce the size of the dimensions by reducing redundancies but can adds
to additional query time because multiple joins have to be made to get to a
filtering criteria.

SLOWLY CHANGING DIRECTIONS

A Slowly Changing Dimension (SCD) is a dimension that stores and manages


both current and historical data over time in a data warehouse. It is considered
and implemented as one of the most critical ETL tasks in tracking the history of
dimension records.

Type 1 SCDs - Overwriting

In a Type 1 SCD the new data overwrites the existing data. Thus the existing
data is lost as it is not stored anywhere else. This is the default type of
dimension you create. You do not need to specify any additional
information to create a Type 1 SCD.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
SLOWLY CHANGING DIRECTIONS

Type 2 SCDs - Creating another dimension record


A Type 2 SCD retains the full history of values. When the value of a chosen
attribute changes, the current record is closed. A new record is created with the
changed data values and this new record becomes the current record. Each
record contains the effective time and expiration time to identify the time period
between which the record was active.

Type 3 SCDs - Creating a current value field


A Type 3 SCD stores two versions of values for certain selected level attributes.
Each record stores the previous value and the current value of the selected
attribute. When the value of any of the selected attribute changes, the current
value is stored as the old value and the new value becomes the current value.

DATA NORMALISATION

Normalisation is done to meet four broad goals four goals:


Arranging data into logical groupings such that each group describes a small part
of the whole
Minimising the amount of duplicate data stored in a database
Organising the data such that, when you modify it, you make the change in only
one place
Building a database in which you can access and manipulate the data quickly and
efficiently without compromising the integrity of the data in storage

FIRST NORMAL FORM (1NF)

For data to be of the 1NF you need to ensure that the data is atomic. A data item
is atomic if only one item is in each cell of a table.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL

SECOND NORMAL FORM (2NF)

To be in the second normal form the table has be to 1NF and the non-PK data
should have full functional dependency on the PK in the table

THIRD NORMAL FORM (3NF)

For a table to be in the 3NF form, it needs to be in the 2NF form and should not
have any transitive dependencies.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
THIRD NORMAL FORM (3NF)

Transitive dependency is a full functional dependency when 2 entities are


indirectly dependent on each other through another entity. That is X determines Z
as X determines Y and Y further determines Z.

BOYCE CODD NORMAL FORM (BCNF)

Also known as 3.5 NF, it is the higher version 3NF. The table should be in 3NF
form and then should be in such a state that no non-PK attribute determines any
PK attribute.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
TYPES OF STATEMENTS
• Data Definition Language (DDL) statements defines data structures. Use
these statements to create, alter, or drop data structures in a database.
These include - ALTER, CREATE, DROP, RENAME, UPDATE STATISTICS
• Data Manipulation Language (DML) statements affect the information
stored in the database. Use these statements to insert, update, and
change the rows in the database. These include - DELETE, INSERT,
SELECT, UPDATE, MERGE, TRUNCATE TABLE
• Data Control Language(DCL) statements defines the control over the
data in a database. These include - GRANT and REVOKE
• Transaction Control Language (TCL) statements are used to manage the
transactions in a database. These are used to manage the changes
made by DML statements. It also allows the statements to be grouped
together into logical transactions. These include - COMMIT and
ROLLBACK

BASIC SQL QUERIES


BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
SUBQUERIES

A subquery is a query that is nested inside a SELECT, INSERT, UPDATE, or


DELETE statement, or inside another subquery. A subquery can be used
anywhere an expression is allowed.
Because they must return a single value, subqueries cannot include
GROUP BY and HAVING clauses

DDL & DML STATEMENTS


BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
DDL & DML STATEMENTS
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
DDL & DML STATEMENTS

VIEWS
A view is a virtual table whose contents are defined by a query. Like a table,
a view consists of a set of named columns and rows of data. Unless
indexed, a view does not exist as a stored set of data values in a database.
The rows and columns of data come from tables referenced in the query
defining the view and are produced dynamically when the view is
referenced.
A view acts as a filter on the underlying tables referenced in the view. The
query that defines the view can be from one or more tables or from other
views in the current or other databases.
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
VIEWS

JOINS

BASIC JOIN
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
LEFT JOIN (OR LEFT OUTER JOIN)

FULL OUTER JOIN


BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
EXCEPT, UNION AND UNION ALL

EXCEPT

UNION AND UNION ALL


BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
OTHER QUERIES

CASE

DATEADD

CAST & CONVERT

SUBSTRING
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
COALESCE

ROW NUMBER, PARTITION & ORDER BY

ROW NUMBER allows you to add a unique number based on parameters


(generally a combination of row elements) specified.

The PARTITION BY clause divides the result set into partitions (another
term for groups of rows). The ROW_NUMBER() function is applied to each
partition separately and reinitialized the row number for each partition. The
PARTITION BY clause is optional. If you skip it, the ROW_NUMBER()
function will treat the whole result set as a single partition.
The ORDER BY clause defines the logical order of the rows within each
partition of the result set. It essentially determines the order of the flow.

Query output is shown on the next


page
BASICS OF ANALYTICS IN
ANALYTICAL TOOLS EXCEL & SQL
ANALYTICS PRACTICE

BASICS OF SQL
CONTENT &
DESIGN TEAM

Vinayak Nadir

Kota Sai Santhosh

Ashwin Palnitkar

REACH US
AT
Srinivasan G

You might also like