You are on page 1of 12

BASIC ANALYTICAL CONCEPTS

This document aims to increase awareness about the basic concepts in analytics that may be
asked during an interview. More information about each of the topics can be found in the
references. Quantitative Methods and Operations Research are two subjects that have to be
familiarised. Questions related to tools (R, python, tableau) mentioned in resume can be
expected, especially about libraries that we are familiar in these languages.
Team ABC would like to extend our heartfelt gratitude to Keshav Arora, Abhishek Terdalkar,
Sruthy Vempa, Debdutta Barman, Kushal Madke, Mohith P, Ganesh Rathod, Chakshu Chawla,
Rahul Bontada, Abhinay Garg, Abhishek Parate and Bhakti Netke for their valuable
contributions in helping collate this document.

❖ ANALYTICS
Analytics is the process of discovering, interpreting, and communicating significant
patterns in data and using tools to empower the entire organisation. It aids in
optimisation, cost savings, and customer engagement. It uses data and math to answer
business questions, discover relationships, predict unknown outcomes and automate
decisions. Consumers now are well informed about product information, notably price
and promotions. Hence prediction is complex. Analytics in any organisation helps in
• Directing R&D investment
• Improving the effectiveness of marketing
• Maximising Supply Chain efficiencies

References

1. sas.com
2. oracle.com
3. pwc.in

❖ MEASURES OF CENTRAL TENDENCIES, THEIR APPLICABILITY AND


DRAWBACKS
A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. The mean, median and mode are
all valid measures of central tendency, but under different conditions, some measures of
central tendency become more appropriate to use than others.

• The mean has one main disadvantage: it is particularly susceptible to the influence of
outliers. As the data becomes skewed, the mean loses its ability to provide the best
central location for the data because the skewed data is dragging it away from the
typical value.

1
• The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data.
• The mode is the most frequent score in our data set and can be considered as being
the most popular option. It is usually used for categorical data to find which is the
most common category. Not useful when we have continuous data because we are
more likely not to have any one value that is more frequent than the other. Another
problem with the mode is that it will not provide us with a very good measure of
central tendency when the most common mark is far from the rest in the data set.
Each of the different measures of central tendencies is suitable in different scenarios

• The mean is usually the best measure of central tendency to use when your data
distribution is continuous and symmetrical.
• The median is usually preferred to other measures of central tendency when your data
set is skewed.
• The mode is the least used of the measures of central tendency and can only be used
when dealing with nominal data.
References
1. statistics.laerd.com

❖ HYPOTHESIS TESTING
A hypothesis is an educated guess about something in the world around us. It should be
testable, either by experiment or observation. Hypothesis testing is a way to test the
results of a survey or experiment to see if we have meaningful results. We are testing to
see if the results obtained are valid by checking out the odds that it might have been due
to chance. Statistical significance in this context implies the relationship between
variables is not by chance.
Null hypothesis (Ho) - It is a hypothesis that says there is no statistical significance
between the variables in the hypothesis (relationship is by chance).
Alternative Hypothesis (Ha) - Hypothesis which states that there is statistical significance
between the variables in the hypothesis (relationship not by chance).
Level of significance: Refers to the degree of significance in which we accept or reject
the null-hypothesis. 100% accuracy is not possible for accepting or rejecting a hypothesis,
so we, therefore, select a level of significance that is usually 5%.
Type I Error is the rejection of the Null Hypothesis when it is true. Implying there's
statistical significance between the variables when actually there isn't. This is also known
as Alpha Error.
Type II Error is failing to reject a Null Hypothesis when it is false. Implying there's no
statistical significance between variables when there is. This is also known as Beta Error.

2
The chances of committing these two types of errors are inversely proportional: that is,
decreasing type I error rate increases type II error rate and vice versa.

Source: simplypsychology.org

References

1. benchmarksixsigma.org
2. simplypsychology.org
3. statisticsolutions.com

❖ MACHINE LEARNING - SUPERVISED AND UNSUPERVISED LEARNING


Machine learning is an application of artificial intelligence (AI) that provides systems with
the ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning algorithms are classified as supervised or unsupervised.
Supervised learning is a learning in which we teach or train the machine using data which
is well labelled, that means some data is already tagged with the correct answer. After
that, the machine is provided with a new set of data, so that supervised learning algorithm
analyses the training data and produces a correct outcome from the labelled data. In
short, a supervised learning algorithm learns from labelled training data and helps you to
predict outcomes for unforeseen data. Supervised learning can be used for two types of
problems:
Classification - Classification models are used for problems where the output variable can
be categorised, such as "Yes" or "No", or "Pass" or "Fail." They are used to predict the
category of the data.
Regression - Regression models are used for problems where the output variable is a real
value. It is most often used to predict numerical values based on previous observations.

3
Unsupervised learning is the training of machine using information that is neither
classified nor labelled and allowing the algorithm to act on that information without
guidance. Here the task of machine is to group unsorted information according to
similarities, patterns and differences without any prior training of data. Unsupervised
learning can be used for two types of problems:

• Clustering: The method of clustering involves organising unlabelled data into similar
groups called clusters. A clustering problem is where you want to discover the
inherent groupings in the data, such as grouping customers by purchasing behaviour.
• Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy
Y.
References

1. guru99.com
2. geeksforgeeks.org
3. javatpoint.com
4. analyticsvidhya.com

❖ LINEAR AND LOGISTIC REGRESSION

Linear and Logistic regression are the most basic form of regression which are commonly
used. Since both the algorithms are supervised in nature, these algorithms use labelled
dataset to make the predictions. The main difference between them is that Linear
Regression is used for solving regression problems, whereas logistic regression is used for
solving classification problems. For the same reason, logistic regression is used when the
dependent variable is binary in nature.
The linear regression technique involves the continuous dependent variable, and the
independent variables can be continuous or discrete. Using best fit straight line linear
regression sets up a relationship between dependent variable (Y) and one or more
independent variables (X).
The logistic regression technique involves dependent variable which can be represented
in the binary (0 or 1, true or false, yes or no) values, meaning the outcome could only be
in either one form of two.

4
Linear regression Logistic Regression

Source: techdifferences.com

References

1. analyticsvidhya.com
2. javatpoint.com
3. techdifferences.com

❖ REGRESSION AND CORRELATION


• Correlation measures the degree of relationship between two variables. Regression
analysis is about how one variable affects another or what changes it triggers in the
other.
• Correlation doesn't capture causality but the degree of interrelation between the two
variables. Regression indicates the impact of change in the dependent variable with
respect to the independent variable. Even though it shows one variable's effect on the
other, it does not always imply causation.
• Interchanging the axis results in different results for regression whereas its same for
correlation.
• Depending on whether correlation is positive or negative, regression slope would also
be positive or negative.
References

1. vedantu.com
2. 365datascience.com
3. graphpad.com

❖ FIT OF A REGRESSION MODEL


• The use of unnecessary independent variables leads to overfitting of the model, i.e.
the model works exceedingly well for the training set but is unable to perform on the
test sets. Increasing the data is one way to avoid overfitting. The reason is that, as

5
you add more data, the model becomes unable to overfit all the samples, and
is forced to generalise to make progress.
• When the model works so poorly that it doesn't even fit the training set it's called
underfitting. Increasing number of parameters is one of the ways to avoid
underfitting.

Source: docs.aws.amazon.com

References

1. hackernoon.com
2. docs.aws.amazon.com

❖ SIGNIFICANCE OF R2, ADJUSTED R2 AND PREDICTED R2


• R-squared value gives the proportion of variation in the target variable explained by
the linear regression model. It always lies between 0 and 1. A higher R-squared value
indicates a higher amount of variability being explained by our model and vice-versa.
• The Adjusted R-squared takes into account the number of independent variables
used for predicting the target variable. In doing so, we can determine whether adding
new variables to the model increases the model fit. This is useful in the case of multiple
linear regression.
• Predicted R-squared helps you determine whether you are overfitting a regression
model. A predicted R-squared that is distinctly smaller than R-squared is a warning
sign that you are overfitting the model. We should try reducing the number of terms
in that case.
References

1. analyticsvidhya.com

❖ SIGNIFICANCE OF P VALUE
In a regression model, the P-Value for each independent variable tests the Null
Hypothesis that there is "No Correlation" between the independent and the dependent
variable. So, if the P-Value is less than the significance level (usually 0.05) then, the null
hypothesis is rejected which means the model fits the data well and the sample data
provides enough evidence to reject the null hypothesis for the entire population. The

6
data favours the hypothesis that there is a correlation. On the other hand, a p-value
greater than the significance level indicates that we cannot reject the null hypothesis,
which means there is insufficient evidence in the sample to conclude that a correlation
exists.
References

1. medium.com
2. statisticsbyjim.com

❖ ASSOCIATION ANALYSIS
• Association analysis is a technique used to understand customer buying patterns. It
aims to find relationships and establish patterns across purchases, commonly referred
to as market basket analysis.
• These relationships can be used to increase profitability through cross-selling,
recommendations, promotions, or even the placement of items on a menu or in a
store.
• Amazon presents users with related products, under the headings of "Frequently
bought together" and "Customers who bought this item also bought."
• People who buy bread and peanut butter also buy jelly. Or people who buy shampoo
might buy a conditioner too.
References

1. smartbridge.com
2. searchcustomerexperience.techtarget.com

❖ BIG DATA AND IT'S CHARACTERISTICS


Big data is larger, more complex data sets, coming through new data sources with the
advent of digitisation. These data sets are so huge that traditional data processing
software is unable to manage them. But it's not the amount of data that's important,
these massive volumes of data can be used to address business problems that we
wouldn't have been able to tackle before. Big data can be analysed for insights that lead
to better decisions and strategic business moves. These are the 4Vs of Big Data

• Volume - The amount of data is being referred to here. For some organisations, this
might be tens of terabytes of data. For others, it may be hundreds of petabytes. Earlier
storage was a major issue, but now with cloud storage it has eased the burden.
• Velocity – The speed at which data is being generated now is the key aspect here.
With more IOT devices and smart devices the velocity has increased drastically.
• Variety – This refers to the different types of data generated. Traditionally data was
structured but now the major chunk of it is unstructured and semi structured data

7
which include with text, images, audio and video. All these add complexity which
require additional processing to derive meaning.
• Veracity – This refers to the quality of data and signifies how meaningful the data is.
The non-valuable in these data sets is referred to as noise. An example of a high
veracity data set would be data from a medical experiment or trial.

Source: ibmbigdatahub.com

References

1. ibmbigdatahub.com
2. oracle.com
3. sas.com

❖ ARTIFICIAL INTELLIGENCE
AI refers to the simulation of human intelligence in machines that are programmed to
think like humans and mimic their actions. The goals of artificial intelligence include
learning, reasoning, and perception. Most AI examples that we hear about from chess-
playing computers to self-driving cars, rely heavily on deep learning and natural language
processing.

• Deep learning is a subset of machine learning where artificial neural networks,


algorithms inspired by the human brain, learn from large amounts of data.
• Natural Language processing, commonly referred to as NLP, interprets raw,
arbitrary written text and transforms it into something a computer can
understand.
Artificial intelligence can further be divided into two different categories:

• Weak AI-These are systems that are designed to carry out a single task. e.g.,
Personal assistants like Alexa and Siri.

8
• Strong AI- These are more complicated and complex systems which are more
human-like. e.g., Self-driving cars.

Source: athenatech.tech

References

1. sas.com 4. athenatech.tech
2. iodinesoftware.com 5. investopedia.com
3. forbes.com 6. expert.ai

❖ AI AND AUTOMATION
The terms Artificial Intelligence and Automation are often used
interchangeably. However, there are pretty significant differences between the
complexity level of both systems. Automation is making a hardware or software that is
capable of doing things automatically without human intervention. Artificial Intelligence,
however, is a science and engineering of making intelligent machines. Automation may or
may not be based on Artificial Intelligence. The key differences are that-
• AI mimics human intelligence decisions and actions, while automation focuses on
streamlining repetitive, instructive tasks.
• AI involves learning from past data and evolving from it, whereas automation is
pre-set to do particular tasks.
References

1. becominghuman.ai
2. geeksforgeeks.org

9
❖ BUSINESS INTELLIGENCE

The terms business intelligence and business analytics are often used interchangeably.
The term business analytics refers to examining data to find trends and insights. When
used together, business intelligence and business analytics has a broader meaning and
includes every aspect of gathering, analysing, and interpreting data. Business intelligence
(BI) combines business analytics, data mining, data visualisation, data tools and
infrastructure, and best practices to help organisations make more data-driven
decisions. It is a set of technologies, procedures, applications that enable us to convert
raw data into meaningful information that can be used for decision making. It is
predominantly used in data extraction and data warehousing techniques. The main
intention of business intelligence is analysing data and predicting the future from past
data. Some ways by which companies can leverage BI are-

• Creating KPI (Key Performance Indicators) based on historical data.


• Identify and set benchmarks for varied processes.
• With BI systems, organisations can identify market trends and spot business
problems that need to be addressed.
• BI helps on data visualisation that enhances the data quality and thereby the
quality of decision making.
• Capturing insights about employees and prospects to optimise HR processes
and recruitment
References

1. geeksforgeeks.org
2. tableau.com
3. oracle.com

❖ PUZZLES
Puzzles may be asked part of testing your analytical thinking. Below mentioned sites
provide some of the most commonly asked puzzles
• https://analyticsindiamag.com/10-commonly-asked-puzzles-in-a-data-science-interview/
• https://analyticsindiamag.com/10-standard-puzzles-asked-analytics-interviews/
• https://www.analyticsvidhya.com/blog/2016/07/20-challenging-job-interview-puzzles-
which-every-analyst-solve-atleast/

10
PROMISING INDIAN ANALYTICS FIRMS

• Mu Sigma – Mu Sigma has worked with more than 140 Fortune 500 companies and
provided substantial ROI to businesses. Mu Sigma is helping businesses to drive
intelligent decision making through their best in class and accurate data analytics
solutions. Serving various industries, Mu Sigma specialises in-demand analytics,
marketing analytics, network planning and optimisation, and risk analytics among
other services.

• Fractal Analytics - Fractal Analytics provides a range of Analytics Solutions for its
varied clientele. The services include image and video analytics, text analytics,
forecasting solutions, and advanced artificial intelligence solutions. They have
solutions for customer experience, supply chain, and behavioural science

• Manthan - Manthan’s Business Analytics intends to seamlessly align technology and


customer experience. It provides big data and AI aided decision making solutions for
restaurants, fashion and apparel, e-commerce, food and grocery and convenience
stores.

• Tiger Analytics - Tiger Analytics has a range of services including Marketing Analytics,
Customer Analytics, Risk Analytics, and operations and planning services. They help in
setting up the KPI reporting and dashboard for tracking and alerting on pre-defined
performance indicators. They also facilitate insights discovery through a structured
and iterative approach

11
RECOMMENDED CERTIFICATIONS
MS Excel

• Problem Solving with Excel by PwC | Coursera


• Data Visualization with Advanced Excel by PwC | Coursera
• Excel Skills for Business Specialisation by Macquarie University | Coursera
• Excel Beginner to Advanced in 4 Hours by Adrian Pumeriega | Udemy
• Microsoft Excel - Excel from Beginner to Advanced by Kyle Pew, Office Newb LLC
|Udemy
Python

• Python for Everybody Specialisation by University of Michigan | Coursera


• Crash Course on Python by Google | Coursera
• Introduction to Data Science by University of Michigan | Coursera
• Python for Data Science and AI by IBM | Coursera
Tableau

• Data Visualization and Communication with Tableau by Duke University | Coursera


• Tableau 2020 A-Z: Hands-On Tableau Training for Data Science by Kirill Eremenko,
SuperDataScience Team | Udemy
R Programming

• R Programming by John Hopkins University | Coursera


• R Programming A-Z: R for Data Science with Real Exercises by Kirill Eremenko,
SuperDataScience Team | Udemy
SQL

• Databases and SQL for Data Science by IBM | Coursera


• SQL for Data Science by University of California | Coursera
Google Analytics

• Google Analytics for Beginners | Google


• Advanced Google Analytics | Google
• Ultimate Google Analytics Course by Pavel Brecik | Udemy
Others

• Business Statistics and Analysis Specialisation | Coursera


• Marketing Analytics by University of Virginia | Coursera
• Machine learning A-Z: Hands on Python & R in Data Science by Kirill
Eremenko, Hadelin de Ponteves, SuperDataScience Team | Udemy

12

You might also like