Professional Documents
Culture Documents
This document aims to increase awareness about the basic concepts in analytics that may be
asked during an interview. More information about each of the topics can be found in the
references. Quantitative Methods and Operations Research are two subjects that have to be
familiarised. Questions related to tools (R, python, tableau) mentioned in resume can be
expected, especially about libraries that we are familiar in these languages.
Team ABC would like to extend our heartfelt gratitude to Keshav Arora, Abhishek Terdalkar,
Sruthy Vempa, Debdutta Barman, Kushal Madke, Mohith P, Ganesh Rathod, Chakshu Chawla,
Rahul Bontada, Abhinay Garg, Abhishek Parate and Bhakti Netke for their valuable
contributions in helping collate this document.
❖ ANALYTICS
Analytics is the process of discovering, interpreting, and communicating significant
patterns in data and using tools to empower the entire organisation. It aids in
optimisation, cost savings, and customer engagement. It uses data and math to answer
business questions, discover relationships, predict unknown outcomes and automate
decisions. Consumers now are well informed about product information, notably price
and promotions. Hence prediction is complex. Analytics in any organisation helps in
• Directing R&D investment
• Improving the effectiveness of marketing
• Maximising Supply Chain efficiencies
References
1. sas.com
2. oracle.com
3. pwc.in
• The mean has one main disadvantage: it is particularly susceptible to the influence of
outliers. As the data becomes skewed, the mean loses its ability to provide the best
central location for the data because the skewed data is dragging it away from the
typical value.
1
• The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data.
• The mode is the most frequent score in our data set and can be considered as being
the most popular option. It is usually used for categorical data to find which is the
most common category. Not useful when we have continuous data because we are
more likely not to have any one value that is more frequent than the other. Another
problem with the mode is that it will not provide us with a very good measure of
central tendency when the most common mark is far from the rest in the data set.
Each of the different measures of central tendencies is suitable in different scenarios
• The mean is usually the best measure of central tendency to use when your data
distribution is continuous and symmetrical.
• The median is usually preferred to other measures of central tendency when your data
set is skewed.
• The mode is the least used of the measures of central tendency and can only be used
when dealing with nominal data.
References
1. statistics.laerd.com
❖ HYPOTHESIS TESTING
A hypothesis is an educated guess about something in the world around us. It should be
testable, either by experiment or observation. Hypothesis testing is a way to test the
results of a survey or experiment to see if we have meaningful results. We are testing to
see if the results obtained are valid by checking out the odds that it might have been due
to chance. Statistical significance in this context implies the relationship between
variables is not by chance.
Null hypothesis (Ho) - It is a hypothesis that says there is no statistical significance
between the variables in the hypothesis (relationship is by chance).
Alternative Hypothesis (Ha) - Hypothesis which states that there is statistical significance
between the variables in the hypothesis (relationship not by chance).
Level of significance: Refers to the degree of significance in which we accept or reject
the null-hypothesis. 100% accuracy is not possible for accepting or rejecting a hypothesis,
so we, therefore, select a level of significance that is usually 5%.
Type I Error is the rejection of the Null Hypothesis when it is true. Implying there's
statistical significance between the variables when actually there isn't. This is also known
as Alpha Error.
Type II Error is failing to reject a Null Hypothesis when it is false. Implying there's no
statistical significance between variables when there is. This is also known as Beta Error.
2
The chances of committing these two types of errors are inversely proportional: that is,
decreasing type I error rate increases type II error rate and vice versa.
Source: simplypsychology.org
References
1. benchmarksixsigma.org
2. simplypsychology.org
3. statisticsolutions.com
3
Unsupervised learning is the training of machine using information that is neither
classified nor labelled and allowing the algorithm to act on that information without
guidance. Here the task of machine is to group unsorted information according to
similarities, patterns and differences without any prior training of data. Unsupervised
learning can be used for two types of problems:
• Clustering: The method of clustering involves organising unlabelled data into similar
groups called clusters. A clustering problem is where you want to discover the
inherent groupings in the data, such as grouping customers by purchasing behaviour.
• Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy
Y.
References
1. guru99.com
2. geeksforgeeks.org
3. javatpoint.com
4. analyticsvidhya.com
Linear and Logistic regression are the most basic form of regression which are commonly
used. Since both the algorithms are supervised in nature, these algorithms use labelled
dataset to make the predictions. The main difference between them is that Linear
Regression is used for solving regression problems, whereas logistic regression is used for
solving classification problems. For the same reason, logistic regression is used when the
dependent variable is binary in nature.
The linear regression technique involves the continuous dependent variable, and the
independent variables can be continuous or discrete. Using best fit straight line linear
regression sets up a relationship between dependent variable (Y) and one or more
independent variables (X).
The logistic regression technique involves dependent variable which can be represented
in the binary (0 or 1, true or false, yes or no) values, meaning the outcome could only be
in either one form of two.
4
Linear regression Logistic Regression
Source: techdifferences.com
References
1. analyticsvidhya.com
2. javatpoint.com
3. techdifferences.com
1. vedantu.com
2. 365datascience.com
3. graphpad.com
5
you add more data, the model becomes unable to overfit all the samples, and
is forced to generalise to make progress.
• When the model works so poorly that it doesn't even fit the training set it's called
underfitting. Increasing number of parameters is one of the ways to avoid
underfitting.
Source: docs.aws.amazon.com
References
1. hackernoon.com
2. docs.aws.amazon.com
1. analyticsvidhya.com
❖ SIGNIFICANCE OF P VALUE
In a regression model, the P-Value for each independent variable tests the Null
Hypothesis that there is "No Correlation" between the independent and the dependent
variable. So, if the P-Value is less than the significance level (usually 0.05) then, the null
hypothesis is rejected which means the model fits the data well and the sample data
provides enough evidence to reject the null hypothesis for the entire population. The
6
data favours the hypothesis that there is a correlation. On the other hand, a p-value
greater than the significance level indicates that we cannot reject the null hypothesis,
which means there is insufficient evidence in the sample to conclude that a correlation
exists.
References
1. medium.com
2. statisticsbyjim.com
❖ ASSOCIATION ANALYSIS
• Association analysis is a technique used to understand customer buying patterns. It
aims to find relationships and establish patterns across purchases, commonly referred
to as market basket analysis.
• These relationships can be used to increase profitability through cross-selling,
recommendations, promotions, or even the placement of items on a menu or in a
store.
• Amazon presents users with related products, under the headings of "Frequently
bought together" and "Customers who bought this item also bought."
• People who buy bread and peanut butter also buy jelly. Or people who buy shampoo
might buy a conditioner too.
References
1. smartbridge.com
2. searchcustomerexperience.techtarget.com
• Volume - The amount of data is being referred to here. For some organisations, this
might be tens of terabytes of data. For others, it may be hundreds of petabytes. Earlier
storage was a major issue, but now with cloud storage it has eased the burden.
• Velocity – The speed at which data is being generated now is the key aspect here.
With more IOT devices and smart devices the velocity has increased drastically.
• Variety – This refers to the different types of data generated. Traditionally data was
structured but now the major chunk of it is unstructured and semi structured data
7
which include with text, images, audio and video. All these add complexity which
require additional processing to derive meaning.
• Veracity – This refers to the quality of data and signifies how meaningful the data is.
The non-valuable in these data sets is referred to as noise. An example of a high
veracity data set would be data from a medical experiment or trial.
Source: ibmbigdatahub.com
References
1. ibmbigdatahub.com
2. oracle.com
3. sas.com
❖ ARTIFICIAL INTELLIGENCE
AI refers to the simulation of human intelligence in machines that are programmed to
think like humans and mimic their actions. The goals of artificial intelligence include
learning, reasoning, and perception. Most AI examples that we hear about from chess-
playing computers to self-driving cars, rely heavily on deep learning and natural language
processing.
• Weak AI-These are systems that are designed to carry out a single task. e.g.,
Personal assistants like Alexa and Siri.
8
• Strong AI- These are more complicated and complex systems which are more
human-like. e.g., Self-driving cars.
Source: athenatech.tech
References
1. sas.com 4. athenatech.tech
2. iodinesoftware.com 5. investopedia.com
3. forbes.com 6. expert.ai
❖ AI AND AUTOMATION
The terms Artificial Intelligence and Automation are often used
interchangeably. However, there are pretty significant differences between the
complexity level of both systems. Automation is making a hardware or software that is
capable of doing things automatically without human intervention. Artificial Intelligence,
however, is a science and engineering of making intelligent machines. Automation may or
may not be based on Artificial Intelligence. The key differences are that-
• AI mimics human intelligence decisions and actions, while automation focuses on
streamlining repetitive, instructive tasks.
• AI involves learning from past data and evolving from it, whereas automation is
pre-set to do particular tasks.
References
1. becominghuman.ai
2. geeksforgeeks.org
9
❖ BUSINESS INTELLIGENCE
The terms business intelligence and business analytics are often used interchangeably.
The term business analytics refers to examining data to find trends and insights. When
used together, business intelligence and business analytics has a broader meaning and
includes every aspect of gathering, analysing, and interpreting data. Business intelligence
(BI) combines business analytics, data mining, data visualisation, data tools and
infrastructure, and best practices to help organisations make more data-driven
decisions. It is a set of technologies, procedures, applications that enable us to convert
raw data into meaningful information that can be used for decision making. It is
predominantly used in data extraction and data warehousing techniques. The main
intention of business intelligence is analysing data and predicting the future from past
data. Some ways by which companies can leverage BI are-
1. geeksforgeeks.org
2. tableau.com
3. oracle.com
❖ PUZZLES
Puzzles may be asked part of testing your analytical thinking. Below mentioned sites
provide some of the most commonly asked puzzles
• https://analyticsindiamag.com/10-commonly-asked-puzzles-in-a-data-science-interview/
• https://analyticsindiamag.com/10-standard-puzzles-asked-analytics-interviews/
• https://www.analyticsvidhya.com/blog/2016/07/20-challenging-job-interview-puzzles-
which-every-analyst-solve-atleast/
10
PROMISING INDIAN ANALYTICS FIRMS
• Mu Sigma – Mu Sigma has worked with more than 140 Fortune 500 companies and
provided substantial ROI to businesses. Mu Sigma is helping businesses to drive
intelligent decision making through their best in class and accurate data analytics
solutions. Serving various industries, Mu Sigma specialises in-demand analytics,
marketing analytics, network planning and optimisation, and risk analytics among
other services.
• Fractal Analytics - Fractal Analytics provides a range of Analytics Solutions for its
varied clientele. The services include image and video analytics, text analytics,
forecasting solutions, and advanced artificial intelligence solutions. They have
solutions for customer experience, supply chain, and behavioural science
• Tiger Analytics - Tiger Analytics has a range of services including Marketing Analytics,
Customer Analytics, Risk Analytics, and operations and planning services. They help in
setting up the KPI reporting and dashboard for tracking and alerting on pre-defined
performance indicators. They also facilitate insights discovery through a structured
and iterative approach
11
RECOMMENDED CERTIFICATIONS
MS Excel
12