Professional Documents
Culture Documents
Q.1. What is Data? Why we need the data and how it is useful?
A.1. Anything that is recorder is data. Observations and facts are data; anecdotes and
opinions are also data, of a different kind. Data can be numbers, like the record of daily
weather, or daily sales. Data can be alphanumeric, such as the name of the employees and
customers.
Business activities are recorded on paper or using electronic media, and then these
records become data. There is more data from customers’ responses and on the industry
as a whole. All this data can be analyzed and mined using special tools and techniques to
generate patterns and intelligence, which reflect how the business is functioning. These
ideas can then be fed back into the business so that it can evolve to become more
effective and efficient in serving customer needs; and the cycle goes on.
Business Intelligence
Data Mining
1. Data can come from any number of sources – from operational records inside an
organization, or from records compiled by the industrial bodies and government
agencies. Data can come from individuals telling stories from memory and from
people’s interaction in social contexts, or from machines reporting their own status
or from logs of web usage.
2. Data can come in many ways – it may come as paper reports, or as a file stored on
a computer. It may be words spoken over the phone. It may be e-mail or chat on
the Internet or may come as movies and songs in DVDs, and so on.
3. There is also data about data that is called metadata. For example, people regularly
upload videos on YouTube. The format of the video file (whether high resolution
or lower resolution) is metadata. The information about the time of uploading is
metadata. The account from which it was uploaded is also metadata. The record of
downloads of the video is also metadata.
Data serves as the primary source of information for analysis. It contains facts,
observations, measurements, and records related to a particular subject or problem.
Analyzing the data allows us to extract valuable information and gain a deeper
understanding of the phenomenon being studied.
Data analysis plays a crucial role in identifying and resolving problems or
inefficiencies. By examining data, analysts can identify areas of concern, detect
anomalies or outliers, and troubleshoot issues. This allows organizations to
improve processes, minimize errors, and enhance performance.
Through data analysis, historical data can be leveraged to make predictions and
forecasts about future trends, events, or outcomes. Statistical models, machine
learning algorithms, and other techniques can be applied to analyze patterns and
make reliable predictions. This helps organizations anticipate changes, plan ahead,
and make strategic decisions.
Data analysis helps validate and verify hypotheses, theories, or assumptions. By
subjecting data to rigorous analysis, researchers can test the validity of their claims
or hypotheses and determine whether the evidence supports or refutes them. This
scientific approach enhances the credibility and reliability of findings.
b. Ratio Scales:-> Ratio scales are similar to interval scales in that scores are
distributed in 2 equal units. Yet, unlike interval scales, a distribution of scores on a
ratio scale has a true zero. This is an ideal scale in behavioral research because any
mathematical operation can be performed on the values that are measured.
Common examples of ratio scales include counts and measures of length, height,
weight, and time. For scores on a ratio scale, order is informative. For example, a
person who is 30 years old is older than another who is 20. Differences are also
informative. For example, the difference between 70 and 60 seconds is the same as
the difference between 30 and 20 seconds (the difference is 10 seconds). Ratios are
also informative on this scale because a true zero is defined-it truly means nothing.
Hence, it is meaningful to state that 60 pounds is twice as heavy as 30 pounds.
d. Ordinal Scales: An ordinal scale of measurement is one that conveys only that
some value is greater or less than another value i.e., order. Examples of ordinal
scales include finishing order in a competition, education level, and rankings.
These scales only indicates that one value is greater then or less than another, so
differences between ranks do not have meaning.
The example of ordinal scales is given in the table below:
Rank College Name Actual Score
1 Stanford University 4.8
2 University of 4.7
California, Berkeley
2 University of 4.7
California, Los
Angeles
4 Harvard University 4.6
4 University of 4.6
Michigan, Ann Arbor
4 Yale University 4.6
7 University of Illinois 4.5
at Urbana campaign
7 Princeton University 4.5
9 University of 4.4
Minnesota, Twin
Cities
9 University of 4.4
Wisconsin-Madison
9 Massachusetts 4.4
Institute of
Technology
12 University of 4.3
Pennsylvania
In the example below we can see that how a data frame looks
Estimation of location
Variables with measured or count data might have thousands of distinct values. A
basic step in exploring your data is getting a “typical value” for each feature
(variable): an estimate of where most of the data is located (i.e., its central
tendency).
1. Mean: The most basic estimate of location is the mean, or average value. The
mean is the sum of all values divided by the number of values. Consider the
following set of numbers: {3 5 1 2}. The mean is (3 + 5 + 1 + 2) / 4 = 11 / 4 =
2.75. You will encounter the symbol x (pronounced “x-bar”) being used to
represent the mean of a sample from a population. The formula to compute the
mean for a set of n values x1, x2, ..., xn is:
𝑛
𝑥𝑖
𝑥̅ = ∑
𝑛
𝑖=1
2. Median: The median is the middle number on a sorted list of the data. If there
is an even number of data values, the middle value is one that is not actually in
the data set, but rather the average of the two values that divide the sorted data
into upper and lower halves. Compared to the mean, which uses all
observations, the median depends only on the values in the center of the sorted
data. While this might seem to be a disadvantage, since the mean is much more
sensitive to the data, there are many instances in which the median is a better
metric for location. Let’s say we want to look at typical household incomes in
neighborhoods around Lake Washington in Seattle. In com‐ paring the Medina
neighborhood to the Windermere neighborhood, using the mean would produce
very different results because Bill Gates lives in Medina. If we use the median,
it won’t matter how rich Bill Gates is—the position of the middle observation
will remain the same.
3. Mode: The mode is the value—or values in case of a tie—that appears most
often in the data. For example, the mode of the cause of delay at Dallas/Fort
Worth airport is “Inbound.” As another example, in most parts of the United
States, the mode for religious preference would be Christian. The mode is a
simple summary statistic for categorical data, and it is generally not used for
numeric data.
Estimates of variability
Standard Deviation and Related Estimates:
The most widely used estimates of variation are based on the differences, or deviations,
between the estimate of location and the observed data. For a set of data {1, 4, 4}, the
mean is 3 and the median is 4. The deviations from the mean are the differences: 1 – 3 =
–2, 4 – 3 = 1, 4 – 3 = 1. These deviations tell us how dispersed the data is around the
central value.
𝑛
|𝑥𝑖 −𝑥̅ |
Mean Absolute deviation= ∑
𝑖=1 𝑛
The best-known estimates of variability are the variance and the standard deviation,
which are based on squared deviations. The variance is an average of the squared
deviations, and the standard deviation is the square root of the variance:
𝑛
2 (𝑥𝑖 −𝑥̅ )
Variance= s = ∑
𝑖=1 𝑛−1
Standard deviation = s = √𝑣
Where v is variance
Mean absolute deviation= Median (|x1-m|, |x2-m|, .... ,|xN-m|)
The median is 4 murders per 100,000 people, although there is quite a bit of variability :
the 5th percentile is only 1.6 and the 95th percentile is 6.51.
To build a boxplot in the pandas we use the following command:
ax= (state[‘Population’]/1_00_000).plot.box()
ax.set_ylabel(‘Population (millions)’)
From this boxplot we can immediately see that the median state population is about 5
million, half the states fall between about 2 million and about 7 million, and there are
some high population outliers. The top and bottom of the box are the 75th and 25th
percentiles, respectively. The median is shown by the horizontal line in the box. The
dashed lines, referred to as whiskers, extend from the top and bottom of the box to
indicate the range for the bulk of the data.
pandas supports histograms for data frames with the DataFrame.plot.hist method.
Use the keyword argument bins to define the number of bins. The various plot
meth‐ ods return an axis object that allows further fine-tuning of the visualization
using Matplotlib:
ax = (state['Population'] / 1_000_000).plot.hist(figsize=(4, 4))
ax.set_xlabel('Population (millions)')
Correlation:
Exploratory data analysis in many modeling projects (whether in data science or in
research) involves examining correlation among predictors, and between predictors and a
target variable. Variables X and Y (each with measured data) are said to be positively
correlated if high values of X go with high values of Y, and low values of X go with low
values of Y. If high values of X go with low values of Y, and vice versa, the variables are
negatively correlated.