You are on page 1of 24

Note: This transcription document is a text version of the upGrad videos present in this

session. It is not meant to be read independently, but can be used to complement your video
watching experience.

Speaker: Himanshu Manroa

Okay. So, let's start understanding the principles of Central Tendency, the measures of central
tendency, and three very important concepts out here that I'm sure you have heard before. And we'll
all keep hearing before, you know, whenever we talk about statistics and averages and what
constitutes the data that we are talking about.
So, it's the concept of mean, median and mode. Very simple terms, but often I've seen, you know,
certain perceptions that develop about these and people are not very sure.

So, let's get this straight. So, when do we use a mean, when do we use a median and when a mode?

Let's take an example of a telecom company. You know, and it is selling its various plans, it's mobile
plans, its data plans to its consumers. And we as consumers have been purchasing them.

The most important aspect that any telecomm company would want to understand is, you know,
what's the average revenue that I'm generating per user.
So, it's, you know, the terminology in telecom is known as ARPU, A R P U.
It's a very simple thing, you know, so you might have, you know, one lakh of customers spread across
a particular location or circle as we call it in telecom, say the Mumbai circle or Maharashtra circle.

But you want to know, you know, what is my average revenue across each of these one lakh
customers? And that's where the concept of mean comes handy. It's a very simple thing, you know,
your total sales across your entire customer base and, you know, your total number of customers.
Simple division.

You know,
you divide your total sales by
your total number of customers. What
you get is average revenue per user, you know.
Of course, your each individual customer, you know, there could be one who has been consuming,
you know, is
giving you bills worth Rs. say 5,000 a month.
There is somebody else who is a very conservative customer and is giving you bills worth Rs say less
than 100 per month or some of those basic plans

But for you, it's to understand, you know, the larger dimensions of the Maharashtra circle. And that's
what Maharashtra circle is yielding me, the average revenues. That's where Mean comes handy. You
know, they sum up your data with a single number. That's the size of the pie of the Maharashtra circle.
You know, that's the average revenue.

And, you know, despite the fact that we know the customers are, you know, spread out fairly, you
know, extensively for us, mean does the job in conveying, you know, the strength of Maharashtra
circle.

Now we can take this example ahead, when do we use median? And why do we use medium? You
know, how different is it from a mean?

So, mean gives you a very simplistic average value, a central value. A median allows us to look at a
more realistic value when we know the dispersion or, you know, the fluctuations within the data, the
ranges within the data are vast.

When do we use median, you know, when do they come handy?


When I want to say, look at the household income levels,
there could be again, say, you know, thousands of households that we are catering to.
You know, and they have their income levels.
We want to look at their disposable incomes that they have, you know, so that they can spend on
more value added services within our telecom. And that's where we want to understand their income
levels.

Now income levels could be fairly spread out.


1. You know, you could have within your customer base, a multimillionaire, right, kind of a business
tycoon with income levels which are running into crores of rupees.
2. And within, he also buys your plan, right? And within the same circle you have, you know, people
who are, you know, with very basic income levels, much lower down in terms of their income levels.

Now, if we were to do a mean of this data, because of that, you know, one or few, you know, high net
worth individuals sitting at the upper spectrum of your data, the averages would be highly distorted.
Your entire averages because of those few high net worth individuals would be showing much more
income levels than what your larger customer base is actually sitting at.

So, it's simply inflates your entire average, and the income levels that you see, household income
levels are much more distorted and skewed towards the higher than what they are in the larger base.
And that's where median comes into play.

When you want to look at the realistic values of your data, you are not looking at one single number,
you know, that sums it up.
But you are looking at something that eliminates this entire, you know, flow coming from extreme
values, extreme outliers.

So, what you do in median? You simply arrange your data. You sort it out in either an ascending order
or descending order. And you look at the center point of the data. So, imagine if you had say, 1001
customers, if it's an odd number, the center point would be 501. You know, the center point of that
data, if it's an odd number.

And if it's an even number of customers that you have, you would have, you know, two central points
because this is even numbers. You won't have a center point of that data. So, you would look at these
two center points and take an average of them.

So, you know, your median would be a number which is not an actual number sitting within your data.
Your income levels could be say, you have two center points 1000, say 10,000 and 15,000, and you
want to draw out a median out of them. You come up with the average of these two numbers that are
sitting right at the centre of data.

But you need to understand the context. The median income levels, you know, tell you the realistic
picture about, hey, this is what your larger customer data set looks like. So, mean, you look for
summation.
Median, you look for, you know, realistic values that eliminates the flaws of the outliers.
So, you know, mostly you would say, you know, your income levels or your spend levels are usually
calculated or displayed as medians.

And finally, the Mode part of it. And that's a very common way or commonly known denominator as
what's the largest value in your entire data set. Now that's easily calculated, that may not be central,
but that tells you, hey, you know, this is the largest value. This is the biggest component of my entire
dataset.

So, taking the same telecom example, you know, and if we are looking at various plans that people
are subscribing to, and you could have say eight or nine different plans.
A mode, something, you know, which allows you to pick up one single largest plan, which is the most
popular with my customers. So, that's the highest value.

So, the highest frequency is what Mode is all about. You don't need deep dive statistics for it. You just
need to understand, you know, the various denominators of your data, the frequencies and the largest
frequency becomes the mode. So, that's about mean, median, mode. Taking a simple telecom
example.
Speaker: Mirza Rahim Baig

So far, we've been visualizing the data and trying to derive some insights from it. And we've seen how
to visualize different kinds of features as well as trends and so on. But now we will see how to use
statistics to summarize information and to derive insights from the data in a further manner.

And to begin, let's discuss the notion of central tendency through this data again. And we've talked of
mean, median, let's see how they are for this data set of ours.

Let's take some numerical columns. So, let's take the column age, or age variable. We take this variable.
And for this, let's see what the mean value is and what the median value is.

A mean is the average, median is a central value. So how do we do this? We have the data here in the
tab right next to us. So, we will simply just take an average function.

So, Excel has a handy average function average, and I go here, I select this entire column here. And
that's what gives me the mean.

So, the mean, or the average value here is 42.3. Let me increase the font size a bit. So, this is 42.38 is
the average. Now the median, again, median is also very handy in Excel. You just need to use the
median formula instead of mean.

And pretty much the same thing like we did earlier, I will select this entire data. And here it is. So, in this
case, 42.38 is a mean, and 42 is the median.
Speaker: S. Anand

One of the characteristic measures that you would put in for a column of numbers is the average or the
mean, another is the median. And a lot of people get confused about when they should be using what.

Think about this. Supposing you said, I am going to set the sales target for this particular group. I want
to know, how much sales we will make next year.

So, in that case, I have a bunch of sales people, each person has put in, let's say the first person has
brought in one crore. Second person has brought in two crore. Third person has brought in three
crores. And the fourth person has brought in three crores again.

So, one plus two plus three, six plus three nine crores is what these four people have brought in. So, if I
had to set a fat target for next year and assume that this target should be met by everybody, I will take
nine divided by four and say, each one of you should probably achieve nine by four, which is what 2.25.

And that seems like a fair target. Well, maybe people's abilities differ. So, even if that is not a fat target,
if I were to just split a pie equally across, so if I had to take this target and give it to, let's say four NGOs.
How would you split it up roughly of equal size, you'd say just divide it equally across?

Now, that's what a mean does. It takes a quantity that it makes sense to aggregate and then says, this
is how well it would look like if it were evenly distributed.

So, if, for example, we had a room in which they were 10 middle class IT sector employees, and one not
so middle-class IT sector employee say bill Gates, walking into the room, then clearly he's an outlier.

But if you pulled all of the wealth and split it evenly, if you took the wealth in the room and split it evenly
among the people, you would get the average.
Now the average is a great measure for looking at how things should be distributed. So, you say that,
look, it doesn't matter where things came from. Put it all in one pie and then make sure that it chunks it
out evenly.

Let's look at, for instance, when we would use median, and when we would use mean and when you
use mode for that matter. So, let's say I have a room full of IT employees. One of whom happens to be
bill Gates and the rest are not Bill Gates.

If I wanted to understand the typical problems of the typical IT employee, and I wanted to therefore find
the salary of such a typical IT employee, I would take the median.

I would make them all stand in a line. And on one end is the richest and on one end is the poorest from
a salary perspective. And I take the middle person and say, you seem to be reasonably typical and
representative of this group.

This allows me to ignore the outlier Bill Gates. He doesn't even come into my calculation. If there's one
guy who's really, really doing badly in the IT sector or has not been drawing a salary, then it tells me to
ignore him as well.

And I get to look at the difficult middle, which has spread across and is fairly large. So, a median gives
me a representative value. It's something that I can generalize to a reasonably large proportion of the
population.

But on the other hand, if I wanted to know how much tax I can collect from this room, and it to the
median gut and put in a tax saying this room should therefore give me ten times the median guy is, let's
say 10 times, 5% of the median guy's salary, if there are 10 people in the room.

That doesn't work, because I should be taking Bill Gates salary, and which would be several tens of
thousands of times all of these people's salary. And that would be an order of magnitude more.
So, then I'd say, give me your average salary, the mean, and then I would say 5% of that is my tax and
multiply that by the total number of people.

So, when you can take a quantity and redistribute it, or each person is giving you a certain amount,
then you use the mean.

On the other hand, if you are focusing on what is representative of a dataset, give me one example that
characterizes the majority of a group, then you take the median.

If you want the typical employee, then go for medium. If you want the average contribution across
employees, go for mean. So, that's roughly the split as to when you decide the mean and when you
decide the median.

As a simple rule of thumb, if somebody says they are picking the median, they're usually late. If
somebody says, they're picking the mean, question them. Why am I saying that? Because the default
assumption that everybody has is go for the mean. That's what we do as a reflex action. So, it happens
without thought.

So, whenever somebody takes a mean, I look a little carefully at it to make sure that they aren't really
confusing the typical response for the mean response.

But if somebody says I'm going to take the median, then usually I don't give it that much scrutiny.
People don't often go for the median. It's a very futuristic, but works effectively in a lot of scenarios.

Mode on the other hand, you would only use for categorical data. It doesn't even make sense for
quantitative data. So, there's no confusion there. I have a column which has a certain frequency, the
most frequent item, that's what we take.

So, there's rarely any confusion between mode. Now, what we have in mean, or median is one value
that summarizes the entire column. That's a pretty useful thing. You've summarized it across the entire
data set. So, it gives you a feel for the typical value or the average contribution that it can provide.

Speaker: Himanshu Manroa

Right. So, now having understood the principles of measures or measures of central tendency, let's
try to understand, you know, what do we mean by measures of dispersion? So, often guys, the
measures of central tendency, the mean, median, mode are not enough that will allow you to take a
call about the consistency of a data.
All it provides you is with a central view, a central lens of how the data is behaving. But it never
informs you about the extreme outliers and never does it inform you about the fluctuations within a
single data set. And that's where the measures of dispersion come handy

Now, you know, it starts with a basic fundamental of, you know, the principle of variance. Variance,
you know, this is about how different are your individual data points from the mean value of that
particular data set. Now mean could be something, you know, which is an average of all the values.

Variance tells you about how individual data points are different from the mean, how farther are
they from the mean? It could be higher, it could be lower. You know, it squares up all these
differences and then averages them out and tells you, hey, this is the variance within the data.
So, the higher the variance, the more you can be sure about, hey, this is a highly volatile data which
can't be relied for its consistency. Now this variance as a feature, you know, is highly prevalent
when we want to look out various stocks and their performance over the past few years.

This allows us to calculate the volatility in that stock. And, you know, should you be investing in
them with a more informed decision? You know, they are volatile, they tend to sway wildly with
even the mildest of tremors in the market. So, take an informed decision before investing in them.
So, that plays a big, that plays a big role, right?

When you want to look at the consistency of your data points.


It can also be applied to say, you know,
performances of individual batsman, you know, within say an IPL tournament. How consistent is a
batsman, you know, will always weigh more, you know, more
than what average he has scored in a particular tournament.

You know, you would always pay more for consistency. Hey, give me 40-50 runs, but on a
consistent basis across the matches, rather than giving me a century after 6 odd matches. You
know, that's what matters more, consistency.

Secondly, you know, we come to the standard deviation part of it. Now that's an off-shoot of
variance. Now variance does the square of, you know, the differences and then takes a mean out
of it. Standard deviation is simply the square root of your variance. You know, which means it brings
to you a value, which is, you know, which is much more relatable when pitched against the mean,
you know, because it takes, you know, it takes a square root of the squares once again.

So, it takes your standard deviation value closer to the mean and standard deviation, you know,
figures would always be in plus-minus, you know, so which means it's a range. So, if for an average
of 80, your standard deviation is plus-minus five, what it tells you? It tells you that your entire
dataset is fluctuating either plus five or minus five in terms of it overall ranges.

So, it tells you, you know, how reliable your data. A plus-minus, you know, five kind of standard
deviation on a mean of 80 I would say is okay. So, which means your data is fairly consistent. So,
that's a call that you can take from these, you know, terminologies of variance and standard
deviation.

Speaker : Himanshu Manroa


Right. So, let's take some interesting examples of where you would use variance and standard
deviation in your daily, you know, management decisions that you want to take out of these
numbers.

Imagine you are a store manager for a particular retail store, and you want to look at your daily
sales on this, or daily footfalls and sales in this particular retail store.

And that's where you would come up with your average numbers, you know, based out of your
daily sales or a month. And you come out with averages. Now, averages would give you some
number, you know, across the entire month, a central value.

Now you also want to look at variance to see what's the level of fluctuations within these daily
sales, you know, entire, you know, across the entire month. And when you look at the variance
numbers, and if the variance numbers are high, very high, it gives you an, it sets an alarm within
you, right?
So, okay, my averages look pretty healthy, but the variance is high. Let's dive deeper into this data
to understand what is it - what is it that's driving this high variance. And what you encode or
decode once you get into this data.

You might find out, right, hey, my numbers are clearly high on certain days of the week or say, you
know, on the weekends, Saturday and Sunday my numbers are really high. On a daily, weekly
basis, my numbers are really very, very low.

You know, that gives you important insight into your sales numbers. That gives you important
ability, you know, to drive up your supply chain, your inventory management, the way you are
stocking up your store. You don't really want to stock up your store to the brim across the week day
for certain perishable item, knowing fully well that I would be, you know, encountering low sales
numbers or low footfalls during the weekday. And it's only during the weekend that I really want to
amp-up my inventory of some of these perishable items.

So, that's the way, you know, in a very simplistic example, how could you come up with important
business decisions

Speaker: Harvinder Singh

Suppose that you want to study the performance of the mobile stores, retail stores in your city. Let's say
there are four stores in total. So, the most natural data you would want to study is the number of units
sold at each store.

Let us say store one sells 1200 units. Store two sells 1000 units. Store three sells 1400 units and store 4
sells 1600 units. Now the mean can be calculated using the Excel function average, selecting the data
for these stores.

Now, instead of squaring this data to calculate the variance, you might decide to add up the differences
with the mean, of course, without squaring them. This can be done by subtracting the mean value from
the unit sold for each store.

To arrive at the newly defined variance, these values can be summed up. Now a variance of zero
means that there is not any variation in the number of units sold in each store, which is not true as can
be seen from the table.

You can see from the table that a difference of negative 100 in store one is cancelling out the
difference of positive 100 for store three.

Similarly, negative 300 for store two is cancelled by the positive 300 for store four. This is how a
variance of zero misleads to the conclusion of no variation in the data.

However, if you squared these differences and then sum them up to calculate the variance, you would
arrive at a nonzero positive result for the mobile store.

This shows that there is finite variation within the sample. However, even with this metric, there is
scope for improvement. Can you think of it? A.

ll those squaring ensures that all the variations are captured in the metrics. It also increases the
magnitude of the variation. It's certainly good to be cautious, but reporting variance for your sample can
be massively misleading because of this mismatch of magnitude.

The unit of variance is also different from that of the original data. So, the variance doesn't give the
right sense of variation about the mean.

Also, if your variance is below one, squaring it will decrease the value further. So, in order to bring this
magnitude and it's unit to coherence with the variable of your sample data, standard deviation is used.

Speaker: Mirza Rahim Baig

So, in this case, we need to identify the partner, which we expect to deliver faster than the other. And
that needs to be within 10 days as well. So, how do we go about assessing this?

We have historical values. We have about 50 orders, passed orders and the delivery times for them, for
each partner. So, partner one, partner two, then we have, you know, how much time we took earlier.

And one way to assess this very simply would be for partner one and partner two, I can simply get the
average values. So, I can say, on average, this is the time they've taken.

And for partner one, that comes out to take the average formula for all the values for partner one, it
comes out to 7.4. Likewise, I can calculate the average for partner two. And partner two on average
delivers in 7.28 days where partner one delivers in 7.4 days.

So, the winner is clear, right? You'd go with partner two in this case. Well, not really. You don't have the
complete picture yet.

So, what you also need to have along with the mean here is how far does the regular point deviate
from the mean as well. We haven't considered the spread. And to see the value of spread at, let's
calculate a measure of spread, which is the standard deviation. For this, as well as this.

So, let's get the standard deviation here and we'll see how we can use it. So, the standard deviation,
standard std dev dot S, it's a sample here. We calculate standard deviation. For the first one, standard
deviation is 1.829.

For the second one, again, the STD dev dot S is a function in Excel, we choose the range you want to
give. And we have a standard deviation of 3.5.

For the second one, the standard deviation is almost twice the standard deviation of the first. Now how
do we use this? Standard deviation in some way, it tells us how far the regular point is from the
average.

But this also tells us, this is the kind of variation from the mean that we can normally expect from a data
point in the dataset. What it means is 7.3 here let's say, plus 3.5, this is succeeds 10.7, right?
What we're saying here is that we can have reasonable expectations on the regular point to be around
10.7. And that's probably not the best for us.

Because if we are making the order right now, we not only want the average value to be low. We also
want that, you know, we can expect this order to arrive within 10 days with certain confidence. We need
to be able to do that.

Now for the first case, if I take the average and add the standard deviation, you know, average plus
standard deviation, this comes out to 9.2 something. This is within 10.

So, with the mean and a standard deviation considered, the first one now, it seems that, you know, it's
much more likely that I will get my order within 10 days. And as for the other, I can see that, you know,
the general point can exceed 10 very easily.

To confirm this, we said that, partner one has lower spread around the mean, right. Values are tighter
around it. Partner two has more spread. And that means that there will be plenty of values exceeding,
plenty of having higher values and so on.

Just to confirm that, we also have made the histogram here in a separate sheet. And we have partner
one, we have partner two, and we can very clearly see that.

For partner one, the values are largely between 4 and 10. For partners two, the spread is more, the
values go from two all the way up to 12, right? And anything above 10 is problematic for you.

So, in partner two, there is a good chance that these values will exceed 10 in partner one, the chances
of that happening are lower. The spread is lower.

And this notion that the spread around the mean is lower, has been quantified by the standard
deviation and the standard deviation therefore is lower in the first partner. And this is how you can use
standard deviation along with the mean to take such decisions.
Speaker: Mirza Rahim Baig

Now to understand practically how standard deviation works and what are the drawbacks, let's take the
same example we had earlier. This time, the data is slightly different.

So, you see the values up, this is data for partner one, the same 50 values. But at a couple of places,
what had happened was maybe there was some political event because of which delivery got delayed
significantly for just two orders, not for the others, but only for these two orders.

Now let's calculate the spread again that we did earlier using standard deviation. Let's calculate the
measure of spread again.

This time again, using the same formula std dev dot S, I'm going to calculate this. The standard
deviation comes out to be 4.108, roughly 4.12.

Now if you remember, in the previous data, without these two high values, the spread of standard
deviation was 1.829. So, what happened here?

What has happened is that these two values, just these two values have really skewed the metric
significantly. Standard deviation is very sensitive to these two values.

And the reason again is that these deviations, 27 minus the mean, which was around 7, these
deviations are large and when they were squared, they had a very high impact on the overall standard
deviation.
Standard deviation as a metric is very sensitive to these extreme values. And therefore, we also need
to think of other metrics which are more robust.

Speaker: S. Anand

There are various measures of spread. The most impractical measure that I've seen is standard
deviation. It tells you something that you have no idea how to explain to a layman.

So, what do you do with standard deviation? Use it, feel proud about it, but otherwise, try not to
showcase it anywhere there's any business discussion happening.

Far more useful measures are inter quartile variations, Q1, Q3. Basically, you say, I'm going to take all of
the data and what is the value of the midpoint that's the median? What is the value of the 75th
percentile or the 25th percentile? That is the top quarter, the bottom quarter. Those values are far more
useful indicators.

So, let us say that you make people stand in height order. What is the height of the middle person?
That's the median.

Now take the right half. What is the height of the middle person in this? He's a little bit taller. What is the
height of the middle person on the left half? He's a little bit shorter.

Now the difference between these heights, that gives me a sense of the spread. Now suppose I do this
for two different sections in class eight.
8A has a huge variation in spread. Class 8B has a much smaller variation in spread. So, that's useful.
So, I know that generally the height difference in class 8A is a lot more.

So, I'll probably find my basketball players, some of the taller ones in 8A. But 8B has a pretty decent
average balance. So, I'm not likely to find so much variation there. So, I could get random people to
play basketball in 8B and they won't do so bad.

So, this interquartile distance is the difference between the right guy and the left guy's height so to
speak. The values themselves are useful. The difference in the values is also useful.

Sometimes instead of taking the midpoint, people take the 90th percentile and the 10th percentile.
That's fine as well. It gives you a sense of the widespread. It tells you where 90% of the data lies.

And these are robust measures. What I mean by robust measures is, some measures are skewed
heavily by outliers.

Take Bill Gates again. If there was a room and you computed the standard deviation of the salary of the
room, the presence of Bill Gates will just completely rip apart the standard deviation. It'll be huge value.

On the other hand, if you took the inter quartile distance, Bill Gates won't even come within the quartile,
or if, even if you took the 90th percentile, we gets what's incumbent with 90th percentile. So, you get
what is representative of 90% or 50% of the entire dataset. That's pretty useful.

In fact, there is a plot that summarizes all of these statistics quite well, which is the box plot. So, on the
box plot, what we have is a variable whose median is shown, and that is highlighted by a thick line
here.

The top quartile and the bottom quartile are shown as well as the largest point and least point. So, I get
a sense of the full variation of this data.
So, if I compare these two variables, then I find that the median for the variable on the right is lower
than on the left. The variation on the right is also lower than on the left.

50% of the data is squeezed into a much smaller volume. Almost a hundred percent of the data is also
squeezing into a much smaller volume. So, the values are lower, the spread is also lower. It gives me a
reasonably quick sense of the distribution of the data.

Speaker: Himanshu Maroa

Now, let's try to understand, again, the applications or business implementation possibilities of
these interquartile ranges that we have spoken about. So, we have spoken about median in an
earlier context, right, which simply sorts out the data and gives you the central point within
that data.

So, if you were to break your data say into, you know, quartiles as we would say is, you know,
four equal parts. Our median would comprise right at the centre in form, you know, there
would be two parts on the right of it, two parts on the left of it. So, that's what median does.
You know, it gives you a central point.

Let's take an example of a telecom company, which says, you know, okay, so these are the
median household income levels, one single point of my entire consumer base. But now you
want to attach much more consistency to this particular statement.

You want to segment your customers. You just don't want to look at their medium, median
income levels, which is one single point. You want to look at your larger consumer base and
what is it, what are the ranges of their income levels so that, you know, the upper limit and the
lower limit of your larger consumer base and what all can you pitch to them in terms of your
offerings, in terms of your technical offerings?
Mean, median, and mode can only give you a single point of reference, while the Iinter quartile
range gives you a range of reference

So, that's where the concept of inter-quartile ranges comes into place. So, while median is a
central point, you know,
the 50% mark of your data, you know, the inter-quartile ranges take up the difference between
the 25th quartile and the 75th quartile.

So, which means, you know, half of the data on the right and half of the data on the left of the
median, you know, that's what you want to consider as a larger range between 75th and 25th.
Because you know, this assumes that you are considering about, you know, a large chunk of
your data. You know, and taking off the extreme outliers is from 0 to 25 and 75 to 100. But
this large chunk central data, you know, about half of your consumer base lies within this, you
know, these 25th and 75th quartile.

So, that gives you a range. You know, for this telecom company, you know, instead of going
ahead with a single figure of say, you know, 15,000 INR is my, you know, household income
level or something, you know, you have a range, you know, which tells you that say from
14,000 to 21,000, you know, income levels per month is what my consumer base, my large
consumer base is all about.
Now, this tells you, you know, the larger income profile of your customer as a segmentation
basis, and what all can you pitch to them. So, a bigger range allows you a bigger headroom to
play with and devise your marketing strategies, your positioning strategies and your similar
product offerings that can go out with it.
Speaker: Mirza Rahim Baig

So, in this case, 42.38 is a mean, and 42 is the median. And the mean, and median seem to be
very, very close. That's a pretty good sign.

And what it also means is that the data points are somewhere closer to the centre, meaning that
we don't seem to have, you know, very lopsided or very skewed data set where you have very
high values and so on.

So, how do you verify that? To verify this, or to have an insightful visual on this, let's create the
box plot for the age column, just to see what it looks like in this case.

So, I go to the data here and I need to create a box plot. So, let me select this entire column and
I need to go to insert, I need to go to these charts here.

So, these charts here are under this category of miscellaneous in a way where you have multiple
kinds of charts. And within this, you need to look at stock charts.

So, most stock charts are what we are looking at. And if I click on this, you see towards the end,
I have a box and whisker. So, this is the plot we are interested in. If I click on box and whisker,
and if I click on okay, you see I have this plot here.

So, let me just move this to a stat sheet. And this is the box plot for age. And as you can see very
clearly here, okay, what do you see here first? So, what are different values we have over here
and what is the information in the box part.

We have here, as a cross, if it is visible clearly, the cross here represents the mean value. The
horizontal line in the middle of this box, this blue box. The horizontal line here represents the
median value. So, median value here is 42.

So, the median 42 and mean of 42.38 are both represented here. Here is the upper quantile. This
is the third quantile which is 75th percentile. And here is the 25th percentile. So, 25th percentile
is 36 and the 75th percentile is 50.

So, you see that around the median value, the values are somewhat in a way symmetrical, the
box on the left and the box on the right or let's say the box above and below, the median have
similar sizes. And the whiskers, they extend. They extend to very similar extends again.

And you have these couple of points, which have been marked as outliers here, which is just
beyond the expected range over here, which is 1.5 times interquartile range. But it's not like
they're very far away. So, they wouldn't be really outliers in true sense.
S,o this box plot for age just confirms an understanding of the distribution of the data and also
gives a nice visual summary of a field like this. So, this is how you make a box plot.

Let's now repeat this exercise for another variable, which is balance. We have balance as well,
and let's see if things are different and let's see if the box plot helps us identify that, or even
these statistics here.

So, let's create the mean and the medium value for balance. Let me take an average here. I go to
this sheet here; you see balance is column G. So, I take this entire column pretty much, and let
me just copy this formula here and replace average by median. And we have this.

And if I copy the format as well, you see that mean is 1354, whereas median is 446. Now this is a
big difference. The mean is almost three times the value of the median.

See the central value is 446. The average value is far higher. This should made us make us
suspect that there are some very high values and this is completely skewed.

How do we confirm that? Let's again make a box plot for balance and the way to do that is I go
here again, I select this entire data, insert these stock charts and then a box and whisker. And I
get this box plot. I'll copy this to a sheet. And this would be the box plot for balance.

Let me get it in a similar kind of formatting. So, this is the box plot for balance, and there are
some things which you can see upfront.

First of all, you can see that the box is pretty much aligned right now. There is no well spread
box, first of all. And the box here is very, very close here. It's very close to zero in this scale.

What it's really showing me is that, you know, while majority of the values are here somewhere,
right, say below 1416, this is where most of the values are, there are many, many individuals with
very high values.

You see median being 446, but highest value is somewhere around a hundred thousand. So, this
is very evident that there is a lot of skew in the data, and you have many individuals with very
high values.

Speaker: Siddhartha Roy

The next thing we are going to talk about is the mean variance and standard deviation of a
discrete random variable.

Now let's assume, or rather let's consider a random variable, slightly different this time. I mean,
we denote it by X, but the example we're going to take is slightly different.

We're going to say that X is the number of chocolates, which Rohan eats in a day. And let's say
that, you know, there are only three chocolates which are available every day.
So, Rohan can technically eat zero chocolates, one chocolate, two chocolate, three chocolate,
nothing more than that. And let's also assume that the chocolates can be broken or anything. So,
these are all whole numbers.

Zero, one, two, three; these are the four values which, you know, this particular random variable
can take. So, what I'm going to do here is, I'm going to list down it here. So, this is another way
in which you write down probabilities.

Here are the outcomes, and I'm going to write P of X, which is the probability that X assumes
these values. So, this is 0.3, this is 0.1, this is half, this is 0.1.

Now just notice this that the addition of this has to be equal to one, because a net probability
can't be greater or less than equal to one. It has to be exactly equal to one.
Now 0.3 plus 0.1, 0.4, 0.4 plus 0.5, 0.9, 0.9 plus 1, 1. So, this is our random variable and these
are the probabilities which the random variable can take.

So, three concepts, first is mean, okay. Let me just write it down, mean. Mean means, it is also
sometimes called as average. But generally, in statistics, this is how you denote it, Mu of X.

Now the definition of Mu of X is this. So, it's sum product of your outcomes and probabilities. So,
it is zero into 0.3, plus one into 0.1, plus two into 0.5, plus 3 into 0.1. Fair.

So, basically the sum product of this. Now, if you just multiply this or rather, let me just break it
down, it's zero into 0.30, 1 into 0.1 is 0.1. 2 into 0.1 is 1, 3 into 0.1 is 0.3. We had this, this is 1.4.

So, what we are saying is the mean outcome or the mean number of chocolates which Rohan is
going to eat is going to be 1.4. Now you would ask me that, you know, at the start, you said
these chocolates can't be broken. So, how is it that he is eating 1.4 chocolates?

Now, the thing is, when you say mean, it doesn't mean he is actually going to eat 1.4 chocolates
on a particular day. What it means is if you're going to repeat this experiment N number of times,
on an average, one day he might eat zero, one day he might eat one, one day he might eat three.

But given the probabilities on an average, he eats 1.4. That's what mean essentially means here.
The next thing we will look at is called as the variance, VARIANCE, also sometimes just denoted
as V of X, variance of X.

What variance of X essentially means is, what is the distribution or rather it denotes, what is the
distribution of your different outcomes to the mean value? What is the difference or distance
between the different outcomes in a mean value?

So mathematically, the way it's denoted is, just think about it, logical. So, this is your mean, this is
your outcome. So, we're just going to say, zero minus 1.4. See distance, right, difference from
mean, whole square into probability of 0, its 0.3, plus one minus 1.4 whole square into 0.1.

The second outcome is one, the mean is 1.4. I am whole squaring this and multiplying it by this
probability, which is 0.1.

Just logically try to think, you're saying variance, try and understand if the mathematical formula
is actually denoting variance. Is it actually the distance or the different values from the mean.

Don't just mug these values up, try and understand this. You know, you'll always remember it. So,
two minus 1.4, whole squaring it, into, what was the probability, half on five, plus three minus
1.4. Again, squaring it. And what is the probability, 0.1.
So, let's write down the actual numbers. We're saying 1.4 whole square is 1.96 into 0.3. Let me
put it in parenthesis. Again, one minus 0.14 is 0.4. 0.4 whole square is 0.16 into 0.1, plus two
minus 1.4 is 0.6. 0.6. whole square is 0.36 into 0.5.

And three minus 1.4 is 1.6. And 1.6 whole square is 2.56. 2.56 into 0.1. So, if you just do this
using a calculator, the variance you will get, or the result of this you get is 1.04. 1.04 is essentially
your variance.

The next thing we are going to talk about is standard deviation. So, standard deviation, it's
generally written as STD, STD standard deviation. It's denoted by Sigma X, because we are
talking about X here.

So, Sigma X is nothing but square root of variance. So, I will not write variance, let me just write
V of X, we denote it, which was 1.04. If you just do the square root of 1.04, you will get it as
1.02.

So, this is your standard deviation. So, three concepts, Mu, mean, variance V, standard deviation,
Sigma X. Again, three concepts which will keep coming on in this course, as we move forward.
And in general, across probability and statistics.

Disclaimer: All content and material on the upGrad website is copyrighted, either belonging to
upGrad or its bonafide contributors and is purely for the dissemination of education. You are
permitted to access, print and download extracts from this site purely for your own education
only and on the following basis:
● You can download this document from the website for self-use only.
● Any copies of this document, in part or full, saved to disk or to any other storage
medium, may only be used for subsequent, self-viewing purposes or to print an
individual extract or copy for non-commercial personal use only.
● Any further dissemination, distribution, reproduction, copying of the content of the
document herein or the uploading thereof on other websites, or use of the content for
any other commercial/unauthorised purposes in any way which could infringe the
intellectual property rights of upGrad or its contributors, is strictly prohibited.
● No graphics, images or photographs from any accompanying text in this document
will be used separately for unauthorised purposes.
● No material in this document will be modified, adapted or altered in any way.
● No part of this document or upGrad content may be reproduced or stored in any
other website or included in any public or private electronic retrieval system or
service without upGrad’s prior written permission.
● Any right not expressly granted in these terms is reserved.

You might also like