You are on page 1of 11

DSBDL 3

Introduction
In this session, the most common methods to obtain statistics of a record will be
discussed.  These methods are count, min, max, mean, meadian, mode, standard
deviation. The basic meaning of some methods are

1. Mean – Average value of given values

2. Median – Middle value

3. Mode – Most repeated value

4. Standard Deviation – For each of the value subtracted by mean and


square, and divide the values by number of values then apply the
square root  
In order to start the practical, open Jupyterlab and launch a Jupyter notebook

Import Pandas and then read the csv file “car_sales.csv”  and execute the data
frame as shown in figure 1.

Figure 1: Reading the csv file


2 Count

In order to find out the number of records present in the data set, count()function
can be used. The data frame name should be specified when using this function

2.1  Getting the count of a particular column


The number of records in a particular column can be printed by specifying the data
frame, the column name with the count function as shown in figure 2. Assume that
the count of the records in the Quantity column is needed to be printed out.

Figure 2: Getting the count of recordings of a particular column


Please note that the count function doesn’t take null values to the account. In order
to demonstrate this, delete a value in the count column (in here the first cell of the
Quantity column is deleted) and re-import the file. Then, again execute the code.
As shown in figure 3, it can be seen that now the count is 9. Because the first cell is
a null value.

Figure 3: Getting the count of a column which is having a null value cell
2.2  Getting the count of all columns
Please note that in the previous section we specified the column name. But, if we
don’t specify it, column-wise records count can be obtained. See figure 4, the
function count is used after the frame.

Figure 4: Getting the count of the all


columns
3   Getting the maximum or minimum

3.1  Maximum

The max() function can be used to find out the maximum value in a column.

1. First, let’s find the maximum value in the Quantity column. (Please
change the first cell value back to 2884 as we put it as null in the
previous section) Specify the data frame and then the column name
with the max function as shown in figure 5.

2. If we want to find the column-wise max value, remove the column


name from the code, and execute it.  As you can see in figure 5, it
gives the max value of each column. The max value of the Make
column is Volvo because Vis the maximum character when it comes
to A-Z. Refer the figure 6 to observe the data set.
Figure 5: Use of the max function
3.2 Minimum

The minimum can be taken in the same way we have done with the max function.
But in order to take the minimum, we gave to use the min() function.
Figure 7shows getting minimum of all the columns. Refer to figure 6 to verify
whether the printed values are correct.
Figure-6

Figure 7: Getting
the minimum
4  Mean

The mean is the average value of a given set of values. The mean can be calculated
by using the mean() function. As the functions we discussed previously this
function can be used to get the mean of a particular column or all the columns.

1. Assume that we need to calculate the mean of the Quantity column.


First, specify the data frame (car_sales), then the column
name(Quantity). Then use the mean function as shown in figure 8.

Figure 8:
Getting the mean of the Quantity column
To get the column-wise mean, remove the column name from the above code.
Then execute it as shown in figure 9. Observe that the mean of the Make column is
not shown. This is because it automatically detects that, the column contains
strings.
Figure
9: Getting the mean of the columns
5. Median

Median is the middle value of a given data set.  The median can be calculated
using the median()function. Specify the data frame you want to find the median
and then use the median function. As discussed in the above sections, this function
also can be used to find out the median of a particular column or all the columns
(figure 10).

Figur
e 10: Getting the median of all columns
6   Mode

The mode is the most repeated value of a given data set. The mode can be obtained
using the mode() function. This can be used for a particular column.

To clearly obtain the mode, lets first change multiple cell values to 2884 as shown
in the figure 11.
Figure 11: Changing multiple cell values to 2884
Then let’s find the mode of the Quantity column. First specify the data frame, and
then column and at last put the mode() function as shown in figure 12 and execute
it. As you can see the mode is shown as 2884.

Figure 12:
Getting the mode of the Quantity column
For demonstration purposes now let’s put mode() function to all the columns as
shown in figure 13 and execute it. It can be observed that for Year column mode is
2007 there is no other hence shows as NaN. For both Pct and the Quantity column,
there are no repeated values hence shows all the values.  The Price column mode is
12090 hence it shows in the first cell and the other cells in that column are NaN.
Figure
13: Trying to find mode of the all columns
Let’s assume that there are no repeated values in the quantity column, and then
execute the code to calculate the mode of the column: Quantity. As shown in
the figure 14, it outputs all the values in that column as there is no mode.

Figure 14: Getting the mode of


Quantity column
As another example execute the mode for the Pct column and it returns all the
values in that column as well. As shown in figure 15 there is no mode.

Figure 15: Getting the mode


of Pct column
7 Standard Deviation

In the same way as the other functions are used, in order to find the Standard
Deviation std()function can be used.

In order to calculate the standard deviation of the Quantity column, Specify the
data frame you want to find the std and then the column name, and lastly, use the
std function as shown in figure 16.

Figure 16:
Getting the std of Quantity column
Column wise std can be obtained if we remove the column name. This is shown
in figure 17.
Figure 17: Obtaining column
wise std
8  Getting the information

There are two functions that can be used to obtain the statistical or concise
summary of the data frame. They are the describe() and info() functions.

8.1  Describe function

Rather than column-wise obtaining the mode, median, std, etc. using the relevant
functions, describe() function can be used. It gives the summarized version of the
calculated mode, median, std, max, min, percentile values as shown in figure 18.

Fig
ure-18
8.2 info function
The info() function can be used to get a summary of index and column data types,
non-null values, and the memory usage as shown in figure 19.

You might also like