You are on page 1of 6

1

Analysis of population using sample data and


descriptive statistics, graphical representation
and their comparison
<Student name>

Project report for GENG Spring 2020


of size 20 is drawn from the population and the descriptive
Abstract— This study aims at calculating the mean, median, statistics, same as population, is calculated. The graphs,
mode, first quartile and fourth quartile of the population. Graphs histogram, time sequence diagram and boxplot is plotted for
like histogram, boxplot and time sequence diagram were plotted. the population and for the sample. Then, on the basis of the
A sample of size 20 was drawn from the population and the
descriptive statistics and graphs, calculated and plotted for the
descriptive statistics and the graphs, as defined for population,
were calculated and plotted respectively. The sample was drawn population and sample are compared.
using systematic sampling so as to get the sample points at equal
interval. Then the population and sample were compared with II.SYSTEM DESCRIPTION
respect to the defined descriptive statistics and graphs. After the
comparison of the population with the sample drawn from it, we A. Data description overview
conclude that the sample drawn from Income, Theatre and Source of the data: Introduction to Statistics and Data
Theatre_ly are good representatives of the population while the Analysis with Exercises, Solutions and Applications in R, by
sample drawn from Culture is not a good representative of the Christian Heumann, Michael Schomaker, Shalabh.
population.
This data summarizes a survey conducted on 699 participants
in a big Swiss city. The survey participants are all frequent
I. INTRODUCTION visitors to a local theatre and were asked about their age, sex

T
(gender, female=1, male=0), annual income (in 1000 SFR),
HE Probability and statistics, the branches general expenditure on cultural activities (“Culture”, in SFR
of mathematics. Probability has its origin in the study per month), expenditure on theatre visits (in SFR per month),
of gambling and insurance in the 17th century, and it is now and their estimated expenditure on theatre visits in the year
an indispensable tool of both social and natural sciences. before the survey was done (in SFR per month).
Statistics may be said to have its origin in census counts taken The variables name assigned to the database while performing
thousands of years ago; as a distinct scientific discipline, the analysis:
however, it was developed in the early 19th century as the Age, Sex, Income, Culture, Theatre, Theatre_ly.
study of populations, economies, and moral actions and later A snapshot of the database is given below:
in that century as the mathematical tool for analyzing such
numbers. Ag Se Incom Cultur Theatr Theatre_l
Nearly every day we use probability to plan around the e x e e e y
weather. Meteorologists can't predict exactly what the weather 31 1 90.5 181 104 150
will be, so they use tools and instruments to determine the
54 0 73 234 116 140
likelihood that it will rain, snow or hail. Probability and
56 1 74.3 289 276 125
statistics could help to shape effective monetary and fiscal
36 1 73.6 185 75 130
policies and to develop pricing models for financial assets
24 1 109 191 172 140
such as equities, bonds, currencies, and derivative securities.
Statistics is a crucial process behind how we make discoveries 25 0 93.1 273 168 130
in science, make decisions based on data, and make 61 1 63.9 184 119 195
predictions. 50 0 46.1 155 97 110
Nowadays, the companies are dealing with large data sets, for 53 1 75 253 152 155
which they require statistics and probability so that data can
get relevant results and prove fruitful for the company. The data has been analyzed in R software.
Probability and Statistics form the basis of Data Science. B. Mathematical and Analytical Representations
The objective of the study is to calculate the mean, median, Representation and analysis of the population data with R
mode, first and fourth quartile of the population. And a sample code:
 1. mu1=mean(theatre$Income)
2

mu1 existing in the data.


[1] 71.67926
The mean value for the variable Income is 71.67926, it 8. vr4=var(theatre$Theatre_ly)
implies that participants who visited the local theatre has, on vr4
an average the annual income of 71.67926(in terms of 1000 [1] 510.6156
SFR). The variance of the variable Theatre_ly is 510.6156, which is pretty
2. mu2=mean(theatre$Culture) Much considerable. It can be said that the variation in the estimated
mu2 value of the theatre expenses is much less as compared to the actual
[1] 219.8555 expenses on theatre visit by the people.
The mean value for the variable Culture is 219.8555, it 9. Creating function to find mode:
implies that participants who visited the local theatre has, and
on an average has general expenditure on cultural activities of getmode <- function(v) {
219.8555 (in terms of SFR per month). + uniqv <- unique(v)
+ uniqv[which.max(tabulate(match(v, uniqv)))]
3. mu3=mean(theatre$Theatre) +}
mu3 # Calculate the mode using the user function.
[1] 139.6795 9.1 result1= getmode(theatre$Income)
The mean value for the variable Theatre is 139.6795, it print(result1)
implies that participants who visited the local theatre has, and [1] 75
on an average has general expenditure on theatre visits of The most frequently occurring value for the annual income(Income)
139.6795(in terms of SFR per month). of the Theatre visitors is 75(in terms of 1000 SFR). Maximum of the
visitors have annual income 75(in terms of 1000 SFR).
4. mu4=mean(theatre$Theatre_ly) 9.2 result2= getmode(theatre$Culture)
mu4 print(result2)
[1] 136.5665 [1] 215
It implies that participants has estimated expenditure on The above value indicates that the maximum people have general
theatre visits Theater_ly in the year before the survey was Expenses on cultural activities (Culture) is 215(in SFR per month).
done (in SFR per month) around the value 136.5665. 9.3 result3= getmode(theatre$Theatre)
We see that, the difference between the actual expenditure and print(result3)
estimated expenditure value is 3.113, which is less, and hence [1] 135
The most frequently occurring value of people spending money at
can be concluded that the estimated value is good enough.
Theatre visit (Theater) is 135(in terms of 1000 SFR).
9.4 result4= getmode(theatre$Theatre_ly)
5. vr1=var(theatre$Income)
print(result4)
vr1
[1] 120
[1] 173.0155
The mode of the estimated value of the money spend on theatre visit
The variance of the variable Income is 173.0155, which
Is (Theater_ly) 120, while mode of actual expense on theatre visit is 135.
Implies that how much variation exists in the data. The
Bit close to each other, hence a good estimation has been made for the
Variation in the annual income is pretty much considerable.
Variable expense on theatre visit.
We can also say that, people of approximately
Same annual income visit the local theatre.
10. quantile(theatre$Income)
0% 25% 50% 75% 100%
6. vr2=var(theatre$Culture)
33.0 63.5 70.0 79.0 122.4
vr2
The first quartile of the variable Income is 63.5, that is, it explains the
[1] 2683.45
First 25% of the data set as the mean value. The median, or the second
The variance of the variable Culture is 2683.45, which
Quartile, explains 50% of the data set. The median value is 70 and the
Implies that how much variation exists in the data. The
Fourth quartile, which explains the 75% of the data set, has value 122.4.
Variation in the general expenditure on cultural activities
is very high, which suggests that there are a number of reasons
11. quantile(theatre$Culture)
due to which this high variation is existing in the data.
0% 25% 50% 75% 100%
64 185 215 249 618
7. vr3=var(theatre$Theatre)
The first quartile of the variable Culture is 185. The median,
vr3
or the second Quartile is 215 and the fourth quartile,
[1] 5855.083
which explains the 75% of the data set, has value 618.
The variance of the variable Theatre is 5855.083, which
Gives us the information of amount of variation in the general
12. quantile(theatre$Theatre)
Expenses on theatres. The Variation in the general expenditure
0% 25% 50% 75% 100%
on cultural activities is very much high, which suggests that there are
23 86 121 171 463
a number of reasons due to which this high variation is
3

For the variable Theatre, the first quartile value is 86, the
Median or 50% dataset is explained by the value 121 and A box plot or boxplot is a method for graphically depicting
the fourth quartile value is 171. groups of numerical data through their quartiles. 
With the boxplot, we observe that variables income,
13. quantile(theatre$Theatre_ly) culture and Theatre have large number of outliers while
0% 25% 50% 75% 100% Theatre_ly has less number of outliers. But we can see that
90 120 130 150 250 The dispersion in the data set is less and minimum for the
The variable Theatre_ly has first quartile value as 120, the Variable Culture.
Median value is 130 and the fourth quartile value is 150. The lines extending from the boxes indicates large variation
Outside the lower and upper quartile, which is high for all the
14. The histogram plot for the dataset: Variables. All the variables except Income are asymmetric
As the whiskers on both sides of the box are not equal, while
par(mfrow=c(2,2))
Income has approximately equal length of whiskers on both
hist(theatre$Income,main="histogram for annual income")
sides of the box.
hist(theatre$Culture,main="histogram for culture")
hist(theatre$Theatre,main="histogram for Theatre")
hist(theatre$Theatre_ly,main="histogram for Theatre_ly")

Representation and analysis of the sample data with R


code:
1. Code for sample selection:
A histogram is the most commonly used graph to N=length(theatre$S.no)
show frequency distributions.  This allows the inspection of N
the data for its underlying distribution [1] 699
(e.g., normal distribution). The graph is a good way to n=20
summarize the data set. k=ceiling(N/n)
With the histogram analysis, we see that, data on income is k
Normally distributed and to some extent symmetric, [1] 35
while culture, Theatre and Theatre_ly are positively skewed.
From the histograms of all the variables, we observe that 1.1 Selecting sample for the variable ‘Income’:
all the variables are unimodal.
We can see that all the variables do have some outliers. sample(theatre$Income[1:k],1,replace=FALSE)
[1] 59.3
s1=c(59.3,84.3,69.1,74,89.3,80,90,68,95.5,75,100,67.5,65,79.
9,74,75,77.6,58,80.5,61)
15. The boxplot for the dataset:
s_mu1=mean(s1)
s_mu1
par(mfrow=c(2,2))
[1] 76.15
boxplot(theatre$Income,main="boxplot for income")
s_vr1=var(s1)
boxplot(theatre$Culture,main="boxplot for culture")
s_vr1
boxplot(theatre$Theatre,main="boxplot for Theatre")
[1] 136.5447
boxplot(theatre$Theatre_ly,main="boxplot for Theatre_ly")
s_result1=getmode(s1)
4

print(s_result1) 115,140,130,135,130,130)
[1] 74 s_mu4=mean(s4)
quantile(s1) s_mu4
0% 25% 50% 75% 100% [1] 136.25
58.000 67.875 75.000 81.450 100.000 s_vr4=var(s4)
The mean for the sample is 76.15, mode is 74, and variance is s_vr4
136.5447. The first quartile is 67.875, the second quartile or [1] 520.7237
the median is 75 and the fourth quartile is 81.450. s_result4=getmode(s4)
print(s_result4)
1.2 Selecting sample for the variable Culture: [1] 130
quantile(s4)
sample(theatre$Culture[1:k],1,replace=FALSE) 0% 25% 50% 75% 100%
[1] 215 105.0 120.0 130.0 142.5 200.0
s2=c(215,238,292,240,245,292,287,236,233,280,262,198,249,
235,240,268,258,258,158,203) The mean for the sample is 136.25, mode is 130, and variance
s_mu2=mean(s2) is 520.7237. The first quartile is 120, the second quartile or the
s_mu2 median is 130 and the fourth quartile is 142.5.
[1] 244.35
s_vr2=var(s2) 2. The histogram plot of the samples:
s_vr2
[1] 1128.239 par(mfrow=c(2,2))
s_result2=getmode(s2) hist(s1,main="histogram for sampled annual income")
print(s_result2) hist(s2,main="histogram for sampled Culture")
[1] 292 hist(s3,main="histogram for sampled Theatre")
quantile(s2) hist(s4,main="histogram for sampled Theatre_ly")
0% 25% 50% 75% 100%
158.0 234.5 242.5 263.5 292.0
The mean for the sample is 244.35, mode is 292, and variance
is 1128.239. The first quartile is 234.5, the second quartile or
the median is 242.5 and the fourth quartile is 263.5.

1.3 Selecting sample for the variable Theatre:

sample(theatre$Theatre[1:k],1,replace=FALSE)
[1] 166
s3=c(166,92,255,65,133,129,456,49,235,72,130,149,76,50,409,
58,89,201,78,107)
s_mu3=mean(s3)
s_mu3
[1] 149.95
s_vr3=var(s3) The histogram of annual income illustrates that the sample
s_vr3 Follows normal distribution while culture is negatively
[1] 12869.63 Skewed and Theatre and Theatre_ly are positively skewed.
s_result3=getmode(s3)
print(s_result3) 3. The boxplot of the samples:
[1] 166
quantile(s3) par(mfrow=c(2,2))
0% 25% 50% 75% 100% boxplot(s1,main="boxplot for the sampled annual income")
49.00 75.00 118.00 174.75 456.00 boxplot(s2,main="boxplot for the sampled culture")
The mean for the sample is 149.95, mode is 166, and variance boxplot(s3,main="boxplot for the sampled Theatre")
is 12869.63. The first quartile is 75, the second quartile or the boxplot(s4,main="boxplot for the sampled Theatre_ly")
median is 118 and the fourth quartile is 174.75.

1.4 Selecting sample for the variable Theatre_ly:

sample(theatre$Theatre_ly[1:k],1,replace=FALSE)
[1] 120
s4=c(120,200,150,130,120,110,130,105,130,140,170,150,120,170,
5

C. The quartiles:
1. Income: The first quartile is shifted to right in
sample. The median is again shifted to right and
as well the fourth quartile is shifted to right.
2. Culture: The quartiles of the sample are shifted
are shifted right as compared to the population.
3. Theatre: The first quartile and median are
shifted left as compared to the population while
the fourth quartile is shifted to rightwards in the
sample.
4. Theatre_ly: The first quartile and median of the
sample is equal as the population while fourth is
shifted to left as compared to population.
D. The histogram:
1. Income: The histogram forms a bell shaped
The boxplot shows that the sample of annual income has no curve of sample as well as of the population. The
Outliers while culture has outlier to its left and Theatre and sample and population both follow normal
Theatre_ly has outliers to its extreme right. distribution. The mode of the sample is 74 and
that of the population is 75, which are
approximately equal.

III. POPULATION AND ITS SAMPLED VERSION 2. Culture: The sample is skewed in opposite
direction to that of population. The mode of the
A. We use t-statistic to compare population and sample population is 215 while that of the sample is 292,
mean: which makes a considerable difference.
1. Income: The sample mean is the representative 3. Theatre: The sample and population is skewed
of the population and can be used to draw in same direction. The mode of the population is
inferences about the population. 135 while that of the sample is 166, which has a
2. Culture: The sample mean is not the significant difference but can be considered.
representative of the population and can’t be 4. Theatre_ly: The sample and population is
used to draw inferences about the population. skewed in same direction. The mode of the
3. Theatre: The sample mean is the representative population is 120 and that of the sample is 130,
of the population and be used to draw inferences which is a good representation.
about the population. E. The boxplot:
4. Theatre_ly: The sample mean is the
representative of the population and be used to 1. Income: The sample is more dispersed as
draw inferences about the population. compared to the population but has no outliers
unlike the population. The median lies
B. We use F-statistic to compare population and sample approximately in the middle as the population.
variance: 2. Culture: The sample is more dispersed as
1. Income: The sample and population variance are compared to the population and has one outlier to
equal and follows a particular distribution. So, the right same as population.
we can say that variation in the population can be 3. Theatre: The sample is more dispersed as
explained by the variation in the sample. compared to the population and has median
2. Culture: The sample and population variance are approximately in the middle while in the
not equal and not follows a particular population median is shifted little to the
distribution. So, we can say that variation in the leftwards..
population is not explained by the variation in 4. Theatre: The sample is more dispersed as
the sample. compared to the population and has potential
3. Theatre: The sample and population variance are outliers to the right like population.
equal and follows a particular distribution. . So,
we can say that variation in the population can be IV. SIMULATIONS RESULTS
explained by the variation in the sample. We are using linear regression model to fit the data set in R
4. Theatre_ly: The sample and population variance software. We take Theatre as dependent variable and age, sex,
are equal and follows a particular distribution. . income, culture as the independent variable. We have the
So, we can say that variation in the population
following code:
can be explained by the variation in the sample.
r=lm(Theatre~ Age+ Sex+ Income+ Culture, data=theatre)
r
6

Call: While building function for the mode, error occurred several
lm(formula = Theatre ~ Age + Sex + Income + Culture, times, so finally idea was drawn using Google, and we got the
data = theatre) perfect code for the mode.
Coefficients: It was dilemma that which model should be selected for the
(Intercept) Age Sex Income Culture data. As it known that linear regression could be considered as
-111.9195 0.4665 21.8951 1.4203 0.5382 a standard model to fit the data and after doing a bit of
research, linear model was selected to fit the data.
The constant in the model is negative while coefficients of The snapshot of the error occurred while building the function
Age, Sex, Income, Culture are positive, which means that for mode:
dependent and independent variable move in same direction.
summary(r)
Call: VI. CONCLUSION
lm(formula = Theatre ~ Age + Sex + Income + Culture, The study dealt with calculation of the descriptive statistics
data = theatre) such as, mean, variance, mode, and the quartiles of the
Residuals:
population data set. And graphical representation of the data
Min 1Q Median 3Q Max
set was done to get a vague idea about the data. Then a sample
-139.25 -43.00 -12.78 29.92 307.29
of size of size 20 was selected using systematic sampling. The
Coefficien Estimate t same procedure was performed on the sample as the
ts Std. Error value Pr(>|t|) population. And finally comparison of population and sample
- was performed using t-test, F-test and theoretical knowledge.
111.919 16.0824 - 7.94e-12 It is not always possible to study the whole population, so a
(Intercept) 45 7 6.959 *** sample from the same population is drawn to make inferences
Age 0.46653 0.19137 2.438 0.015 * about the population. It may be sometimes difficult to get a
21.8950 3.16e-05 sample which is truly random. Most samples therefore tend to
Sex 8 5.22657 4.189 ***
get biased. So, it becomes necessary to study the biasness and
0.2037 7.39e-12
test whether sample is a good representative of the population
Income 1.42035 8 6.97 ***
or not. The study revolves around this fact and so sample is
0.0505 10.64 < 2e-16
Culture 0.53817 6 4 *** compared with the population.
The paper would be relevant to get a brief knowledge of how
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 descriptive statistics and graphical representation can be used
to compare the population and sample. A better model could
Residual standard error: 64.46 on 694 degrees of freedom be chosen which fits the data very well. Some more factors
Multiple R-squared: 0.2944, Adjusted R-squared: 0.29 except those used in the study can be identified which effects
F-statistic: 72.38 on 4 and 694 DF, p-value: < 2.2e-16 the visitors of the local theatres. Prediction can be performed
on the data set for the analysis of future outcomes. Principal
We can see that all independent variables except Age are component analysis can be applied to the data to identify that
highly significant while Age is just significant enough to be in which age group people prefers which type of movie in the
the model. The R-squared and adjusted R-squared are theatre or with other factors, we can perform the analysis.
approximately 29%, which is the extent of explanation given Also, classification could be performed with respect to factors,
by independent variables for dependent variable. say, income, age, gender or other.
Coefficient Standard Error measures the average amount that Hence, various types of analysis can be performed on the data
the coefficient estimates vary from the actual average value of set as an extension and prove to be more informative.
our response variable, which we can see is less for all the
variables. The residual standard error is 64.46 which is a
measure of quality of linear regression fit. F-statistic is a good REFERENCES
indicator of whether there is a relationship between our [1] Introduction to Statistics and Data Analysis with Exercises, Solutions
predictor and the response variables. The further the F-statistic and Applications in R, by Christian Heumann, Michael Schomaker,
is from 1 the better it is.  Here, the F-statistic is very high, so Shalabh.
[2] W.-K. Chen, Linear Networks and Systems (Book style). Belmont, CA:
It can be said that there is a good relationship between Wadsworth, 1993, pp. 123–135.
predictor [3] Link used for building mode function:
And the response variable. https://www.rdocumentation.org/packages/DescTools/versions/0.99.19/t
opics/Mode
[4] B. Smith, “An approach to graphs of linear forms (Unpublished work
V. PRACTICAL PROBLEMS style),” unpublished.
It was a dilemma whether to choose SPSS or R software for [5] Link used as a reference while fitting regression model:
the analysis of the database. To make the analysis elaborated https://onlinelibrary.wiley.com/doi/10.1002/9781118596289.ch3#:~:text
=In%20the%20standard%20linear%20regression,the%20estimation
and to obtain good quality of graphical representation, R %20of%20regression%20models.&text=The%20R%20function%20is
software was chosen for the analysis. %20used%20to%20fit%20linear%20(regression)%20models.

You might also like