You are on page 1of 9

2022

Exploring how sampling


techniques affect the
relationship between
literacy rates and
fertility rates
SHAAN SUVARNA
Introduction

The two variables I have chosen for my investigation are literacy rates, which involves the percentage
of adults(people aged 15 and above) who are literate, and fertility rates, which is defined as the total
number of children that would be born to each woman if she were to live to the end of her child-
bearing years and give birth to children in alignment with the prevailing age-specific fertility rates1

Fertility rates are decreasing across the world and one of the main contributing factors to this is the
increase in literacy rates in many countries, resulting from the increase in urbanisation and greater
access to education. Although the two things may seem unrelated initially, the main reason as to why
this could be the case is due to educated women being less likely to have children due to them either
prioritising their careers, or wanting to make sure that all of their kids receive education just as they
did(and with more children this would may not be feasible). There are a multitude of factors
influencing the effects of literacy rates on fertility rates therefore I have carried out this investigation in
order to firstly identify if there is a definite correlation between literacy rates and fertility rates, and
then to also see the effects of using various sampling techniques on the correlation between the
variables. Such as random sampling, systematic sampling, convenience sampling. For example,
some countries may manipulate the data, through biased sampling techniques, in order to increase
the correlation between literacy and fertility rates and promote an agenda, such as the anti-natalist
policy in China, as they could argue that a fall in fertility rates would increase the literacy rates within
the country. Additionally other countries such as South Korea may manipulate this information to
further increase the correlation, so that they can more effectively highlight the importance of having
women in education. This would allow them to experience falling fertility rates(due to educated
women having less children), subsequently leading to the country experiencing demographic
dividend, as the proportion of young dependants reduces, whilst the proportion of the working
population stays the same. Therefore, in my investigation, I will firstly investigate whether the two
variables have correlation and then identify if different sampling techniques can have different effects
on this correlation.

I am going to do this by firstly conducting a Chi squared test of independence to see whether literacy
rates and fertility rates are independent or dependant to each other. And if they are deemed to be
dependant to each other, I will use the Pearson product-moment correlation coefficient to further
investigate the strength of the correlation between the variables. In order to compare the sampling
techniques, I will firstly create a data set using each method of sampling (random sampling,
systematic sampling, convenience sampling and biased sampling) and calculate a pearson value for
each of the samples, making sure to keep the sample sizes constant. The different methods of
sampling have their own individual advantages and disadvantages, for example, random sampling is
seen as the most effective out of the three in negating the effects of bias, as values used for the
sample are randomly chosen and there can be no influence on what numbers appear in the final
sample. However, a disadvantage to this method of sampling is that it can be time consuming to
generate random numbers, using a random number generator, in order to create the final sample,
thereby making this the most difficult to carry out from the three methods. Systematic sampling on the
other hand is a faster way of creating a sample, compared to random sampling, and this involves
randomly selecting the first value and then taking every nth value from the first one until the final
sample size is obtained. For example, in my sample, I ended up randomly eliminating the fifth value
and from there I eliminated every fifth piece of data until I got my final sample size. But in comparison
to random sampling, systematic sampling can pose some degree of bias due to it being possible for
the data to be arranged to fit the systematic sampling process. For example, it would have been
possible for me to arrange my data so that every 5th piece of data that gets chosen is whatever value I
wanted it to be. A similar problem arises with convenience sampling as once again the person
arranging the data could arrange it so that the first 80 pieces of data that get chosen are whatever
they wanted it to be. An advantage to convenience sampling however is that it is the most time

1
https://data.oecd.org/pop/fertility-rates.html
effective method of sampling out of the three as it involves selecting the first 80 pieces of data that are
visible and using that to make the sample.

I have chosen to source my data from the world data bank2,since it is extremely reliable and is
updated on a regular basis, with primary data sources from over 80 countries. The world data bank
works closely with the agencies of the United Nations (UN), the Organisation for Economic Co-
Operation and Development (OECD) and the International Monetary Fund (IMF) in order to display
the most accurate information possible. This makes the world data bank one of the most reliable
sources to gather information for literacy and fertility rates across the world.
3

Chi squared test of independence


The chi squared test of independence was used to identify whether the two variables I was
investigating, literacy rates and fertility rates, are independent to each other, meaning that one has no
effect on the other, or are not independent. By working out a chi squared value for the data set, we
can compare this number to the critical value in order to determine whether we should accept or reject
the null hypothesis. The null hypothesis for my test is that literacy rates and fertility rates are
independent to each other and my alternative hypothesis is that they are not independent.

The significance level in chi squared tests determines the probability of rejecting the null hypothesis
for a chi squared test, despite it being correct. For example, a significance level of 5% would mean
that there is a 5% chance that the conclusion I come to after completing my test is incorrect.
Therefore in order to increase the accuracy of my results, the significance level I have chosen for my
test is 1% so that the chances of selecting the wrong conclusion from my chi squared test is
minimised.

The first stage in my Chi squared test involved creating an observed frequency table. Firstly I
imported the values from my raw data set into google sheets and then arranged my X values(literacy
rates) in ascending order, using the sorting range function, and used this information to determine my
class widths for my observed frequency table. Then I repeated the same process for my Y values in
order to create the class widths for the fertility rates. After creating both class widths I created a 3x3
table and recorded the values from my data set into the observed frequency table after colour coding
the values(in order to speed up the process of inputting the frequencies into the table) After inputting
my values, I worked out the totals from each row and each column using the “sum” function on google
sheets.

Here is the full observed frequency table for my data set:

To work out the expected frequencies for my table, I had the option of either using my GDC or using
the formula: (row total*column total/ overall total). For example, if the fertility rates were to be between
1.25<y<2.10 and the literacy rates were between 78.79<x<94.56(the observed frequency is

2
https://data.worldbank.org
subsequently 10 as stated in the table above), the expected frequency for this value, using the
formula, would be (34*35)/104, giving us a value of 11.44230769.

Here is the full expected frequency table for my data set:

Finally, to work out the chi squared value for each of the values within our frequency tables, we have
to use the formula:

Using the values from the observed frequency table and expected frequency table, I was able to
create a table for chi squared. And after adding all the numbers up, using the sum function, I was able
to work out the final chi squared value, which I would ultimately compare to the critical value to draw
my conclusion.

Here is my table for chi squared:

The degrees of freedom for this test was worked out using the formula (rows-1) *(column-1) therefore
(3-1*(3-1)=4, making the degrees of freedom for this test 4. The chi squared critical value for degrees
of freedom of 4 at a 1% significance level is 13.227. So since my chi squared value that was derived
from the test, 56.2(3s. f), is greater to that of the critical value, we must reject the null hypothesis,
meaning that the two variables are not independant. Therefore, the overall conclusion to my chi-
squared test is that literacy and fertility rates are dependant to each other, and one influences the
other. Now that it is evident that the variables are dependant to each other, we can calculate the
pearson product moment correlation coefficient in order to determine the exact strength of this
correlation between the two dependant variables.

Whole population Pearson’s Product Moment Correlation Coefficient


The Pearson product moment correlation coefficient is a measure used to determine the strength of
the correlation of data by effectively drawing a line of best fit through the data and calculating how far
away the points (from the data set) are from that line of best fit. The distance between the points and
the line of best fit is then ranked from a scale of 0-1 to give the final Pearson value. I predict that for
my data set, there will be a negative Pearson value, as seen from the negative correlation on the
scatter graph, and the value will be relatively large due to the scatter graph showing a strong negative
correlation.

The first stage in working out the Pearson value involved importing the whole population data set and
calculating the totals and averages for both the literacy and the fertility rates.
Pearson’s Product Moment Correlation

^Means and totals for each of the X and Y variables for the raw data

After having worked out the means and totals, we can then use this information to work out the values
for (x-xmean)2 and (y-ymean)2, for each individual value (using a separate column on google sheets).
E.g
Finally in another column we can work out the values for (x-xmean)*(y-ymean) for each individual
value from the data set like we had done previously. After finding out the totals for the (x-xmean)2, (y-
ymean)2 and (x-xmean)*(y-ymean), using the sum function, we can substitute these values into the
formula used to work out the pearson value.

Pearson product moment correlation coefficient formula:

Pearson value for my data set

The Pearson value I worked out for my data fits my hypothesis as it is negative and is relatively large,
which explains the strong correlation of my data as seen in the scatter graph. This elucidates the
strong correlation between literacy and fertility. I verified the Pearson value I obtained using the
formula, by checking if it matched with the CORREL function for Pearson on google sheets. To do this
I highlighted my entire data set and used the correl function. The results from using the function on
google sheets matched the pearson value I obtained from my formula, further increasing the
likelihood of my value being accurate.
Therefore overall, from the Pearson test we can determine that there is a strong negative correlation
between literacy rates and fertility rates and that reducing literacy rates results in an increase in
fertility rates. Contextually, this would make sense as lower literacy rates, imply lower levels of
education and development within a country. Therefore, if a country is less developed, infant mortality
rates are likely to be higher as a result of the poor healthcare systems and therefore women have
more children in order to compensate for the chances of their children dying. And as a result, the
fertility rates are higher.

Calculation of the regression line


The regression line shows the relationship between the two variables that are being investigated if it
was completely linear. In order to calculate the regression line for my data set, I utilised 2 methods.
The first method involved drawing a scatter graph of the data with google sheets and utilising the
trendline function to work out the equation of the regression line.

The second method involved calculating the regression line using the formula:

^And arranging it into the form y=mx+c

To work out the value for I divided the sum of the values from my (x-xmean)*(y-ymean) by the
sum of (x-xmean)2. And this gave me the gradient of the regression line. To work out the c value for
the line, I firstly rearranged the equation so that it was equal to Y and then expanded out the brackets,
by multiplying the gradient value by x and -xmean. Then adding the value of ymean to this gave the
equation for the line of regression.
F107 is the sum of (x-xmean)*(y-ymean) and D107 is the sum of (x-xmean)2

K34 is the gradient of the regression line which was worked out above
and B109 is the mean of x. It has a minus sign, since in the equation it is gradient*(x-xmean). C109 is
the mean of y.

R values for every different method of sampling

To work out the Pearson value for each method of sampling, I repeated the same method that I used
to work it out for the whole population on each of the different samples from the different methods of
sampling.

Random sampling

Systematic sampling
Convenience sampling

Biased sample

Despite random sampling being the most effective at eliminating bias from the samples, it was in fact
systematic sampling that had the Pearson value that was closest to that of the whole population
(being -0.813 3.sf). With random sampling having a Pearson value that was lower to that of the whole
population and convenience sampling having a value that was higher to that of the whole population.
Therefore, despite convenience sampling being the least effective in removing bias, it still proved to
be accurate in the Pearson value calculation as it produced similar results to that of random sampling.
However, since the value for correlation was higher in the convenience sample, it could perhaps
indicate a subtle element of bias, and a country could exploit this method of sampling to conveniently
arrange their data and increase the correlation between the two variables in order to perhaps promote
a policy(such as Chinas one child policy). Furthermore, in my biased sample, where I handpicked the
data for my sample that best fits the correlation, the Pearson value was inflated substantially as a
result of my bias. Therefore, if a country such as China were to do this when creating their sample, it
would not at all be difficult to manipulate the correlation between the two variables, allowing them to
more easily promote their policy and establish their control over society. If I could so easily increase
the correlation between literacy and fertility rates within my sample, then what is to stop them

Conclusion

From my exploration I have firstly learnt that literacy rates and fertility rates are not independent
through my chi squared test of independence. And after calculating the pearson value for each of my
samples, I have concluded that systematic sampling is the best method of sampling as it was the
most accurate method since its r value was similar to that of the r value for the whole population and it
was also extremely time effective in comparison to random sampling. Convenience sampling was also
effective in the sense that it was very quick for me to create a sample and the results were as
accurate as that of random sampling. Overall all the methods of sampling (except the biased sample)
produced relatively accurate r values that were extremely close to that of the whole population,
therefore I can conclude that these methods of sampling do infact remove the effects of bias. For my
biased sample however (where I purposefully chose the data that would feature in my sample) I had
managed to drastically increase the pearson value and achieve far greater correlation between the
two variable compared to the whole population. Demonstrating that if it was possible for someone to
select the sample based on their own accord, they could manipulate the correlation between two
variables in any way that they wanted to. In terms of literacy and fertility rates, a country for instance
may want to decrease the relationship between these two variables in order to keep fertility rates high
for religious reasons/beliefs.

You might also like