You are on page 1of 8

THE LANGUAGE OF DATA MANAGEMENT

MATHEMATICS Data – collection of facts, such as numbers,


words, measurements, observations or just
o Mathematics has a peculiar language in
descriptions of things.
which symbols occupy most important
position. Data Management – involves collecting,
storing, organizing, protecting, verifying, and
Algebra us written in a symbolic
processing essential data and making it
language that is designed to express
available to your organization.
mathematical thoughts.
Data Collection Methods
“Mathematics is the language of physical
sciences and certainly no more marvelous  Direct observation – simplest way to
language was ever created by the mind of collect data
man.” – Lindsay  Survey
 Case Study
o Language of Mathematics – highly
 Interview
structured and rule bond.
 Observation
Characteristics:  Group Assessment
1. Precise (very fine distinctions)  Reviews (Portfolio, Expert/Peer,
2. Concise (talk briefly) Document)
3. Powerful (express complex thought  Testimonials
w/ relative ease)  Tests (Questionnaire)
 Diaries, Journals, Logs
Mathematics – synonymous with rigor  Photographs, Videotapes, Slides
and universality.
Noun (expression) – name Common Methods of Determining the Sample
mathematical objects of interest.  Random – by chance
Sentences – state complete Ex. Students in a class write their names
mathematical thoughts. on strips of paper and put them in a hat.
The teacher draws five names.
Verbs – equal sign (=)
 Systematic – according to a rule or
ENGLISH MATHEMATICS formula
Ex. An exit poll is taken of every tenth
name given NOUN EXPRESSION
voter
to an
Examples: Examples:  Stratified – at random from a random
object of
Carol, 5, 2 + 3, ½ chosen subgroup
interest:
Idaho, book Ex. In a statewide survey, five countries
SENTENCE SENTENCE are randomly chosen and 100 people
a complete
are randomly chosen from each
thought:
Examples: Examples: country
The capital 3+4=7 Data Organization – organizing collected
of Idaho is factual material commonly accepted in
3+4=8
Boise. scientific community as necessary to validate
research finding.
The capital
of Idaho is
Pocatello.
Various Modes of Data Presentation

1. Textual Presentation – data are


presented in the form of texts,
PROBLEM SOLVING phrases/paragraphs.
STRATEGIES Ex. Newspaper reports

1. Trial and Improvement 2. Tabular Presentation – reliable and


2. Draw a Diagram effective way of showing relationship or
3. Look for a Pattern comparisons of data through tablets.
4. Act it out
5. Draw a table 3. Graphical Presentation – most effective
6. Simplify the Problem way of presenting data though
7. Use equation statistical graph.
8. Work Backwards
9. Eliminate Possibilities
Types of Graphs 2. Median

A. Bar chart – composed of discrete bars o Defined as the midpoint of the list (In
that represent different categories of distribution of scores: smallest to largest)
data. It is used to compare values o It divides the scores, thus 50% of scores
across categories. in the distribution have values that are
B. Pie chart – circular chart used to equal to or less than the median.
compare parts of the whole. o Requires scores that can be placed in
C. Line chart – displays the relationship rank order (small - large) and are
between two types of information. measured on an ordinal, interval or ratio
D. Histogram – has connected bars that scale.
displays the frequency or proportion of o Median can be identified by a simple
cases that fall within defined interval counting procedure.
columns. o It is relatively unaffected by extreme
E. Cumulative frequency – allows us to find scores; hence it tends to stay in the
the median, lower and upper quartile ‘center’ of the distribution.
range. o It serves as an alternative to the mean.
F. Box Plots – displays data for comparison.
Five needed information:
1. Minimum value 3. Mode
2. Lower quartile (Q1)
3. Median (Q2) o Most frequently occurring category or
4. Upper quartile (Q3) score in the distribution.
5. Maximum value o In frequency distribution graph, mode is
the score corresponding to the peak or
high point of the distribution.
o It can be determined for data
Measures of Central Tendency (CT)
measured on any scale of
o A statistical measure that determines a measurement: nominal, ordinal, interval,
single value that accurately describes or ration.
the center of the distribution and
Primary value
represents the entire distribution of
scores. (I) The only measure of CT that can be
o Aims to identify the single value that is used for data measured on a
the best representative for the entire set nominal scale.
of data. (II) Used as a supplemental measure of
o Allows researchers to summarize or central tendency that is reported
condense a large set of data in a very along with mean or median.
simplified, concise form though
identifying the ‘average score’ (CT).
o A descriptive statistic Central Tendency & Shape of Distribution

 In symmetrical distribution, mean and


1. Mean median will always be equal.
o Most commonly used measure of CT. If a symmetrical distribution has only one
However, it does not always work in (1) mode, the mode, mean, and
every situation. median will always have the same value
o Requires scores that are numerical  In a skewed distribution, mode will be
values measured on an interval or ratio located at the peak on one side and
scale. the mean usually will be placed toward
o Obtained by computing the sum/total the trail on the other side.
for the entire set of scores, then dividing  The median is usually located between
the sum/total by the number of scores. the mean and the mode.
Changing the Mean
Measure of Dispersion
If a constant value is added to every
score in a distribution, then the same o Descriptive statistics that describe how
constant value is added to the mean. similar a set of scores are to each other.
Also, if every score is multiplied by a  The more similar the scores are to
constant value, then the mean is also each other, the lower the
multiplied by the same constant value. measure of dispersion will be.
 The less similar the scores are to
each other, the higher the
measure of dispersion will be.
 Generally, the more spread out a The larger the variance is, the more the
distribution is, the larger the scores deviate, on average, away from
measure of dispersion will be. the mean.
The smaller the variance is, the less the
scores deviate, on average, from the
Three Main Measures of Dispersion mean.
1. Range – difference between the largest
score in the set of data and the smallest Standard Deviation
score in the set of data.
 When the deviate scores are squared in
R= XL – Xs variance, their unit of measure is
Ex. Find the range: 4 8 1 6 6 2 9 3 6 9 squared as well.
XL – Xs = 9 – 1 = 8 Eg. If people’s weigh are measured in
pounds, then the variance of the
o It is used when weights would be expressed in pounds2.
- ordinal data  Since squared units of measure are
- presenting results to people with often awkward to deal with, the square
little/no knowledge of statistics. root of variance is often used instead.
o It is rarely used in scientific work as it is The standard deviation (SD) is the
fairly insensitive square root of variance.
- Depends on only two scores in
the set of data, XL – Xs. SD = √𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
- Two very different sets of data
V = standard deviation2
can have the same range.

2. Semi-Interquartile Range (SIR) – the


difference of the first two and third
quartiles divided by two.
1st quartile – 25th percentile
3rd quartile – 75th percentile

SIR = (Q3 – Q1)/2

o SIR is often used with skewed data as it is


insensitive to the extreme scores.

3. Variance (V) – the average of the


square deviations:
Not the formula of variance

What does the Variance Formula Mean?

 It says to subtract the mean from each


of the scores, which is called as a
deviate or a deviation score.
Deviate tells how far a given score is
from the typical, average, score. Hence,
it is a measure of dispersion for a given
score.
 One of the definitions of the mean was
that it always made the sum of the
scores minus the mean equal to 0. Thus,
the average of the deviates must be 0
since the sum of the deviates must
equal 0.
To avoid this problem, squaring the
deviate scores makes all the squared
scores positive.
 Variance is the mean of the squared
deviation scores.
Diagram of Normal Distribution Curve (Z score)

Normal (Gaussian) Distribution


Distinguishing Features
o Descriptive model that describes real
world situations. o The mean ± 1 standard deviation covers
o A continuous frequency distribution of 68.00% of the area under the curve
infinite range (can take any values not o The mean ± 2 standard deviation covers
just integers as in the case of binomial 95% of the area under the curve
and Poisson distribution). o The mean ± 3 standard deviation covers
o The most important probability 99.7% of the area under the curve
distribution in statistics and important
tool in analysis of epidemiological data
and management science. CORRELATION AMD
REGRESSION
Characteristics of Normal Distribution
Correlation and Regression – used when we
 Links frequency distribution to probability
are interested in the relationship between two
distribution
variables.
 It has a Bell Shape Curve and is
Symmetric Correlation is used to:
 It is Symmetric around the mean: Two
halves of the curve are the same (mirror o describe the strength of a relationship
images) between two variables. This is the ‘r
value’ and may vary from -1.0 to 1.0
Hence Mean = Median o determine the probability that two
unrelated variables would produce a
 The total area under the curve is 1 (or
relationship this strong, just by chance.
100%)
This is the ‘p value’.
 It has the same shape as Standard
Normal Distribution.
 In a Standard Normal Distribution: The
NOTE:
mean (μ) = 0 and Standard deviation
(σ) =1 Correlation does not imply causation - the
variables are related, but one does not cause
Z - Score (Standard Score) the second.

𝑿– 𝝁 The variables are both dependent variables in


𝒁= the experiment. Hence, it is incorrect to think of
𝝈
one variable as ‘causing’ the other.
o Z indicates how many standard
deviations away from the mean the
point x lies. Parametric Test – the Pearson Correlation
o Z score is calculated to 2 decimal coefficient.
places
If the data is normally distributed, then
you can use a parametric test to
determine the correlation coefficient -
the Pearson correlation coefficient.
Pearson’s Correlation 177 182
Assumptions of the Test
178 182
o Random sample from the populations
o Both variables are approximately
192 198
normally distributed
o Measurement of both variables is on an
202 202
interval or ratio scale
o The relationship between the 2
variables, if it exists, is linear.
- Thus, before doing any Solution:
correlation, plot the relationship
Step 1. Plot data
to see if its linear!

To calculate the Pearson’s correlation


coefficient, we use the formula below, where

n = sample size.

Testing ‘r’

o Calculate t using above formula


o Compare to tabled t-value with n-2 df
o Reject null if calculated value > table
value
o But SPSS will do all this for you, so you
don’t need to!

Example: The heights and arm spans of 10


adult males were measured in cm. Is there a
correlation between these two
measurements?

Height (cm) Arm Span (cm)

171 173
Step 2. Calculate the correlation coefficient
195 193
- r = 0.932

180 188 Step 3. Test the significance of the relationship

- p = 0.0001
182 185

190 186

175 178
Spearman’s Test 3 840 17.2
Like most non-parametric tests, the data are
first ranked from smallest to largest 4 505 6.7

In this case, each column is ranked


5 765 20
independently of the other.

Then….. 6 780 24.1

(1) subtract each rank from the other


7 235 1.5
(2) square the difference
8 790 13.8
(3) sum the values, and

(4) plug into the following formula to 9 440 1.7


calculate the Spearman correlation
coefficient. 10 435 2.1

11 815 20.2
Calculating Spearman’s correlation coefficient
12 460 3.0

13 697 10.3

Step 1. Plot the data


Testing ‘r’

o The null hypothesis for a Spearman’s


correlation test is also that:

ρ = 0; i.e., H0: ρ = 0; HA: ρ ≠ 0

o When we reject the null hypothesis we


can accept the alternative hypothesis
that there is a correlation, or
relationship, between the two variables. NOTE: not very linear

o Calculate t using above formula


o Compare to tabled t-value with n-2 df
o Reject null if calculated value > table
value
o But SPSS will do all this for you, so you
don’t need to!

Example: The mass (in grams) of 13 adult male


tuataras and the size of their territories (in
square meters) was measured. Are territory size
and the size of the adult male tuatara related?

Solution:
Step 2. Calculate the correlation coefficient
Observation Mass Territory Step 3. Test the significance of the relationship
number size
ρ = 0.835, p = 0.001
1 510 6.9

2 773 20.6
Linear Regression 3. For any value of x, any particular value
of y is:
o Here we are testing a causal
relationship between the two variables. i. yi = α + βx + e
o We are hypothesizing a functional
ii. Where e, the residual, is the
relationship between the two variables
amount by which any observed
that allows us to predict a value of the
value of y differs from the mean
dependent variable, y, corresponding
value of y (analogous to
to a given value of the independent
“random error”)
variable, x.
iii. Residuals will follow a standard
normal distribution
Regression
4. The variances of the y variable for all
o Unlike correlation, regression does imply values of x are equal
causality 5. Observations are independent – each
individual is measured only once.
o An independent and a dependent
variable can be identified in this
situation. Estimating the Regression Function and Line

- This is most often seen in o A regression line always passes through


experiments, where you the point: “mean x, mean y”.
experimentally assign the
independent variable, and
measure the response as the
dependent variable.

o Thus, the independent variable is not


normally distributed (indeed, it has no
variance associated with it!) - as it is
usually selected by the investigator.

For a linear regression, this can be written


as:

o μy = α + βx (or y = mx + b)

o where μy = population mean


value of y at any value of x

o α = the population (y) intercept,


and

o β = population slope.

You can use this equation to make


predictions - although of course these
are usually estimated by sample
statistics rather than population
parameters.

Linear Regression Assumptions

1. The independent variable (X) is fixed


and measured without error – no
variance.
2. For any value of the independent
variable (X), the dependent variable (Y)
is normally distributed, and the
population mean of these values of y, μy
is:

μy = α + βx
Estimating the Regression Function and Line

o To measure total error, you want to sum


the residuals… but they will cancel out…
so you must square the differences, then
sum.
o Now we have the TOTAL SUM OF
SQUARES (SST)
o The sum of squares of the residuals is
thus:

o Thus, you see a lot of variance in y when


x is not taken into account. How much
of the variance in y can be attributed to
the relationship with x?

Sums of Squares

o SSt - this is the value for sums of squares


for y when x is not considered (the total
o This “line of best fit” minimizes the y sum sums of squares)
of squares, and accounts for how x, the o SSe - this is the value for the sums of
independent variable, influences y, the squares of the residuals - in other words,
dependent variable. it represents the variance in y that is still
present even when x is considered (the
o The difference between the observed
error sums of squares)
values and this “line of best fit” are the
o SSr – this is the variation in y accounted
residuals – the “error” left over when the
for by the relationship with x. It can be
relationship is included.
calculated two ways:
o The sum of squares of these regression
- by subtraction (SSt – SSe)
residuals is now:
- directly using formula

o This is equivalent to the ERROR SS = (SSe);


it is the variance “left over” after the
relationship with x has been included.
o To get this best fit line, based on the
principles we just went over, you can
calculate the slope and the intercept of
the best fit line.

You might also like