You are on page 1of 6

STATISTICAL ANALYSIS AND SOFTWARE APPLICATION

Statistics - is the science of collecting, organizing, summarizing, and analyzing information to draw conclusions or answer
questions. In addition, statistics is about providing a measure of confidence in any conclusions.

NOTES: The first part states that statistics involves the collection of information. The second refers to the
organization and summarization of information. The third states that the information is analyzed to draw
conclusions or answer specific questions. The fourth part states that results should be reported using some measure
that represents how convinced we are that our conclusions reflect reality.

USES OF STATISTICS
• Statistics is important because it enables people to make decisions based on empirical evidence.

• Statistics provides us with tools needed to convert massive data into pertinent information that can be used in
decision making.

• Statistics can provide us information that we can use to make sensible decisions.

(1) Statistics helps in providing a better understanding and accurate description of nature’s


phenomena.
(2) Statistics helps in the proper and efficient planning of a statistical inquiry in any field of
study.
(3) Statistics helps in collecting appropriate quantitative data.
(4) Statistics helps in presenting complex data in a suitable tabular, diagrammatic and graphic
form for an easy and clear comprehension of the data.
(5) Statistics helps in understanding the nature and pattern of variability of a phenomenon
through quantitative observations.
(6) Statistics helps in drawing valid inferences, along with a measure of their reliability about
the population parameters from the sample data.
- https://www.emathzone.com/tutorials/basic-statistics/functions-or-uses-of-statistics.html#:~:text=of
%20Statistics%20%7C%20eMathZone-,Functions%20or%20Uses%20of%20Statistics,in%20collecting
%20appropriate%20quantitative%20data.

2 FIELDS OF STATISTICS
A. Mathematical Statistics- The study and development of statistical theory and methods in the abstract.

B. Applied Statistics- The application of statistical methods to solve real problems involving randomly generated data
and the development of new statistical methodology motivated by real problems. Example branches of Applied
Statistics: psychometric, econometrics, and biostatistics.

2 MAJOR FIELDS OF STATISTICS


 Descriptive statistics - which describes the properties of sample and population data.
 Inferential statistics - which uses those properties to test hypotheses and draw conclusions.
- https://www.investopedia.com/terms/s/statistics.asp

Data - According to the Merriam Webster dictionary, data are “factual information used as a basis for reasoning,
discussion, or calculation”. Data can be numerical, as in height, or non-numerical, as in gender. In either case, data
describe characteristics of an individual. (a set of any kind of information, either numerical or non numerical)

TYPES OF DATA

1. Categorical Data – it data represents characteristics. Therefore it can represent things like a person’s gender, language
etc. Categorical data can also take on numerical values (Example: 1 for female and 0 for male). Note that those numbers
don’t have mathematical meaning.

- collects data though frequencies, proportion, percentage (to visualize it we can use pie and bar chart).

2. Nominal Data (qualitative) - represent discrete units and are used to label variables that have no quantitative value.
Just think of them as „labels“. Note that nominal data that has no order. Therefore if you would change the order of its
values, the meaning would not change. We can use percentiles, median, mode and the interquartile range to summarise
your data.

3. Ordinal Data (qualitative) - represent discrete and ordered units. It is therefore nearly the same as nominal data,
except that it’s ordering matters.
4. Numerical Data:
A. Discrete Data – if values are distinct and separate. discrete data if the data can only take on certain values. This
type of data can’t be measured but it can be counted. It basically represents information that can be categorized into a
classification. An example is the number of heads in 100 coin flips.

B. Continuous Data - represents measurements and therefore their values can’t be counted but they can be
measured. An example would be the height of a person, which you can describe by using intervals on the real number
line. To visualize we can use box plot and histogram.

5. Interval Data (quantitative) – represent ordered units that have the same difference. interval data when we have a
variable that contains numeric values that are ordered and where we know the exact differences between the values. An
example would be a feature that contains temperature of a given place.

We can add and subtract, but we cannot multiply, divide or calculate ratios. Because there is no true zero, a lot of
descriptive and inferential statistics can’t be applied.

6. Ratio Data (quantitative) - ordered units that have the same difference. Ratio values are the same as interval values,
with the difference that they do have an absolute zero. Good examples are height, weight, length etc.
- https://towardsdatascience.com/data-types-in-statistics-347e152e8bee

SOURCES OF DATA

1. Primary Sources - Provide a first-hand account of an event or time period and are considered to be authoritative.
Represent original thinking, reports on discoveries or events and share new information. Often these sources are
created at the time the events occurred but they can also include sources that are created later. They are usually the
first formal appearance of original research.

Primary Data - are data documented by the primary source. The data collectors documented the data
themselves.

2. Secondary Sources - offer an analysis, interpretation or a restatement of primary sources and are considered to be
persuasive. They often involve generalization, synthesis, interpretation, commentary or evaluation in an attempt to
convince the reader of the creator's argument. They often attempt to describe or explain primary sources.

Secondary Data - are data documented by a secondary source.

SKILLS OF MEASURING DATA

1. Nominal Scale. Nominal variables (also called categorical variables) can be placed into categories. They don’t have a
numeric value and so cannot be added, subtracted, divided or multiplied. They also have no order; if they appear to
have an order then you probably have ordinal variables instead. (ex. Type of school (private, public))

2. Ordinal Scale. The ordinal scale contains things that you can place in order. For example, hottest to coldest, lightest
to heaviest, richest to poorest. Basically, if you can rank data by 1st, 2nd, 3rd place (and so on), then you have data
that’s on an ordinal scale. (ex. Food preferences, stage of disease, social economic class (1 st, middle, lower))

3. Interval Scale. An interval scale has ordered numbers with meaningful divisions. Temperature is on the interval scale:
a difference of 10 degrees between 90 and 100 means the same as 10 degrees between 150 and 160. Compare that to
high school ranking (which is ordinal), where the difference between 1st and 2nd might be .01 and between 10th and
11th .5. If you have meaningful divisions, you have something on the interval scale. (Ex. IQ level, high IQ, average IQ,
lower IQ)

4. Ratio Scale. The ratio scale is exactly the same as the interval scale with one major difference: zero is meaningful.
For example, a height of zero is meaningful (it means you don’t exist). Compare that to a temperature of zero, which
while it exists, it doesn’t mean anything in particular (although admittedly, in the Celsius scale it’s the freezing point for
water). (EX. Heigh & weight, time)

- https://www.statisticshowto.com/probability-and-statistics/descriptive-statistics/scales-of-measurement/

METHODS OF GATHERING DATA

Steps in Data Gathering top six data collection methods:


1. Set the objectives for collecting data
2. Determine the data needed based on the set 1. Interviews
objectives. 2. Questionnaires and surveys
3. Determine the method to be used in data gathering 3. Observations
and define the comprehensive data collection points. 4. Documents and records
4. Design data gathering forms to be used. 5. Focus groups
5. Collect data. 6. Oral histories

- https://www.jotform.com/data-collection-
methods/

BASIC TERMINOLOGIES OF STATISTICS

1. Sample -  used in statistical testing when population sizes are too large for the test to include all possible
members or observations
2. Population - he pool of individuals from which a statistical sample is drawn for a study. Thus, any selection of
individuals grouped together by a common feature can be said to be a population.
3. Parameter –are numbers that summarizes data for an entire population
4. Variables - a characteristic of a unit being observed that may assume more than one of a set of values to which a
numerical measure or a category from a classification can be assigned (e.g. income, age, weight, etc., and
“occupation”, “industry”, “disease”, etc.
5. Probability - Probability is a mathematical tool used to study randomness. It deals with the chance (the
likelihood) of an event occurring.
6. Data - factual information used as a basis for reasoning, discussion, or calculation.

METHODS IN PRESENTING DATA


1. Tabulation - Tables are devices for presenting data simply from masses of statistical data. Tabulation is the first step
before data is used for analysis. Tabulation can be in form of Simple Tables or Frequency distribution table (i.e., data is
split into convenient groups).
2. Charts and Diagrams - They are useful methods in presenting simple statistical data. Diagrams are better retained in
the memory than statistical tables.
The methods used are:
(a) Bar Charts - They are merely a way of presenting a set of numbers by the length of a bar. The bar chart can be
simple, multiple or component type.
(b) Histogram - It is a pictorial diagram of frequency distribution. It consists of a series of blocks. The class intervals are
given along the horizontal axis and the frequencies along the vertical axis.
(c) Frequency Polygon - A frequency distribution may be represented diagramatically by the frequency polygon. It is
obtained by joining the mid-points of the histogram blocks.
(d) Line Diagram - Line diagram are used to show the trend of events with the passage of time.
(e) Pie Charts - Instead of comparing the length of a bar, the areas of segments of a circle are compared. The area of
each segment depends upon the angle.
(f) Pic to Gram - Pictogram is a popular method of presenting data to the “man in the street” and to those who cannot
understand orthodox charts. Small pictures or symbols are used to present the data.
3. Statistical Maps - When statistical data refer to geographic or administrative areas, it is presented either as “Shaded
Maps” or “Dot Maps” according to suitability.
4. Statistical Averages - The term “average” implies a value in the distribution, around which the other values are
distributed. It gives a mental picture of the central value.
The types of averages used are:
(i) The Mean (Arithmetic Mean):
To obtain the mean, the individual observations are first added together, called summation or ‘S’ then divided by the
number of observations. Means is denoted by the sign X̅ (called “X bar”).

(ii) The Median:
It is an average of a different kind, which does not depend upon the total and number of items. To obtain the median,
the data is first arranged in an ascending or descending order 0 of magnitude, and then the value of the middle
observation is located.

(iii) The Mode:
It is the most frequent item in series of observations.

5. Measures of Dispersion:
(a) The Range - The range is by far the simplest measure of dispersion. It is defined as the difference between the
highest and lowest figures in a given sample. If we have grouped data, the range is taken as the difference between the
midpoints of the extreme categories.
(b) The Mean Deviation - It is the average of the deviations from the arithmetic mean.
(c) The Standard Deviation - It is the most frequently used measure of deviation. In simple terms, it is defined as “Root-
Means- Square-Deviation”. It is denoted by Greek letter 6.
It is calculated by formula:
When the sample size is more than 30, the above basic formula may be used without modification. For smaller samples,
the above formula tends to underestimate the standard deviation, and therefore needs correction i.e., use n-1 instead
of n.The meaning of standard deviation can only be appreciated fully when we study it with reference to “normal curve”.
The larger the standard deviation the greater the dispersion of values about the mean.
(d) Normal Distribution - The normal distribution or normal curve is an important concept in statistical theory. The
shape of the curve will depend upon the mean and standard deviation which in turn will depend upon the number and
nature of observation.
It is important to note that:
i. The area between one S.D. on either side of the mean will include approximately 68% of the values in the distribution.

ii. The area between two S.D. on either side of the mean will cover most of the values i.e., approximately 95% of the
values.

iii. The area between three S .D. will include 99.7% of the values. These limits on either side of the mean are called
“confidence limits”.

(e) Standard Normal Curve - Although there is an infinite number of normal curves depending upon the mean and S.D.,
there is only standard normal curve. It is a smooth, bell shaped perfectly symmetrical curve based on an infinitely large
number of observations. The total area of the curve is 1, its mean is 0 and its S.D. is 1. The mean, median and mode all
coincide. The distance of a value (x) from the mean (X̅ ) of the curve in units of S.D. is called “relative deviate or standard
normal variate” and is usually denoted by Z.
The standard normal deviate or Z is given by formula:
6. Sampling - When a large proportion of individuals or items or units have been studied we take a sample.
The commonly used sampling methods are:
(a) Simple Random Sample - This is done by assigning a number to each of the units (the individuals) in the sampling
frame. Random numbers are a haphazard collection of certain numbers, arranged in a running manner to eliminate
personal selection of unconscious bias in taking out the sample. This technique provides the greatest number of possible
samples.
(b) Systematic Random Sample - This is done by picking every 5th or 10th units at regular intervals. By this method,
each unit is the sampling frame would have the same chance of being selected, but the number of possible samples is
greatly reduced.
(c) Stratified Random Sample - The sample is deliberately drawn in a systematic way so that each portion of the sample
represents a corresponding strata of the universe. This method is particularly useful where one is interested in analysing
the data by a certain characteristic of the population e.g., Hindus, age-groups etc.
Sampling Errors - If we take repeated samples from the same population or universe, the results obtained from one
sample will differ to some extent from the results of another sample. This type of variation from one sample to another
is called sampling error. It occurs because data were gathered from a sample rather than from the entire population of
concern. The factors which influence the sampling error are size and variability of individual readings.
Non-Sampling Errors - Errors may occur due to inadequate calibrated instruments, due to observer variation as well as
due to incomplete coverage achieved in examining the subjects selected and conceptual errors.
Standard Error - If we take a random sample from the population, and similar samples over and over again, we will find
that every sample will have a different mean. The S.D. of the means is a measure of the sample error and is given by the
formula σ√n which is called the standard error or the standard error of the mean.
7. Tests of Significance:
These are:
(a) Standard Error of the Mean - S.D. of the mean is also called the standard error and the distribution of the sample
means about the true mean of the universe. This is called to set confidence limits and find out level of significance.
(b) Standard Error of Proportion - In this, instead of means, proportions and its universe are used in a sample.
This is calculated by the formula:
where p and q are proportions and n = size of the sample.

(c) Standard Error of Difference Between Two Means - To compare the results between two groups (e.g., control group
and experimental group), the difference between the means of two group is compared to indicate that the samples
represent two different universe. This is done by calculating the standard error of difference between the two means.
The formula is: S.E. (d):
(d) Standard Error of Difference Between Proportions - Instead of means, sometimes one has to test the significance of
difference between two proportions or ratios to find out if the difference between the two proportions or ratios have
occurred by chance.
In this case, we calculate the standard error of difference between two proportions:
(e) Chi-Square Test - Chi-square (x2) test offers an alternate method of testing significance of difference between two
proportions. It has advantage that it can be used when more than two groups are to be compared.
(i) Test the ‘Null Hypothesis’ - First, one has to set up a hypothesis, called the Null Hypothesis that there was no
difference between the findings of the two groups.
(ii) Applying the χ2 Test:
(iii) Finding the Degree of Freedom:
d.f. – (c-1) (r-1)

(c = number of columns, r = number of rows)

(iv) Probability Tables - By knowing χ2 and d.f. value, find out a probability from published tables.
(f) Correlation and Regression:
Inferential Statistics:
These assess the meaning of the data e.g.,:
i. Correlation Coefficient - Measures the statistical relationship between two sets of variables, without assuming that
either is dependent or independent. C.C. of 1.0 implies exact similarity and C.C. of 0.0 means no relationship.
ii. Regression Coefficient - Measures relationship between two sets of variables but assumes that one is dependent and
the other is independent.
iii. Parametric Statistics - Assume a normal distribution (e.g., the student’s test). Non parametric statistics use data
which are not normally distributed (e.g., chi square test).
iv. Factor Analysis - Looks for the minimum number of dimensions which can be used to define a group. This will
generate dimensions (e.g., psychotic neurotic). Factors are an expression of the relationship between attributes, not
between individuals.
v. Cluster Analysis - Can only generate clusters not dimensions.
Reliability - The extent to which there is repeatability of an individual’s score or other’s test result.
It is of following types:
(a) Test Retest Reliability - High correlation between scores on the same test given on two occasions.
(b) Alternate Form Reliability - High correlation between two forms of the same test.
(c) Split Half Reliability - High correlation between two halves of the same test.
(d) Inter Rater Reliability - High correlation between results of two or more raters of the same test.
Validity:
The extent to which a test measures what it is designed to measure:
(a) Predictive Validity - Ability of the test to predict outcome.
(b) Content Validity - Whether the test selects a representative sample of the total tests for that variable.
(c) Construct Validity - How well the experiment test the hypothesis underlying it.
Reliability Paradox - A very reliable test may have low validity precisely because its results do not change i.e., it does not
measure true changes.
Measurement in Psychiatric Research:
Aims may be:
i. To identify psychiatric cases

ii. To diagnose psychiatric disorder accurately

iii. To assess severity and change in severity.

Information is called by document studies (case notes, journal articles, census etc.), mail questionnaires (cheap and easy
but low response rate, hence sample bias), self-rating questionnaires be answered inaccurately) and observer rated
interview (structured, semi structured or informal, allow great flexibility and accuracy but are expensive and need
training).

Source of Errors:
(a) Response Set - Subject always tends either to agree or disagree with questions.
(b) Bias towards Centre - Subject tends to choose the middle response and shun extremes.
(c) Social Acceptability - Subject chooses the acceptable answer rather than the true one.
(d) Halo Effect - Answers are chosen to ‘fit’ with previously chosen answers; responses become what is expected by the
observer.
(e) Hawthorne Effect - Researchers alter the situation by their presence.
Variables:
Variables are any constructs or events which research studies.

These are of following types:


(a) Independent Variable - The antecedent condition manipulated by the experimenter (e.g., drug levels).
(b) Dependent Variable - The variable used to measure the effect of the independent variable (e.g., recovery rates).
(c) Confounding variable - Any extraneous variable whose potential influence on the dependent variable has not been
controlled for. A source of error (e.g., age and sex imbalance).
(d) Controlled Variable - A variable whose influence is kept constant by the experiment (e.g., other medications).
(e) Uncontrolled Variable - A variable which is not manipulated or held constant, though it may be measured (e.g., life
events).

You might also like