Professional Documents
Culture Documents
Introduction:
Module Framework:
https://learn.sun.ac.za/pluginfile.php/2776931/mod_resource/content/3/SDS188%20-
%20Information%20Document%20-%202021.pdf
Table of Contents
CHAPTER ONE: ............................................................................................................................... 2
CHAPTER TWO: .............................................................................................................................12
CHAPTER THREE: ...........................................................................................................................18
CHAPTER FOUR: ............................................................................................................................24
CHAPTER FIVE: ..............................................................................................................................28
CHAPTER SIX: ................................................................................................................................33
CHAPTER ONE:
VARIABLES AND SUREVEYS
What is Statistics?
¥ Statistics is the collection of methods that allow one to work with data effectively.
¥ Statistics is a TOOL to obtain INFORMATION from DATA.
¥ It provides us with a formal basis to summarise and visualise data, reach conclusions
about the data, make reliable predictions about business activities, and improve the
business process.
Big Data:
- a collection on data that cannot be easily browsed or analysed using traditional
methods
- Big data are data being collected in massive volumes, at very fast rates (real time),
and in a variety of forms.
- Might refer to large datasets of structured data stored in files or worksheets.
- Big data might be unstructured such that the data have an irregular pattern and
contain values that are not comprehensible without further interpretation.
o Unstructured data could be text, pictures, videos, or audio.
Variable:
¥ A variable defines a characteristic or property of an item that can vary among the
occurrences of those items.
¥ Using this definition, data is a set of values associated with one or more variables.
¥ Notice that each value for a variable is a single fact 3 not a list of facts.
Statistics:
¥ Statistics can be defined as the methods that analyse the data of the variables of
interest.
o Descriptive statistics are the methods of organising, summarising, and
presenting data in an informative and convenient way.
o Inferential statistics are the methods used to make a conclusion about a
characteristic of a population, based on a smaller sample of the population.
Descriptive Statistics:
- are the methods of organising, summarising, and presenting data in an informative and
convenient way.
Inferential Statistics:
- are the methods used to make a conclusion about a characteristic of a population, based
on a smaller sample of the population.
Sources Of Data:
¥ Primary
o Data collector is also analysing it
¥ Secondary
o The data analyser does not collect the data themselves
§ E.g. Collects it from open source sites etc.
Variable Classification:
Categorical:
¥ Qualitative
¥ Variables take categories as their values such as <yes=, <no=, or <blue=, <brown=,
<green=.
Numerical:
¥ Quantitative
¥ Variables have values that represent a counted or measured quantity.
o Discrete variables arise from a counting process. Values are countable over a
finite range.
o Continuous variables arise from a measuring process. Values are
uncountable over a finite
Scales:
CATAGORICAL: (qualitative)
¥ Nominal scale
o Classifies categorical data into distinct categories in which no ranking is
implied.
¥ Ordinal Scale
o Classifies categorical data into distinct categories in which ranking is implied.
NUMERICAL: (quantitative)
Numerical variables use an interval scale or ratio scale.
¥ Interval scale
o is an ordered scale in which the difference between measurements is a
meaningful quantity but the measurements do not have a true zero point.
¥ Ratio scale
o is an ordered scale in which the difference between the measurements is a
meaningful quantity and the measurements have a true zero point.
POPULATION VS SAMPLE:
Parameters are numbers that summarize data for an entire population. Statistics are
numbers that summarize data from a sample
- A population parameter summarises the value of a specific variable for a population.
- A sample statistic summarises the value of a specific variable for sample data.
- Sample statistics are used to estimate population parameters.
Population:
¥ The entire group
Sample:
¥ A portion of the entire group
Pros:
¥ Less time consuming
¥ Less costly
¥ More practical
SAMPLING:
SAMPLING FRAMES:
- The sampling frame is a listing of items that make up the population.
- Frames are data sources such as population lists, directories, or maps.
- Inaccurate or biased results can result if a frame excludes certain groups or portions
of the population.
- Using different frames to generate data can lead to dissimilar conclusions
- Your data set
- The frame from which your samples are to be drawn
- Needs to be a true representation of the population
o Beware of sample bias
N = Population size
n = Sample size
NONPROBABILITY SAMPLE:
- Items included are chosen without regard to their probability of occurring
o Convenience sampling:
§ Items chosen based on that they are easy, convenient etc. To sample
o Judgement sampling:
§ Experts from a field give their opinions
PROBABILITY SAMPLES:
- Items are chosen on the basis of known probabilities
o Simple Random Sample:
§ Easiest to do, convenient, most commonly used
§ Randomly selected samples from within the frame/population
§ Every individual or item from the frame has an equal chance of being
selected.
¥ Selection may be with replacement (selected individual is
returned to frame for possible reselection) or without
replacement (selected individual is not returned to the frame).
¥ Eg. Samples obtained from table of random numbers or
computer random number generators.
o Systematic Sample:
§ Decide on sample size = n
§ K = N/n
§ Split N into groups
§ Select every Kth sample from the
groups
o Stratified Sample:
§ N can be divided into sub-groups (strata) according to a common
characteristic
¥ Minimum strata =2
§ A simple random sample is selected from each subgroup
¥ with sample sizes proportional to strata sizes
§ Samples from strata are combined back into on group
¥ A common technique for sampling voters 3 stratifying across
socio-economic or provincial lines
¥ Less samples needed when using stratified samples
o Cluster Sample:
§ Simple random clusters selected
§ E.g. Randomly selected polling station for exit-polls
BENEFITS OF EACH:
Simple random sample and Systematic sample:
- Simple to use.
- May not be a good representation of the population9s underlying characteristics.
Stratified sample:
- Ensures representation of individuals across the entire population.
Cluster sample:
- More cost effective.
- Less efficient (need larger sample to acquire the same level of precision).
DATA CLEANING:
Data cleaning corrects irregularities in the data:
- Invalid variable values, including:
o Non-numerical data for numerical variable.
o Invalid categorical values for a categorical variable.
o Numeric values outside a defined range.
- Coding errors, including:
o Inconsistent categorical values.
o Inconsistent case for categorical values.
o Unrelated / Unwanted characters.
- Data integration errors, including:
o Redundant columns.
o Duplicated rows.
o Differing column lengths.
o Different units of measure or scale for numerical variables
SEMI-AUTOMATIC CLEANING:
- Invalid variable values can be identified by simple scanning techniques, for example:
o Non-numeric entries for numerical variables.
o Values for categorical variables that don9t match a pre-defined category.
o Values for a numeric variable outside a pre-defined explicit range.
o Features exist in Excel to assist in these tasks.
PROBLEMS:
- Coding errors
- Data integration errors from combining two different computerised data sources
- Missing values or values not collected for a variable
o Hence, data cleaning can never be fully automated
SURVEY WORTHINESS:
- Purpose
- Is it from a probability sample?
- Coverage error, appropriate frame?
- Nonresponse error 3 follow up
- Measurement error
o Good questions elicit good responses
- Sampling error
o Always exists
SURVEY ERRORS:
- Coverage or Selection Bias:
o If someone/thing is excluded and has no chance of selection
- Nonresponse Error or Bias:
o People who do respond may different from those who choose not to
- Sampling Error:
o Variation from sample to sample will always exist
- Measurement Error:
o Due to weakness in question design and/or respondent
ETHICAL ISSUES:
- Coverage error and nonresponse error can be leveraged by survey designers
o To purposely bias results
- Sampling error can be an ethical issue
o if the findings are purposely not reported with the associated margin of error.
- Measurement error can be an ethical issue:
o Survey sponsor chooses leading questions.
o Interviewer purposely leads respondents in a particular direction.
o Respondent(s) wilfully provide false information.
EXCEL SAMPLING:
Simple Random Sample with Replacement:
1. Make sure the Data Analysis tool is installed.
2. Click the Data ribbon and select Data Analysis.
3. Select Sampling and click OK.
4. Set the Input Range equal to A3:A8.
5. Since the input range does not contain a column name,
a. make sure Labels is unchecked.
6. Set the Sampling Method to Random,
a. and type 4 in the Number of Samples box.
7. Under Output options,
a. click the Output Range option and type E3.
8. Click OK.
9. Then work with the index function to return the chosen samples
a. E.g., To return the 340 in cell F3,
b. Type the following in cell F3: =INDEX($B$3:$B$8, E3)
Systematic Sampling:
1. Go Data -> Data Analysis -> Sampling and click OK.
2. Choose �㗅�㗅 = 2 for a systematic sample
a. by selecting Periodic under Sampling Method and set the Period equal to 2.
3. In the Output Range dialog click cell A10.
a. This produces a systematic sample of the row indices of the population data
frame.
4. As before, we can return the corresponding values for �㗄�㗄�㗄 and �㗄�㗄�㗄 using INDEX().
CHAPTER TWO:
ORGANISING AND VISUALIZING VARIABLES
Summary Table:
- Tallies the frequencies or percentages of items in a set of categories so differences can
be identified
Contingency Table:
- To study patterns between two or more categorical variables
- Cross tabulates tallies
- Looks at joint distributions
- For two variables the tallies for one variable are located in the rows and the tallies for
the second variable are located in the columns.
Categorical Data:
- For one variable:
o Bar Chart
§ Has gaps (vs no gaps in a histogram)
§ Is easiest to understand for the average
person
o Pareto Chart
§ To display categorical data on a nominal scale
§ A vertical bar chart
¥ Categories are shown in descending order of frequency
§ A cumulative polygon is shown in the same graph
¥ To separate the <vital few= from the <trivial many=
o Pie/Doughnut Chart
§ Percentage of the whole data set = size of pie slice
§ Avoid using where possible
- For two variables (contingency tables):
§ Variable x in the rows
§ Variable y in the columns
o Side-by-side Bar Chart
o Doughnut Chart
Numerical Data:
- Ordered array:
o A sequence of data in rank order
o From smallest to largest value
o Shows range
o Can help identify outliers
- Frequency Distribution:
o Summary table where items are arranged numerically in ordered classes
o Choose appropriate number of class groupings (or bins)
§ establishing boundaries to stop group overlaps
o Usually between 5-15 bins
o Width of class interval = range/ number of bins desired
§ Condense raw data into useful form
§ Fast visual interpretation
§ Easy determination of major info/idea
o Different class boundaries may provide different pictures
o Shifts in data concentration may show up when different class boundaries are
chosen.
o As the size of the data set increases, the impact of alterations in the selection
of class boundaries is greatly reduced.
- When comparing two or more groups with different sample sizes:
o You must use relative frequency or a percentage distribution
§ So that sample size does not create a bias/skewed perception of the
data
- Percentage Polygon:
o is formed by having the midpoint of each class represent the data in that
class
§ then connecting the sequence of midpoints at their respective class
percentages.
o cumulative percentage polygon, or ogive, displays the variable of interest
along the X axis,
§ the cumulative percentages along the Y axis.
o Useful when there are two or more groups to compare
- Time-Series Plot:
o Used to study patterns in a numeric variable over time
o Numeric variables on the Y-axis (vertical)
§ Time period on the X-axis
Displays:
- More useful than multidimensional CT
o More useful than multidimensional
o The data can be show for many variables
- Multiple numerical variables can be presented in one
summarization
- Visualizations can reveal patterns
o That are harder to see in tables
- Bubble Charts:
o Use the size of points to represent the value of an
additional variable
§ In Excel this variable must be numerical
§ In Tableau can be numerical/categorical
- Pivot Charts:
o Visualizes specific categories from a PivotTable Summary
- Tree Maps:
o Visual representations of contingency tables
- Spark Lines:
o A compact Time-Series visualization
o Of Numerical Variables
Best Practices:
- Use simplest possible visualization
- Title and label all axes
- Include a scale
- Begin vertical axis at 0 and use a constant scale
- Avoid 3D or exploded graphics
- Use consistent colouring in graphs that will be compared
- Avoid uncommon charts such as:
o Radar, surface, cone and pyramid charts
CHAPTER THREE:
DESCRIPTIVE SUMMARIES
GENERAL NOTES:
- The more the data is spread out, the greater the range, variance, and standard
deviation.
- The more the data are concentrated, the smaller the range, variance, and standard
deviation.
- If the values are all the same (no variation), all these measures will be zero.
- None of these measures are ever negative
CENTRAL TENDANCY:
- Is the extent to which the values of a numerical variable group around a typical or
central value.
- Mean:
o Most commonly used
- Median:
o Less sensitive to extreme values
- Mode:
o Value that appears most often
o Not affected by extreme values
o There may be no mode
§ Or several
- Geometric Mean:
o Geometric Mean rate of return
o Use for growth rates
MEASURES OF VARIATION:
- Range:
o Can be misleading:
§ Does not account for data distribution
§ Is sensitive to outliers
- Variance:
o Average (approx.) of squared deviations of values from the mean
o The variation is the amount of dispersion or scattering away from a central
value that the values of a numerical variable show
- Standard Deviation:
o Most commonly used
o Shows variation around the mean
o Is the square root of the variance
o As the same units as the original data
- Steps for Computing Standard Deviation:
1. Compute the difference between each value and the mean.
2. Square each difference.
3. Add the squared differences.
4. Divide this total by n-1 to get the sample variance.
5. Take the square root of the sample variance to get the sample standard deviation.
- Coefficient of Variation:
o Measures relative variation.
o Always in percentage (%).
o Shows variation relative to mean.
o Can be used to compare the variability of two or more sets of data measured
in different units.
SHAPE OF DISTRIBUTION:
- Describes how data is distributed
- Useful related statistics:
o Skewness:
§ Measures the extent to which
data values are not symmetrical
§ Right : positive
§ Left : negative
o Kurtosis:
§ measures the peakedness of the curve of the distribution
¥ how sharply the curve rises approaching the centre of the
distribution.
Population Variance
- Average of squared deviations of values from
the mean
Chebyshev9s Rule:
- Regardless of data distribution
o At least (1-[1/k^2]) X 100%
§ Will fall within k standard deviations of the mean (for k > 1)
- The Covariance
o Measures strength of linear relationships
§ Between two numerical variables (X and Y)
o Only concerned with the strength of relationship
o No casual effect is implied
o It is not possible to determine relative strength of a relationship
§ From the size of the covariance
CHAPTER FOUR:
PROBABILITY
Probability Principles:
- Probability distributions
- Mathematical expectation
- Binomial and Poisson distributions
Sample Space: S
- S = { x, y, z }
- All possible events
Events:
Simple Event:
- Single characteristic
- A day in January from all days in 2021
Joint Event:
- Two or more characteristics
- A day in January that is also a Wednesday from all days in 2021
Complement of an Event A (denoted A9)
- All events that are not part of event A
- All days in 2021 that are not in January
Probability:
- Numerical value representing the likelihood of a certain event occurring
Impossible Event:
- An event that has no chance of occurring
- Probability = 0
Certain Event:
- An event that is going to happen
- Probability = 1
Collectively Exhaustive:
- One of the events must occur
- A set of events covers the entire sample space
Visualizing Probability:
- Venn Diagram:
o AuB
§ A or B
o AnB
§ A and B
- Contingency Tables:
o Alternative: Decision Tree
- Tree Diagrams:
Visualizing Events:
- Collectively Exhaustive:
o Entire sample space exhausted
o Total P = 1
- Mutually Exclusive:
o Can only do A or B
§ Never both
Simple Probability:
- Probability of a simple event
Joint Probability:
- The probability of the occurrence of
o two or more events
Marginal Probability:
- An unconditional probability
Conditional Probability:
- Probability of an event
o Given that the other event has already occurred
Independent Events:
- The probability of A is not affected if B happens
Bayes Theorem:
- Used to revise previously calculated probabilities based on new information
- Is an extension of conditional probability
- Where:
o Bi = i^th event of k mutually
exclusive and collectively
exhaustive events
o A = new event that might effect P(Bi)
- Example:
Counting Rules:
- Rules for counting the number of possible outcomes
RULE ONE:
- If any on of K different mutually exclusive and collectively exhaustive events can occur on each
of n trials
o Number of possible outcomes: K^n
§ Eg. a dice rolled 3 times = 6^3
RULE TWO:
- If there are k events on the first trial, k2 on the second etc.
o Number of possible outcomes: (k1)(k2)&.(kn)
§ Eg. I go to park then dinner then movie
§ There are 3 parks, 2 rest. And 6 movies to
choose = 3 x 2 x 6 = 36 possible outcomes
RULE THREE:
- The number of ways that n items can be arranged
o Number of ways: n! = (n)(n-1)&(1)
§ Eg. 5 books to put away (no replacement)
§ 5! = 120 different ways
RULE FOUR:
- Permutations: the number of ways you can arrange x objects selected from n objects
- (in order)
o Number of ways: nPx = n!/(n 3 x)!
§ Eg I have 5 books, will put 3 away
§ 5!/(5-3)! = 60 ways
RULE FIVE:
- Combinations: the number of ways of selecting x objects from n objects
- (irrespective of order)
o Number of ways: nCx = n!/x!(n-x)!
§ Eg. I will read 3 of my 5 books, the order
I read them doesn9t matter
§ 5!/3!(5-3)! = 10 possible combinations
CHAPTER FIVE:
PROBABILITY DISTRIBUTION
Random Variable:
- A function that assigns a number to each outcome of an experiment
- Alternatively, the value of a random variable is a numerical event
Discrete Variable:
- Produce outcomes that come from a counting process
o Number of classes you take
- A probability distribution is a mutually exclusive list of all possible numerical outcomes for that
variable and a probability of occurrence associated with each outcome.
- Often represented graphically
- MUST satisfy two conditions:
o 0 f �㗄(�㗆) f 1 for all �㗆
o 3�㗄�㗅�㗅 �㗆 �㗄(�㗆) = 1
- Expected value (mean):
- Variance of a discrete variable:
- Standard deviation of a discrete variable
Continuous Variable:
- Produce outcomes that come from a measurement
o Your weight
Rules of Expectations:
c = a constant
Rule One:
- E[c] = c
Rule Two:
- E[X +c] = E[X] + c
Rule Three:
- E[c . X] = c . E[X]
Rules of Variance:
c = a constant
Rule One:
- Var(c) = 0
Rule Two:
- Var(X + c) = Var(X)
Rule Three:
- Var(c . X) = c^2 . Var(X)
Probability Distributions:
Counting Techniques:
- The number of combinations of selecting x objects out of n objects:
Binomial Distribution:
- Finite population with replacement
- OR infinite population without replacement
Cumulative Probabilities:
- The probability of X is less than or equal to x
o P(X <= x)
- The binomial table calculates cumulative probabilities for P(X <= k)
o Can be used to compute the marginal and excess probabilities
Hypergeometric Distribution:
- Applicable when selecting from a finite population without replacement
- n trials in a sample from a finite population N
- Outcomes of trials ARE dependant
- Finding the probability of x items of interest in the sample
o where there are E items of interest in the population
- Finite Population Correction Factor:
- Excel formula =HYPGEOM.DIST(x; n; E; N; FALSE)
Marginal Probability:
- Summing across rows and down columns
o To determine the probability of x and y individually
o Table with arrows alongside
- MP can be used to work out the mean, standard deviation,
deviation of an individual variable in a discrete bivariate
distribution
Covariance (2 VAR):
- Measures the strength of a linear relationship between x and y
- Covariance > 0
o Positive relationship
- Formula alongside
- Covariance = 0
o Events are independent
Applications in Finance:
CHAPTER SIX:
NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Continuous Variable:
- Can assume and value on a continuum
o An uncountable number of values
- E.g., thickness, time to do a task, height
o Can potentially be any value depending on ability to accurately and precisely
measure
Evaluating Normality:
- Not all continuous distributions are normal
- It is important to evaluate how well the data set is approximated by a normal distribution
- Normally distributed data should approximate the theoretical normal distribution:
o The normal distribution is bell shaped (symmetrical) where the mean is equal to the
median
o The empirical rule applies to the normal distribution.
§ Approximately 68% of the observations should lie within (�㗰 ± �㗰)
§ Approximately 95% of the observations should lie within (�㗰 ± 2�㗰)
§ Approximately 99.7% of the observations should lie within (�㗰 ± 3�㗰)
- Construct Charts or Graphs:
o For small/moderate sized data sets:
§ Stem-and-leaf or boxplot - assesses symmetry
o For large data sets:
§ Histogram or Polygon 3 assess the bell-shapedness
- Evaluate Normal Probability Plot:
o Is the normal probability plot approximately linear (that is, a straight line) with
positive slope?
- Compute Descriptive Summary Measures:
o Do the mean, median and mode have similar values?
Uniform Distribution
- Rectangular Distribution
- Symmetrical
- Every value between the smallest and largest
o Is equally possible
Exponential Distribution
- Right Skewed:
- Mean > Median
- Ranges from 0 (zero) to positive infinity
- A continuous probability distribution that is positively skewed
- On excel =1 - EXPONDIST(periods of concern, events per period, True)