Statistics and Data Science 188 Y1 s1

lOMoARcPSD|32674724
Statistics and Data Science 188 Y1 S1
Statistics and Data Science (Universiteit Stellenbosch)
Studocu is not sponsored or endorsed by any college or university

Downloaded by Tanaka Potera (tpotera8@gmail.com)
lOMoARcPSD|32674724
Statistics and Data Science 188

Group 7 (Kotze)
20073712@sun.ac.za
Introduction:
Module Framework:
https://learn.sun.ac.za/pluginfile.php/2776931/mod_resource/content/3/SDS188%20-
%20Information%20Document%20-%202021.pdf
Textbooks and Resources:

¥ Statistics for Managers Using Microsoft Excel, 9th Edition.
o The textbook is not compulsory since course notes will also be provided by
lecturers on SUNLearn.
¥ Microsoft Excel
¥ Financial calculator: HP 10bII+
Assessment Dates for Semester One:
Table of Contents
CHAPTER ONE: ............................................................................................................................... 2
CHAPTER TWO: .............................................................................................................................12
CHAPTER THREE: ...........................................................................................................................18
CHAPTER FOUR: ............................................................................................................................24
CHAPTER FIVE: ..............................................................................................................................28
CHAPTER SIX: ................................................................................................................................33
CROOKES, COURTNEY [25093908@SUN.AC.ZA] 1

lOMoARcPSD|32674724
CHAPTER ONE:
VARIABLES AND SUREVEYS
In this chapter the following will be covered:

¥ To understand issues that arise when defining variables.
¥ How to define variables.
¥ To understand the different measurement scales.
¥ How to collect data.
¥ To identify different ways to collect a sample.
¥ To understand the issues involved in data preparation.
¥ To understand the types of survey errors.
What is Statistics?
¥ Statistics is the collection of methods that allow one to work with data effectively.
¥ Statistics is a TOOL to obtain INFORMATION from DATA.
¥ It provides us with a formal basis to summarise and visualise data, reach conclusions
about the data, make reliable predictions about business activities, and improve the
business process.
A Framework for Statistics:

- To minimise errors, we use the DCOVA framework that organises a set of tasks to
apply statistics correctly.
1. Define the data that you want to study to meet an objective.
2. Collect the data from appropriate sources.
3. Organise the data collected by developing tables.
4. Visualise the data by developing charts.
5. Analyse the data collected, reach conclusions
a. present the results.
Note that the Define and Collect steps must be done before the others. The remaining three
are done in varying orders.
Big Data:
- a collection on data that cannot be easily browsed or analysed using traditional
methods
- Big data are data being collected in massive volumes, at very fast rates (real time),
and in a variety of forms.
- Might refer to large datasets of structured data stored in files or worksheets.
- Big data might be unstructured such that the data have an irregular pattern and
contain values that are not comprehensible without further interpretation.
o Unstructured data could be text, pictures, videos, or audio.

lOMoARcPSD|32674724
Variable:
¥ A variable defines a characteristic or property of an item that can vary among the
occurrences of those items.
¥ Using this definition, data is a set of values associated with one or more variables.
¥ Notice that each value for a variable is a single fact 3 not a list of facts.
Statistics:
¥ Statistics can be defined as the methods that analyse the data of the variables of
interest.
o Descriptive statistics are the methods of organising, summarising, and
presenting data in an informative and convenient way.
o Inferential statistics are the methods used to make a conclusion about a
characteristic of a population, based on a smaller sample of the population.
Descriptive Statistics:
- are the methods of organising, summarising, and presenting data in an informative and
convenient way.
Inferential Statistics:
- are the methods used to make a conclusion about a characteristic of a population, based
on a smaller sample of the population.
Sources Of Data:
¥ Primary
o Data collector is also analysing it
¥ Secondary
o The data analyser does not collect the data themselves
§ E.g. Collects it from open source sites etc.
Variable Classification:
Categorical:
¥ Qualitative
¥ Variables take categories as their values such as <yes=, <no=, or <blue=, <brown=,
<green=.
Numerical:
¥ Quantitative
¥ Variables have values that represent a counted or measured quantity.
o Discrete variables arise from a counting process. Values are countable over a
finite range.
o Continuous variables arise from a measuring process. Values are
uncountable over a finite

lOMoARcPSD|32674724
Scales:
CATAGORICAL: (qualitative)
¥ Nominal scale
o Classifies categorical data into distinct categories in which no ranking is
implied.
¥ Ordinal Scale
o Classifies categorical data into distinct categories in which ranking is implied.
NUMERICAL: (quantitative)
Numerical variables use an interval scale or ratio scale.
¥ Interval scale
o is an ordered scale in which the difference between measurements is a
meaningful quantity but the measurements do not have a true zero point.
¥ Ratio scale
o is an ordered scale in which the difference between the measurements is a
meaningful quantity and the measurements have a true zero point.

lOMoARcPSD|32674724
POPULATION VS SAMPLE:
Parameters are numbers that summarize data for an entire population. Statistics are
numbers that summarize data from a sample
- A population parameter summarises the value of a specific variable for a population.
- A sample statistic summarises the value of a specific variable for sample data.
- Sample statistics are used to estimate population parameters.
Population:
¥ The entire group
Sample:
¥ A portion of the entire group
Pros:
¥ Less time consuming
¥ Less costly
¥ More practical
SAMPLING:
SAMPLING FRAMES:
- The sampling frame is a listing of items that make up the population.
- Frames are data sources such as population lists, directories, or maps.
- Inaccurate or biased results can result if a frame excludes certain groups or portions
of the population.
- Using different frames to generate data can lead to dissimilar conclusions
- Your data set
- The frame from which your samples are to be drawn
- Needs to be a true representation of the population
o Beware of sample bias
N = Population size
n = Sample size
NONPROBABILITY SAMPLE:
- Items included are chosen without regard to their probability of occurring
o Convenience sampling:
§ Items chosen based on that they are easy, convenient etc. To sample
o Judgement sampling:
§ Experts from a field give their opinions

lOMoARcPSD|32674724
PROBABILITY SAMPLES:
- Items are chosen on the basis of known probabilities
o Simple Random Sample:
§ Easiest to do, convenient, most commonly used
§ Randomly selected samples from within the frame/population
§ Every individual or item from the frame has an equal chance of being
selected.
¥ Selection may be with replacement (selected individual is
returned to frame for possible reselection) or without
replacement (selected individual is not returned to the frame).
¥ Eg. Samples obtained from table of random numbers or
computer random number generators.
o Systematic Sample:
§ Decide on sample size = n
§ K = N/n
§ Split N into groups
§ Select every Kth sample from the
groups
o Stratified Sample:
§ N can be divided into sub-groups (strata) according to a common
characteristic
¥ Minimum strata =2
§ A simple random sample is selected from each subgroup
¥ with sample sizes proportional to strata sizes
§ Samples from strata are combined back into on group
¥ A common technique for sampling voters 3 stratifying across
socio-economic or provincial lines
¥ Less samples needed when using stratified samples
o Cluster Sample:
§ Simple random clusters selected
§ E.g. Randomly selected polling station for exit-polls

lOMoARcPSD|32674724

lOMoARcPSD|32674724
BENEFITS OF EACH:
Simple random sample and Systematic sample:
- Simple to use.
- May not be a good representation of the population9s underlying characteristics.
Stratified sample:
- Ensures representation of individuals across the entire population.
Cluster sample:
- More cost effective.
- Less efficient (need larger sample to acquire the same level of precision).
Selection Probability Proportionate to Size: (PPS)

- Putting a larger weighting on more important transactions/samples
o E.g. More weight on sample selection of a R1 million invoice vs a R100 invoice
- Can9t select the same invoice more than once
- In such a case a selection process that takes the magnitude of monetary values on each
invoice into account, is preferred.
o We refer to this type of selection process as selection probability
proportional to size, where size refers to the monetary value on each
invoice.
o Suppose several invoices must be selected from �㗄 invoices via the probability
proportional to size (PPS) selection process.

lOMoARcPSD|32674724
DATA CLEANING:
Data cleaning corrects irregularities in the data:
- Invalid variable values, including:
o Non-numerical data for numerical variable.
o Invalid categorical values for a categorical variable.
o Numeric values outside a defined range.
- Coding errors, including:
o Inconsistent categorical values.
o Inconsistent case for categorical values.
o Unrelated / Unwanted characters.
- Data integration errors, including:
o Redundant columns.
o Duplicated rows.
o Differing column lengths.
o Different units of measure or scale for numerical variables
SEMI-AUTOMATIC CLEANING:
- Invalid variable values can be identified by simple scanning techniques, for example:
o Non-numeric entries for numerical variables.
o Values for categorical variables that don9t match a pre-defined category.
o Values for a numeric variable outside a pre-defined explicit range.
o Features exist in Excel to assist in these tasks.
PROBLEMS:
- Coding errors
- Data integration errors from combining two different computerised data sources
- Missing values or values not collected for a variable
o Hence, data cleaning can never be fully automated
OTHER DATA PREPROCESSING TASKS:

Data Formatting
- Rearranging data structure or changing electronic encoding of the data or both.
Stacking and Unstacking Data
- Analysis of a numerical variable may require subdividing that data into two or more
groups.
- Unstacking involves creating separate numerical variables for the different groups.
o Stacking involves pairing the one numerical variable with second categorical
variable.
Recoding Variables
- Redefining categories for a categorical variable.
- Transforming a numerical variable into a categorical variable.
- Recoding a variable can either supplement or replace the original variable.
- Recoding a categorical variable involves redefining categories.
- Recoding a numerical variable involves changing this variable into a categorical variable.
- When recoding be sure that the new categories are mutually exclusive:
o Categories do not overlap and collectively exhaustive
§ That categories cover all possible values

lOMoARcPSD|32674724
SURVEY WORTHINESS:
- Purpose
- Is it from a probability sample?
- Coverage error, appropriate frame?
- Nonresponse error 3 follow up
- Measurement error
o Good questions elicit good responses
- Sampling error
o Always exists
SURVEY ERRORS:
- Coverage or Selection Bias:
o If someone/thing is excluded and has no chance of selection
- Nonresponse Error or Bias:
o People who do respond may different from those who choose not to
- Sampling Error:
o Variation from sample to sample will always exist
- Measurement Error:
o Due to weakness in question design and/or respondent
ETHICAL ISSUES:
- Coverage error and nonresponse error can be leveraged by survey designers
o To purposely bias results
- Sampling error can be an ethical issue
o if the findings are purposely not reported with the associated margin of error.
- Measurement error can be an ethical issue:
o Survey sponsor chooses leading questions.
o Interviewer purposely leads respondents in a particular direction.
o Respondent(s) wilfully provide false information.

lOMoARcPSD|32674724
EXCEL SAMPLING:
Simple Random Sample with Replacement:
1. Make sure the Data Analysis tool is installed.
2. Click the Data ribbon and select Data Analysis.
3. Select Sampling and click OK.
4. Set the Input Range equal to A3:A8.
5. Since the input range does not contain a column name,
a. make sure Labels is unchecked.
6. Set the Sampling Method to Random,
a. and type 4 in the Number of Samples box.
7. Under Output options,
a. click the Output Range option and type E3.
8. Click OK.
9. Then work with the index function to return the chosen samples
a. E.g., To return the 340 in cell F3,
b. Type the following in cell F3: =INDEX($B$3:$B$8, E3)
Systematic Sampling:
1. Go Data -> Data Analysis -> Sampling and click OK.
2. Choose �㗅�㗅 = 2 for a systematic sample
a. by selecting Periodic under Sampling Method and set the Period equal to 2.
3. In the Output Range dialog click cell A10.
a. This produces a systematic sample of the row indices of the population data
frame.
4. As before, we can return the corresponding values for �㗄�㗄�㗄 and �㗄�㗄�㗄 using INDEX().

lOMoARcPSD|32674724
CHAPTER TWO:
ORGANISING AND VISUALIZING VARIABLES
Tabular and Visual Summaries:

- Summaries guide further exploration and facilitate decision
making
- Visual summaries enable rapid review of larger amounts of data
o And show possible patterns
- Often the Organise and Visualize stages occur concurrently
Summary Table:
- Tallies the frequencies or percentages of items in a set of categories so differences can
be identified
Contingency Table:
- To study patterns between two or more categorical variables
- Cross tabulates tallies
- Looks at joint distributions
- For two variables the tallies for one variable are located in the rows and the tallies for
the second variable are located in the columns.
Categorical Data:
- For one variable:
o Bar Chart
§ Has gaps (vs no gaps in a histogram)
§ Is easiest to understand for the average
person
o Pareto Chart
§ To display categorical data on a nominal scale
§ A vertical bar chart
¥ Categories are shown in descending order of frequency
§ A cumulative polygon is shown in the same graph
¥ To separate the <vital few= from the <trivial many=
o Pie/Doughnut Chart
§ Percentage of the whole data set = size of pie slice
§ Avoid using where possible
- For two variables (contingency tables):
§ Variable x in the rows
§ Variable y in the columns
o Side-by-side Bar Chart
o Doughnut Chart

lOMoARcPSD|32674724
Numerical Data:
- Ordered array:
o A sequence of data in rank order
o From smallest to largest value
o Shows range
o Can help identify outliers
- Frequency Distribution:
o Summary table where items are arranged numerically in ordered classes
o Choose appropriate number of class groupings (or bins)
§ establishing boundaries to stop group overlaps
o Usually between 5-15 bins
o Width of class interval = range/ number of bins desired
§ Condense raw data into useful form
§ Fast visual interpretation
§ Easy determination of major info/idea
o Different class boundaries may provide different pictures
o Shifts in data concentration may show up when different class boundaries are
chosen.
o As the size of the data set increases, the impact of alterations in the selection
of class boundaries is greatly reduced.
- When comparing two or more groups with different sample sizes:
o You must use relative frequency or a percentage distribution
§ So that sample size does not create a bias/skewed perception of the
data
- Stem and Leaf:

o A stem-and-leaf display organises data into groups (called stems)
o so that the values within each group (the leaves) branch out to the right on
each row.
- Histogram:
o Vertical bar chart of frequency distribution
o No gaps between adjacent bars
o Class boundaries (class midpoints) are shown on the
horizontal axis
o Vertical axis =
§ Frequency, relative frequency, percentage

lOMoARcPSD|32674724
- Percentage Polygon:
o is formed by having the midpoint of each class represent the data in that
class
§ then connecting the sequence of midpoints at their respective class
percentages.
o cumulative percentage polygon, or ogive, displays the variable of interest
along the X axis,
§ the cumulative percentages along the Y axis.
o Useful when there are two or more groups to compare
Visualizing Two Numerical Variables Using Graphical Displays:

- Scatter Plot:
o Numerical data consisting of paired observations from
two numerical observations
o One variable on X-axis
§ The other on the Y-axis
o Used to examine the relationship between two numerical variables
- Time-Series Plot:
o Used to study patterns in a numeric variable over time
o Numeric variables on the Y-axis (vertical)
§ Time period on the X-axis

lOMoARcPSD|32674724
Organising a Mix of Variables:

- Multidimensional Contingency Table:
o Tallied responses of three or more variables
o Pivot Table
o As a practical rule:
§ Avoid having 4+ variables
o They:
§ Extend contingency tables to two or more row/column variables
§ Replace the frequencies found in contingency tables
¥ With summary information about a numeric variable
Displays:
- More useful than multidimensional CT
o More useful than multidimensional
o The data can be show for many variables
- Multiple numerical variables can be presented in one
summarization
- Visualizations can reveal patterns
o That are harder to see in tables
- Coloured Scatter Plots:

o Visualize both numerical and categoric variables
- Bubble Charts:
o Use the size of points to represent the value of an
additional variable
§ In Excel this variable must be numerical
§ In Tableau can be numerical/categorical
- Pivot Charts:
o Visualizes specific categories from a PivotTable Summary
- Tree Maps:
o Visual representations of contingency tables
- Spark Lines:
o A compact Time-Series visualization
o Of Numerical Variables

lOMoARcPSD|32674724

lOMoARcPSD|32674724
Filtering and Querying Data:

- Associated with preparing tabular/visual summaries
- Data filtering selects rows of data that match criteria;
o specified value(s) for specific variable(s).
- Data Querying is similar but may not select all of the columns of the matching rows.
- Excel has various filtering & query features
- Excel Slicers Filter & Query Data from a Pivot Table:
o A slicer is a panel of clickable buttons superimposed over a worksheet
o Each button represents a specific value found in the source data
§ Of the pivot table
o By clicking these buttons you query the data
The Pitfalls of Organising and Visualizing Variables:

- Be aware of:
o The limits of others9 ability to understand and comprehend
o Presentation issues
§ Can undercut the usefulness of methods from this chapter
- It is easy to:
o Create summaries that;
§ Obscure the data
¥ Information overload
§ Create false impressions
¥ Selective summarization
¥ Improperly constructed charts
§ Contain Chart junk
- Graphical errors:
o Having no relative basis
o Compressing the vertical axis
o Having no 0-point on the vertical axis
Best Practices:
- Use simplest possible visualization
- Title and label all axes
- Include a scale
- Begin vertical axis at 0 and use a constant scale
- Avoid 3D or exploded graphics
- Use consistent colouring in graphs that will be compared
- Avoid uncommon charts such as:
o Radar, surface, cone and pyramid charts

lOMoARcPSD|32674724
CHAPTER THREE:
DESCRIPTIVE SUMMARIES
GENERAL NOTES:
- The more the data is spread out, the greater the range, variance, and standard
deviation.
- The more the data are concentrated, the smaller the range, variance, and standard
deviation.
- If the values are all the same (no variation), all these measures will be zero.
- None of these measures are ever negative
CENTRAL TENDANCY:
- Is the extent to which the values of a numerical variable group around a typical or
central value.
- Mean:
o Most commonly used
- Median:
o Less sensitive to extreme values
- Mode:
o Value that appears most often
o Not affected by extreme values
o There may be no mode
§ Or several
- Geometric Mean:
o Geometric Mean rate of return
o Use for growth rates

lOMoARcPSD|32674724
MEASURES OF VARIATION:
- Range:
o Can be misleading:
§ Does not account for data distribution
§ Is sensitive to outliers
- Variance:
o Average (approx.) of squared deviations of values from the mean
o The variation is the amount of dispersion or scattering away from a central
value that the values of a numerical variable show
- Standard Deviation:
o Most commonly used
o Shows variation around the mean
o Is the square root of the variance
o As the same units as the original data
- Steps for Computing Standard Deviation:
1. Compute the difference between each value and the mean.
2. Square each difference.
3. Add the squared differences.
4. Divide this total by n-1 to get the sample variance.
5. Take the square root of the sample variance to get the sample standard deviation.
- Coefficient of Variation:
o Measures relative variation.
o Always in percentage (%).
o Shows variation relative to mean.
o Can be used to compare the variability of two or more sets of data measured
in different units.
LOCATING EXTREME OUTLIERS:

Z-Score:
- Subtract the mean and divide by the standard deviation
- Is the number of standard deviations a data value is away from the mean
- A data value is considered an extreme outlier if its Z-score is less than -3.0 or greater
than +3.0.
o The larger the absolute Z-score, the further the value is from the mean.

lOMoARcPSD|32674724
SHAPE OF DISTRIBUTION:
- Describes how data is distributed
- Useful related statistics:
o Skewness:
§ Measures the extent to which
data values are not symmetrical
§ Right : positive
§ Left : negative
o Kurtosis:
§ measures the peakedness of the curve of the distribution
¥ how sharply the curve rises approaching the centre of the
distribution.
Distribution of the Values For a Numerical Variable:

The Quartiles:
- Quartiles split ranked data into four segments
o With an equal number of values per segment
- Where <n= is the number of observed values
- When calculating the ranked position use the following rules:
o If the result is a whole number
§ then it is the ranked position to use.
o If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.)
§ then average the two corresponding data values.
o If the result is not a whole number or a fractional half
§ then round the result to the nearest integer to find the ranked
position.
- Interquartile Range:
o Measures the spread in the middle 50%
of data
o Is a measure of variability that is not
influenced by outliers or extreme values
o IQR = Q3 3 Q1
The Five-Number Summary:
- Numbers that describe the centre, spread and shape of data
o Min, Q1, Median (Q2), Q3, Max
Boxplot:
- A visual representation of the Five-Number Summary
- Box-and-whisker diagram

lOMoARcPSD|32674724
DESCRIPTIVE MEASURES FOR A POPULATION:

Population Mean:
- Sum of values in the population divided by
the population size, N
Population Variance
- Average of squared deviations of values from
the mean
Population Standard Deviation:

- Most commonly used measure of variation
- Shows variation around the mean
- Is the square root of the population variance
- Has the same units as the original data
Sample Statistics vs Population Parameters:
The Empirical Rule:

- Approximates the data in a symmetrical mound-shaped distribution
- Approximately 68% of the data in a symmetric mound shaped distribution is
o within 1 standard deviation of the mean
o or µ ± 1Ã
- Approximately 95% of the data in a symmetric mound
shaped distribution lies
o within two standard deviations of the mean
o or µ ± 2Ã.
- Approximately 99.7% of the data in a symmetric mound
shaped distribution lies
o within three standard deviations of the mean
o or µ ± 3Ã.
Chebyshev9s Rule:
- Regardless of data distribution
o At least (1-[1/k^2]) X 100%
§ Will fall within k standard deviations of the mean (for k > 1)

lOMoARcPSD|32674724

lOMoARcPSD|32674724
Relationships Between Two Numerical Variables:

- Scatter Plots
o Allow a visual examination of relationships
- The Covariance
o Measures strength of linear relationships
§ Between two numerical variables (X and Y)
o Only concerned with the strength of relationship
o No casual effect is implied
o It is not possible to determine relative strength of a relationship
§ From the size of the covariance
- The Coefficient of Correlation

o Measures the relative strength of the linear
relationship between two numerical variables
o the population coefficient of correlation is P
o The sample coefficient of correlation is r
§ Both P and r have the following features:
¥ unit free
¥ -1 < P(or r) < 1
¥ The closer to -1
o The stronger the negative relationship
¥ The closer to 1
o The stronger the positive relationship
¥ The closer to 0
o The weaker the linear relationship
SEE PAGE 74 FOR EXCEL FORMULAS RELATING TO THIS CHAPTER:

https://learn.sun.ac.za/pluginfile.php/2790063/mod_resource/content/1/chapter3_2021.
pdf

lOMoARcPSD|32674724
CHAPTER FOUR:
PROBABILITY
Probability Principles:
- Probability distributions
- Mathematical expectation
- Binomial and Poisson distributions
Sample Space: S
- S = { x, y, z }
- All possible events
Events:
Simple Event:
- Single characteristic
- A day in January from all days in 2021
Joint Event:
- Two or more characteristics
- A day in January that is also a Wednesday from all days in 2021
Complement of an Event A (denoted A9)
- All events that are not part of event A
- All days in 2021 that are not in January
Probability:
- Numerical value representing the likelihood of a certain event occurring
Impossible Event:
- An event that has no chance of occurring
- Probability = 0
Certain Event:
- An event that is going to happen
- Probability = 1
Mutually Exclusive Events:

- Events that cannot occur simultaneously
Collectively Exhaustive:
- One of the events must occur
- A set of events covers the entire sample space
Visualizing Probability:
- Venn Diagram:
o AuB
§ A or B
o AnB
§ A and B
- Contingency Tables:
o Alternative: Decision Tree
- Tree Diagrams:

lOMoARcPSD|32674724
Visualizing Events:
- Collectively Exhaustive:
o Entire sample space exhausted
o Total P = 1
- Mutually Exclusive:
o Can only do A or B
§ Never both
Approaches to Assessing the Probability of Events:

- A Priori:
o Based on prior knowledge
o P = X/T
o Assuming all outcomes are
equally likely
- Empirical Probability:
o Based on observed data
o P = X/T
o Assuming all outcomes are equally likely
- Subjective Probability:
o Based on combination of an individual9s past experience, personal opinion and
analysis of a situation
§ Differs from person-to-person
Simple Probability:
- Probability of a simple event
Joint Probability:
- The probability of the occurrence of
o two or more events
Marginal Probability:
- An unconditional probability
Conditional Probability:
- Probability of an event
o Given that the other event has already occurred

lOMoARcPSD|32674724
General Additional Rules:

- P(AuB) = P(A) + P(B) 3 P(AnB)
o it A and B are mutually exclusive:
o P(AuB) = P(A) + P(B)
Independent Events:
- The probability of A is not affected if B happens
Multiplication Rules for Two Events:

- General Multiplication Rule
o P(A|B) = P(AnB)/P(B)
- For Marginal Probability for event A:

o P(A) = P(A|B1)P(B1) + P(A|B2)P(B2) + &
§ Where B1; B2 and Bk
§ Are mutually exclusive and
collectively exhaustive
Bayes Theorem:
- Used to revise previously calculated probabilities based on new information
- Is an extension of conditional probability
- Where:
o Bi = i^th event of k mutually
exclusive and collectively
exhaustive events
o A = new event that might effect P(Bi)
- Example:

lOMoARcPSD|32674724
Counting Rules:
- Rules for counting the number of possible outcomes
RULE ONE:
- If any on of K different mutually exclusive and collectively exhaustive events can occur on each
of n trials
o Number of possible outcomes: K^n
§ Eg. a dice rolled 3 times = 6^3
RULE TWO:
- If there are k events on the first trial, k2 on the second etc.
o Number of possible outcomes: (k1)(k2)&.(kn)
§ Eg. I go to park then dinner then movie
§ There are 3 parks, 2 rest. And 6 movies to
choose = 3 x 2 x 6 = 36 possible outcomes
RULE THREE:
- The number of ways that n items can be arranged
o Number of ways: n! = (n)(n-1)&(1)
§ Eg. 5 books to put away (no replacement)
§ 5! = 120 different ways
RULE FOUR:
- Permutations: the number of ways you can arrange x objects selected from n objects
- (in order)
o Number of ways: nPx = n!/(n 3 x)!
§ Eg I have 5 books, will put 3 away
§ 5!/(5-3)! = 60 ways
RULE FIVE:
- Combinations: the number of ways of selecting x objects from n objects
- (irrespective of order)
o Number of ways: nCx = n!/x!(n-x)!
§ Eg. I will read 3 of my 5 books, the order
I read them doesn9t matter
§ 5!/3!(5-3)! = 10 possible combinations

lOMoARcPSD|32674724
CHAPTER FIVE:
PROBABILITY DISTRIBUTION
Random Variable:
- A function that assigns a number to each outcome of an experiment
- Alternatively, the value of a random variable is a numerical event
Discrete Variable:
- Produce outcomes that come from a counting process
o Number of classes you take
- A probability distribution is a mutually exclusive list of all possible numerical outcomes for that
variable and a probability of occurrence associated with each outcome.
- Often represented graphically
- MUST satisfy two conditions:
o 0 f �㗄(�㗆) f 1 for all �㗆
o 3�㗄�㗅�㗅 �㗆 �㗄(�㗆) = 1
- Expected value (mean):
- Variance of a discrete variable:
- Standard deviation of a discrete variable
Continuous Variable:
- Produce outcomes that come from a measurement
o Your weight

lOMoARcPSD|32674724
Rules of Expectations:
c = a constant
Rule One:
- E[c] = c
Rule Two:
- E[X +c] = E[X] + c
Rule Three:
- E[c . X] = c . E[X]
Rules of Variance:
c = a constant
Rule One:
- Var(c) = 0
Rule Two:
- Var(X + c) = Var(X)
Rule Three:
- Var(c . X) = c^2 . Var(X)
Probability Distributions:
Binomial Probability Distribution:

- A fixed number of observations, n
- Each observation is classified into one of two mutually exclusive and collectively exhaustive
categories
- The probability of being classified as the event of interest Ã is constant from observation to
observation
- The value of any observation is independent of the value of any other observation
Counting Techniques:
- The number of combinations of selecting x objects out of n objects:

lOMoARcPSD|32674724
Binomial Distribution:
- Finite population with replacement
- OR infinite population without replacement
Binomial Distribution Characteristics:
Cumulative Probabilities:
- The probability of X is less than or equal to x
o P(X <= x)
- The binomial table calculates cumulative probabilities for P(X <= k)
o Can be used to compute the marginal and excess probabilities
- For P(X >= k):

lOMoARcPSD|32674724
The Poisson Distribution:

- finding the number of times an event occurs in a given area of opportunity
o Area of Opportunity:
§ A continuous unit or interval of time, volume or such area where more than
one occurrence of an event can happen
- E.G., number of mosquito bites on a person
o Number of scratched on a car etc.
- Applied when (image below)
- The average number of events per unit = » (Lambda)
Poisson Distribution Characteristics:

- Where Lambda = expected number of events
- On Excel =POISSON.DIST(X; Lambda Value; False)
Hypergeometric Distribution:
- Applicable when selecting from a finite population without replacement
- n trials in a sample from a finite population N
- Outcomes of trials ARE dependant
- Finding the probability of x items of interest in the sample
o where there are E items of interest in the population
- Finite Population Correction Factor:
- Excel formula =HYPGEOM.DIST(x; n; E; N; FALSE)

lOMoARcPSD|32674724
Discrete Bivariate Distributions:

- Probabilities of combinations of two variables
- Joint Probability Distributions
- Requirements alongside
- Any combination of events can be calculated
o Sum of the probabilities of those outcomes
Marginal Probability:
- Summing across rows and down columns
o To determine the probability of x and y individually
o Table with arrows alongside
- MP can be used to work out the mean, standard deviation,
deviation of an individual variable in a discrete bivariate
distribution
Covariance (2 VAR):
- Measures the strength of a linear relationship between x and y
- Covariance > 0
o Positive relationship
- Formula alongside
- Covariance = 0
o Events are independent
Coefficient of Correlation (2 VAR):

- same as for one variable (Chapter Three)
- If Covariance= 0
o Then, Coefficient of Correlation = 0
Rules for Summing Two Random Variables:
Applications in Finance:

lOMoARcPSD|32674724
CHAPTER SIX:
NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Continuous Variable:
- Can assume and value on a continuum
o An uncountable number of values
- E.g., thickness, time to do a task, height
o Can potentially be any value depending on ability to accurately and precisely
measure
Continuous Probability Distributions (Normal, Uniform, Exponential):

Normal Distribution
- Bell-Shaped
- Symmetrical
- Ranges from negative to positive infinity
- Mean = Median = Mode
- Location is determined by the mean
- Spread is determined by the standard deviation
- The random variable has an infinite theoretical range
o -> < �㗇 < +>
The Normal Distribution Density Function:
The Standardised Normal:

- Any normal distribution (with any mean and standard deviation combination) can be
transformed into the standardised normal distribution (Z).
- To compute normal probabilities, one needs to transform X units into Z units.
- The standardised normal distribution (Z) has a mean of 0 and a standard deviation of 1
Translation to the Standardised Normal Distribution:
- Translate from X to the standardised normal (the <Z= distribution) by subtracting the mean of X
and dividing by its standard deviation:
- The Z Distribution always has mean = 0
o And standard deviation = 1
- Values above the mean have positive Z values
- Values below the mean have negative Z values

lOMoARcPSD|32674724
The Standardised Normal Probability Density Distribution:
Finding Normal Probabilities:

- Measured as the area under a curve
o The probability of any individual value = 0
- The total area under the curve = 1 (an event will occur)
o If the curve is symmetrical, 0.5 above the mean , 0.5 below.
Cumulative Standardised Normal Table:

- Gives the probability of less than the desired value of Z
o From negative infinity to Z
- Column:
o Gives the value of Z to the second decimal point
- Row:
o Gives the value of Z to the first decimal point
- The corresponding value:
o The probability from Z = -> up to the desired Z value
General Procedure for Finding Normal Probabilities:

P(a < X < b) when X is distributed normally:
- Draw the normal curve for the problem in terms of X
- Translate the X values to Z values
- Use the Cumulative Standardised Normal Table
Calculating the Normal Percentile:

- Given a normal probability find the X value
o Find the Z value for the known probability
o Convert to X units using the formula
Calculating Cumulative Normal Probability on Excel:

lOMoARcPSD|32674724
Evaluating Normality:
- Not all continuous distributions are normal
- It is important to evaluate how well the data set is approximated by a normal distribution
- Normally distributed data should approximate the theoretical normal distribution:
o The normal distribution is bell shaped (symmetrical) where the mean is equal to the
median
o The empirical rule applies to the normal distribution.
§ Approximately 68% of the observations should lie within (�㗰 ± �㗰)
§ Approximately 95% of the observations should lie within (�㗰 ± 2�㗰)
§ Approximately 99.7% of the observations should lie within (�㗰 ± 3�㗰)
- Construct Charts or Graphs:
o For small/moderate sized data sets:
§ Stem-and-leaf or boxplot - assesses symmetry
o For large data sets:
§ Histogram or Polygon 3 assess the bell-shapedness
- Evaluate Normal Probability Plot:
o Is the normal probability plot approximately linear (that is, a straight line) with
positive slope?
- Compute Descriptive Summary Measures:
o Do the mean, median and mode have similar values?
Constructing Normal Probability Plots:

1. Arrange variable data (�㗄 = �㗆1, �㗆2, ... , �㗆�㗅 ) into ordered array,
a. Where �㗅 is the number of observations in �㗄
2. Create an index variable: j = 1, 2, & n
3. Generate probability values: �㗯1,�㗯2,...,�㗯�㗅
a. such that �㗯�㗅 =�㗅/(�㗅+1)
4. Find the corresponding standardised normal quantile values (Z Values)
5. Plot the pairs with the observed values (X) on the vertical axis
a. And the standardised normal quantile values (Z) on the horizontal
6. Evaluate this graph for linearity
a. Is normally distributed and linear if forms a straight line
b. A boxplot would also be symmetrical
c. Can be checked using the Empirical Rule

lOMoARcPSD|32674724
Uniform Distribution
- Rectangular Distribution
- Symmetrical
- Every value between the smallest and largest
o Is equally possible
Continuous Uniform Distribution:
Properties of a Uniform Distribution:

- Mean:
- Standard Deviation:
- Cumulative Uniform Distribution:
o Area/Range
o t is the width of the area
o a is the height of the area
a and b are values on the horizontal axis
Normal Approximation to Binomial Distribution:

- When the number of trials (n) is large
o Analysis with binomial distribution is tedious
o So we use normal approximation instead
- If �㗅 is such that �㗅�㗰 and �㗅(1 2 �㗰) are both greater than 5,
o Binomial(�㗅, �㗰) may be approximated with a Normal (�㗅�㗰, �㗅�㗰[1 2 �㗰]) distribution.
Approximate Binomial Estimation:

Continuity Correction:
- Because:
- You have to correct for continuity adjustment
o To approximate binomial probabilities with normal distribution
- E.g., below

lOMoARcPSD|32674724
Exponential Distribution
- Right Skewed:
- Mean > Median
- Ranges from 0 (zero) to positive infinity
- A continuous probability distribution that is positively skewed
- On excel =1 - EXPONDIST(periods of concern, events per period, True)
Properties of an Exponential Distribution:

- Mean:
- Variance:
- Probability:
Exponential Distribution Density Function:

Statistics and Data Science 188 Y1 s1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics and Data Science 188 Y1 s1

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|32674724

Statistics and Data Science 188 Y1 S1

Statistics and Data Science (Universiteit Stellenbosch)

Studocu is not sponsored or endorsed by any college or university

Statistics and Data Science 188

Textbooks and Resources:

Assessment Dates for Semester One:

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 1

Downloaded by Tanaka Potera (tpotera8@gmail.com)

In this chapter the following will be covered:

A Framework for Statistics:

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 2

Downloaded by Tanaka Potera (tpotera8@gmail.com)

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 3

Downloaded by Tanaka Potera (tpotera8@gmail.com)

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 4

Downloaded by Tanaka Potera (tpotera8@gmail.com)

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 5

Downloaded by Tanaka Potera (tpotera8@gmail.com)

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 6

Downloaded by Tanaka Potera (tpotera8@gmail.com)

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 7

Downloaded by Tanaka Potera (tpotera8@gmail.com)

Selection Probability Proportionate to Size: (PPS)

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 8

Downloaded by Tanaka Potera (tpotera8@gmail.com)

OTHER DATA PREPROCESSING TASKS:

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 9

Downloaded by Tanaka Potera (tpotera8@gmail.com)

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 10

Downloaded by Tanaka Potera (tpotera8@gmail.com)

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 11

Downloaded by Tanaka Potera (tpotera8@gmail.com)

Tabular and Visual Summaries:

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 12

Downloaded by Tanaka Potera (tpotera8@gmail.com)

- Stem and Leaf:

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 13

Downloaded by Tanaka Potera (tpotera8@gmail.com)

Visualizing Two Numerical Variables Using Graphical Displays:

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 14

Downloaded by Tanaka Potera (tpotera8@gmail.com)

Organising a Mix of Variables:

- Coloured Scatter Plots:

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 15

Downloaded by Tanaka Potera (tpotera8@gmail.com)

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 16

Downloaded by Tanaka Potera (tpotera8@gmail.com)

Filtering and Querying Data:

The Pitfalls of Organising and Visualizing Variables:

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 17

Downloaded by Tanaka Potera (tpotera8@gmail.com)

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 18

Downloaded by Tanaka Potera (tpotera8@gmail.com)

LOCATING EXTREME OUTLIERS:

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 19

Downloaded by Tanaka Potera (tpotera8@gmail.com)

Distribution of the Values For a Numerical Variable:

CROOKES, COURTNEY [25093908@SUN.AC.ZA] 20

Downloaded by Tanaka Potera (tpotera8@gmail.com)

DESCRIPTIVE MEASURES FOR A POPULATION:

Population Standard Deviation:

Sample Statistics vs Population Parameters: