Data Analysis Course Notes

Sampling should be random and the sample size should be big
When N approaches to infinity we sole the n(sample size) as a limit
Sn*Sn=p*(1-p) : the function of the marginal error reach it’s maximum when p=0.5
Systematic sampling: I have a sample with n
K=N/n, randomly select d=from 1 to k and then form the sample of d, d+k, d+2k….
Biased sample (when for eg you take statistics for every Thursday for all the weeks)
Fix from the beg the must haves in each strata: it is easy to be implemented- knowing the criteria
that affects the result is hard
Affect results:People gender, education level, region,social status, age
APP F=6% : we take 6% from each strata and also Nh/N=6%
DISP you can take different rates from diff stratas/we can say that there is “désiquilibre” it will be
fixed with weight=Nh/N
The sum of all weights = 1
CI 1-alpha(teta)=[teta+-E]=[teta Z1-alpha/2*racine variance teta]
Sampling rate nh/Nh
If you don’t have the frame go to non-prob
A frame with not enough money stratified
Cluster sampling: we do census (étude kemla for all the elemnts) in each cluster
Stratified adds in the cost
If I have to colledct dataI should follow these steps:
-look for a frame : buy it or do it or deal with a call center
-we calculate using proportion, p=0.5 to have an idea about the sample size
-How we know the percentage for stratas? : INS website helps
Quota: it is like stratified( looks like yet not exactly the same) but it’s harder: you call randomly to
collect the sample sizes you need , you may fill a sample and still call people fitting there while others
are still witjhout data
Route sampling : you define sampling path

Judgmental : ne5tar blassa nekhdh mnha sample based on “haw feha l anwe3 mtaa laabed lkol”
Sample adjustment: for non representative samples, we fix that sample
Over-sampling : nzid lel na9es to equilibrate
Under sampling: eli jjet kbira nekhdh menha eli 7ajti bih barka
CHAPTER II : Bivariate Analysis
We always believe that the sample is representative (drawn at random and big enough) else you
can’t prove any hypothesis testing or get any confidence intervals
In this chapter we will: Determine the relationship between two features which can be:
- 2 numerical ( pearson correlation test)
- mixture between qualitative and quantitave (test of comparison of two means/ ANOVA test)
-2 categorcal (chi-square test / cross table / l = number of rows c =number of columns)
For testing there is always:
H0: Null hypothesis: absence of relationship, difference, effect, correlation
H1: Alternative hypothesis: their existence
It always has a test chosen according to p-value
The decision is always made through the comparison of the p-value
p-value< alpha  we reject H0
p-value= the probability to reject the null hypothesis by error
for pearson coorelation :

0.8/0.9 or -0.8/-0.9  there is a relationship don’t do a test
There is no r=0
r can be small( near 0) but significant
alpha is 5% or 10% in business area
e-06 = *10 à la puissance de -6
more than 2 population

H0= all are the same
H1= at least one is different
Chapter 3: Principal component analysis
To find the best projection for data (project while conserving the maximum number of data)
PCA : reduce dimension while keeping the maximum number of information clearly
dispersed
PCA is designed for numerical data
PCA ybasset l oumour w ysahhel ( moving frm 10 dimension-space to 2 dimension for eg)
Inertia: Sum of squared distances from the gravity center and the data
Inertia after projection/ Inertia before projection= %rate the bigger the better
We choose the projection with the higher rate and we say that this projection explains rate% of data
variability
We will see: Variable space and individual space
Victor, subspace , dimensions, basis
We calculate correlation matrix then eigenvector and eigenvalues (do research)
Det( A – lambda I)=0
Find lambda from the system: lambda is eigenvalue

A39ed data is the statistical data
The data is presented as matrix

Each row presents an individual and columns are for features/variables
X^j : a vector representing the variable j
G: the gravity center: the average of each column
In the example of data we calculate the average using the rensis likert scale
Objectives: it can help in market segmentation: identifying groups of homogenous

individuals
Perception studies: important in marketing : how the consumer sees the product what make
it different why is he loyal to this product
Evolution study: statistics about the balance sheet over the years
Individuals are the years here
Before PCA we have to fix the scale: normalization (when a feature dominates in the
distance distance=(Xa-Xb)^2 + (Ya –Yb)^2 )
To normalize / to standardize (bch features ykoun andhom nafs l value f distance)
Scaling :
Z= (X – Xmin)/(Xmax – Xmin)  0<Z<1
Z= (X – Xbar) / standard deviation of x
 We unify the unit of measure of all features
 We can choose not to scale this when we have the same unit of measure among
all the variables
Before normalization data is called “X” , after it is “Z”

Symetric matrix: aij=aji
Var(Zj)=1
Zj bar =0
Data after transformation has a center 0( the average of each variable)  origin of the space
(nahkou ala Zj= ( Xj – Xj bar)/ Sigma xj
Data standardization
Calculate … matrix
Calculate eigenvalues
 Each eigenvalue has its own eigenvector, we sort them in a decreasing way
P eigenvalues  p dimensions
How to move for a smaller dimension-space?
We want to move from p dimension space to for n<p dimension
We take the n highest eigenvalues and we work on the space formed with their
corresponding eigenvectors.
We will have a matrix with these vectors “U” we multiply it by “Z”
Any couple of eigenvectors the form a factor dra chneya (forme un plan)
 Business data must be normalized

 The more variability in the data the best is this data
 The best projection is the one that conserves the variability of the original data
 Variability: talking about variances , the sum of variances of all variables= inertia
 Correlation matrix R= transpose Z * Z  pearson correlation (after standardization,
but before it gives the covariance )
 From the correlation matrix, we calculate the eigenvectors and eigenvalues and
order them in a decreasing way
 det(R- lambda I)=0 solve it to find lambda 1, lambda 2, lambda 3 … eigenvalues
 all eigenvectors are orthogonal, there is no colinearity between eigenvectors
 the best plan projection is the one on the plan defined by vector 1 and vector 2 with
the first eigenvalues lambda 1 and lambda 2 ( akthr wahda t7afedh ala soura mtaa
data l assleya)
 the more we add an axe, the better the inertia will be (closer to the original one) so
the higher the inertia rate
 ZU= C : coordinates of the individuals of the new base defined by the highest
eigenvectors (after projection)
 Each combination of vectors to define a plan gives a new information
 The worst PCA is when it gives the same result for every projection , it happens when
the eigenvalues are somehow equal(9rab l baadhhom ma fammech jump between
them)
 The extreme points contribute well to the clarity of the projection ( les pointes mtaa
sewba3 idi for hand projection) for this we calculate ACTR
 The bigger(nearer to 1) cos^2 the smaller the angle the better the projection
 La somme de 2 cos^2 donne cos^2 par rapport au plan c
 Cos=0.8 we lost 0.2=20% of the distance
 The smaller the distance OH the worse the situation
-we had indiv and their features, neglbouhom ywalou features have their indiv
Cercle, sphère, hyper sphère
For the correlation circle :

V1 age : from O to v1 more old people , fl ettijeh l mou3akes nalga laabed sghaar
the nearer to the center the nearer to the average point so it’s useless
number of axes to retain:

- inertia criteria: I keep adding axes until I attain 70% as inertia ratio normalement
- kaizer : I keep the axes with eigenvalues> average of eigenvalues = 1 (hata 0.999
saa3at we take it)
khatr sum lambda i = P
lambda bar= sum /p =1
t1=lambda/sum lambda = %
- elbow criteria : nakhedh lmarfa9 w yaamlou screeplot
we keep the projection that gives better results and better interpretations
The first comp represents/ explains 32.71% of the data
The 2 first comp explains 50% of the data variation
If my threshold is 70% I should do a projection of 4 dimensions using the first 4 vectors with
the highest eigenvalues
Communiality(High jump)= projection quality over principal factor map (the map defined by
comp1 and comp2)
=cos^2(high jump, 1)+ cos^(high jump, 2)
=The projection quality
= sum of two cos^2
Applicartion
Communality ( high jump)=0.32+0.12=0.44 : khaybaaa = bad projection quality
Sum>0.6 : acceptable projection quality
Which variable contributes most to the definition of the 1st component?

 100m - 110m. hurdle – long. Jump (we can choose the points > average)
This is the contribution
Dimension 4 represents pole. Vault and Javeline ( this is the interpretation of the
contribution)
 We see each dimension who or what represents the most, than each point near to a
dimensions we can conclude that it is good at these dimensions
The relative and the absolute contrib. gives the same result
- There is a command to add dimensions to the factor map
the best package for the PCA
for R install.package(“namepackage”)
factominer install

Data Analysis Course Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis Course Notes

Uploaded by

Copyright:

Available Formats

Sampling should be random and the sample size should be big

When N approaches to infinity we sole the n(sample size) as a limit

Systematic sampling: I have a sample with n

Affect results:People gender, education level, region,social status, age

APP F=6% : we take 6% from each strata and also Nh/N=6%

The sum of all weights = 1

CI 1-alpha(teta)=[teta+-E]=[teta Z1-alpha/2*racine variance teta]

Sampling rate nh/Nh

If you don’t have the frame go to non-prob

A frame with not enough money stratified

Stratified adds in the cost

If I have to colledct dataI should follow these steps:

-look for a frame : buy it or do it or deal with a call center

-How we know the percentage for stratas? : INS website helps

Route sampling : you define sampling path

Sample adjustment: for non representative samples, we fix that sample

Over-sampling : nzid lel na9es to equilibrate

CHAPTER II : Bivariate Analysis

- 2 numerical ( pearson correlation test)

-2 categorcal (chi-square test / cross table / l = number of rows c =number of columns)

For testing there is always:

H0: Null hypothesis: absence of relationship, difference, effect, correlation

H1: Alternative hypothesis: their existence

It always has a test chosen according to p-value

The decision is always made through the comparison of the p-value

p-value< alpha  we reject H0

p-value= the probability to reject the null hypothesis by error

for pearson coorelation :

e-06 = *10 à la puissance de -6

more than 2 population

Chapter 3: Principal component analysis

We will see: Variable space and individual space

Victor, subspace , dimensions, basis

We calculate correlation matrix then eigenvector and eigenvalues (do research)

Det( A – lambda I)=0

Find lambda from the system: lambda is eigenvalue

The data is presented as matrix

Objectives: it can help in market segmentation: identifying groups of homogenous

Before normalization data is called “X” , after it is “Z”

 Business data must be normalized

For the correlation circle :

number of axes to retain:

Which variable contributes most to the definition of the 1st component?

You might also like