You are on page 1of 69

Descriptive statistics 2023/2024

Introduction :

The collection of numerical information dates back to the earliest times in human
history. For example, the first shepherds did not use stones or twigs to count and control
the number of their flocks entering and leaving the sheepfold.

But it was with the sedentarization of man and the development of agricultural activity,
which gave rise to great civilizations, that the need for information on the population and its
wealth became a necessity for the survival of these states. Indeed, these great organized
states (Sumerians, Chinese, Egyptians, Persians, Greeks, Romans, etc.) used headcounts for
fiscal purposes (taxes on harvests, trade in goods, etc.), to distribute agricultural land, or to
mobilize the armed forces.

With the advent of Islam, there was an incredible development in the techniques used by
the public services of the caliphate to count and evaluate goods and wealth for the payment
of taxes: ZAKAT. In the period of the OMAR EL KHATTAB caliphate, there was a census of
livestock, fruit trees and other assets.

Thus, an unprecedented shift has taken place in the aims of data collection: no longer to
increase the opulence, wealth and domination of those in power, but to improve the lot of
the underprivileged and strengthen social cohesion.

The beneficial effects of this form of income redistribution (or social solidarity taxes)
have led to peace and social equilibrium never seen before.

Later on, Muslims encouraged the use of statistics in scientific research because of the
enormous progress made in mathematics, IBN KHALDOUN (1332 -1406) used and advised
on the use of statistics in research into history, sociology etc.... He sometimes made
estimates for certain balance sheets.

The Middle Ages, spanning roughly from the 5th to the 15th century, were characterized
by a lack of systematic statistical development compared to later periods. The Middle Ages
were marked by different intellectual and social priorities, and statistical methods as we
understand them today were not a prominent part of the scholarly landscape during this
time. However, there were some rudimentary data collection and analysis practices that can
be considered precursors to modern statistics:

Economic Data Collection: In medieval Europe, some rudimentary forms of economic


data collection existed. This was often done for the purposes of taxation or trade. Local
authorities might collect information about the population, agricultural production, or
market activities.

Mr : SAHNOUN.A.Y Page 1
Descriptive statistics 2023/2024

Demographic Records: Church records, such as parish registers, sometimes contained


information about births, deaths, and marriages. These records were not systematically
collected for statistical analysis but could be used retrospectively for demographic studies.

The development of statistics in the 17th century marked an important step in the
evolution of this field. While statistics as we know it today didn't fully take shape until later
centuries, there were several key developments and figures in the 17th century that laid the
groundwork for the discipline. Here's an overview of some of the significant developments
in statistics during this period:

John Graunt (1620-1674): John Graunt, an Englishman, is often considered one of the
pioneers of modern statistics. In 1662, he published a book titled "Natural and Political
Observations Made upon the Bills of Mortality." In this work, Graunt analyzed data on births
and deaths in London, creating the first known life tables and mortality statistics. He is
credited with introducing the concept of the life expectancy and is often referred to as the
father of demography.

William Petty (1623-1687): William Petty, an English economist and philosopher, made
significant contributions to the development of statistics. He applied statistical methods to
economic and social data. Petty is known for his work on political arithmetic, where he used
quantitative data to analyze various aspects of society, including population, wealth, and
resources.

Early Census Efforts: In the 17th century, there were initial attempts to conduct
censuses and collect data on population and economic activities in various countries. These
early census efforts laid the groundwork for more systematic data collection and analysis in
the centuries that followed.

Probability Theory: Probability theory, a fundamental branch of statistics, began to take


shape during this period. Blaise Pascal (1623-1662) and Pierre de Fermat (1601-1665)
corresponded about probability problems, leading to the development of probability
theory. This laid the foundation for statistical inference and the understanding of
randomness and uncertainty.

Scientific Revolution: The 17th century was a time of great scientific advancement, and
many of the scientific thinkers of this era, such as Galileo Galilei and Johannes Kepler, made
contributions that would later have implications for statistics. The scientific method, which
emphasizes systematic data collection and analysis, became more widely adopted during
this period.

Role in Statecraft: Statistics began to play a role in statecraft and governance.


Governments and rulers recognized the value of data in making informed decisions,
especially in matters related to taxation, public health, and military planning.

It's important to note that while these developments were crucial for the emergence of
statistics as a field, the mathematical and theoretical foundations of modern statistics were
still evolving. The 17th century laid the groundwork for subsequent advancements in
statistics, which continued to develop in the centuries that followed, particularly during the

Mr : SAHNOUN.A.Y Page 2
Descriptive statistics 2023/2024

18th and 19th centuries. The field of statistics as we understand it today, with concepts like
probability theory, hypothesis testing, and sampling theory, was more fully developed in the
18th and 19th centuries by figures like Carl Friedrich Gauss, Pierre-Simon Laplace, and Sir
Francis Galton.

Statistics

Definition:

«Statistics is the set of methods and techniques for processing numerical data
associated with a situation or phenomenon, with the aim of reporting reality, presenting
and analyzing data, and drawing conclusions and making decisions».

Remarque

Statistics can be divided into two types:

Mathematical or inductive statistics (‫)ا س ت قرائ ية‬

Descriptive or deductive statistics (‫)اال س ت ن تاج يه‬

Descriptive statistics

The result of an observation, of a measurement, is not equal to the theoretical value


calculated or expected by the engineer; the repetition of the same measurement, carried
out under seemingly identical conditions, does not always lead to the same results. These
fluctuations, due to numerous causes, known or unknown, controlled or uncontrolled,
create difficulties for engineers and scientists alike.

Definition :

A set of methods for describing and analyzing, in a quantified way, phenomena


identified by numerous elements of the same nature, which can be counted and classified.

The aim of descriptive statistics is to :


Describe and analyze, in a quantified way, phenomena identified by numerous
elements: describe, i.e. make tables, graphs, calculate averages in order to highlight
significance.

I Chapter I: Single-character statistical series

A statistical series is the sequence of values taken by a variable X over units of observation.
A single statistical variable, X, is considered here. The aim is to explain the elementary tools,
adapted to the nature of X, that enable us to present this variable in a synthetic way, to
make an appropriate graphical representation and to summarize its main characteristics.
Mr : SAHNOUN.A.Y Page 3
Descriptive statistics 2023/2024

The number of observation units is noted n.


The values of the variable X are denoted x1, . . . xi, . . . xn.

I.1 Data collection


Before collecting data, you need to
- Set a research problem and define the objectives to be achieved, either to explore and
describe a phenomenon, or to explain a relationship, or to forecast and anticipate.
- Define the study's target population and its components.

There are two types of survey:

I.1.1 Exhaustive surveys, or census :


Definition :
A census is an operation in which all the individuals in a population are covered by
observation.

Comments:
Do not confuse "enumeration" with "census":
- Enumeration: counting individuals in a population
- Census: quantifying data according to several parameters

Exemple :
Study of Algerian demographics over a specific period of time.

I.1.2 The partial surveys or sampling


A survey is a non-exhaustive data collection technique based on a portion of the
population. (or subset) called a sample. This sample, chosen at random, helps us to know
the " mother " population with a certain degree of precision.
We sometimes find ourselves in situations where we are obliged to forego the census in
favor of more operational surveys, such as destructive product quality surveys: lamps
(service life), match candles, etc.
Exemple :
To analyze the blood composition of a human being, it's enough to analyze a sample of a
few milliliters and generalize it over the whole quantity.

I.2 the statistical apparatus in Algeria :


The backbone of Algeria's information system is the National Statistics Office (ONS). This
organization is responsible for the collection of administrative statistics, the general
population census

Exemple :
In a company, several surveys can be established, such as:
Labor survey, employment and salary survey, expenditure and consumption survey,
industrial survey, municipal survey, building and public works survey:

Mr : SAHNOUN.A.Y Page 4
Descriptive statistics 2023/2024

I.3 Statistical vocabulary

Statistical test

Figure 01: Vocabulaire statistique.

 Statistical observation: statistical observation is the rudimentary act of recording all


possible information of interest to the statistician, in accordance with a written or
oral questionnaire prepared in advance..

 Statistical test :
Descriptive statistics aims to study the characteristics of a set of observations, such as
the measurements obtained in an experiment. The experiment is the preliminary stage
in any statistical study.

Definition :
The statistical test is an experiment that we provoke.

Population: The population is the set on which our statistical study is based. This set is
denoted Ω.

Example: ENP students; employees of a company, fleet of cars...


Statistical units (individuals): The element of the population on which the
observation focuses, it is noted ω (ω in Ω).

Example: student; employee, car


Sample: is the subset of individuals taken from the population.

Figure 02: Statistical distribution

Mr : SAHNOUN.A.Y Page 5
Descriptive statistics 2023/2024

Example: Students under 20, young employees, Renault cars

 (Character) Statistical variable: We call character (or statistical variable,


denoted S.V.) any application
X: →
C is the set of values of the character X (the value measured or observed on individuals)

Exemple :
For individuals: gender, SPC, age, salary, etc.
For companies: number of employees, business sector, etc.
For geographical locations: altitude, vegetation type, etc.
For dates: share price, temperature, daily sales, etc.

 Modality : modalities are the different situations in which the individual can be
envisaged each of the traits studied can present two or more modalities.
Each individual in the population presents one and only one of the modalities of the trait
under consideration.

Exemple :
The number of modalities for a character varies according to the degree of detail. For
example, the marital status characteristic can have, depending on the case :
 Two modes: married, unmarried
 Three modes: single, married, widowed or divorced
 Four: single, married, widowed, divorced
 five: single, married, widowed, divorced, undeclared

Mr : SAHNOUN.A.Y Page 6
Descriptive statistics 2023/2024

I.4 Types des caractères :

Figure 03: Types des caractères

I.4.1. Qualitative Variable:


A statistical variable is said to be qualitative if its modalities are not measurable, such as
: Gender, profession, marital status...
There are two types of qualitative variable:

I.4.1.1 Nominal qualitative variables:


Whose modalities cannot be classified or hierarchized, Nominal categorical variables
cannot be measured. However, their modalities can be coded. The order and origin of the
coding are arbitrary, and may be numeric, alphabetic or alphanumeric.

Coding: The process of assigning numerical values to responses expressed in numerical or


textual form.
Example: For the character 'gender of students', the female and male modalities cannot be
classified or hierarchized.

Coding a qualitative nominal variable


The following table shows the different categories of the nominal variable Professions
and socio-professional categories (CSP):
Code Category
01 Farmers
02 Artisans, shopkeepers
03 Executives and higher intellectual professions
04 Intermediate professions
05 Employees
06 Manual workers
07 Retirees
08 Other
Tableau 01: Example of coding.

Mr : SAHNOUN.A.Y Page 7
Descriptive statistics 2023/2024

In this example, there is no natural order between the eight categories, or modalities,
which are simply labels; the qualitative variable "CSP" is defined on a nominal scale.

Ordinal qualitative variables:


Since the modalities can be ranked or hierarchized, we can rank all the categories, from the
smallest to the largest (or, conversely, from the largest to the smallest). In contrast to a
nominal scale, expressions such as "greater than", "precedes", "follows", etc., are not used.

Example: for the character 'mention du baccalauréat', the modalities are ordered in
ascending order as follows: Fair, Fairly good, Good, Very good, Excellent.

I.4.2 Quantitatives Variables:


Any variable that is not quantitative can only be qualitative. The different modalities of a
quantitative variable constitute the set of numerical values that the variable can take.

Definition : A statistical variable is said to be quantitative if its modalities can be measured.


The modalities of a quantitative variable are numbers linked to the chosen unit, which must
always be specified.
There are two types of quantitative variable: discreet and continuous.

I.4.2.1 Discrete or discontinuous quantitative variables:

Definition : A quantitative statistical variable is said to be discrete if its modality set is finite
or countable. Thus, the set of modalities can be given in the form of a list of numbers.
In the rest of this chapter, we consider the following situation:
→* +
with Card ( ) := N is the number of individuals in our study.

Exemple: number of children per household; number of hours worked per day by a
company's employees, weight in kg of a person

I.4.2.2 Continuous quantitative variables:


A continuous character can take on an infinite number of values within its interval of
definition. These values can be grouped into classes.

Definition: A continuous S.V. (or continuous character) is any real-valued application of Ω


that takes a "significant" number of values.
infinite number of modalities).
Example: Number of millimeters of rainfall recorded in a region during different months of
the year.

I.5 Representation of a statistical series :


I.5.1 Statistical tables:
One of the aims of descriptive statistics is to summarize the raw data collected on a
population in statistical tables, in order to present the data in a readable way.

Mr : SAHNOUN.A.Y Page 8
Descriptive statistics 2023/2024

Exemple : Survey of a sample of 60 families in the city ..... on the number of children per
household. The raw results of the number of children are:

214220123045
254262642132
133311132332
425233151526
252312201431

It should be noted that the raw data are not legible, hence the need to group them
together in a table for easier analysis.

Presentation of a statistical table


The presentation of a statistical table must respect certain general principles:
- The table must include clearly defined row and column headings, and specify the units
used.
- The table must display a title specifying its content and the source of the information
when the data are borrowed from a publication or organization.

The components of a statistical table :


- The first column represents the different modalities (xi ) taken by the character under
study.
- The second column represents the number of individuals (ni ) corresponding to each
modality (xi ) of the characteristic.
Let's consider a statistical population of n individuals described according to the
characteristic x whose k modalities are x1, x2, ..., xi, ...., xk
Modality (xi) Effectif (ni)
x1 n1
x2 n2
x3 n3
. .
xi ni
. .
xk nk
Total N
Tableau 02: Statistical table.

 ni represents the number of individuals, called the "partial headcount",


presenting the modality xi
 N: the sum of the partial numbers ni is called the "total number" of the population

Mr : SAHNOUN.A.Y Page 9
Descriptive statistics 2023/2024

Rappel :
Let Ω be a set. We call cardinal and denote Card(Ω), the number of elements of Ω.
Card(Ω) := number of elements of Ω = N.

Exemples:
Number of children observed in a sample of households in the region Z
Number of Number of
children per households (ni)
household (xi)
0 3
1 12
2 18
3 12
4 6
5 6
6 3
Total 60

 Partial headcount : :

Definition :
For each value xi, we define
𝑛𝑖 𝐶𝑎𝑟𝑑*𝜔 ∈ Ω ∶ 𝑋(𝜔) 𝑥𝑖 +
ni : the number of individuals with the same xi, called the partial headcount of xi.

𝜔 𝒙𝒊 𝒙𝒊
𝒏𝒊
𝒏𝒊

Figure 04: The number of individuals taking the value xi.

Exemple:
In the example above, the number of families with three children is :

Number of children
… 3 …
per household (xi)
Number of
… 12 …
households (ni)
Mr : SAHNOUN.A.Y Page 10
Descriptive statistics 2023/2024

 Cumulative headcount:
Relative frequency, or fi, is the proportion of individuals in the population presenting the
same modality. It is obtained by dividing each number ni by the total number N:

Definition :
For each value xi, we define
𝑁𝑖 𝑛 + 𝑛 + + 𝑛𝑖
The cumulative
Exemples: number Ni of a value is the sum of the number of this value and all the numbers
𝑛
of the
Dans preceding
l’exemple précédant, 𝑘 𝑛𝑘 ont un nombre inférieur ou égale à trois enfants
values. Ni45 familles

Number of children
0 1 2 3 4 5 6
per household (xi)
Number of
3 15 33 45 51 57 60
households (ni)

 Partial frequency:
Relative frequency, or fi, is the proportion of individuals with the same modality in the
population. fi is obtained by dividing each number ni by the total number N:

Note :
fi can be replaced by fi × 100, which then represents a percentage.

Exemple :
Applying the notion of partial frequency to the previous example gives us :

Number of Number of Frequency of


children per households (ni) households in (%)
household (xi)
0 3
1 12 20
2 18 30
3 12 20
4 6 10
5 6 10
6 3 5
Mr : SAHNOUN.A.Y Page 11
Descriptive statistics 2023/2024

Total 60 100
Proposition :

Proposition:

If fi is defined as the partial frequency, then


𝑛

∑ 𝑓𝑖
𝑖

Demonstration. Recall that

Alors

∑ ∑ ∑

 Cumulative frequency :

Definition :
For each value xi, we define
𝐹𝑖 𝑓 + 𝑓 + + 𝑓𝑖
The quantity Fi is called the cumulative frequency of xi.

Calculating cumulative numbers, Ni, and cumulative frequencies, Fi, helps us to diagnose
our problem.
The calculation is made by summing the relative numbers and frequencies in a table
column. In effect :

Mr : SAHNOUN.A.Y Page 12
Descriptive statistics 2023/2024

000

Figure 05: Calculation diagram.

Exemple :
Using the same example, answer the following questions:
- How many families have less than four children?
- How many families have at least four children?
- What is the proportion of families with at most four children?
- What proportion of families have more than four children?
Number Number of Household Cumulativ Cumulativ Increasing Cumulative
of household frequencie e e cumulative decreasing
children s s (%) increasing decreasing frequencie frequencie
per numbers numbers s (%) s (%)
househol
d
0 3 0.05 3 60 0.05 1
1 12 0.20 15 57 0.25 0.95
2 18 0.30 33 45 0.55 0.75
3 12 0.20 45 27 0.75 0.45
4 6 0.10 51 15 0.85 0.25
5 6 0.10 57 9 0.95 0.15
6 3 0.05 60 3 1 0.05
Total 60 1
According to the table:
- 45 households have fewer than 4 children.
- 15 households have at least 4 children.
- 85% of households have no more than 4 children.
- 15% of households have more than 4 children.

Mr : SAHNOUN.A.Y Page 13
Descriptive statistics 2023/2024

I.5.1.1 Different types of statistical tables :


The table differs according to the nature of its statistical variable.
Apply this diversity to an example to obtain the different tables.
Last name: First name :
Age (in years) Place of birth: Height (cm) :
Opinion on NPS
Poor Average Good Very good Excellent

I.5.1.1.1 Statistical table of a nominal qualitative character:


The table below shows the distribution of students surveyed according to their place
of birth:
Place of birth: Number of students
A 98
B 87
C 22
E 13
Z 30

I.5.1.1.2 Statistical table of a qualitative ordinal character:


The distribution of students' opinions on the quality of studies at ENP is summarized
in the following table:
Avis Effectif
Excellent 38
Very good 84
Good 75
Average 37
Poor 16

I.5.1.1.3 Statistical table of a discrete quantitative character:


The age of the students questioned is shown in the statistical table below.
Age Headcount
< 18 8
18 82
19 70
20 28
21 20
22 14
23 16
> 24 12

Mr : SAHNOUN.A.Y Page 14
Descriptive statistics 2023/2024

I.5.1.1.4 Statistical table of a continuous quantitative characteristic:


Student height distribution (in cm):
height (cm) Headcount
[130;150[ 2
[150;160[ 30
[160;165[ 60
[165;170[ 62
[170;175[ 44
[175;180[ 28
[180;190[ 16
[190;220[ 8

I.5.1.2 Determining the number and range of classes :


Various empirical formulas can be used to establish the number of classes for a sample of
size n.

The STURGE rule:


Number of classes = 1+ (3,3 log n)

The YULE rule:


4
Number of classes = 2 𝑛

Class amplitude:

Definition :
The number 𝑒 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

is called the extent of X. In this case, the pitch can be defined by :


𝑒𝑥𝑡𝑒𝑛𝑡
𝑎⬚
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠

With
X max et X min are respectively the largest and smallest values of X in the statistical series.

Exemple :
We weigh the 50 students in a section and we obtain the following results:
43,43,43,47,48,48,48,48,49,49,49,50,50,51,51,52,53,53,53,54,54,56,56,56,57,59,59,59,62,
62, 63,63,65,65,67,67,68,70,70,70,72,72,73,77,77,81,83,86,91,91
Create a summary table.

Solution :
Using STURGE's formula
STURGE's rule: k= 1+ (3,3 log n)
The number of classes equals 6.6

Mr : SAHNOUN.A.Y Page 15
Descriptive statistics 2023/2024

4
The YULE rule: 2
The number of classes equals 6.64
If the number is between 0.0 and 0.5 it is rounded to 0
If the number is between 0.51 and 0.99, round to 1
We accept 7
-Calculation of the class interval.

a: class interval
xmax: maximum value of the statistical series.
xmin: minimum value of the statistical series.
K: number of classes.
Donc (91-43)/7=6.85
6.85 is between 6.51 and 6.99
We take the value 7
Weight Number Amplitude « ai»
[43, 50[ 11 7
[50,57 [ 13 7
[57,64 [ 8 7
[64, 71[ 8 7
[71,78[ 5 7
[78, 85[ 2 7
[85,92 [ 3 7
TOTAL 61
Class amplitude (raw table)
This is the difference between the upper and lower bounds of a class.
The amplitude "a" of a class i is given by the following formula :

𝑖𝑛𝑓
𝑎𝑖 𝑒𝑖𝑠𝑢𝑝 − 𝑒𝑖

: the amplitude of a class.


the upper bound of the class
the lower limit of the class

Case of equal amplitude or uniformity


Exemple :

Height Number amplitude « ai »


[130,140[ 2 10
[140,150[ 30 10
[150,160[ 60 10
[160,170[ 62 10
[170,180[ 44 10
[180,190[ 28 10

Mr : SAHNOUN.A.Y Page 16
Descriptive statistics 2023/2024

Cases of unequal amplitudes


Exemple :
Height Number Amplitude « ai »
[130,150[ 2 20
[150,160[ 30 10
[160,165[ 60 5
[165,170[ 62 5
[170,175[ 44 5
[175,180[ 28 5
[180,190[ 16 10
[190,220[ 8 30
A histogram is made up of a series of rectangles, whose bases coincide with the classes
dividing the range of variation of the variable, and whose heights are such that the numbers
(or frequencies) are expressed by the areas of the rectangles.

The histogram can no longer be constructed in exactly the same way


Frequencies (numbers) relating to classes of unequal amplitude are no longer comparable.
In this case, a correction must be made to take account of amplitude differences.
Generally, the correction is made by calculating the frequencies (or numbers) per amplitude
unit.

Corrected headcount
𝑛𝑖
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 ℎ𝑒𝑎𝑑𝑐𝑜𝑢𝑛𝑡 𝑝𝑒𝑟 𝑎𝑚𝑝𝑙𝑖𝑡𝑢𝑑𝑒 𝑢𝑛𝑖𝑡 𝑎𝑚𝑖𝑛
𝑎𝑖
Corrected frequency
𝑓𝑖
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑝𝑒𝑟 𝑎𝑚𝑝𝑙𝑖𝑡𝑢𝑑𝑒 𝑢𝑛𝑖𝑡 𝑎𝑚𝑖𝑛
𝑎𝑖

Class boundaries are plotted on the x-axis.


The y-axis shows the corrected frequencies (or numbers) corresponding to each class.

Example:
The human resources manager of a company has drawn up a statistical distribution of the
years of service of the company's managers, expressed in years:

classes [6,5 ;9,5[ [9,5 ;11[ [11 ;12,5[ [12,5 ;14[ [14 ;17[
Amplitue 3 1,5 1,5 1,5 3
Headcount 11 12 19 9 9
2
Height h 2 19 9 4,5

Mr : SAHNOUN.A.Y Page 17
Descriptive statistics 2023/2024

Figure 06: Représentation graphique des salaires.

The center of a class:


This is the average of the class extremities.

Definition :
The center "c" of a class i is given by the following formula :
𝑖𝑛𝑓
𝑒𝑖𝑠𝑢𝑝 + 𝑒𝑖
𝑐𝑖
2

Mr : SAHNOUN.A.Y Page 18
Descriptive statistics 2023/2024

Exemple :

amplitude center
Height headcount
« ai » « ci »
[130,150[ 2 20 140
[150,160[ 30 10 155
[160,165[ 60 5 162,5
[165,170[ 62 5 167,5
[170,175[ 44 5 172,5
[175,180[ 28 55 177,5
[180,190[ 16 10 185
[190,220[ 8 30 205

Each class is characterized by :

- Lower boundary
- Upper boundary
- Amplitude (ai)
- Center (ci)

Ci

ai
Lower boundary Upper boundary

Figure 07: Class components.

I.5.2 Graphic representations


Graphs provide a visual summary of the distribution of a variable and highlight certain
information given in the table.
Graphical representations are specific to each type of variable or character (qualitative,
discrete quantitative or continuous quantitative).

I.5.2.1 Representations of qualitative characters


Qualitative variables can be represented graphically in a variety of manners.
The most commonly used diagrams are the band diagram (or organ pipe diagram) and the
circular sector diagram.

A) Line diagrams: this chart is applicable when the statistical units are few in number,
individually known and not repeated.
Exemple 1 : Let be the series of numbers :
{8, 2, 3, 7, 4}

Mr : SAHNOUN.A.Y Page 19
Descriptive statistics 2023/2024

Figure 08 : Online graphical representation


if some data is repeated, as in the example below, you need to switch to grouped data
representation.
Exemple 2 : Consider the series of digits where 7 and 2 are repeated 2 times:

{8, 2, 3, 7, 4, 7, 2}

Figure 09 : Online graphical representation

A) The "stem and leaf" graph:


This graphic consists of stacking units, while retaining their identification (a number,
a name, etc.). In this way, no initial data is missing from the graph, and each unit can
be easily located.

Exemple : That's 18 people, identified by a number from 1 to 20, and given scores
from 0 to 5.

Notes = {{0, 12}, {0, 14}, {1, 7}, {1, 9}, {1, 13}, {1, 18}, {2, 4}, {2, 8}, { 2, 11}, {2, 15}, {2,
16}, {3, 17}, {3, 10}, {4, 5}, {4, 6}, {4, 20}, {5, 3}, {5, 19}}

In each data pair, the first number corresponds to the score (from 0 to 5), i.e. the
"stem", and the second identifies the person by a number ranging from 1 to 20,
i.e. "the leaves".

Figure 10 : stem and leaf diagram


C) Line diagrams : A bar chart is made up of a series of vertical or horizontal bars. Each
modality of the characteristic is associated with a " line " of length proportional to the
number or frequency of that modality.

Exemple: Distribution of company X employees by employment contract

Mr : SAHNOUN.A.Y Page 20
Descriptive statistics 2023/2024

CSP Senior Supervisors Employees workers Other


executives categories
Number of
10 5 20 40 5
employees

Figure 11: Bar charts showing the distribution of company employees


according to SPC by headcount.

Figure 12: Bar charts showing the frequency distribution of employees


by SPC frequencies.
B) Band charts :
In a strip chart, a vertical band is associated with each modality. The width of each band
is the same, and its height is proportional to the number or frequency of the
corresponding modality. The distance between bands is constant. Above each band are
labels showing the number or frequency of the associated modality.

Exemple: the strip chart corresponding to the number of employees in company X

Mr : SAHNOUN.A.Y Page 21
Descriptive statistics 2023/2024

Figure 13: Band charts showing the distribution of company employees


according to SPC by workforce.

Diagram by sector (pie chart)


Circular or semicircular diagrams divide a disk or half-disk into slices, or sectors,
corresponding to the modalities observed and whose area is proportional to the number, or
frequency, of the modality.
The degree of a sector is determined using the rule of three as follows:

→ ( )
N : total headcount
ni : partial headcount
So,

Figure 14: Sector diagram

I.5.2.2 Representations of quantitative characteristics :


I.5.2.2.1 Graphical representation of discrete quantitative characteristics :

Mr : SAHNOUN.A.Y Page 22
Descriptive statistics 2023/2024

a) Representation of a frequency distribution :A distribution with a discrete quantitative


variable is shown as a bar chart.
Exemple : Number of children with 40 employees :

Nbre children Headcount Frequencies (%)


0 8 20
1 7 17,5
2 12 30
3 6 15
4 3 7,5
5 4 10
Total 40 100

Figure 15: Frequency polygon.

I.5.2.1.1 Graphical representation of continuous quantitative characteristics :


a) Histogram :

Exemple : When studying the teenage population of a popular neighborhood, their height
values can be distributed as follows:
Height (cm) Headcount Frequency (%)
[140,145[ 1 2
[145,150[ 1 2
[150,155[ 9 18
[155,160[ 17 34
[160,155[ 16 32
[165,170[ 3 6
[170,175[ 3 6

Mr : SAHNOUN.A.Y Page 23
Descriptive statistics 2023/2024

The histogram of the numbers in this series is shown in the following graph:

Figure 16: Histogram of headcount.

On the same graph of numbers (frequencies), we present the polygon of numbers


(frequencies). This polygon is used to represent the distribution in the form of a curve, by
joining the midpoints of the upper bases of each histogram rectangle with straight line
segments.

Figure 17: Frequency polygon.

B) Cumulative polygons: In an orthogonal Cartesian coordinate system, we construct points whose


abscissas are equal to the upper limits of the classes (except for the first point), and whose ordinates
are the corresponding cumulative increasing numbers.
By joining these points by line segments, we obtain the cumulative increasing polygon of the given
distribution.
Exemple: Cumulative increasing numbers are calculated:

Height (cm) Headcount Cumulative


headcount
[140,145[ 1 1
[145,150[ 1 2
[150,155[ 9 11
[155,160[ 17 28
[160,155[ 16 44
[165,170[ 3 47
[170,175[ 3 50

Mr : SAHNOUN.A.Y Page 24
Descriptive statistics 2023/2024

Figure 18: Cumulative polygon.

C) Representation of a cumulative frequency distribution (or h. cum.)


To better exploit and interpret our results.
Exemple: Number of children among a company's 40 employees

Children Nbre headcount Freq (%) Incresing Fi Decreasing Fi


0 8 20 20 100
1 7 17,5 37,5 80
2 12 30 60,5 62,5
3 6 15 82,5 32,5
4 3 7,5 90 17,5
5 4 10 100 10
Total 40 100

Figure 19: Courbe cumulative croissante.

Mr : SAHNOUN.A.Y Page 25
Descriptive statistics 2023/2024

Figure 20: Decreasing cumulative curve.

I.6 Parameter of a series :


A statistical series is simply a list of values, not necessarily numbers.
A statistical series can be represented by a list of values, a table...

I.6.1 Position parameters:


I.6.1.1 The mode :

Definition :
The mode (or modal value), noted Mo, is the value that the statistical variable takes most often
(the value with the highest headcount).
The mode can be calculated for both qualitative and quantitative characteristics..

Exemple : Let be the series: {8, 4, 4, 3, 4, 3, 8, 2, 5}


The most frequent value in this series is 4. The mode is Mo=4.

Comment:
A series can have a single mode, i.e. uni-modal, or multiple modes, i.e. multimodal or modeless.

Exemple :
Let be the series S = {4, 0, 1, 1, 2, 2, 2, 3, 3, 4, 2, 3, 4, 5, 2, 1, 3, 3, 4, 5}.
"2" and "3" are the most frequently recurring values: 5 times each.
This series has two modes: 2 and 3.
Let be the series R = {8, 6, 5, 7, 3, 1}. In this case, we can also say that all values are modal, or
there is no mode.

Mr : SAHNOUN.A.Y Page 26
Descriptive statistics 2023/2024

a) Mode in the case of a qualitative variable :


The mode is the one corresponding to the highest number of employees.
Exemple: Distribution of company X employees by employment contract

SPC Executive supervisor Employeer Workers Other


category
Employee
10 5 20 40 5
headcount
The mode of this character is "workers".

b) The mode in the case of a discrete quantitative variable:


The mode is the value corresponding to the highest number of employees.

Example: Number of children in a company with 40 employees

Children Nbre Headcount Frequency (%)


0 8 20
1 7 17,5
2 12 30
3 6 15
4 3 7,5
5 4 10
Total 40 100
The mode of this character is equal to 2 children

c) The modal class for a continuous quantitative variable:


If the classes are all of the same amplitude, the modal class is the class with the highest number of
individuals or the highest frequency. If the classes are not all of the same amplitude, the modal
class is the class with the highest corrected number or frequency.

Examples:
Consider the statistical distribution of a population of students according to their height (in cm):

height (cm) <160 [160, 170[ [170, 180[ [180, 190[ > 190 Total
H 6 7 8 2 1 24
Freq (%) 25 29,1 33,3 8,3 4,3 100

The highest number of employees or the highest frequency indicate that the modal class is
[170, 180[

Mode calculation
Consider the distribution of a population of students by weight (in kg)

Weight headcount Correct Freq Amplitude Height


(kg) he (%) Corrected fréq
<55 2 2 0,0833 5 (0,0833/5)*5=0.0833
[55,60[ 3 3 0,125 5 0,125
[60,70[ 4 (4/10)*5=2 0,1667 10 (0.1667 /10)*5=0.0835
[70,75[ 5 5 0,2083 5 0,2083
[75,85[ 6 3 0,25 10 0,125
> 85 4 2 0,1667 10 0,0835
24 1

Mr : SAHNOUN.A.Y Page 27
Descriptive statistics 2023/2024

The modal class with the highest corrected frequency is the [70; 75[

Rule :
The mode for a continuous quantitative characteristic can be calculated by the following
∆1
formula: 𝑀𝑜 𝐿𝐼 CMo + 𝐴 CMo ∆ +∆
1 2

Where
LI CMo is the lower bound of the modal class
ACMo is the amplitude of the modal class
Δ1= fCMO-fCMO-1: difference between the frequency of the modal class and the frequency of the
preceding class
Δ2= fCMO-fCMO+1: difference between the frequency of the modal class and the frequency of the
next class
Using the headcount
2 −
+
(2 − ) + (2 − 2 )
Mo = 72,99

Figure 21: Graphical representation or determination of the mode (continuous case).

I.6.1.2 The median


The median of a series is the value which divides this series, previously classified, into two
series of equal numbers. The first series contains values below the median. The second series
contains values above the median.

1) From a statistical series :


A. Calculation of the median: odd numbers

Mr : SAHNOUN.A.Y Page 28
Descriptive statistics 2023/2024

Rule:
To find the median, you must:
Arrange the series in increasing order of values;
Locate the value that divides the total number into two equal sub-numbers by applying the
formula (n+1)/2,

Exemple : Let be the following series of 5 numbers: {8, 13, 9, 5, 25}


The rising ranking {5, 8, 9, 13, 25}
The formula: (5+1)/2=3. The third value in the series is 9.

Under-staffing of below- Under-staffing of above-


median values median values
mediane

Check that there are as many values below the median as there are above it. The total number
is divided into two equal parts.

B – Median calculation: even numbers


When the number is even, the median is not a value in the series. It must be calculated.
Exemple : Let's take the following 8-digit series: {13,1,9,10,2,4,12,7}
To find the median, you must :
a) Arrange the series in ascending order of values {1,2,4,7,9,10,12,13}
b) Apply the formula (n+1)/2, i.e. here (8+1)/2=4.5. This tells us that the median interval is
made up of the 4th and 5th values. The median is therefore equal to the simple arithmetic
mean of these two values:
Me = (7+9)/2=8

Under-staffing of below- Under-staffing of


median values above-median values
mediane

2)- From a statistical table: discrete

Mr : SAHNOUN.A.Y Page 29
Descriptive statistics 2023/2024

It is determined from cumulative frequencies.


xi ni Ni fi(%) Fi(%)
0 3 3 5 5
1 4 7 7 12
2 8 15 15 27
3 7 22 13 40
4 14 36 25 65
5 9 45 16 81
6 6 51 11 92
7 2 53 4 96
8 1 54 2 98
9 1 55 2 100
Me = 4

3) – Calculating the median: grouped numbers by value class

− ( )
+ [ ]

: Lower limit of the median class.


( ) : Cumulative number strictly less than xi.
xi : median class
ai : amplitude of the median class.
Exemple :
xi ni N(xi) fi Fi
[0,5[ 2 2 6.66 6.66
[5,10[ 7 9 23.33 29.99
[10,15[ 18 27 60 89.99
[15,20[ 3 30 10 99.99


+ [ ]
Me= 11,66

I.6.1.3 The Quartiles :

The first quartile Q1 is the smallest value in the series such that at least 25% of the values are less
than or equal to Q1.
The first quartile Q2 is the smallest value in the series such that at least 50% of the values are less
than or equal to Q2.
The third quartile Q3 is the smallest value in the series such that at least 75% of the values are less
than or equal to Q3.
We can also define the quartiles Q1, Q2, Q3 as values that can be used to divide an ordered
population into four groups, each containing the same number of elements.

Mr : SAHNOUN.A.Y Page 30
Descriptive statistics 2023/2024

Exemple : we carry out a statistical study on the 50 marks awarded by a board of examiners.
Here are the results obtained by classifying these marks in ascending order (discrete variable).

Cumulative Increasing
Marks headcount headcount Frequency cumul Freq
0 1 1 2 2
1 2 3 4 6
2 2 5 4 10
3 3 8 6 16
4 2 10 4 20
5 3 13 6 26 Q1
6 2 15 4 30
7 3 18 6 36
8 4 22 8 44
9 3 25 6 50 Q2
10 2 27 4 54
11 3 30 6 60
12 4 34 8 68
13 4 38 8 76 Q3
14 3 41 6 82
15 1 42 2 84
16 2 44 4 88
17 1 45 2 90
18 2 47 4 94
19 2 49 4 98
20 1 50 2 100

n/4 = 12,5 this is not an integer, so the first quartile is the term of rank 13,is Q1 = 5
3n/4 = 37,5 this is not an integer, so the third quartile is the term of rank 38 soit Q3 = 13

The first quartile Q1 = 5


The second quartile or median Q2 = 9
The third quartile Q3 = 13

Continuous case

Formula :
The first quartile
𝑁
. − 𝑁𝑄1 /
4
𝑄 𝐿𝑄 + 𝑎𝑄
𝑛𝑄
The third quartile
𝑁
. − 𝑁𝑄3 /
4
𝑄 𝐿𝑄 + 𝑎𝑄
𝑛𝑄

Mr : SAHNOUN.A.Y Page 31
Descriptive statistics 2023/2024

Exemple :
Cumulative
Marks headcount headcount Frequency cumul Freq
[0 ; 5[ 10 10 20 20
[5 ; 8[ 8 18 16 36
[8 ; 12[ 12 30 24 60
[12 ; 15[ 11 41 22 82
[15 ; 20[ 9 50 18 100
50

Solution :

Method 1 :

.4 − /
+

. − /
+
2

. 4
− /
2+

Methode 2 :

There is another method for determining the three values.

Figure 22: Determining quartiles.

Mr : SAHNOUN.A.Y Page 32
Descriptive statistics 2023/2024

Using Thales' theorem

Definition :
The difference between Q3 and Q1 is called the interquartile range.
The interquartile range is used to assess the dispersion of a series, either absolutely, or by
comparison with another series (provided the values of the other series are expressed in the
same unit). The Q1 and Q3 values delimit a range within which approximately 50% of the values
in the series are concentrated.

Box plot:

The median as a positional parameter and the interquartile range as a dispersion parameter provide a
good description of a statistical series. We use these two data to construct a box plot of the series.
Exemple :
Let a series of values be summarized as:

- minimum Min = 8

- 1st quartile Q1 = 10.25

- median Me =Q2 = 12.75

- 3rd quartile Q3 = 18.75

- maximum Max = 33

These 5 data allow us to construct a box plot:

Mr : SAHNOUN.A.Y Page 33
Descriptive statistics 2023/2024

Figure 23: box plot


I.6.1.4 Deciles :
Deciles are used to separate a statistical series into ten groups of equal size (to the nearest unit).one-tenth of
the values are below the first decile D1.one-tenth of the values are above the ninth decile D9.

We calculate the quantity 1/10 of N = 1/10×N = N:10

D1 is the nth value or n = N:10

D9 is the nth value or the entire n' = 9/10 or N = 9/10×N= 9×N:10

Exemple :

Let's take the values in ascending order:

1-3-3-3-5-5-6-7-7-8-8-8-9-9-10-10-10-10-11-11-12-12-13-13-13-13-14-15-16-19

There are N = 30 values, which is divisible by 10 because 30:10=3 which is an integer.

n=N:10 = 3 so D1 is the 3rd value of the series arranged in ascending order, so D1 = 3.

and n' = 9N:10 = 27 so D9 is the 27th value of the series arranged in ascending order, so D9= 14.

Mr : SAHNOUN.A.Y Page 34
Descriptive statistics 2023/2024

I.6.1.5 Arithmetic mean (̅)


The average is the simplest indicator for summarizing the information provided by a set of
statistical data.

Définition :
The (arithmetic) mean is the sum of the observed values divided by their number.
Let {x1, x2 , ....,xn } be a series of numbers. The formula for the arithmetic mean of this series is
given by : 𝑥̅ 𝑛 𝑘𝑖 𝑥𝑖

Exemple : Let be the series of numbers {8, 5, 9, 13, 25}. The arithmetic mean of this series of figures
is calculated as follows:
+ + + +2
̅ 2

Weighted arithmetic mean: (for a discrete variable)

Definition :
Let {x1, x2 , ....,xk } be a series of numbers and {n1, n2 , ....,nk } be the corresponding numbers.
The formula for the weighted arithmetic mean of this series is given by :
𝑥̅ 𝑛 𝑘𝑖 (𝑛𝑖 𝑥𝑖 )

Exemple: The study of 20 families led to the distribution of the number of children in each
family:
Nbr of children (xi) 0 1 2 3 4 5
Nbr of families (ni) 5 3 6 1 3 2
fi 25 15 30 5 15 10
The average number of children per family is :

( )+( ) + (2 )+( )+( )+( 2)


̅
+ + + + +2
̅ 2
2
( 2 )+( ) + (2 )+( )+( )+( )
̅

2
̅ 2

Weighted arithmetic mean: (for a continuous variable)

Définition :
Let [ai,bi ] be the classes of a continuous variable and {n1, n2 , ....,nk } the corresponding
numbers.
CiMr
is the center of these classes.
: SAHNOUN.A.Y Page 35
The formula for the weighted arithmetic mean of this series is given by :
𝑘
𝑥̅ 𝑛𝑖 (𝑛𝑖 𝑐𝑖 )
Descriptive statistics 2023/2024

Exemple: The size distribution of 40 students is given in the table below.


Size (cm) [150,160[ [160,165[ [165,170[ [170,175[ [175,185[
Nbr of student 4 8 10 16 2
ci 155 162,5 167,5 172,5 180
Average student height:
( )+( 2 )+( )+( 2 )+( 2)
̅
+ + + +2

I.7 Dispersion parameters :


These parameters characterize the structure of the distribution and help explain

internal composition and shape.

I.7.1 Moments :
Moments are algebraic quantities used to describe the characteristics of statistical distributions:
shape, symmetry, kurtosis, central tendency, dispersion.

a. Simple or non-centered r-order moment:

The moment of order r of n numbers x1,x2,...,xi,...,xn is defined by :

( + + + + + ) ∑

If ̅ the non-centered moment of order 1 is the arithmetic mean

b. Centered moment of order r

The centered moment of order r with respect to a constant a is defined by :

,( − ) + +( − ) + +( − ) - ∑ ( − )

In general, when you don't specify ̅ and ( − ̅)

Or : ( − ̅) ( − ̅)

Mr : SAHNOUN.A.Y Page 36
Descriptive statistics 2023/2024

I.7.2
The variance σ2 or V (x)
Calculation of the "developed" formula

Definition :
The variance is an indicator of the dispersion of a series in relation to its mean.
1) The variance of a series is given by the following formula :
𝑛

𝑉(𝑥) ∑(𝑥𝑖 − 𝑥̅ )
𝑛
𝑖
2) The variance of a discrete quantitative variable is expressed by :
𝑉(𝑥) 𝑛 𝑛𝑖 𝑛𝑖 (𝑥𝑖 − 𝑥̅ ) if the size considered is that of a population …..(1)
3) The variance of a continuous quantitative variable is expressed by :
𝑉(𝑥) 𝑛 𝑛𝑖 𝑛𝑖 (𝑐𝑖 − 𝑥̅ ) ………………………………………………………………………………(2)
Where ci is the center of the class

Formula (1) can also be calculated using the previous method. However, to facilitate
calculs, it is preferable to use the "developed" formula. We show that formula (1) can be written
as :
Variance property :
 V(x+a)=V(x), so σ(x+a)=σ(x)
 V(ax)= a²V(x), so σ(ax)=aσ(x)

( + ) ∑ ,( + ) − (̅̅̅̅̅̅̅
+ )-

( + ) ( + − ̅ − ) with the average property (̅̅̅̅̅̅̅


+ ) ̅+

( + ) ∑ ( − ̅)

( + ) ( )
( ) ( ) − (̅̅̅)
( ) − ( ̅ ) , avec la propriété de la moyenne (̅̅̅) ̅

Note :
The "expanded" formula for a discrete quantitative variable:
𝑘

𝑉(𝑥) ∑ 𝑛𝑖 𝑥𝑖 − (𝑥̅ )
𝑛
𝑖
The "expanded" formula for a continuous quantitative variable:
𝑘

𝑉(𝑥) ∑ 𝑛𝑖 𝑐𝑖 − (𝑥̅ )
𝑛
𝑖
Where ci is the center of the class

Mr : SAHNOUN.A.Y Page 37
Descriptive statistics 2023/2024

( ) ( ∑ − ( ̅) )

( ) ( )

Exemple : The study of 20 families led to the distribution of the number of children in each
family:
Nbre of childrenxi 0 1 2 3 4 5
Nbre of family ni 5 3 6 1 3 2

We've already calculated ̅ =2.


xi ni (xi)² ni(xi)²
0 5 0 0
1 3 1 3
2 6 4 24
3 1 9 9
4 3 16 48
5 2 25 50
134
( ) ( ) − (2 ) 2
2

Exemple: The size distribution of 40 students is given in the following table


Weight (cm) [150,160[ [160,165[ [165,170[ [170,175[ [175,185[
Nbre of students 4 8 10 16 2
We've already calculated: x=167,875

Weight (cm) ni ci (ci)² ni(ci)²


[150,160[ 4 155 24025 96100
[160,165[ 8 162,5 26406 211250
[165,170[ 10 167,5 28056 280562,5
[170,175[ 16 172,5 29756 476100
[175,185[ 2 180 32400 64800
40 1128813
( ) ( 2 )−( )

I.7.3 Standard deviation :


By definition, the standard deviation of a variable is the square root of its variance.
√ ( )

Definition:
The standard deviation of a variable is the square root of the variance.
𝜎𝑥 √𝑉(𝑥)

Mr : SAHNOUN.A.Y Page 38
Descriptive statistics 2023/2024

Example:
Using the result of the last example, calculate the stabdard deviation

Remarque :
The σx parameter measures the average distance between 𝑥̅ and the values of X (see next
Figure). It is used to measure the dispersion of a statistical series around its mean.
– More it is smaller, more characters are concentrated around the mean (the series is said to be
homogeneous).
– More it is greater, more the characters are scattered around the mean (the series is said to be
heterogeneous).

Figure 24: The dispersion of a statistical series around its mean

I.7.4 Asymmetry coefficient


I.7.5 Fisher's coefficient of asymmetry :
The statistician Ronald Fisher proposed a characteristic to measure the asymmetry of a distribution:

centered moment of 3rd-order;

centered moment of 2nd-order;

: Standard deviation.

Fisher's asymmetry coef is a dimensionless number, i.e. independent of the units of measurement of
xi.

Mr : SAHNOUN.A.Y Page 39
Descriptive statistics 2023/2024

If , the distribution is rigorously symmetrical around the mean

If , here is more spread to the right.

Si , here is more spread to the left.

Figure 25: Different forms of statistical series concentration.

Different forms of distribution ::


A - Uniform distribution

Figure 26: Uniform distribution.

B - Symmetric distribution

Mr : SAHNOUN.A.Y Page 40
Descriptive statistics 2023/2024

Figure 27: Flattened symmetrical distribution

C - Unimodal symmetrical distribution

Figure 28: Unimodal symmetrical trend

D - Distribution spread to the right

Figure 29: concentrated on the left

Mr : SAHNOUN.A.Y Page 41
Descriptive statistics 2023/2024

E - Distribution spread on the left

Figure 30: concentrated on the right

I.7.5.1 Yule and Kendall coefficient of asymmetry


The interquartile asymmetry coefficient of a simple, discrete or continuous, ungrouped statistical
distribution is the quantity defined by :

( − )−( − )
( − )

: The first quartile ;

: The second quartile ;

: The third quartile.

I.7.5.2 Pearson's coefficient of asymmetry


The Pearson asymmetry coefficient of a simple statistical distribution, discrete or continuous, is the
quantity defined by :
̅−

̅ : The arithmetic mean;

: The second quartile ;

: standard deviation.

I.7.5.3 Flattening coefficient or kurtosis


4
4

4 Centered moment of order 4;

: standard deviation.

Mr : SAHNOUN.A.Y Page 42
Descriptive statistics 2023/2024

If , the distribution is less flat than a Gaussian distribution,

If , the distribution is flatter than a Gaussian distribution.

Figure 31: concentration homogène d’une série statistique

Figure 31: Different forms of statistical series flattening

Mr : SAHNOUN.A.Y Page 43
Descriptive statistics 2023/2024

Exemple :

A survey of 1500 households in a certain rural geographical area looked at the variable X
corresponding to household size, i.e. the number of people in the household. The data collected can
be presented in the form of the following bar chart.

Figure 32: stick representation.


Déterminez le coefficient d'asymétrie de Fisher, Pearson et de Yule.

Solution :

4 4
The 3rd-order moment is : 2

4
The average ̅ 2

The variance ( ) − (2 ) 22

The standard deviation is equal to 22

Mr : SAHNOUN.A.Y Page 44
Descriptive statistics 2023/2024

Fisher's asymmetry coefficient


2
2
( )

Pearson's asymmetry coefficient


2 −2

Yule asymmetry coefficient


( − 2) − (2 − )

I.7.5.4 The coefficient of variation


. /
̅
This coefficient reflects the state of dispersion of the static series.
The coefficient of variation indicates the homogeneity of the series. If the coefficient of variation
is less than or equal 15%, the data are considered homogeneous; conversely, if the coefficient of
variation is greater than 15%, the data are said to be heterogeneous.
It is used to value the dispersion rate of the variable

Exemple
Let's assume that, following a statistical study of passenger weight x and baggage weight y, an
airline has obtained the following results:

weight Passengers (kg) Bags (kg)


Avrage 70 15
standard deviation 8 6

For the Travellers series ( ) soit 11,43 %,


For the luggage series : ( ) soit 40 %,
While the standard deviation of the traveller series is greater than that of the baggage series
(σX > σY), the luggage weight series is more dispersed than the traveler weight series, as CV(y) >
CV(x) , this can be explained by the following:
Travelers are closer together in weight, but luggage is more varied in weight.

Mr : SAHNOUN.A.Y Page 45
Descriptive statistics 2023/2024

II Chapter II: two-character statistical series

A bivariate statistical series is one in which two measurable characteristics are recorded for the
same population. It can be presented in the form of a table, in rows or columns.

Exemple : We want to study the relationship between the height of men and their weight..

II.1 Tables with two-character


A statistical population can be described using two characters simultaneously. The
corresponding statistical tables are two-dimensional, called contingency tables or dynamic or
double-entry crosstabs.

Overview of contingency tables

Let's consider a statistical population described by two characters

A character X whose p modalities xi are x1, x2, ...,xp and a

Character Y whose q modalities yj are y1, y2, ..., yq

The q modalities of Y
xi y1 Y2 ……. yj ……… yq ni.
The p modalities of X

Marginal headcount
yj
x1 n11 n12 ……. n1j ……. n1q n1.
X2 n21 n22 ……. n1j ……. n2q

Xi ni1 ni2 ……. nij ……. niq ni.

Xp np1 np2 ……. npj ……. npq np.


n.j n.1 n.2 n.j n.q n..
Marginal headcount

Figure 32: Contingency table representation.

ni. : sum of the number in the ith row, where the subscript j, ranging from 1 to q, is
replaced by " . "

n.j : sum of the numbers of the modality yj, index i=1 to p is replaced by " .”

Remarque :
1. in the 1st column the n modalities x1, x2, ..., xi, ...., xp of characteristic X
In the 1st row, the k modalities y1, y2, ..., yj, ...., yq of characteristic Y
2. The number nij corresponds to the intersection of a row i and a column j
The number of people in the population with both modality xi and modality yj
3. For the marginal numbers ni. and n.j , replace the index that varies by " .
ni. : sum of the numbers in the ith row, j =1, ..., q is replaced by " .
Propriétés des tableaux de contingence :
n.j : sum of the numbers in the jth column, i =1, ..., p is replaced by " .
4.Mr
The: SAHNOUN.A.Y
marginal headcount of X is noted "ni." and that of Y "n.j". Page 46
5. The total number in the table is "n..". This is the total number of people in the population
studied.
Descriptive statistics 2023/2024

The xi and yj modalities being incompatible and exhaustive, we can write several series of
equalities

Represents the number of individuals presenting modality xi of X whatever the modality of Y

Represents the number of individuals presenting the modality yj of Y whatever the modality of X

The total number of individuals in the population:

It appears at the intersection of the last row and the last column.

It is equal to the sum of the last row or the last column

∑ ∑

En remplaçant ni. et n.j par les expressions précédentes, on obtient

∑∑ ∑∑

Partial frequencies :

The partial frequency is the ratio of the partial number to the total number.

The partial frequency of the modalities xi , yj is equal to :

This is the proportion of individuals satisfying both modality xi and modality yj.

Note :
The sum of partial frequencies is 1
𝑝 𝑞

∑ ∑ 𝑓𝑖𝑗
𝑖 𝑗

II.1.1 Marginal distributions :


A contingency table has two marginal distributions, the marginal distribution of the X character
and the marginal distribution of the Y character.

The marginal distribution of the X

Mr : SAHNOUN.A.Y Page 47
Descriptive statistics 2023/2024

It is made up of the modalities of character X and the corresponding numbers, whatever the
modalities of character Y.

The marginal distribution of characteristic X is given by the following table

Character Marginal headcount Marginal frequencies


x1 n1. f1.
x2 n2. f2.

xi ni. fi.

xp np. fp.
total n.. 1

Marginal frequencies" can be calculated as the ratio of the marginal number to the total
number.

The marginal distribution of character Y :

It is composed of the modalities of the Y character and the corresponding number of individuals,
whatever the modalities of the X character. The marginal frequency of the yj modality is equal to:

Caractère Effectifs marginaux Fréquences marginales


y1 n.1 f.1
y2 n.2 f.2

yi n.i f.j

yp n.p f.q
total n.. 1

Exemple: Le tableau de contingence ci-dessous représente le nombre de chambres « X » par


foyer et le nombre d’enfants par foyer «Y ».

Y
1 2 3 4
X
1 12 4 5 11
2 18 16 11 3
3 10 4 20 6

Calcul des effectifs :

Mr : SAHNOUN.A.Y Page 48
Descriptive statistics 2023/2024

y
1 2 3 4 ni.
X
1 12 4 5 11 32
2 18 16 11 3 48
3 10 4 20 6 40
n.j 40 24 36 20 120

Les fréquences :

y 1 2 3 4 fi.
X
1 0.10 0.033 0.041 0.091 0.2667
2 0.15 0.1333 0.091 0.025 0.40
3 0.083 0.0333 0.1666 0.05 0.3333
f.j 0.3333 0.20 0.30 0.1667 1

Distribution marginale de X:
xi ni. fi. (%)
1 32 0.2667
2 48 0.40
3 40 0.3333
total 120 1

Distribution marginale de Y:

yj n.j f.j (%)


1 40 0.3333
2 24 0.20
3 36 0.30
4 20 0.1667
total 120 1

II.1.2 Distributions conditionnelles :


1. Distributions conditionnelles du caractère X liées par yj

Ce sont les modalités de X et des effectifs de chacune de ces modalités dans la sous population

présentant la modalité yj de Y.

Mr : SAHNOUN.A.Y Page 49
Descriptive statistics 2023/2024

Caractère Effectifs de yj Fréquences conditionnelles


x1 n1j f1/j
x2 n2 j f2/j

xi ni j fi/j

xp np j fp/j
total n.j 1

On peut calculer la fréquence conditionnelle de la modalité xi de X

sous condition que Y=yj :

2. Distributions conditionnelles du caractère Y liées par xi

Ce sont les modalités de Y et des effectifs de chacune de ces modalités dans la sous population
présentant la modalité xi de X

Caractère Effectifs de yj Fréquences


conditionnelles
y1 ni1 f1/i
y2 ni2 F2/i

yi nij fj/i

yp nip fq/i
total ni. 1

La fréquence conditionnelle de la modalité yj de Y sous condition que x = xi

Exemple: Pour l’exemple précédent, la distribution de X sous la condition Y=2


xi ni2 fi2 (%)
1 4 0.1667
2 16 0.666
3 4 0.166
total 24 1

La distribution de Y sous la condition X=1


yj n1j f1j (%)
1 12 0.375
2 4 0.125
3 5 0.156

Mr : SAHNOUN.A.Y Page 50
Descriptive statistics 2023/2024

4 11 0.343
total 32 1

II.2 Représentation graphique :


II.2.1 Qualitatifs
Exemple :

Y
X fonctionner inactif retraité
masculin 5 3 1
féminin 4 3 4

3 masculin

2 féminin

0
fonctionneur inactif retraité

Figure 33 : Profils colonne en bâton

100%
90%
80%
70%
60%
50% féminin
40% masculin
30%
20%
10%
0%
fonctionneur inactif retraité

Figure 34 : Profils colonne groupé en bâton

Mr : SAHNOUN.A.Y Page 51
Descriptive statistics 2023/2024

féminin

retraité
inactif
fonctionneur
masculin

0 1 2 3 4 5 6

Figure 35 : Profils ligne en bâton.

féminin

fonctionneur
inactif
retraité
masculin

0% 20% 40% 60% 80% 100%

Figure 36 : Profils ligne groupé en bâton

Mr : SAHNOUN.A.Y Page 52
Descriptive statistics 2023/2024

3 masculin

2 féminin

0
fonctionneur inactif retraité

Figure 37 : diagramme de cylindre de colonne

féminin
retraité
inactif
fonctionneur
masculin

0 2 4 6

Fig : diagramme de cylindre en ligne

100%
80%
60%
40% féminin

20% masculin
0%
fonctionner
inactif
retraité

Figure 38 : Profils plan (mur)

Mr : SAHNOUN.A.Y Page 53
Descriptive statistics 2023/2024

II.2.2 Quantitatifs :
II.2.2.1 Cas discret :
Une variable discrète a une valeur finie. Il est possible de les énumérer
Y
X
1 2 3 4
1 1 1 1 4
2 4 3 1 3
3 2 4 2 2

Figure 39 : Profils en bulle

Y
X
1 2 3 4
1 1 1 1 14
2 4 13 10 3
3 12 4 2 2

Mr : SAHNOUN.A.Y Page 54
Descriptive statistics 2023/2024

Figure 40 : Profils en bulle


II.2.2.2 Cas continu :
Une variable continue peut prendre, en théorie, une infinité des valeurs, formant un ensemble
continu

POIDS [20,40[ [40,60[ [60,80[


[120,140[ 1 1 1
[140,160[ 6 3 1
[160,180[ 2 6 2

4 [120,140[

3 [140,160[
[160,180[
2

0
[20,40[ [40,60[ [60,80[

Figure 41 : Profils histogramme en colonne

Mr : SAHNOUN.A.Y Page 55
Descriptive statistics 2023/2024

[160,180[

[60,80[
[140,160[
[40,60[
[20,40[

[120,140[

0 1 2 3 4 5 6 7

Figure 42 : Profils histogramme en ligne

II.3 Caractéristiques numériques des distributions :


II.3.1 Caractéristiques numériques des distributions marginales :
Soient X et Y deux caractères quantitatifs discrets.

{ xi , ni. - est la distribution marginale d’effectifs du caractère X et , yj , n.j } est la distribution


marginale d’effectifs du caractère Y.

Ces deux distributions peuvent être étudiées comme dans le cas des statistiques univariées.
En particulier, elles peuvent être caractérisées par leur moyenne et variance.

La moyenne du caractère X:

̅ ∑

La moyenne du caractère Y:

̅ ∑

La variance du caractère X:

( ) ∑ ( − ̅)

∑ − ̅

Mr : SAHNOUN.A.Y Page 56
Descriptive statistics 2023/2024

La variance du caractère Y:

( ) ∑ ( − ̅)

∑ −̅

Remarque : Dans le cas où l’un des caractères X et Y est quantitatif continu, on remplace les
formules de la moyenne et de la variance les valeurs xi par les centres ci des classes du caractère

Exemple :
Y
1 2 3 4
X
1 12 4 5 11
2 18 16 11 3
3 10 4 20 6

La moyenne de X :

̅ ∑
2

Y
ni. ni.* xi
X
1 32 32
2 48 96
3 40 120
somme 120 248

̅ 2 2
2
La variance de X :

( ) ∑ − ̅
2

Y
ni. ni.* xi²
X
1 32 32
2 48 192
3 40 360
somme 120 584

( ) − (2 )
2

Mr : SAHNOUN.A.Y Page 57
Descriptive statistics 2023/2024

II.3.2 Caractéristiques numériques des distributions conditionnelles :


Chacune des distributions conditionnelles peut être étudiée comme dans le cas des statistiques
univariées. On peut définir les moyennes et les variances conditionnelles

La moyenne conditionnelle du caractère X sachant que Y=yj :

̅ ∑

La moyenne conditionnelle du caractère Y sachant que X=xi :

̅ ∑

La variance de X sachant que Y=yj

( ) ∑ ( − ̅)

∑ −̅

La variance de Y sachant que X=xi

( ) ∑ ( −̅ )

∑ −̅

Exemple :
Y
1 2 3 4
X
1 12 4 5 11
2 18 16 11 3
3 10 4 20 6

Calculer La moyenne conditionnelle du caractère Y sachant que X=2 :

X
ni. ni.* xi
Y
1 18 18
2 16 32
3 11 33
4 3 12
somme 48 95

Mr : SAHNOUN.A.Y Page 58
Descriptive statistics 2023/2024

Variance du caractère Y sachant que X=2 :

X
ni. ni.* yi²
Y
1 18 18
2 16 64
3 11 99
4 3 48
somme 48 229

( ) 22 − ( )

Mr : SAHNOUN.A.Y Page 59
Descriptive statistics 2023/2024

II.3.3 La covariance
Il s'agit de définir un indice de liaison entre les deux variables considérées. Cet indice est le
coefficient de corrélation linéaire ; il nécessite la définition préalable de la covariance.

Définition :
La covariance généralise à deux variables la notion de variance. Sa formule de définition est la suivante :
Bon ajustement 𝑛

𝑐𝑜𝑣(𝑥 𝑦) 𝑠𝑋𝑌 ̅- 𝒚𝒋 − 𝒚
∑,𝒙𝒊 − 𝒙 ̅
𝑛
𝑖

Mauvais ajustement : 𝑛

[ ∑ 𝒙𝒊 𝒚𝒋 ] − ,𝒙 ̅-
̅ 𝒚
𝑛
𝑖

Démonstration

La covariance est donc la moyenne des produits des écarts aux moyennes (dans chaque produit,
chacun des deux écarts est relatif à l'une des deux variables considérées). On peut, la encore,
retenir son expression sous la forme suivante : c'est la moyenne des produits moins le produit
des moyennes. Comme la variance, la covariance n'a pas de signification concrète. Dans le cas de
la variance, on doit passer à l'écart-type pour avoir un indicateur interprétable ; dans celui de la
covariance, il faudra passer au coefficient de corrélation linéaire.

Propriétés de la covariance :

La covariance est un indice symétrique. De façon évidente, on a SXY = SYX (les deux variables
jouent donc le même rôle dans la définition de la covariance).

La covariance peut prendre toute valeur réelle (négative, nulle ou positive ; \petite" ou \grande"
en valeur absolue).

Mr : SAHNOUN.A.Y Page 60
Descriptive statistics 2023/2024

II.3.4 Le coefficient de corrélation linéaire :


Définition :

Définition :
Les coefficients de corrélation permettent de donner une mesure synthétique de l’intensité de la
relation entre deux caractères et de son sens lorsque cette relation est monotone.
Bon ajustement
Le coefficient de corrélation de Pearson permet d’analyser les relations linéaires :

𝑐𝑜𝑣(𝑋 𝑌)
𝑟 𝑐𝑜𝑟𝑟(𝑥 𝑦)
Mauvais ajustement : 𝜎(𝑋)𝜎(𝑌)

Figure 43: différentes formes de corrélation.

Propriétés coefficient de corrélation :

corr(X, Y ) € *−1, 1+

corr(X, Y ) = corr(Y,X)

corr(X,X) = 1

Le coefficient de corrélation est un coefficient sans dimension. Il mesure la présence et


l’intensité de la liaison linaire entre X et Y

1. corr(X, Y ) = 1 : liaison linéaire exacte Y = aX + b avec a > 0 ;

2. corr(X, Y ) = −1 : liaison linéaire exacte Y = aX + b avec a < 0 ;

Mr : SAHNOUN.A.Y Page 61
Descriptive statistics 2023/2024

3. corr(X, Y ) = 0 : non corrélation : on a indépendance possible, mais non certaine ;

4. corr(X, Y ) > 0 : liaison relative, X et Y ont tendance à varier dans le même sens ;

5. corr(X, Y ) < 0 : liaison relative, X et Y ont tendance à varier dans le sens contraire ;

6. |corr(X, Y )| > 0.9 la liaison linéaire est considérée comme forte.

Exemple :
Nous considérons 10 joueurs et soient :
– Y la variable qui représente le nombre de jeux auquel un joueur joue.
– X la variable qui représente le gain ou perte (+1 s’il gagne 10 Da et −1 s’il perd 10
Da et 0 sinon).
Nous avons le tableau de contingence suivant :

Calculer cov(X, Y ).
Solution :

Nous avons

̅ ∑ (− )+( )+( 2) −

Et

̅ ∑ ( ) + (2 )+( )+( ) 2

Aussi, nous avons

( ) [ ∑ ] − ,̅ ̅-

( ) (− ) + (− 2) + (− 2 ) + (− 2 )
+( 2) + ( ) − (− 2 ) − 2

On calcule l’écart-type

Mr : SAHNOUN.A.Y Page 62
Descriptive statistics 2023/2024

( ) ∑ − ̅

Alors
( ) √

( ) ∑ −̅

Alors
( ) √

− 2
( )
( ) −
II.4 Ajustement :
L'ajustement linéaire consiste à tracer une droite qui passe au plus près des observations d'un
nuage de points

Définition :
Soit X et Y deux variables statistiques numériques observées sur n individus. Dans un repère
orthogonal ( ⃗ ⃗) , l’ensemble des n points de coordonnées (xi, yi) forme le nuage de points
associé à cette série statistique.

II.4.1 Le problème de l’ajustement


Le nuage de points associé à une série statistique à deux variables donne donc immédiatement
des informations de nature qualitatives.
Pour en tirer des informations plus quantitatives, il nous faut poser le problème de l’ajustement.
Le tracé met en évidence la possibilité de "reconnaître" graphiquement la possibilité d’une
relation fonctionnelle entre les deux grandeurs observées (ici rang et nombre d’adhérent).
Le problème de l’établissement d’une relation fonctionnelle entre les deux séries est le
problème de l’ajustement

II.4.2 Point moyen (droite de Mayer)


Définition :
Soit une série statistique à deux variables, X et Y , dont les valeurs sont des couples (xi; yi). On
appelle point moyen la méthode qui consiste à :
 Classer les donners
 partager le nuage en deux nuages N1 et N2 de même effectifs
 déterminer les points moyens G1(centre de gravité) et G2 des nuages N1 et N2
 ajuster le nuage par la droite (G1G2) appelée droite de Mayer

+ + +

+ + +

− ( − )
Avec
( 2− ) ( 2− )
:abscisse de G1
: ordonné de G1
2 :abscisse de G2

Mr : SAHNOUN.A.Y Page 63
Descriptive statistics 2023/2024

2 : ordonné de G2

Exemple :

Le tableau suivant donne l’évolution du nombre d’adhérents d’un club de Tennis de 2001 à
2006.
Année 2001 2002 2003 2004 2005 2006
Rang xi 1 2 3 4 5 6
Nombre d’adhérents 70 90 115 140 170 220

Déterminer les coordonnées des points moyens suivants :

G1 des années allant de 2001 à 2003,

G2 des années allant de 2004 à 2006,

G, point moyen du nuage de points tout entier.

Solution :
+ +
Calcul des coordonnées de G1 : 2

+ +

donc, G1( 2 ; 91, 7 )

Calcul des coordonnées de G2 :

+ +

+ + 22

donc, G2( 5 ; 176, 7 )

Calcul des coordonnées de G :


+2+ + + +

+ + + + + 22
2

donc, G( 3, 5 ; 134, 2 )

Mr : SAHNOUN.A.Y Page 64
Descriptive statistics 2023/2024

Figure 44 : représentation graphique de la régression

Détermination de l’équation :

G1( 2 ; 91, 7 )

G2( 5 ; 176, 7 )

Y-91,7=. / ( − 2)

Y=28,33 x + 280,02
II.4.3 Méthode des moindres carrés :
La loi normale ou de Laplace-Gauss est encore appelée loi des erreurs ou des écarts, car
c’est ainsi qu’elle a été introduite. Le principe de la méthode des moindres carrés ordinaires
(MCO) consiste à s’intéresser à la série statistique

Définition :
On appelle droite de régression de Y selon x, notée DY / x, déterminée par la méthode des moindres
carrés, la droite d’équation y = ax + b, pour laquelle la somme des carrés des résidus est minimale.
Les résidus ou erreurs
Si le modèle observées
est fort, sont définis( comme
les n observations étant lesvérifier
) devraient différences entre+les. valeurs observées
et les valeurs estimées par un modèle de régression.
En fait, cela se produit rarement. Le plus souvent, il y a des écarts notés ei que l’on va
introduire dans l’équation du modèle :

Mr : SAHNOUN.A.Y Page 65
Descriptive statistics 2023/2024

+ +

L'ajustement linéaire par la méthode des moindres carrés consiste à déterminer la droite
(que l'on appelle aussi droite de régression) telle que la somme des carrés des n valeurs –̂
soit minimale (ce qui explique le nom de la méthode).

–̂

Figure 45 : représentation graphique de la régression

On veut donc minimiser la quantité ( ) ( −( + ))

Rappelons que la valeur minimale d'une fonction se calcule en posant sa dérivée égale à 0. Pour
trouver a et b. Calculons d'abord la dérivée partielle de q par rapport à a.

−2 ∑( −( + ))

∑ ∑ + ∑ ( )

Calculons maintenant la dérivée de q par rapport à b.

−2 ∑( − − )

∑ ∑ +∑

∑ ∑ +

Divisons le tout par n

Mr : SAHNOUN.A.Y Page 66
Descriptive statistics 2023/2024

̅ ̅+

̅− ̅ (2)

Ce résultat indique que la droite passe par le point moyen ( ̅ ̅) .

Introduisons le résultat de (2) dans (1) pour trouver a :

∑ ∑ + (̅ − ̅) ∑

∑ ∑ + ̅∑ − ̅∑

∑ + ̅∑ ∑ − ̅∑

−̅
− ̅

− ̅ −̅ ̅
− ̅ − ̅

( )
̂
+ { ( )
̂ ̅− ̅

̂ ( )
+ { ( )
̂ ̅− ̅

Ces deux droites se coupent au point moyen G.

Exemple :

Les ventes au cours des 6 premiers mois ont été les suivantes :
Mois (x) Janvier Février Mars Avril Mai Juin
Ventes(y) 345 410 485 535 610 675

x y x- ̅ y- ̅ (x- ̅ )* (y- ̅) (x- ̅ )² (y- ̅)²


1 345 -2,5 -165 412,5 6,25 27225
2 410 -1,5 -100 150 2,25 10000
3 485 -0,5 -25 12,5 0,25 625
4 535 0,5 25 12,5 0,25 625
5 610 1,5 100 150 2,25 10000
6 675 2,5 165 412,5 6,25 27225
somme 21 3060 0 0 1150 17,5 75700

Mr : SAHNOUN.A.Y Page 67
Descriptive statistics 2023/2024

( +2+ + + + ) 2
̅

( + + + + + )
̅

−( ) 2 2

L’équation de la droite de tendance est de la forme :

+2 2

Tendance exponentielle :

Une courbe de tendance exponentielle est un trait courbe qui s'avère particulièrement utile
lorsque les valeurs de données augmentent ou baissent de manière croissante

Les tendances exponentielles sont fréquentes et concernent le lancement de nouvelles activités,


de nouveaux produits. On observe une accélération de plus en plus forte du rythme des ventes.

On recherche une tendance linéaire :

( ) ( ) ( ) + ( ) ( ) + ( )

On transforme une droite exponentielle en droite linéaire par les logarithmes décimaux.

( )

( )

( )

La fonction exponentielle représente une grandeur dont le taux périodique d’accroissement

« a » est constant.

Exemple :
Période (x) Janvier Février Mars Avril Mai Juin
Quantité (y) 430 455 520 730 1140 1850

Mr : SAHNOUN.A.Y Page 68
Descriptive statistics 2023/2024

Figure 45 : Linéarisation d’une tendance logarithmique

On calcul la valeur logarithmique de y pour linéariser l’allure

xi Y=log(yi) xi Yi xi²
1 Log(430)=2,633 2,633 1
2 Log(455)=2,658 5,316 4
3 2,716 8,148 9
4 2,863 11,453 16
5 3,057 15,285 25
6 3,267 19,603 36
21 17,195 62,438 91

2
̅

̅ 2

2 −, 2 -
2
−( )
4 4
2 − 2 2 2

La fonction ajustée est 2

Mr : SAHNOUN.A.Y Page 69

You might also like