You are on page 1of 7

Chapter No.

1 - Introduction to Machine learning and Statistical measurement


Basic definitions:
Exploratory Data Analysis (EDA):
It refers to the critical initial investigations on the data so as to discover patterns from the
data. It refers to discovering anomalies from the data. To check assumptions from the
summary of statistical and graphical representations. It is better to try to gather data and make
and try to gather as many insights as we can. EDA is all about sensing data which is in our
hand.
What is Statistic?
Statistics: The science of collecting, describing, and interpreting data.
Two areas of statistics:
Descriptive Statistics: collection, presentation, and description of sample data.
Examples:
Average rainfall in Manchester last year
Percentage of males in our class

Inferential Statistics: making decisions and drawing conclusions about populations.


Data from sample used to draw inferences about population
Generalising beyond actual observations
Generalise from a sample to a population
Population: A collection, or set, of individuals or objects or events whose properties are to
be analyzed.
Two kinds of populations: finite or infinite.
Sample: A subset of the population.
Qualitative or Categorical Variable: A variable that categorizes or describes an element of
a population.
Note: Arithmetic operations, such as addition and averaging, are not meaningful for data
resulting from a qualitative variable.
Quantitative or Numerical Variable: A variable that quantifies an element of a population.
Note: Arithmetic operations such as addition and averaging, are meaningful for data resulting
from a quantitative variable.
Measurement scales:
• Measurements can be qualitative or quantitative and are measured using four different
scales
1. Nominal or categorical scale
– uses numbers, names or symbols to classify objects
– e.g. Colour names – Red, Blue, Green etc.
– E.g. Branch name – CSE, Mech, ETC, etc.
2. Ordinal scale
Properties
– ranking scale
– objects are placed in order
– divisions or gaps between objects may no be equal
Example:
Students rank in exam – first, second, third, …
Officer Ranks – Class 1, Class 2, Class 3,….
3. Interval scale
Properties
– equality of length between objects
– no true zero
Example: Temperature scales
Fahrenheit: Fahrenheit established 0°F as the stabilised temperature when equal amounts of
ice, water, and salt are mixed. He then defined 96°F as human body temperature.
Celsius: 0 and 100 are arbitrarily placed at the melting and boiling points of water.
4. Ratio scale
Properties
– an interval scale with a true zero
– ratio of any two scale points are independent of the units of measurement
Discrete and Continuous data
• Discrete Data -: No. of students in Class
• Continuous Data -: Height of person
Central Tendency:
• Central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. They are also classed as
summary statistics.
– mean, median and mode
• The Mean is a measure of central value
– What most people mean by “average”
– Sum of a set of numbers divided by the number of numbers in the set
The Median
• Middlemost or most central item in the set of ordered numbers; it separates the distribution into two
equal halves

• If odd n, middle value of sequence


– if X = [1,2,4,6,9,10,12,14,17]
– then 9 is the median
• If even n, average of 2 middle values
– if X = [1,2,4,6,9,10,11,12,14,17]
– then 9.5 is the median; i.e., (9+10)/2
Mode:
• The mode is the most frequently occurring number in a distribution
– if X = [1,2,4,7,7,7,8,10,12,14,17]
– then 7 is the mode
• Easy to see in a simple frequency distribution
– Nominal data: Mode
– The distribution is bimodal: Mode
– You have ordinal data: Median or mode
– Are a few extreme scores: Median

Dispersion / Variability
• Dispersion
– How tightly clustered or how variable the values are in a data set.
• Example
– Data set 1: [0,25,50,75,100]
– Data set 2: [48,49,50,51,52]
– Both have a mean of 50, but data set 1 clearly has greater Variability than data
set 2.
• Dispersion is the degree to which data is distributed around this central tendency.
• Is represented by range, deviation, variance, standard deviation and standard error.
Variance:

Population Variance for population of size N = Σ(Xi−¯X)2NΣ(Xi−X¯)2N


Sample Variance for sample of size N = Σ(Xi−¯X)2N−1
TYPES OF LEARNING:

You might also like