You are on page 1of 20

KAZAKH BRITISH TECHNICAL UNIVERSITY

FACULTY OF INFORMATION TECHNOLOGY

Sub.: Industrial manufacturing

Lecture 1. Introduction to Data analysis


design
Lecturer: Professor, doctor Ph.D,
Samigulina Zarina Ildusovna

Almaty, 2022
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Statistics is collecting, analysing and interpreting data. It is present


in many things, like industry, polls, medical studies, scientific research,
ice forecast et cetera.

STATISTICAL STUDY STEPS:

Research question

Experimental design

Data collection

Data analysis

Interpretation of results

Presentation of results
and conclusion
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

STATISTICAL MODEL
DATA ANALYSIS:

get an impression of data


validate statistical model
summarize data (descriptive
statistics)
analyse (e.g. estimate/test
parameters in model)

INTERPRETATION OF RESULTS

this is not always straightforward.

PRESENTATION OF RESULTS AND


CONCLUSION
translate back to the experimental
context
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

DATA TYPES:

There are a number of characteristics


measures in one variable:

 univariate (gender)

bivariate (gender and level of


education)

multivariate (gender, level of


education, shoe size, etc.)

Data is quantifies measurement of a study. Data is typically stored in


variables. A variable is a property of an individual/object that can be
measured.
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

UNIVARIATE SUMMARIES

A good summary shows at least:

Location, scale (variance)

Range, extremes

Holes, modes

Symmetry
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Additionally, it may answer the following questions:

Are data rounded?


Are data from a known distribution?
Do we need to divide the data into groups?
Is there influence of other variables, like time?
What is the relation between variables?
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

GRAPHICAL SUMMARIES
histogram

stem-leaf plot (numerical version of a histogram)

empirical distribution function

boxplot
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

MULTIVARIATE SUMMARIES

scatter plot time plot


DA, Associate professor, doctor Ph.D, Samigulina Z.I.

General process for data analysis

https://www.rstudio.com/
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Example of statistic analysis

KEY DEFINITIONS:

Population: members of a group about which you want to draw


a conclusion ie. all of what you are interested in.

Sample: portion of subset of the population selected for


analysis.

Parameter: numerical measure which describes a characteristic


of a population ie. average based on population data.

Statistic: numerical measure that describes a characteristic of


a sample ie. average based on a sample. It is more practical and
is used more than parameters.
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Population vs. Sample

Measures used to describe a population are


called parameters

Measures used/computed from sample data are


called statistics

Size of the sample has a big impact on the


result

Analysing the sample allows you to say


something about a larger group but there is
always a chance for error
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Population and Sample Data


A sample is a selection of measurements (subset) from all
measurements (population)

Purpose of analysing a sample is to make a statistical inference

 A note on notation:

-Greek letters (µ, θ, N) are used for population data


- Roman letters (x, s, n) are used for sample data

*Note: there are two formulas for each notation*


DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Types of Data

Data does not have to be numerical

There are two types of data:


- categorical/qualitative
- numerical/quantitative

Numerical data is measured on a natural number scale

Categorical data can only be named


or categorised. It can be further
categorised:

- nominal: no natural/implied order


- ordinal: there is an implied order
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Further Classifications of Numerical Data


Continuous or discrete

- continuous: Can take on any real number Infinite number of items eg. time
-discrete: Countable number of responses Finite number of items (looking
at integers)

*Note: whilst there cannot be half a person there can be half a


shoe size*
Interval or ratio
- interval: Difference between measurements (no true 0) eg. temperature
It is untrue to say size 10 is double a size 5 (shoe sizes)

*Note: Uses discrete data*

-ratio: Differences between measurements where the true 0 exists It is true


to say $100 is double $50

* Note: Uses continuous data*


DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Time Series or Cross Sectional

-time series: Data collected through time (look for trends)

- cross sectional: Collected for a certain point in time


DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Presenting and Describing Information


Variables: characteristics of items or individuals
Data: observed values of variables
Population: consists of all the members of a group about which you want
to draw a conclusion. Two factors need to be specified when defining a
population:
- the entity (eg. people or vehicles)
- the boundary (eg. those registered to vote or registered in QLD for road
use)
Sample: the portion of the population selected for analysis. The people
or vehicles in the sample represent a portion, or subset, of the people or
vehicles comprising the population.
Parameter: numerical measure that describes a characteristic of a
population
Statistic: numerical measure that describes a characteristic of a sample
Categorical variables: yield categorical responses, such as yes or no
answers. Categorical responses can also yield more than one possible
response.
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Presenting and Describing Information


Continuous variables: produce numerical responses that arise from a
measuring process. The more precise the measuring device used to
greater the likelihood of detecting small differences in measurements and
therefore having more precise data.
Descriptive statistics: focuses on collecting, summarising and
presenting a set of data
Discrete variables: produce numerical responses that arise from a
counting process.
Focus group: a market research tool which is used to elicit unstructured
responses to open-ended questions.
Inferential statistics: uses sample data to draw conclusions about a
population.
Interval scale: ordered scale in which the difference between
measurements is a meaningful quantity but does not involve a true zero
point.
Nominal scale: classifies data into various distinct categories in which
no ranking is implied. Nominal scaling is the weakest form of
measurement because you cannot specify any ranking across the various
Categories.
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Presenting and Describing Information


Numerical variables: yield numerical responses, such as your height in
centimetres. There are two types of numerical variables: discrete and
continuous.

Ordinal Scale: classifies data into distinct categories in which ranking is


implied ie. things are ranked in order of satisfaction level. Ordinal scaling
is a stronger form of measurement than nominal scaling because an
observed value classified into one category possesses more of a property
than does an observed value classified into another category. However,
ordinal scaling is still relatively week because the scale does not account
for the amount of the differences between the categories. Ordering only
implies which category is greater or preferred – not by how much.

Operational definition: a universally accepted meaning that is clear to all


associated with an analysis

Ratio scale: an ordered scale in which the difference between the


measurements involves a true zero point eg. weight, length age or salary
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

Basic Concepts of Statistics Chapter


Summary

 Statistics examines ways to process and analyse data and provides


procedures to collect and transform data in ways that are useful to
business decision-makers.
Identifying the most appropriate source of data is a critical aspect of
statistical analysis.
Data from a categorical variable are measured on a nominal scale or on
an ordinal scale.
Data from numerical variable are measured on an interval or ratio scale.
Data measured on an interval scale or on a ratio scale constitute the
highest levels of measurement.
They are stronger forms of measurement than an ordinal scale because
you can determine not only which observed value is the largest but also by
how much.
DA, Associate professor, doctor Ph.D, Samigulina Z.I.

LITERATURE
1. Jessica King. Data Analysis // Queensland University of
Technologies. – 2009. – 54 p.
2. Salmin A.A. Data Analysis //Lecture notes. – Samara, 2013. – 113 p.
3. Manfred W. Hopfe, Stanley A. Taylor. Notes for data analysis
//California state university, Sacramento. – 37 p.

You might also like