You are on page 1of 37

Data Analysis

Dr. Hamid Mostofi


mostofidarbani@tu-berlin.de
Project Manager / Senior Researcher
at TU Berlin
Dr. HamidMostofi
Dr. Phill. - Dipl.-Ing – MBA. Team member of GECI in German federal ministry
for environment

Lecturer of MBA of sustainable mobility management at


TU Berlin : Mobility Data Mining
Macroeconomics and business models

Lecturer of System dynamics modeling at


Tu Berlin and HTW Berlin
eHaul project: Project Manager: Electrification of long haul heavy-duty
commercial vehicles with automated battery swapping stations, founded
by BMWK (Federal Ministry of Economic Affairs and Climate Actions)

Projects SUMIC 2020 Project Manager: Academic cooperation for developing


smart urban mobility considering climate change mitigation solution,

Dr. Hamid Mostofi founded by BMBF (Federal Ministry of Education and Research

Dr. Phill. - Dipl.-Ing – MBA. GECI project - Developing the academic curricula in the field of
renewable energies, smart mobility- founded by BMU (Federal
Ministry of Environment) "

System dynamic modelling of sustainable urban transportation in

New York, Cairo


Software and Programs
Data science is an interdisciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from noisy, structured and
unstructured data, and apply knowledge from data across a broad range of
application domains.
Data science is the field of study that
combines domain expertise, programming
skills, and knowledge of mathematics and
statistics to extract meaningful insights
from data.
What is correlation?
Correlation

• Correlation is a statistical term describing the degree to which


two variables move in coordination with one another. If the two
variables move in the same direction, then those variables are
said to have a positive correlation. If they move in opposite
directions, then they have a negative correlation.
How do you
interpret
data ?
Causality vs. Correlation
• A correlation between variables does not automatically mean that
the change in one variable is the cause of the change in the values of
the other variable.
• Causation indicates that one event is the result of the occurrence of
the other event; i.e. there is a causal relationship between the two
events.
•All Causation have correlation

•All correlation can not be causation

•If there is no correlation , for 100% there is


no causation

•Correlation is the requirement of Causations


Behavior Patterns
Behavior Patterns in Stock prices

https://www.youtube.com/watch?time_continue=29&v=2FnEGipMUNE
Why study statistics?

1. Data are everywhere

2. Statistical techniques are used to make many decisions that affect our lives

3. No matter what your career, you will make professional decisions that involve
data. An understanding of statistical methods will help you make these decisions
efectively
Applications of statistical concepts in the
business world
• Finance – correlation and regression, index numbers, time series
analysis

• Marketing – hypothesis testing, chi-square tests, nonparametric


statistics

• Personel – hypothesis testing, chi-square tests, nonparametric tests

• Operating management – hypothesis testing, estimation, analysis of


variance, time series analysis
Statistics
• The science of collectiong, organizing, presenting, analyzing, and

interpreting data to assist in making more effective decisions

• Statistical analysis – used to manipulate summarize, and

investigate data, so that useful decision-making information

results.
Statistical data
— The collection of data that are relevant to the problem being studied is commonly

the most difficult, expensive, and time-consuming part of the entire research

project.

— Statistical data are usually obtained by counting or measuring items.

— Primary data are collected specifically for the analysis desired

— Secondary data have already been compiled and are available for statistical analysis
Data
Statistical data are usually obtained by counting or measuring items. Most
data can be put into the following categories:

• Qualitative - data are measurements that each fail into one of several
categories. (hair color, ethnic groups and other attributes of the population)

• Quantitative - data are observations that are measured on a numerical scale


(distance traveled to college, number of children in a family, etc.)
Qualitative data
Qualitative data are generally described by words or letters. They are not as widely used as

quantitative data

because many numerical techniques do not apply to the

qualitative data. For example, it does not make sense to

find an average hair color or blood type.

Qualitative data can be separated into two subgroups:

— dichotomic (if it takes the form of a word with two options (gender - male or female)

— polynomic (if it takes the form of a word with more than two options (education - primary school,

secondary school and university).


Quantitative data
Quantitative data are always numbers and are the

result of counting or measuring attributes of a population.

Quantitative data can be separated into two

subgroups:

• discrete (if it is the result of counting (the number of students of a given ethnic group in a class, the
number of books on a shelf, ...)

• continuous (if it is the result of measuring (distance traveled, weight of luggage, …)


Types of variables

Variables

Qualitative Quantitative

Dichotomic Polynomic Discrete Continuous

Amount of
Children in
Gender, marital Brand of Pc, hair income tax paid,
family, Strokes on
status color weight of a
a golf hole
student
Variable types :
• Nominal – consist of categories in each of which the number of respective observations is
recorded. The categories are in no logical order and have no particular relationship. The
categories are said to be mutually exclusive since an individual, object, or measurement can be
included in only one of them.

• Ordinal – contain more information. Consists of distinct categories in which order is implied.
Values in one category are larger or smaller than values in other categories (e.g. rating-excelent,
good, fair, poor)

• Interval/scale – is a set of numerical measurements in which the distance between numbers is of


a known.
Data Analysis

1. Descriptive Statistics - provide an overview of the attributes of a


data set. These include measurements of central tendency
(frequency, histograms, mean, median, & mode) and dispersion
(range, variance & standard deviation)

2. Inferential Statistics - provide measures of how well your data


support your hypothesis and if your data are generalizable beyond
what was tested (significance tests)
Types of statistics
• Descriptive statistics – Methods of organizing, summarizing, and presenting
data in an informative way

• Inferential statistics – The methods used to determine something about a


population on the basis of a sample
• Population –The entire set of individuals or objects of interest or the measurements
obtained from all individuals or objects of interest

• Sample – A portion, or part, of the population of interest


29
Measure of central tendency
• The measure of central tendency is a single value that attempts to describe the whole set

of data. There are three main features of central tendency –

• Mean

• Median

• Mode

30
What do you “Mean”?

The “mean” of some data is the average score or value,


such as the average age of an MBA student or average
Income of people that use regularly public transport

Mean : µ=(åX)/N
Problem of being “mean”

• The main problem associated with the mean value of some data is that
it is sensitive to outliers.

• Example, the average Income of people that use regularly public


transport might be affected if there was one person who has annual
income 500,000 euros.
The Median
• Because the mean average can be sensitive to extreme
values, the median is sometimes useful and more accurate.

• The median is simply the middle value among some scores


of a variable. (no standard formula for its computation)
What is the Median?

Professor Weight Weight

Rank order
Schmuggles 165 132
and choose
Bopsey 213 148
middle value.
Pallitto 189
151
Homer 187
If even then 165
Schnickerson 165
average 165
Levin 148
between two 187
Honkey-Doorey 251
in the middle 189
Zingers 308
Boehmer 151 199
Queenie 132 213
Googles-Boop 199 227
Calzone 227 251
194.6 308
Percentiles
• If we know the median, then we can go up or down and rank the

data as being above or below certain thresholds.

• You may be familiar with standardized tests. 90th percentile, your


score was higher than 90% of the rest of the sample.
The Mode
• The most frequent response or value for a variable.

• Multiple modes are possible: bimodal or multimodal.


Figuring the Mode
Professor Weight

Schmuggles 165 What is the mode?


Bopsey 213
Pallitto 189
Homer 187
Answer: 165
Schnickerson 165
Levin 148
Important descriptive
Honkey-Doorey 251
information that may help
inform your research
Zingers 308
Boehmer 151
Queenie 132
Googles-Boop 199
Calzone 227

You might also like