You are on page 1of 40

Introduction to Data Science

What we learned from the intake survey


• ~Half the students have a sufficient background in
Python, the other half does not. Instead, a variety of
languages, mostly Java.
• à We have to start from scratch in the recitations, re
coding in Python. Perhaps go a bit faster than usual.
• Everyone will learn something in this course.

?
What we will not do in this class
Math is extremely useful
• But it can be used for deliberate obfuscation:
We will focus on
conceptual relations,
not random, isolated
factoids
Theory of (data) science

10101001
Why does science matter?
• What is the point of science?
• Of course people had thoughts about the
world well before science…
• Philosophers
• Philosophy = Love of knowledge
• “Natural philosopher” = People concerned
with the nature of reality, e.g. Thales:
“Everything is water”.
• Mode of inquiry: Observation, Intuition, Logic.
Aristotle’s book of problems: On hair
• What is hair? Growing hair removes “gross fluids”
from the brain. Hair is the dried residue. So dry that
it can’t even be digested.
• Why do women have longer hair than men?
Women don’t have beards, it all drains on top.
• Why are women less hairy in general? Because
their periods drain bodily fluids à Less for hair
growth.
• Why do men go bald? Male sexual activity expels
moisture, which deprives hair of nutriment.
Excessive sex accelerates hair loss, whereas eunuchs
don’t go bald.
What is missing?
Data!
• Data is missing.
• But what is data and how can it help to
understand reality better?
Data is very special. Coming up with the
concept was a radical, paradoxical step
Qualitative descriptions
Mathematics of the natural world

1 17
“The monsoon
brings a moist heat”

3
Quantification
Descriptions
Mathematics of the natural world

1 17

3
Quantification was a radical step
• Took until the 1250s to seriously consider the idea.
• Took another 750+ years to implement / realize.
Key concept: Operationalization
• Quantifying (measuring) theoretical
constructs.
• A theory links the measurement to the
construct and makes it meaningful.
• For instance: For a long time, physicists
measured temperature by the height of a
scaled mercury column (based on the
notion that metal expands when heated).
• That yields a number that is then
interpreted to correspond to a particular
physical state (temperature).
The “invention” of temperature
• We take it for granted, but it took physics hundreds of years to
come up with a usable temperature scale.
• The key issue was the “fixedness” of fixed points.
• Once one has stable fixed points (e.g. the freezing or boiling point
of water), it is trivial to make a scale.
• But how would one know what is fixed without having a
thermometer in the first place?
• People initially suggested all kinds of “fixed points” to calibrate the
scale, e.g. the temperature at which butter melts, body
temperature, etc.
• Even when converging on the boiling point of water as a stable
reference point, it turned to be much more variable than initially
thought, due to the phenomena of superheating, and the fact that
this point is actually quite variable, depending on the air dissolved
in water, impurities in (and salt content of) the water, air pressure,
the vessel (metal or glass) and so on.
• Even inserting a thermometer itself changes the properties of the
water.
Data is special in another way:
The other two realms

Things
Things that
people
exist
make up

Are data born or made?


The birth/making of data

Things
Things that
people
exist
make up

Data
The datification of the world

Ever growing
What scientific data are:
• Latin: “A given thing” (~1645) – no data in the ancient
world. A relatively modern concept.
• However, it is *not* just given. Ironically named. But
they meant to say “quantitative facts”
• Some people have proposed capta for “taking” instead.
But scripta (“writing down/recording”) might be most
accurate
• Born (AND made) as a result of a formal observation, a
measurement process.
• No measurement, no data. Quality of the data depends
on the quality of the measurement.
• Has to be interpreted in terms of a theory of
measurement (e.g. temperature).
• Usually plural (for our purposes). Singular: Datum
How is data related to science?
• What is science?
What is science?
• Derived from the latin word “scientia”,
meaning knowledge.
• In this – broad – sense, any body of
knowledge qualifies as a science.
• However, there is a much more narrow sense,
in which we understand the term.
• In this sense, most bodies of knowledge are
decidedly *not* a science.
Characteristics of science
• A body of knowledge.
• Goal: A principled understanding of the natural world.
-Understanding = Description, Explanation, Prediction.
• Deductive insofar as principles are tested with new
observations.
• Inductive by formulating principles from observations.
• Falsification focus
• The more precise the observations, the better.
• Can deal with simple or complex subjects.
• Experiments help with a certain kind of understanding
(particularly for complex subjects) but are not strictly
necessary.
Deduction
• “Top-down” reasoning.
• Going from premises (axioms, assumptions) to logically
valid conclusions.
• Spelling out implications by logic. Classic syllogism:
• 1. All men are mortal [Major premise]
• 2. Socrates is a man [Minor premise]
• 3. Socrates is mortal [Conclusion]
• Advantage: If premises are true and reasoning is
logically sound, the conclusions are certain.
• Also, if they are general, the conclusions apply to all
minor premises that match the major premise.
• Disadvantage: Nothing new can be learned. Everything
that is true is already contained in the premises, it is
just being spelled out.
Induction
• “Bottom-up” reasoning.
• Going from individual observations to general rules,
laws or principles.
• Example: 2 4 6 8… What is the rule?
• Example: Everyone who has ever lived died before
they turned 123*.
• Rule: People die before they are 123 years of age.
(Dong et al., Nature, 2016, “Evidence for a limit to human lifespan”)
• Advantage: New knowledge can be generated.
• Disadvantage: One can be wrong. The conclusions
arrived at by induction are not certain.
Induction: Connecting the dots

Problem: There are many ways to do so


A fundamental problem with induction:
The turkey problem
• A turkey lives on a farm.
• Every morning, without fail
the turkey is fed.
• The turkey starts to build a
model of the world where its
life purpose is to be fed.
• Every evening, the turkey
predicts that it will be fed in
the morning.
• For hundreds of days, the
turkey is always right and
boast about the accuracy of
its model.
• Until one day…
The generalized turkey problem
• Knowledge gained by induction is inherently
uncertain (whether it is true knowledge).
• It must be tested under new conditions, or
remain uncertain.
• Tests under the same conditions can give the
illusion of “prediction” or “things are working”.
• The problem is that the uncertainty is hidden –
the observer is not aware of its existence.
• This typically leads to catastrophic failure.
• For instance, the 2008 financial crisis was the end
result of inappropriate use of data in this regard.
Example: The Aztec cult of Huitzilopochtli
• Huitzilopochtli is the Aztec
sun god.
• The Aztecs realized that life
on earth critically
depended on sunlight.
• To entice the sun (god) to
return each day, they
sacrificed a prisoner to this
god each and every day, by
offering his beating heart.
• This worked.
• The sun rose each day,
without fail.
• Until one day….
Black Swans: Cygnus Atratus

They exist! There is no proof in science. The fact


that there are proofs in math establishes that
math is not (/can’t be) a science
Doing it wrong: Attempting confirmation
The fundamental asymmetry between
verification/confirmation and falsification
• Confirming the rule (model/theory) doesn’t add
anything.
• It only breeds false confidence, as seen in the
turkey problem and the issue of black swans.
• One only learns from falsification.
• Corollary: A scientific investigation focuses on
falsification. The conclusions can’t be foregone.
• People are not inclined to do this. The natural
tendency – in almost all human affairs is to confirm
– with evidence what one already suspects.
Rule: If there is a vowel on one side,
there is an even number on the other.

Which cards need to be turned over to test the


rule (whether it is true or false)?
Rule:
If one is indoors, one has to wear a mask

Does not
Wears a
Indoors Outdoors wear a
mask
mask

Who has to be checked to see whether the rule was violated?


Is it now clear why falsification is
central in science?

One can only learn something from falsification,


not confirmation (to test hypotheses).
So: Data are fuel
• For the inducto-deductive engine of science.

Realm of ideas

Deduction*
Induction

External physical world (Reality)


In regular science, data flows like water
What about data science – science
applied to data itself?
• A new kind of science.
• What makes it so special?
Data science as a relevant and independent
field is less than a decade old
A genuinely 21st century enterprise
What took so long?

You might also like