You are on page 1of 11

ST2195 Programming for Data Science

1a. Introduction to Data Science

Note: Content extracted from ST2195 Subject Guides (available via UOL VLE)
1
What is Data?

“Data is a set of values of subjects with respect to qualitative or quantitative


variables.” -- Wikipedia

Data can be summarized in the form of


 Vector or matrix
 Tensor (high-order matrix)
 Image, or text

2
What is Data Science?

Many definitions exist none of which is widely acceptable. Here’s a reasonable one…

Data science is the application of computational and statistical techniques on data to address or
gain insight into some problem in the real world.

We can also think of data science as the union of the various techniques that are required to accomplish
the above…

Data science = Data collection + Data (pre-)processing + Big data + Scientific hypotheses +
Business Insights + Visualisation + Machine learning + Statistics + (etc)

In this course we will explore the programming aspect of all the above areas.

3
What Data Science is Not…
Data Science Is Not (Just) Machine Learning
Predictions can be an important part of data science, but truly hard elements involve also
 Collecting the data
 Defining the problem you are trying to solve (and frequently, re-defining it many times)
 Interpreting and understanding the results, and knowing what actions to take based upon this

Data Science Is Not (Just) Statistics


At least two distinctions that are worth making between data science and statistics
 Academic field of statistics has tended more towards the theoretical aspects of data analysis than the
practical aspects.
 Historically, data science has evolved from computer science as much as it has from statistics: topics like
data scraping, and data processing more generally, are core to data science, typically are steeped more
in the historical context of computer science, and are unlikely to appear in many statistics courses.

Data Science Is Not (Just) Data Nor (Just) Big Data


A substantial proportion of the data science workload (if not the majority) consists of data collection and
management, but data science consists of all the other tasks mentioned above.

4
Data vs Information

Data
 Raw, unorganized facts that need to be processed
 Unusable until it is organized

Information
 Created when data is processed, organized, and structured
 Needs to be put in an appropriate context in order to become useful

5
Examples of Data Science Tasks
Example 1: Email Spam
 4601 email messages were stored and labelled as spam or not.
 The relative frequency of the 57 most common words are available. See table below for some
of them.
 Aim is to design automatic spam detector that could filter out spam before clogging the users
mailboxes.

Spam email example. Table 1.1 from Hastie et al. (2001)

6
Examples of Data Science Tasks
Example 2: Real Estate
 Determine neighbourhood
characteristics that drive house
prices.
 Data is from the Boston Housing
dataset available from the “scikit
learn” Python library

Matrix scatter plot of the Boston Housing dataset variables.


7
Examples of Data Science Tasks

Example 3: Handwritten Digits


 Construct a machine that identifies
the numbers in a handwritten ZIP
code from digitised images.
 Data are summarised in the image
shown on the right, taken from
Hastie et al. (2001) textbook

Images of digits in Figure 1.2 from Hastie et al. (2001)

8
Computer Programming and Data Science
How do we actually do data science? Some of the tasks we need to undertake…
 Data collection
 Data processing (wrangling)
 Data visualisation
 Train and apply algorithms from fields such as machine learning, statistics, data mining, optimisation,
image processing, etc.

Tasks require the use of computers and in particular programming.


 (Computer) programming: Process of producing an executable computer program that performs a
specific task, like the ones above.
 Programming Language: Source code of a program is written in one or more languages that are
intelligible to humans, rather than machine code, which is directly executed by the CPU.

We will explore programming languages R and Python to perform data science tasks.
 Both R and Python are open source, i.e. free software which the user can modify and distribute within
the terms of a licence

9
Programming for Data Science Tools
Integrated Development Environments (IDEs)
 Software applications that facilitate computer programming and software development. Examples
include RStudio, Spyder, Microsoft Visual Studio, etc

Version Control via Git


 Git: A version control system that allows for complete history of changes, branching, staging areas, and
flexible and distributed workflows

GitHub
 Code hosting platform for version control and collaboration that is based on Git
 Largest host of source code in the world

Markdown (and Other Markup Languages)


 Idea of a “markup” language: HTML, XML, LaTeX
 Created by John Gruber as a simple way for non-programming types to write in an easy-to-read format
that could be converted directly into HTML. Uses plain text, and can be read when not rendered

10
Useful Links, Resources, and References

Useful Links and Resources


• 50 years of data science: Article by David Donoho
• The joy of Stats: Video by Hans Rosling
• R web page
• RStudio web page
• Python
• Anaconda web page
• GitHub web page
• Git for non-developers - Blog by Anita Cheng

References
• Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical
learning. Springer.

11

You might also like