You are on page 1of 4

Introduction to data science:

What is data science?

Data science in a nutshell is solving problems with data!

What is in data science?

Machine learning- Usage and development of algorithms that allow computers to learn and make
predictions

Applied statistics- Hypothesis Testing, Mathematical Modeling, Experimental design

Operational Research- Optimizing processes, resources, decision making, etc. within a business

Information theory- Quantification, storage and communication of information

Data engineering- Building and maintaining robust infrastructure for data pipelines (feeding and
analyzing large volumes of data)

Types of ML

There are three most common types of ML:

Regression (supervised)
Classification (supervised)

Clustering (supervised)

Regression:

Classification:

Clustering:
Basic statistics to describe data:

- When describing numerical data, we may utilize descriptive statistics


- Descriptive statistics provide summary information on the characteristics and distributions of
values within one or more datasets

Descriptive statistics:
- There are three main prominent areas:
- Distribution- frequency of each value occurring within the data
- Central tendency- the averages of the data
- Variability- how spread out the values are from central tendency

Distribution:

- Datasets are made up of distribution values, and we can summarize the frequency of each
possible value using numbers or percentages. This is usually done through a frequency table.
- The simple frequency table represents all values grouped together with their main
categories. We can easily identify the most popular group using this
- The group frequency table creates numerical groupings based on the amount of visits each
person had to the library. We can identify further information on the distribution of values,
ie. Here most people visit the library between 9 and 12 times.

Central tendency:
- Central tendency represents the center, or average of the dataset
- Mean, median and mode are mostly used for finding the average
- Mean: add up all values and divide by the amount of values
- Median: the exact middle value
- Mode: the most commonly found value

Variability:

- Variability tells us how spread out the values within the dataset are
- Range, standard deviation and variance are the most common metrics of variability
- Range- largest value minus smallest value
- Standard deviation- Average amount of variability within the dataset. High SD means high
variability, low SD means low variability
- Variance- Standard Deviation squared

Standard deviation:

- Standard deviation represents the dispersion of values from the mean


- The flatter the curve, the higher standard deviation there is
- The smaller the spread, the lower the standard deviation

You might also like