You are on page 1of 1

Leveraging Free Text Data to Build Better

Models - Webinar, June 18

Topics: Coronavirus | AI | Data


Science | Deep Learning | Machine
Learning | Python | R | Statistics

How to Become
a (Good) Data
Scientist –
Beginner Guide

<= Previous post Next post =>

Like 143 Share 143 Tweet Share 288

Share
Share

Tags: Beginner, BI, Data Scientist, Sciforce, Statistics

A guide covering the things you should learn to


become a data scientist, including the basics of
business intelligence, statistics, programming, and
machine learning.

Gartner 2020 MQ
Data Science and
Machine Learning
Read the Report

comments
By Sciforce
Sciforce.

How simple is Data Science?


Sometimes when you hear data scientists shoot a
dozen algorithms while discussing their experiments
or go into details of Tensor!ow usage you might think
that there is no way a layman can master Data
Science. Big Data looks like another mystery of the
Universe that will be shut up in an ivory tower with a
handful of present-day alchemists and magicians. At
the same time, you hear about the urgent necessity
to become data-driven from everywhere.

The trick is, we used to have only limited and well-


structured data. Now, with the global Internet, we are
swimming in the never-ending !ows of structured,
unstructured, and semi-structured data. It gives us
more power to understand industrial, commercial or
social processes, but at the same time, it requires
new tools and technologies.

Data Science is merely a 21st-century extension of


mathematics that people have been doing for
centuries. In its essence, it is the same skill of using
information available to gain insight and improve
processes. Whether it’s a small Excel spreadsheet or
100 million records in a database, the goal is always
the same: to "nd value. What makes Data Science
di#erent from traditional statistics is that it tries not
only to explain values, but to predict future trends.

In other words, we use Data Science for:

Data Science is a newly developed blend of machine


learning algorithms, statistics, business intelligence,
and programming. This blend helps us reveal hidden
patterns from the raw data, which in turn provides
insights into business and manufacturing processes.

What should a data scientist know?


To go into Data Science, you need the skills of a
business analyst, a statistician, a programmer, and a
Machine Learning developer. Luckily, for the "rst dive
into the world of data, you do not need to be an
expert in any of these "elds. Let’s see what you need
and how you can teach yourself the necessary
minimum.

Business Intelligence

When we "rst look at Data Science and Business


Intelligence, we see the similarity: they both focus on
“data” to provide favorable outcomes, and they both
o#er reliable decision-support systems. The
di#erence is that while BI works with static and
structured data, Data Science can handle high-speed
and complex, multi-structured data from a wide
variety of data sources. From the practical
perspective, BI helps interpret past data for reporting,
or Descriptive Analytics and Data Science analyzes
past data to make future predictions in Predictive
Analytics or Prescriptive Analytics.

Theories aside, to start a simple Data Science project,


you do not need to be an expert Business Analyst.
What you need is to have clear ideas of the following
points:

have a question or something you’re curious


about;
"nd and collect relevant data that exists for your
area of interest and might answer your question;
analyze your data with selected tools;
look at your analysis and try to interpret "ndings.

As you can see, at the very beginning of your journey,


your curiosity and common sense might be su$cient
from the BI point of view. In a more complex
production environment, there will probably be
separate Business Analysts to do insightful
interpreting. However, it is important to have at least
dim vision of BI tasks and strategies.

Resources

We recommend you to have a look at the following


introductory books to feel more con"dent in
analytics:

Introduction To The Basic Business Intelligence


Concepts — an insightful article giving an overview of
the basic concepts in BI;

Business Intelligence for Dummies —step-by-step


guidance through BI technologies;

Big Data & Business Intelligence — an online course


for beginners;

Business Analytics Fundamentals — another


introductory course teaching the basic concepts of BI.

Statistics and probability

Probability and statistics are the basis of Data


Science. Statistics is, in simple terms, the use of
mathematics to perform technical analysis of data.
With the help of statistical methods, we make
estimates for further analysis. Statistical methods
themselves are dependent on the theory of
probability, which allows us to make predictions. Both
statistics and probability are separate and
complicated "elds of mathematics. However, as a
beginner data scientist, you can start with 5 basic
statistics concepts:

Statistical features like bias, variance, mean,


median, percentiles, and many others are the
"rst stats technique you would apply when
exploring a dataset. It’s all fairly easy to
understand and implement them in code even at
the novice level.
Probability Distributions represent the
probabilities of all possible values in the
experiment. The most common in Data
Science are a Uniform Distribution that has is
concerned with events that are equally likely to
occur, a Gaussian, or Normal Distribution where
most observations cluster around the central
peak (mean) and the probabilities for values
further away taper o# equally in both directions
in a bell curve, and a Poisson Distribution similar
to the Gaussian but with an added factor of
skewness.
Over and Under Sampling that help to
balance datasets. If the majority class is
overrepresented, undersampling helps select
some of the data from it to balance it with the
minority class has. When data is insu$cient,
oversampling duplicates the minority class
values to have the same number of examples as
the majority class has.
Dimensionality Reduction . The most common
technique used for dimensionality reduction is
PCA, which essentially creates vector
representations of features showing how
important they are to the output, i.e., their
correlation.
Bayesian Statistics. Finally, Bayesian statistics
is an approach applying probability to statistical
problems.. It provides us with mathematical tools
to update our beliefs about random events in
light of seeing new data or evidence about those
events.

Image credit: unsplash.com

Resources

We have selected just a few books and courses that


are practice-oriented and can help you feel the taste
of statistical concepts from the beginning:

Practical Statistics for Data Scientists: 50 Essential


Concepts — a solid practical book that introduces
essential tools speci"cally for data science;

Naked Statistics: Stripping the Dread from the Data —


an introduction to statistics in simple words;

Statistics and probability — an introductory online


course;

Statistics for data science — a special course on


statistics developed for data scientists.

Programming

Data Science is an exciting "eld to work in, as it


combines advanced statistical and quantitative skills
with real-world programming ability. Depending on
your background, you are free to choose a
programming language to your liking. The most
popular in the Data Science community are, however,
R, Python, and SQL.

R is a powerful language speci"cally designed for


Data Science needs. It excels at a huge variety of
statistical and data visualization applications, and
being open source has an active community of
contributors. In fact, 43 percent of data scientists
are using R to solve statistical problems.
However, it is di$cult to learn, especially if you
already mastered a programming language.
Python is another common language in Data
Science. 40 percent of respondents surveyed by
O’Reilly use Python as their major programming
language. Because of its versatility, you can use
Python for almost all steps of data analysis. It
allows you to create datasets, and you can
literally "nd any type of dataset you need on
Google. Ideal for entry-level and easy-to-learn,
Python remains exciting for Data Science and
Machine Learning experts with more
sophisticated libraries such as Google’s
Tensor!ow.
SQL
SQL(structured
(structured query language) is more
useful as a data processing language than as an
advanced analytical tool. IT can help you to carry
out operations like add, delete and extract data
from a database and carry out analytical
functions and transform database structures.
Even though NoSQL and Hadoop have become a
large component of Data Science, it is still
expected that a data scientist can write and
execute complex queries in SQL.

Resources

There are plenty of resources for any programming


language and every level of pro"ciency. We’d suggest
visiting DataCamp to explore the basic programming
skills needed for Data Science.

If you feel more comfortable with books, the vast


collection of O’Reilly’s free programming ebooks will
help you choose the language to master.

Image credit: unsplash.com

Machine Learning and AI

Although AI and Data Science usually go hand-in-


hand, a large number of data scientists are not
pro"cient in Machine Learning areas and techniques.
However, Data Science involves working with large
amounts of data sets that require mastering Machine
Learning techniques, such as supervised machine
learning, decision trees, logistic regression, etc. These
skills will help you to solve di#erent data science
problems that are based on predictions of major
organizational outcomes.

At the entry-level, Machine Learning does not require


much knowledge of math or programming, just
interest and motivation. The basic thing that you
should know about ML is that in its core lies one of
the three main categories of algorithms: supervised
learning, unsupervised learning and reinforcement
learning.

Supervised Learning is a branch of ML that


works on labeled data, in other words, the
information you are feeding to the model has a
ready answer. Your software learns by making
predictions about the output and then
comparing it with the actual answer.
In unsupervised learning , data is not labeled,
and the objective of the model is to create some
structure from it. Unsupervised learning can be
further divided into clustering and association. It
is used to "nd patterns in data, which are
especially useful in business intelligence to
analyze customer behavior.
Reinforcement learning is the closest to the
way that humans learn, i.e., by trial and error.
Here, a performance function is created to tell
the model if what it did was getting it closer to its
goal or making it go the other way. Based on this
feedback, the model learns and then makes
another guess, this continues to happen, and
every new guess is better.

With these broad approaches in mind, you have a


backbone for analysis of your data and explore
speci"c algorithms and techniques that would suit
you the best.

Resources

Similarly to programming, there are numerous books


and courses in Machine Learning. Here are just a
couple of them:

Deep Learning textbook by Ian Goodfellow and


Yoshua Bengio and Aaron Courville is a classic
resource recommended for all students who want to
master machine and deep learning.

Machine Learning course by Andrew Ng is an


absolute classic that leads you through the most
popular algorithms in ML.

Machine Learning A-Z™: Hands-On Python & R In


Data Science — a Udemy course speci"cally for
novice data scientists that introduces basic ML
concepts both in R and Python.

What skills should a data scientist


possess?
Now you know the main prerequisites for Data
Science. Does it make you a good data scientist?
While there is no correct answer, there are several
things to take into consideration:

Analytical Mindset
Mindset: it is a general requirement for
any person working with data. However, if common
sense might su$ce at the entry-level, your analytical
thinking should be further backed up by statistical
background and knowledge of data structures and
machine learning algorithms.

Focus on Problem Solving


Solving: when you master a
new technology, it is tempting to use it everywhere,
However, while it is important to know recent trends
and tools, the goal of Data Science is to solve speci"c
problems by extracting knowledge from data. A good
data scientist "rst understands the problem, then
de"nes the requirements for the solution to the
problem, and only then decides which tools and
techniques are the best "t for the task. Don’t forget
that stakeholders will never be captivated by the
impressive tools you use, only by the e#ectiveness of
your solution.

Domain Knowledge
Knowledge: data scientists need to
understand the business problem and choose the
appropriate model for the problem. They should be
able to interpret the results of their models and
iterate quickly to arrive at the "nal model. They need
to have an eye for detail.

Communication Skills
Skills: there’s a lot of
communication involved in understanding the
problem and delivering constant feedback in simple
language to the stakeholders. But this is just the
surface of the importance of communication — a
much more important element of this is asking the
right questions. Besides, data scientists should be
able to clearly document their approach so that it is
easy for someone else to build on that work and, vice
versa, understand research work published in their
area.

As you can see, it is the combination of various


technical and soft skills that make up a good data
scientist.

Original. Reposted with permission.

Bio: SciForce is a Ukraine-based IT company


specialized in development of software solutions
based on science-driven information technologies.
We have wide-ranging expertise in many key AI
technologies, including Data Mining, Digital Signal
Processing, Natural Language Processing, Machine
Learning, Image Processing and Computer Vision.

Related:

6 bits of advice for Data Scientists


My journey path from a Software Engineer to
BI Specialist to a Data Scientist
10 Great Python Resources for Aspiring Data
Scientists

What do you think?


118 Responses

Upvote Funny Love

Surprised Angry Sad

Comments Community ! "


1 Login

( Recommend Sort by Best

Join the discussion…

LOG IN WITH

OR SIGN UP WITH DISQUS ?

Name

S Charlesworth
− ⚑
8 months ago

Your comment about R: 'However, it is difficult to learn,


especially if you already mastered a programming
language.' Um, no. Citation: currently working with
somebody that picked it up in a flash. Likewise though I
came from working w/ Java, Python, etc, I didn't have
much trouble picking up R. Don't spread FUD. It's
irresponsible.
1△ ▽ Reply
Emilia Jazz
− ⚑
7 months ago

To become a specialist you need:


1 . Machine learning – Retrieving accuracy in all theoretical
data

2. Signal processing – Improving and analyzing the digital


signals

3. Data mining – Finding data which can be used to create


predictable solutions
△ ▽ Reply
vilkoos x
− ⚑
8 months ago edited

A very nice intro to DS, kudos to you.

Some suggestions.

My starting point would be: DS is about doing useful things


with data.
IMHO it is not that interesting to distinguish between data-
craft and data-science (labeling DS as science is just sales
talk).

Your starting point is: DS is about doing complicated things


with data.
So data science is not about storing and retrieving facts in
databases or doing business intelligence. We are doing that
for 50 years now, we know how to do that, that is not
complicated enough to be a "science". (corollary: according
to you one needs a BSc in high energy physics, an MSc in
econometrics or a PhD in CS to gain entry to DS)

Tip: include the simple/basic things in your definition of DS


(and add useful)

THESIS FOR REFLECTION


The ancient Babylonian scribes were the first data-scientist
(so DS is about 3000 years old).
They invented techniques to record and store facts (i.e.
invented writing).
They invented ways to produce derived facts from
elementary facts (i.e. invented arithmetic)
△ ▽ Reply
Erika
− ⚑
8 months ago

I agree that saying R is difficult to learn is inaccurate. I


would also appreciate a definition of what the acronym PCA
stands for (mentioned in the "Statistics and probability"
section.) The intended readers for this article are beginners
and it may not be fair to assume they are going to know
what PCA is.
△ ▽ Reply
vilkoos x
− ⚑
8 months ago

You say: beginners should learn ("R") or ("Python") or


("SQL").

Would it not be better to say: beginners should learn ("R" or


"Python") and ("SQL")?

R and Python are basically the same (mainly procedural and


algorithm orientated ).
SQL is really different (mainly declarative and data oriented).

PS Those who want alternatives for SQL could look at


MongoDB or good-old Prolog.
△ ▽ 1 Reply
Gregory Piatetsky > vilkoos x
− ⚑
8 months ago

I would recommend beginners with programming


background to learn Python and SQL. R is a very
powerful platform, but harder for beginners.
△ ▽ Reply

✉ Subscribe d Add Disqus ⚠ Do Not Sell My Data

<= Previous post Next post =>

Top Stories Past 30 Days


Most Popular Most Shared

1. The Best NLP 1. Deep Learning


with Deep for Coders
Learning with fastai
Course is Free and PyTorch:
The Free
2. Easy Speech-
eBook
to-Text with
Python 2. A Complete
guide to
3. Python For
Google Colab
Everybody:
for Deep
The Free
Learning
eBook
3. Natural
4. Build and
Language
deploy your
Processing
"rst machine
with Python:
learning web
The Free
app
eBook
5. How to Think
4. Deep Learning
Like a Data
for Detecting
Scientist or
Pneumonia
Data Analyst
from X-ray
6. Dont Images
Democratize
5. Easy Speech-
Data Science
to-Text with
7. If you had to Python
start statistics
6. Uber's Ludwig
all over again,
is an Open
where would
Source
you start?
Framework
for Low-Code
Machine
Learning

7. Interactive
Machine
Learning
Experiments

Latest News
Machine Learning in Dask

4 Free Math Courses to do and Level up your Data


Scienc...

How to Deal with Missing Values in Your Dataset

Top Stories, Jun 15-21: Easy Speech-to-Text with


Python...

Graph Machine Learning in Genomic Prediction

What is emotion AI and why should you care?

Top Stories
Last Week

Most Popular
1. Easy Speech-to-Text with Python

2. The Most Important Fundamentals of


PyTorch you Should Know

3. A Complete guide to Google Colab for Deep


Learning

4. Understanding Machine Learning: The Free


eBook

5. Overview of data distributions

6. A Classi"cation Project in Machine Learning:


431 a gentle step-by-step guide
SHARES

You might also like