You are on page 1of 10

Data Science

What is Data Science?


Let us begin our learning on Data Science with Python by first understanding of data science. Data
science is all about finding and exploring data in the real world and using that knowledge to solve
business problems. Some examples of data science are:

Customer Prediction - System can be trained based on customer behavior patterns to predict the
likelihood of a customer buying a product

Service Planning - Restaurants can predict how many customers will visit on the weekend and plan their
food inventory to handle the demand

Now that you know what data science is and before we get deep into the topic of Data Science with
Python is let’s talk about Python.

Introduction to Data Science with Python

---------------------------------------------------------------------------------------------------------------------------------------

As the world entered the era of big data in the last few decades, the need for better and efficient data
storage became a significant challenge. The main focus of businesses using big data was on building
frameworks that can store a large amount of data. Then, frameworks like Hadoop were created, which
helped in storing massive amounts of data.

With the problem of storage solved, the focus then shifted to processing the data that is stored. This is
where data science came in as the future for processing and analyzing data. Now, data science has
become an integral part of all the businesses that deal with large amounts of data. Companies today
hire data scientists and professionals who take the data and turn it into a meaningful resource.

Let’s now dig deep into data science and how data science with Python is beneficial.

Looking forward to a career as a Data Scientist? Check out the Data Science with Python Training Course
and get certified today.
What is Data Science?

Let us begin our learning on Data Science with Python by first understanding of data science. Data
science is all about finding and exploring data in the real world and using that knowledge to solve
business problems. Some examples of data science are:

Customer Prediction - System can be trained based on customer behavior patterns to predict the
likelihood of a customer buying a product

Service Planning - Restaurants can predict how many customers will visit on the weekend and plan their
food inventory to handle the demand Now that you know what data science is and before we get deep
into the topic of Data Science with

Why Python?

When it comes to data science, we need some sort of programming language or tool, like Python.
Although there are other tools for data science, like R and SAS, we will focus on Python and how it is
beneficial for data science in this article.

Python as a programming language has become very popular in recent times. It has been used in data
science, IoT, AI, and other technologies, which has added to its popularity.

Python is used as a programming language for data science because it contains costly tools from a
mathematical or statistical perspective. It is one of the significant reasons why data scientists around the
world use Python. If you track the trends over the past few years, you will notice that Python has
become the programming language of choice, particularly for data science.
There are several other reasons why Python is one of the most used programming languages for data
science, including:

Speed - Python is relatively faster than other programming languages

Availability - There are a significant number of packages available that other users have developed,
which can be reused

Design goal - The syntax roles in Python are intuitive and easy to understand, thereby helping in building
applications with a readable codebase

Python Libraries for Data Analysis

---------------------------------------------------------

Python is a simple programming language to learn, and there is some basic stuff that you can do with it,
like adding, printing statements, and so on. However, if you want to perform data analysis, you need to
import specific libraries. Some examples include:

 Pandas - Used for structured data operations


 NumPy - A powerful library that helps you create n-dimensional arrays
 SciPy - Provides scientific capabilities, like linear algebra and Fourier transform
 Matplotlib - Primarily used for visualization purposes
 Scikit-learn - Used to perform all machine learning activities

In addition to these, there are other libraries as well, like:

 Networks & I graph


 TensorFlow
 BeautifulSoup
 OS

Let’s now take a look at some of the most important Python libraries in detail:

SciPy: As the name suggests, it is a scientific library that includes some special functions:

It currently supports special functions, integration, ordinary differential equation (ODE) solvers, gradient
optimization, and others

It has fully-featured versions of the linear algebra modules

It is built on top of NumPy


NumPy: NumPy is the fundamental package for scientific computing with Python. It contains:

Powerful N-dimensional array objects

Tools for integrating C/C++, and Fortran code

It has useful linear algebra, Fourier transform, and random number capabilities

Pandas: Pandas is used for structured data operations and manipulations.

The most useful data analysis library in Python

Instrumental in increasing the use of Python in the data science community

Used extensively for data mugging and preparation

Next, in our learning of Data Science with Python let us learn the exploratory analysis using Pandas.

The Pillars of Data Science Expertise

--------------------------------------------------------------

While data scientists often come from many different educational and work experience backgrounds,
most should be strong in, or in an ideal case be experts in four fundamental areas. In no particular order
of priority or importance, these are:

1) Business/Domain

2) Mathematics (includes statistics and probability)

3) Computer science (e.g., software/data architecture and engineering)

4) Communication (both written and verbal)

There are other skills and expertise that are highly desirable as well, but these are the primary four in
my opinion. These will be referred to as the data scientist pillars for the rest of this article.

In reality, people are often strong in one or two of these pillars, but usually not equally strong in all four.
If you do happen to meet a data scientist that is truly an expert in all, then you’ve essentially found
yourself a unicorn.
Based on these pillars, my data scientist definition is a person who should be able to leverage existing
data sources, and create new ones as needed in order to extract meaningful information and actionable
insights. A data scientist does this through business domain expertise, effective communication and
results interpretation, and utilization of any and all relevant statistical techniques, programming
languages, software packages and libraries, and data infrastructure. The insights that data scientists
uncover should be used to drive business decisions and take actions intended to achieve business goals.

Data Science Venn Diagrams

One can find many different versions of the data scientist Venn diagram to help visualize these pillars (or
variations) and their relationships with one another. David Taylor wrote an excellent article on these
Venn diagrams entitled, Battle of the Data Science Venn Diagrams. I highly recommend reading it.

Here is one of my favorite data scientist Venn diagrams created by Stephan Kolassa. You’ll notice that
the primary ellipses in the diagram are very similar to the pillars given above.

Data Science Goals and Deliverables

-------------------------------------------------------------------

In order to understand the importance of these pillars, one must first understand the typical goals and
deliverables associated with data science initiatives, and also the data science process itself. Let’s first
discuss some common data science goals and deliverables.

Here is a short list of common data science deliverables:


 Prediction (predict a value based on inputs)
 Classification (e.g., spam or not spam)
 Recommendations (e.g., Amazon and Netflix recommendations)
 Pattern detection and grouping (e.g., classification without known classes)
 Anomaly detection (e.g., fraud detection)
 Recognition (image, text, audio, video, facial, …)
 Actionable insights (via dashboards, reports, visualizations, …)
 Automated processes and decision-making (e.g., credit card approval)
 Scoring and ranking
 Segmentation (e.g., demographic-based marketing)
 Optimization (e.g., risk management)
 Forecasts (e.g., sales and revenue)

The Data Science Process

-------------------------------------------------------------

Below is a diagram of the GABDO Process Model that I created and introduce in my book, AI for People
and Business. Data scientists usually follow a process similar to this, especially when creating models
using machine learning and related techniques.

The GABDO Process Model consists of five iterative phases—goals, acquire, build, deliver, optimize—
hence, represented by the acronym GABDO. Each phase is iterative because any phase can loop back to
one or more phases before. Feel free to check out the book if you’d like to learn more about the process
and its details.

One important thing to discuss are off-the-shelf data science platforms and APIs. One may be tempted
to think that these can be used relatively easily and thus not require significant expertise in certain
fields, and therefore not require a strong, well-rounded data scientist.
It’s true that many of these off-the-shelf products can be used relatively easily, and one can probably
obtain pretty decent results depending on the problem being solved, but there are many aspects of data
science where experience and chops are critically important.

Some of these include having the ability to:

Customize the approach and solution to the specific problem at hand in order to maximize results,
including the ability to write new algorithms and/or significantly modify the existing ones, as needed

Access and query many different databases and data sources (RDBMS, NoSQL, NewSQL), as well as
integrate the data into an analytics-driven data source (e.g., OLAP, warehouse, data lake, …)

Find and choose the optimal data sources and data features (variables), including creating new ones as
needed (feature engineering)

Understand all statistical, programming, and library/package options available, and select the best

Ensure data has high integrity (good data), quality (the right data), and is in optimal form and condition
to guarantee accurate, reliable, and statistically significant results

Avoid the issues associated with garbage in equals garbage out

Select and implement the best tooling, algorithms, frameworks, languages, and technologies to
maximize results and scale as needed

Choose the correct performance metrics and apply the appropriate techniques in order to maximize
performance

Discover ways to leverage the data to achieve business goals without guidance and/or deliverables
being dictated from the top down, i.e., the data scientist as the idea person

Work cross-functionally, effectively, and in collaboration with all company departments and groups

Distinguish good from bad results, and thus mitigate the potential risks and financial losses that can
come from erroneous conclusions and subsequent decisions

Understand product (or service) customers and/or users, and create ideas and solutions with them in
mind.
Data Analyst / Data scientist

----------------------------------------------------

Data analysts share many of the same skills and responsibilities as a data scientist, and sometimes have
a similar educational background as well. Some of these shared skills include the ability to:

 Access and query (e.g., SQL) different data sources


 Process and clean data
 Summarize data
 Understand and use some statistics and mathematical techniques
 Prepare data visualizations and reports

Web scraping:

Web scraping is the process of collecting structured web data in an automated fashion. It’s also called
web data extraction. Some of the main use cases of web scraping include price monitoring, price
intelligence, news monitoring, lead generation and market research among many others.

In general, web data extraction is used by people and businesses who want to make use of the vast
amount of publicly available web data to make smarter decisions.

If you’ve ever copy and pasted information from a website, you’ve performed the same function as any
web scraper, only on a microscopic, manual scale. Unlike the mundane, mind-numbing process of
manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or
even billions of data points from the internet’s seemingly endless frontier.

The basics of web scraping:

It’s extremely simple, in truth, and works by way of two parts: a web crawler and a web scraper. The
web crawler is the horse, and the scraper is the chariot. The crawler leads the scraper, as if by the hand,
through the internet, where it extracts the data requested.

The crawler: A web crawler, which we generally call a “spider,” is an artificial intelligence that browses
the internet to index and search for content by following links and exploring, like a person with too
much time on their hands. In many projects you first “crawl” the web or one specific website to discover
URLs which then you pass on to your scraper.
The scraper: A web scraper is a specialized tool designed to accurately and quickly extract data from a
web page. Web scrapers vary widely in design and complexity, depending on the project. An important
part of every scraper is the data locators (or selectors) that are used to find the data that you want to
extract from the HTML file - usually xpath, css selectors, regex or a combination of them is applied.

The web scraping process

 Identify target website


 Collect URLs of the pages where you want to extract data from
 Make a request to these URLs to get the HTML of the page
 Use locators to find the data in the HTML
 Save the data in a JSON or CSV file or some other structured format

What is web scraping used for?

Price intelligence

In our experience, price intelligence is the biggest use case for web scraping. Extracting product and
pricing information from e-commerce websites, then turning it into intelligence is an important part of
modern ecommerce companies that want to make better pricing/marketing decisions based on data.

How web pricing data and price intelligence can be useful:

 Dynamic Pricing
 Revenue Optimization
 Competitor Monitoring
 Product Trend Monitoring
 Brand and MAP Compliance

Market research

Market research is critical – and should be driven by the most accurate information available. High
quality, high volume, and highly insightful web scraped data of every shape and size is fueling market
analysis and business intelligence across the globe.

 Market Trend Analysis


 Market Pricing
 Optimizing Point of Entry
 Research & Development
 Competitor Monitoring

Real Estate
The digital transformation of real estate in the past twenty years threatens to disrupt traditional firms
and create powerful new players in the industry. By incorporating web scraped product data into
everyday business, agents and brokerages can protect against top-down online competition and make
informed decisions within the market.

 Appraising Property Value


 Monitoring Vacancy Rates
 Estimating Rental Yields
 Understanding Market Direction

News & Content Monitoring

Modern media can create outstanding value or an existential threat to your business - in a single news
cycle. If you’re a company that depends on timely news analyses, or a company that frequently appears
in the news, web scraping news data is the ultimate solution for monitoring, aggregating and parsing the
most critical stories from your industry.

 Investment Decision Making


 Online Public Sentiment Analysis
 Competitor Monitoring
 Political Campaigns
 Sentiment Analysis

You might also like