You are on page 1of 14

DATA SCIENCE

WHAT IS DATA SCIENCE?

Data science is a multidisciplinary blend of data inference, algorithm development, and technology in
order to solve analytically complex problems.

At the core is data. Troves of raw information, streaming in and stored in enterprise data warehouses.
Much to learn by mining it. Advanced capabilities we can build with it. Data science is ultimately about
using this data in creative ways to generate business value

HOW DO DATA SCIENTISTS MINE OUT INSIGHTS?

A data scientist is a person who should be able to leverage existing data sources, and create new ones as
needed in order to extract meaningful information and actionable insights. These insights can be used to
drive business decisions and changes intended to achieve business goals.

It starts with data exploration. When given a challenging question, data scientists become detectives.
They investigate leads and try to understand pattern or characteristics within the data. This requires a
big dose of analytical creativity.

Then as needed, data scientists may apply quantitative technique in order to get a level deeper – e.g.
inferential models, segmentation analysis, time series forecasting, synthetic control experiments, etc.
The intent is to scientifically piece together a forensic view of what the data is really saying.

DATA SCIENCE – DEVELOPMENT OF DATA PRODUCT

A "data product" is a technical asset that: (1) utilizes data as input, and (2) processes that data to return
algorithmically-generated results. The classic example of a data product is a recommendation engine,
which ingests user data, and makes personalized recommendations based on that data. Here are some
examples of data products:

 Amazon's recommendation engines suggest items for you to buy, determined by their
algorithms. Netflix recommends movies to you. Spotify recommends music to you.
 Gmail's spam filter is data product – an algorithm behind the scenes processes incoming mail
and determines if a message is junk or not.
 Computer vision used for self-driving cars is also data product – machine learning algorithms are
able to recognize traffic lights, other cars on the road, pedestrians, etc.

WHAT IS DATA SCIENCE – THE REQUISITE SKILL SET

Data science is a blend of skills in three major areas:


1. Mathematics Expertise
At the heart of mining data insight and building data product is the ability to view the data through a
quantitative lens. There are textures, dimensions, and correlations in data that can be expressed
mathematically.

Solutions to many business problems involve building analytic models grounded in the hard math, where
being able to understand the underlying mechanics of those models is key to success in building them.

Also, a misconception is that data science all about statistics. While statistics is important, it is not the
only type of math utilized. Overall, it is helpful for data scientists to have breadth and depth in their
knowledge of mathematics.

2. Technology and Hacking


First, let's clarify on that we are not talking about hacking as in breaking into computers. We're referring
to the tech programmer subculture meaning of hacking – i.e., creativity and ingenuity in using technical
skills to build things and find clever solutions to problems.

Why is hacking ability important? Because data scientists utilize technology in order to wrangle
enormous data sets and work with complex algorithms, and it requires tools far more sophisticated than
Excel. Data scientists need to be able to code — prototype quick solutions, as well as integrate with
complex data systems. Core languages associated with data science include SQL, Python, R, and SAS. On
the periphery are Java, Scala, Julia, and others. But it is not just knowing language fundamentals. A
hacker is a technical ninja, able to creatively navigate their way through technical challenges in order to
make their code work.

Along these lines, a data science hacker is a solid algorithmic thinker, having the ability to break down
messy problems and recompose them in ways that are solvable. This is critical because data scientists
operate within a lot of algorithmic complexity. They need to have a strong mental comprehension of
high-dimensional data and tricky data control flows. Full clarity on how all the pieces come together to
form a cohesive solution.

3. Strong Business Acumen


It is important for a data scientist to be a tactical business consultant. Working so closely with data,
data scientists are positioned to learn from data in ways no one else can. That creates the responsibility
to translate observations to shared knowledge, and contribute to strategy on how to solve core business
problems. This means a core competency of data science is using data to cogently tell a story. No data-
puking – rather, present a cohesive narrative of problem and solution, using data insights as supporting
pillars, that lead to guidance.
SKILLS NEEDED TO BE A DATA SCIENTIST

PROGRAMMING

No matter what type of company or role you’re interviewing for, you’re likely going to be expected to
know how to use the tools of the trade. This means a statistical programming language, like R or
Python, and a database querying language like SQL.

DATA VISUALIZATION

Visualizing and communicating data is incredibly important, especially with young companies that are
making data-driven decisions for the first time, or companies where data scientists are viewed as
people who help others make data-driven decisions. When it comes to communicating, this means
describing your findings, or the way techniques work to audiences, both technical and non-technical.
Visualization-wise, it can be immensely helpful to be familiar with data visualization tools like
matplotlib, ggplot, or d3.js. Tableau has become a popular data visualization and dash boarding tool
as well. It is important to not just be familiar with the tools necessary to visualize data, but also the
principles behind visually encoding data and communicating information.

MACHINE LEARNING

If you’re at a large company with huge amounts of data, or working at a company where the product
itself is especially data-driven (e.g. Netflix, Google Maps, Uber), it may be the case that you’ll want to
be familiar with machine learning methods. This can mean things like k-nearest neighbors, random
forests, ensemble methods, and more. It’s true that a lot of these techniques can be implemented
using R or Python libraries—because of this, it’s not necessarily to become an expert on how the
algorithms work. More important is to understand the broad strokes and really understand when it is
appropriate to use different techniques.

WORKING WITH DATA

Companies want to see that you’re a data-driven problem-solver. At some point during the interview
process, you’ll probably be asked about some high level problem—for example, about a test the
company may want to run, or a data-driven product it may want to develop. It’s important to think
about what things are important, and what things aren’t. How should you, as the data scientist,
interact with the engineers and product managers? What methods should you use? When do
approximations make sense?

STATISTICAL MODELLING

A good understanding of statistics is vital as a data scientist. You should be familiar with statistical
tests, distributions, maximum likelihood estimators, etc. This will also be the case for machine
learning, but one of the more important aspects of your statistics knowledge will be understanding
when different techniques are (or aren’t) a valid approach. Statistics is important at all company
types, but especially data-driven companies where stakeholders will depend on your help to make
decisions and design / evaluate experiments. Understanding these concepts is most important at
companies where the product is defined by the data, and small improvements in predictive
performance or algorithm optimization can lead to huge wins for the company. In an interview for a
data science role, you may be asked to derive some of the machine learning or statistics results you
employ elsewhere. Or, your interviewer may ask you some basic multivariable calculus or linear
algebra questions, since they form the basis of a lot of these techniques. You may wonder why a data
scientist would need to understand this when there are so many out of the box implementations in
Python or R. The answer is that at a certain point, it can become worth it for a data science team to
build out their own implementations in house.

HANDLING BIG DATA

Often, the data you’re analyzing is going to be messy and difficult to work with. Because of this, it’s
really important to know how to deal with imperfections in data. Some examples of data
imperfections include missing values, inconsistent string formatting (e.g., ‘New York’ versus ‘new york’
versus ‘ny’), and date formatting (‘2017-01-01’ vs. ‘01/01/2017’, UNIX time vs. timestamps, etc.). This
will be most important at small companies where you’re an early data hire, or data-driven companies
where the product is not data-related (particularly because the latter has often grown quickly with not
much attention to data cleanliness), but this skill is important for everyone to have.

TOOLS AND PROGRAMMING LANGUAGES

1. TABLEAU

Tableau is a Business Intelligence Tool used for data visualization. With Tableau you can gain insights by
just visualizing the stats that you already have with you and use it for your development of your
business.

Now, let’s understand it in detail.

What is Tableau?

Well, Tableau is an interactive data visualization tools that enables you to create interactive and apt
visualizations in form of dashboards, worksheets to gain business insights for the better development of
your company. It allows non-technical users to easily create customized dashboards that provide insight
to a broad spectrum of information.

Once you start using Tableau, you will realize soon enough what an eye-opener it can be. Things which
you might not even have thought about could be actually the solution to your business problems.
2. QLIKVIEW

QlikView is a leading Business Discovery Platform. It is unique in many ways as compared to the traditional BI
platforms. As a data analysis tool, it always maintains the relationship between the data and this relationship
can be seen visually using colors. It also shows the data that are not related. It provides both direct and
indirect searches by using individual searches in the list boxes.

QlikView's core and patented technology has the feature of in-memory data processing, which gives
superfast result to the users. It calculates aggregations on the fly and compresses data to 10% of original
size. Neither users nor developers of QlikView applications manage the relationship between data. It is
managed automatically.

Features of QlikView

QlikView has patented technology, which enables it to have many features that are useful in creating
advanced reports from multiple data sources quickly. Following is a list of features that makes QlikView very
unique.

 Data Association is maintained automatically − QlikView automatically recognizes the relationship


between each piece of data that is present in a dataset. Users need not preconfigure the relationship
between different data entities.

 Data is held in memory for multiple users, for a super-fast user experience − The structure, data and
calculations of a report are all held in the memory (RAM) of the server.

 Aggregations are calculated on the fly as needed − As the data is held in memory, calculations are done
on the fly. No need of storing pre-calculated aggregate values.

 Data is compressed to 10% of its original size − QlikView heavily uses data dictionary. Only essential bits
of data in memory is required for any analysis. Hence, it compresses the original data to a very small size.

 Visual relationship using colors − The relationship between data is not shown by arrow or lines but by
colors. Selecting a piece of data gives specific colors to the related data and another color to unrelated
data.

 Direct and Indirect searches − Instead of giving the direct value a user is looking for, they can input some
related data and get the exact result because of the data association. Of course, they can also search for
a value directly.

3. SAS VISUAL ANALYTICS

SAS Visual Analytics is a data visualization software which helps to build and design interactive web
dashboard. All you need to do is to import a file and start preparing a dashboard with drag and drop
feature. You don’t need to know SAS programming language before using SAS Visual Analytics.
4. MICROSTRATEGY

MicroStrategy is a Business Intelligence software, which offers a wide range of data analytics
capabilities. As a suite of applications, it offers Data Discovery, Advanced Analytics, Data Visualizations,
Embedded BI, and Banded Reports and Statements. It can connect to data warehouses, relational
systems, flat files, web services and a host of other types of sources to pull data for analysis. Features
such as highly formatted reports, ad hoc query, thresholds and alerts, and automated report distribution
makes MicroStrategy an industry leader in BI software space.

5. R LANGUAGE

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide
variety of UNIX platforms, Windows and MacOS.

6. PYTHON

Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming


language.

7. SAS

SAS (previously "Statistical Analysis System"[1]) is a software suite developed by SAS Institute for
advanced analytics, multivariate analyses, business intelligence, data management, and predictive
analytics.

SAS is a software suite that can mine, alter, manage and retrieve data from a variety of sources and
perform statistical analysis on it.[2] SAS provides a graphical point-and-click user interface for non-
technical users and more advanced options through the SAS language

8. MATLAB

MATLAB (matrix laboratory) is a multi-paradigm numerical computing environment. A proprietary


programming language developed by MathWorks, MATLAB allows matrix manipulations, plotting of
functions and data, implementation of algorithms, creation of user interfaces, and interfacing with
programs written in other languages, including C, C++, C#, Java, Fortran and Python

9. SPSS
SPSS Statistics is a software package used for logical batched and non-batched statistical analysis. The
software name originally stood for Statistical Package for the Social Sciences (SPSS).

DATA SCIENCE PROCESS

Step 1: Frame the problem

The first thing you have to do before you solve a problem is to define exactly what it is. You need to be
able to translate data questions into something actionable.

You’ll often get ambiguous inputs from the people who have problems. You’ll have to develop the
intuition to turn scarce inputs into actionable outputs–and to ask the questions that nobody else is
asking.

Say you’re solving a problem for the VP Sales of your company. You should start by understanding their
goals and the underlying why behind their data questions. Before you can start thinking of solutions,
you’ll want to work with them to clearly define the problem.

A great way to do this is to ask the right questions.

You should then figure out what the sales process looks like, and who the customers are. You need as
much context as possible for your numbers to become insights.

You should ask questions like the following:

1. Who are the customers?


2. Why are they buying our product?
3. How do we predict if a customer is going to buy our product?
4. What is different from segments who are performing well and those that are performing below
expectations?
5. How much money will we lose if we don’t actively sell the product to these groups?

In response to your questions, the VP Sales might reveal that they want to understand why certain
segments of customers have bought less than expected. Their end goal might be to determine whether
to continue to invest in these segments, or de-prioritize them. You’ll want to tailor your analysis to that
problem, and unearth insights that can support either conclusion.

It’s important that at the end of this stage, you have all of the information and context you need to solve
this problem.

Step 2: Collect the raw data needed for your problem


Once you’ve defined the problem, you’ll need data to give you the insights needed to turn the problem
around with a solution. This part of the process involves thinking through what data you’ll need and
finding ways to get that data, whether it’s querying internal databases, or purchasing external datasets.

You might find out that your company stores all of their sales data in a CRM or a customer relationship
management software platform. You can export the CRM data in a CSV file for further analysis.

Step 3: Process the data for analysis

Now that you have all of the raw data, you’ll need to process it before you can do any analysis.
Oftentimes, data can be quite messy, especially if it hasn’t been well-maintained. You’ll see errors that
will corrupt your analysis: values set to null though they really are zero, duplicate values, and missing
values. It’s up to you to go through and check your data to make sure you’ll get accurate insights.

You’ll want to check for the following common errors:

1. Missing values, perhaps customers without an initial contact date


2. Corrupted values, such as invalid entries
3. Time zone differences, perhaps your database doesn’t take into account the different time
zones of your users
4. Date range errors, perhaps you’ll have dates that makes no sense, such as data registered from
before sales started

You’ll need to look through aggregates of your file rows and columns and sample some test values to
see if your values make sense. If you detect something that doesn’t make sense, you’ll need to remove
that data or replace it with a default value. You’ll need to use your intuition here: if a customer doesn’t
have an initial contact date, does it make sense to say that there was NO initial contact date? Or do you
have to hunt down the VP Sales and ask if anybody has data on the customer’s missing initial contact
dates?

Once you’re done working with those questions and cleaning your data, you’ll be ready for exploratory
data analysis (EDA).

Step 4: Explore the data

When your data is clean, you’ll should start playing with it!

The difficulty here isn’t coming up with ideas to test, it’s coming up with ideas that are likely to turn into
insights. You’ll have a fixed deadline for your data science project (your VP Sales is probably waiting on
your analysis eagerly!), so you’ll have to prioritize your questions. ‘

You’ll have to look at some of the most interesting patterns that can help explain why sales are reduced
for this group. You might notice that they don’t tend to be very active on social media, with few of them
having Twitter or Facebook accounts. You might also notice that most of them are older than your
general audience. From that you can begin to trace patterns you can analyze more deeply.

Step 5: Perform in-depth analysis

This step of the process is where you’re going to have to apply your statistical, mathematical and
technological knowledge and leverage all of the data science tools at your disposal to crunch the data
and find every insight you can.

In this case, you might have to create a predictive model that compares your underperforming group
with your average customer. You might find out that the age and social media activity are significant
factors in predicting who will buy the product.

If you’d asked a lot of the right questions while framing your problem, you might realize that the
company has been concentrating heavily on social media marketing efforts, with messaging that is
aimed at younger audiences. You would know that certain demographics prefer being reached by
telephone rather than by social media. You begin to see how the way the product has been has been
marketed is significantly affecting sales: maybe this problem group isn’t a lost cause! A change in tactics
from social media marketing to more in-person interactions could change everything for the better. This
is something you’ll have to flag to your VP Sales.

You can now combine all of those qualitative insights with data from your quantitative analysis to craft a
story that moves people to action.

Step 6: Communicate results of the analysis

It’s important that the VP Sales understand why the insights you’ve uncovered are important.
Ultimately, you’ve been called upon to create a solution throughout the data science process. Proper
communication will mean the difference between action and inaction on your proposals.

You need to craft a compelling story here that ties your data with their knowledge. You start by
explaining the reasons behind the underperformance of the older demographic.

DATA SCIENCE PROCESS - REVISITED:

Stage 1: Ask a Question

 Skills: science, domain expertise, curiosity


 Tools: your brain, talking to experts, experience

Stage 2: Get the Data


 Skills: web scraping, data cleaning, querying databases, CS stuff
 Tools: python, pandas

Stage 3: Explore the Data

 Skills: Get to know data, develop hypotheses, patterns? Anomalies?


 Tools: matplotlib, numpy, scipy, pandas, mrjob

Stage 4: Model the Data

 Skills: regression, machine learning, validation, big data


 Tools: scikits learn, pandas, mrjob, mapreduce

Stage 5: Communicate the Data

 Skills: presentation, speaking, visuals, writing


 Tools: matplotlib, adobe illustrator, power point/keynote

MACHINE LEARNING

1. Supervised Machine Learning


The majority of practical machine learning uses supervised learning.

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output.

Y = f(X)

The goal is to approximate the mapping function so well that when you have new input data (x) that you
can predict the output variables (Y) for that data.

It is called supervised learning because the process of an algorithm learning from the training dataset
can be thought of as a teacher supervising the learning process. We know the correct answers, the
algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning
stops when the algorithm achieves an acceptable level of performance.

Supervised learning problems can be further grouped into regression and classification problems.

 Classification: A classification problem is when the output variable is a category, such as “red” or
“blue” or “disease” and “no disease”.
 Regression: A regression problem is when the output variable is a real value, such as “dollars” or
“weight”.
Some common types of problems built on top of classification and regression include recommendation
and time series prediction respectively.

2. Unsupervised Machine Learning

Unsupervised learning is where you only have input data (X) and no corresponding output variables.

The goal for unsupervised learning is to model the underlying structure or distribution in the data in
order to learn more about the data.

These are called unsupervised learning because unlike supervised learning above there is no correct
answers and there is no teacher. Algorithms are left to their own devises to discover and present the
interesting structure in the data.

Unsupervised learning problems can be further grouped into clustering and association problems.

 Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
 Association:  An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.

Some popular examples of unsupervised learning algorithms are:

 k-means for clustering problems.


 Apriori algorithm for association rule learning problems.

APPLICATIONS / USES OF DATA SCIENCE

Using data science, companies have become intelligent enough to push & sell products as per customers
purchasing power & interest. Here’s how they are ruling our hearts and minds:

Internet Search

When we speak of search, we think ‘Google’. Right? But there are many other search engines like Yahoo,
Bing, Ask, AOL, Duckduckgo etc. All these search engines (including Google) make use of data science
algorithms to deliver the best result for our searched query in fraction of seconds. Considering the fact
that, Google processes more than 20 petabytes of data everyday. Had there been no data science, Google
wouldn’t have been the ‘Google’ we know today.

Digital Advertisements (Targeted Advertising and re-targeting)


If you thought Search would have been the biggest application of data science and machine learning, here
is a challenger – the entire digital marketing spectrum. Starting from the display banners on various
websites to the digital bill boards at the airports – almost all of them are decided by using data science
algorithms.

This is the reason why digital ads have been able to get a lot higher CTR than traditional advertisements.
They can be targeted based on user’s past behavior. This is the reason why I see ads of analytics trainings
while my friend sees ad of apparels in the same place at the same time.

Recommender Systems

Who can forget the suggestions about similar products on Amazon? They not only help you find relevant
products from billions of products available with them, but also adds a lot to the user experience.

A lot of companies have fervidly used this engine / system to promote their products / suggestions in
accordance with user’s interest and relevance of information. Internet giants like Amazon, Twitter, Google
Play, Netflix, Linkedin, imdb and many more uses this system to improve user experience. The
recommendations are made based on previous search results for a user.

Image Recognition

You upload your image with friends on Facebook and you start getting suggestions to tag your friends. This
automatic tag suggestion feature uses face recognition algorithm. Similarly, while using WhatsApp web,
you scan a barcode in your web browser using your mobile phone. In addition, Google provides you the
option to search for images by uploading them. It uses image recognition and provides related search
results.

Speech Recognition

Some of the best example of speech recognition products are Google Voice, Siri, Cortana etc. Using speech
recognition feature, even if you aren’t in a position to type a message, your life wouldn’t stop. Simply
speak out the message and it will be converted to text. However, at times, you would realize, speech
recognition doesn’t perform accurately.

Price Comparison Websites

At a basic level, these websites are being driven by lots and lots of data which is fetched using APIs and RSS
Feeds. If you have ever used these websites, you would know, the convenience of comparing the price of a
product from multiple vendors at one place. PriceGrabber, PriceRunner, Junglee, Shopzilla, DealTime are
some examples of price comparison websites. Now a days, price comparison website can be found in
almost every domain such as technology, hospitality, automobiles, durables, apparels etc.
Airline Route Planning

Airline Industry across the world is known to bear heavy losses. Except a few airline service providers,
companies are struggling to maintain their occupancy ratio and operating profits. With high rise in air fuel
prices and need to offer heavy discounts to customers has further made the situation worse. It wasn’t for
long when airlines companies started using data science to identify the strategic areas of improvements.
Now using data science, the airline companies can:

 Predict flight delay


 Decide which class of airplanes to buy
 Whether to directly land at the destination, or take a halt in between (For example: A flight can
have a direct route from New Delhi to New York. Alternatively, it can also choose to halt in any
country.)
 Effectively drive customer loyalty programs
 Southwest Airlines, Alaska Airlines are among the top companies who’ve embraced data science to
bring changes in their way of working.

Fraud and Risk Detection

One of the first applications of data science originated from Finance discipline. Companies were fed up of
bad debts and losses every year. However, they had a lot of data which use to get collected during the
initial paper work while sanctioning loans. They decided to bring in data science practices in order to
rescue them out of losses. Over the years, banking companies learned to divide and conquer data via
customer profiling, past expenditures and other essential variables to analyze the probabilities of risk and
default. Moreover, it also helped them to push their banking products based on customer’s purchasing
power.

Delivery logistics

Who says data science has limited applications? Logistic companies like DHL, FedEx, UPS, Kuhne+Nagel
have used data science to improve their operational efficiency. Using data science, these companies have
discovered the best routes to ship, the best suited time to deliver, the best mode of transport to choose
thus leading to cost efficiency, and many more to mention. Furthermore, the data that these companies
generate using the GPS installed, provides them a lots of possibilities to explore using data science.

You might also like