You are on page 1of 33

Introduction to Data

Analytics
 Data-Information
 characteristics of data
 data munging
Module –  Scraping
1
Syllabus  Sampling
 Cleaning
 importance of data
analytics
Data  Data analysis is a process of
Analysis obtaining raw data and converting
it into information useful for
decision-making by users.
Lots of data is being collected and warehoused

Data All ••Around


Web data, e-commerce
Financial transactions,
bank/credit transactions
• Online trading and purchasing
• Social Network
Data All Around
Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50
TB/day (5/2009)
1000 genomes project: 200 TB

How Much Cost of 1 TB of disk: $35


Time to read 1 TB disk: 3 hrs
Data Do (100 MB/s)

We have?
Big Data is any data that is expensive to manage
and hard to extract value from

Volume
The size of the data

Velocity
Big Data Velocity refers to the speed with which data is
generated. High velocity data is generated with
such a pace that it requires distinct (distributed)
processing techniques. An example of
a data that is generated with
high velocity would be Twitter messages or
Facebook posts

Variety and Complexity


the diversity of sources, formats, quality,
Big Data
• Relational Data
(Tables/Transaction/Legacy Data)
• Text Data (Web)
Types of Data • Semi-structured Data (XML)
• Graph Data
We Have • Social Network, Semantic Web (RDF),

• Streaming Data
 Aggregation and Statistics
 Data warehousing and OLAP
 Indexing, Searching, and Querying
What To Do  Keyword based search
With These  Pattern matching (XML/RDF)
Data?  Knowledge discovery
 Data Mining
 Statistical Modeling
• An area that manages, manipulates,
extracts, and interprets knowledge from
tremendous amount of data
What is • Data science (DS) is a multidisciplinary
field of study with goal to address the
Data challenges in big data
• Data science principles apply to all data
Science? – big and small
Theories and techniques from many fields and
disciplines are used to investigate and analyze a
large amount of data(structured and unstructured)
to help decision makers in many areas.
What is Data Data Science is a field that comprises of
Science? everything that related to data cleansing,
preparation, and analysis.

Involves in creation of new algorithms


Data
 Data Analyticsthe science of
Analytics
examining rawdata with the
purpose of drawing conclusions about
that information.
It is the part/ component of Data Science
 Data analytics is a term and
includes data broader necessary
Analysis subcomponent. analysis as
vs
Analyti
cs  Analytics defines the science behind the
analysis.
 The science means understanding the
cognitive processes an analyst uses to
understand problems and explore data in
meaningful ways.
Analysis
 Ex: Question and Answers
vs
Analyti  Analytics also data extract,
cs transform, include and specific
techniques, and methods;
load; and how to
tools,
successfully communicate results.
Ex: Very precise answers or outcomes for the given data at
that movement
 Data are the facts or details from
which information is derived.

Dat  Individual pieces of data are


a
rarely useful alone.
vs
Informa
tion  For data to become information,
data needs to be put into context.
Dat
a
vs
Informa
tion
The seven characteristics that define
data quality are:

1. Accuracy and
Precision
2. Legitimacy and Validity
Characteris
tics of 3. Reliability and Consistency
data 4. Timeliness and Relevance
5. Completeness and
Comprehensiveness
6. Availability and Accessibility
7. Granularity and Uniqueness
Accuracy and Precision: This
characteristic refers to the exactness of the
data.
Characteris
tics of Ex: Records at the wrong level of precision
data (i.e. prices that were originally quoted at
three decimal places, but cut-off and stored
at two decimal places)
Legitimacy and Validity: Requirements
governing data set the boundaries of this
characteristic.
Characteris Ex: On surveys, items such as gender, ethnicity, and
nationality are typically limited to a set of options and
tics of open answers are not permitted. Any answers other
data than these would not be considered valid or legitimate
based on the survey’s requirement.
Reliability and Consistency: Regardless of
what source collected the data or where it
resides, it cannot contradict a value residing
in a different source or collected by a
Characteris different system.
tics of Ex:
data • Telephone numbers with commas vs. hyphens
• U.S. vs. European date formats
Timeliness and Relevance: Data collected
too soon or too late could misrepresent a
situation and drive inaccurate decisions.
Characteris Ex:
tics of • An issuance or corporate action not
data delivered when it was announced
• A credit rating change not updated on the
day it was issued
Completeness and Comprehensiveness:
Incomplete data as dangerous as
is inaccurate data.
Ex: Missing data
Characteris
tics of Availability and Accessibility: This
data presumes that the data exists and is
available for access to be granted.
Ex: This characteristic can be tricky at times due to legal and
regulatory constraints. Regardless of the challenge, though,
individuals need the right level of access to the data in order to
perform their jobs. This presumes that the data exists and is
available for access to be granted.
• Granularity and Uniqueness: The level of detail at
which data is collected is important, because
confusion and inaccurate decisions can otherwise
occur.
• Aggregated, summarized and manipulated
collections of data could offer a different meaning
than the data implied at a lower level.
Characteris • An appropriate level of granularity must be defined
to provide sufficient uniqueness and distinctive
tics of properties to become visible.
data • This is a requirement for operations to function
effectively.
Ex:
• Two instances of the same security with different identifiers or
spellings
• A preferred share represented as both an equity and debt
object in the same database
• Data science plays a role in virtually all aspects
of our day-to-day lives and is used across nearly
all industries.

• The adoption of data science was largely spurred


by the successes of start-ups such as Uber,
Active Airbnb, and Facebook that rose rapidly and
earned valuations of billions of dollars in a very
domains of short span of time.

data science • Data generated by social media networks such


as Facebook and Twitter, search engines such as
Google and Yahoo!, companies leveraged the
information using various machine learning
techniques to gain insights.
Data science has been used in finance, especially in
trading for many decades. Investment banks,
especially trading desks, have employed complex
models to analyse and make trading decisions. Some
examples of data science as used in finance include:
•Credit risk management: Analyse the
creditworthiness of a user by analyzing the historical
financial records, assets, and transactions of the user
Finance •Loan fraud: Identifying applications for credit or
loans that may be fraudulent by analyzing the loan
and applicant's characteristics
•Market Basket Analysis: Understanding the
correlation among stocks and other securities and
formulating trading and hedging strategies
•High-frequency trading: Analyzing trades and
quotes to discover pricing inefficiencies and arbitrage
opportunities
Patient journey and treatment pathways:

Sales field messaging: using NLP, pharma


companies analyse discussions between sales
Health care representatives and physicians during sales visits to
improve their messaging content and better inform
and physicians on the potential risks and benefits of
Pharmaceutica medications as needed.
Biomarker analysis: Machine learning for
ls identifying biomarkers and their importance and/or
relevance to diseases are used in clinical research
such as cancer-related studies.
Data science is used by state and national
governments for a wide range of uses.

Climate Change: ML techniques are being


used throughout the globe to detect and
Government understand the causes of climate change

Cyber security: The use of extremely


advanced machine learning techniques for
national cyber security is evident and well
known all over the world, ever since such
practices were disclosed
Price optimization: Generally related to the realm of
linear programming, the challenge of price
optimization, that is, pricing products, is now also
being addressed with the help of machine learning.
Dynamic pricing based upon market conditions, user
preferences, and other factors are used as inputs to
assess optimal pricing of products.
Retail sales: Retailers use algorithms to determine
Manufacturing future sales forecasts, price discounts, and promotion
and retail sequences.
Production capacity and maintenance: In
manufacturing, data science is being used to
determine device maintenance requirements,
equipment effectiveness, optimize production lines,
and much more. The overall supply chain
management is an area that has benefited and
continues to earn profits from smart use of machine
learning.
• One of the earliest beneficiaries of data
science was the web industry.

• Empowered by the collection of user-


specific data from social networks, firms
Web around the world employ algorithms to
understand user behaviour and generate
industry targeted ads.

• Google, one of the earliest proponents of


targeted ad marketing today, earns most
of its revenue from ads, more than $95
billion in 2017.
The use of data science for web-related
businesses is ubiquitous today and
companies such as Uber, Airbnb, Netflix,
Web and Amazon have successfully navigated
industry and made full use of this complex
ecosystem, generating not only huge profits
but also added millions of new jobs directly 
There are various other industries today that benefit
from data science and as such, it has become so
common that it would be impractical to list all, but at a
high level, some of the others include the following:
• Oil and natural gas for oil production
• Meteorology for understanding weather patterns

Other • Space research for detecting and/or analyzing


stars and galaxies

industries • Utilities for energy production and energy savings


• Biotechnology for research and finding new cures
for diseases

In general, since data science, or machine learning


algorithms are not specific to any particular industry,
it is entirely possible to apply algorithms to creative
use cases and derive business benefits.
Data Science
Data Science

You might also like