Introduction To Data Science and Big Data

Unit - 1 Introduction to Data Science and Big Data
Data science is used in to how the data can be made use in a manner
that helps in a better decision-making process and solve complex
problems more simply.
It processes a huge amount of structured, semi-structured,

unstructured data to extract insight meaning, from which one
pattern can be designed that will be useful to take a decision for
grabbing the new business opportunity, the betterment of
product/service, ultimately business growth.
AI
 Artificial intelligence is a technology using which we can create
intelligent systems that can simulate human intelligence
 Artificial intelligence system does not require to be pre-

programmed
Machine Learning
It is about extracting knowledge from the data

Machine learning is a subfield of artificial intelligence, which enables
machines to learn from past data or experiences without being
explicitly programmed
Big Data –
It is a collection of data sets which is so large and complex that it
become difficult to process using DBMs tools
"Big Data" consists of very large volumes of heterogeneous data that

is being generated, often, at high speeds.
Application of Data Science
Transport
Data Science also entered into the Transport field like Driverless Cars.
With the help of Driverless Cars, it is easy to reduce the number of
Accidents.
In Driverless Cars the training data is fed into the algorithm and with
the help of Data Science techniques, the Data is analyzed
E commerce -
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to
make a better user experience with personalized recommendations.
Banking
It is one of the biggest applications of Data Science , banks can
manage their resources efficiently, furthermore, banks can make
smarter decisions through fraud detection, management of customer
data, risk modeling, real-time predictive analytics, customer
segmentation, etc.
Manufacturing
– Optimizing production
– Reducing costs
– Boosting the profits
Data Explosion
The rapid increase in the amount of data that is generated and stored
in the computing systems, that reaches level where data management
becomes difficult, is called “Data Explosion”.
The key drivers of data growth are following :

– Increase in storage capacities.
– Cheaper storage.
– Increase in data processing capabilities by modern computing
devices.
– Data generated and made available by different sectors
Five V’s of big data
we can identify Big Data by a few characteristics which are

specific to Big Data. Which is know as Five V’s of big data
• Volume –
it refers to the amount of data that exists.
If the volume of data is large enough, it can be considered big data
• Velocity -
It refers to how quickly data is generated and how quickly that data
moves.
• Variety
refers to the diversity of data types.
An organization might obtain data from a number of different data
sources, which may vary in value. Data can come from sources in and
outside an enterprise as well
• Veracity
It refers to the quality and accuracy of data. Gathered data could have
missing pieces, may be inaccurate or may not be able to provide real,
valuable insight
• Value
This refers to the value that big data can provide, and it relates directly
to what organizations can do with that collected data.
Relation between DS and IS

DS is about discovery of knowledge from a data
A data science is field in which information and knowledge extracted
from data by using diff algorithm and processes
Data science is used in business function such as strategy information ,
decision making
IS is about design practices for storing and retrieving information

IS is used in areas such as knowledge management , data
management
Data science Lifecycle –
Data Science Life Cycle is a definite procedure that has five

important steps .
Gathering/Collecting Data
Before creating any new product, organizations need to

collect data to research the demand, customer preferences,
competitors, etc.
If these data are not collected in advance, the rate of failure
for the new product is 80 percent or even higher.
There are two main methods of data collection
1. Primary Data Collection

• Interviews
• Observations
• Surveys and Questionnaires
• Focus Groups
• Oral Histories
2. Secondary Data Collection
Secondary data refers to data that has already been collected

by someone else.
• It is much more inexpensive and easier to collect than
primary data.
• While primary data collection provides more authentic and
original data, there are numerous instances where secondary
data collection provides great value to organizations.
Example - Internet
Cleaning Data -
Scrubbing and filtering of data.
Here we Remove duplicate or irrelevant observations
Exploring Data
Modeling Data
Interpreting Data

Introduction To Data Science and Big Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Data Science and Big Data

Uploaded by

Copyright:

Available Formats

Unit - 1 Introduction to Data Science and Big Data

It processes a huge amount of structured, semi-structured,

 Artificial intelligence system does not require to be pre-

It is about extracting knowledge from the data

"Big Data" consists of very large volumes of heterogeneous data that

Application of Data Science

The key drivers of data growth are following :

Five V’s of big data

we can identify Big Data by a few characteristics which are

Relation between DS and IS

IS is about design practices for storing and retrieving information

Data Science Life Cycle is a definite procedure that has five

Before creating any new product, organizations need to

There are two main methods of data collection

1. Primary Data Collection

2. Secondary Data Collection

Secondary data refers to data that has already been collected

You might also like