Professional Documents
Culture Documents
Data Science
1
An Overview of Data Science
Data science is a multi-disciplinary field that uses scientific methods,
• Example:
• Consider data involved in buying a box of KitKat from the store or supermarket:
• Your data here is the planned purchase written somewhere
• When you get to the store, you use that piece of data to remind yourself about
what you need to buy and pick it up and put it in your cart.
• At checkout, the cashier scans the barcode on your box and the cash register
logs the price.
• Back in the warehouse, a computer informs the stock manager that it is time to
order this item from distributor because your purchase takes the last box in the
store.
• You may have a coupon for your purchase and the cashier scans that too, giving
you a predetermined discount.
Overview of Data Science …
• Example:
• At the end of the week, a report of all the scanned
manufacturer coupons gets uploaded to the KitKat company
so they can issue a reimbursement to the grocery store for all
of the coupon discounts they have handed out to customers.
• Finally, at the end of the month, a store manager looks at a
colorful collection of pie charts showing all the different kinds
of KitKat that were sold and, on the basis of strong sales of
KitKat, decides to offer more varieties of these on the store’s
limited shelf space next month.
• So, the small piece of information on your notebook ended up in many different places
• Notably on the desk of a manager as an aid to decision making.
• The data went through many transformations.
Overview of Data Science …
• In addition to the computers where the data might have stopped by or stayed on
for the long term, lots of other pieces of hardware—such as the QR code scanner—
were involved in collecting, manipulating, transmitting, and storing the data.
• In addition, many different pieces of software were used to organize,
aggregate, visualize, and present the data.
• Finally, many different human systems were involved in working with the data.
• People decided which systems to buy and install, who should get access to
what kinds of data, and what would happen to the data after its immediate
purpose was fulfilled.
• Data science evolves as one of the most promising and in-demand career
paths.
• Professionals use advanced techniques for analyzing large volumes of
data.
Overview of Data Science …
formalized manner
8
What are data and information?
Information is the processed data on which decisions and actions are
based.
knowledge
What are data and information?
Knowledge: An appropriate collection of information.
Is the level of patronization (creating r/ship among concept)
Used to answer ‘how’ question
Found through many experience and much information.
Come through understanding patterns.
Wisdom: Collection of very deep knowledge.
Come through understanding principles. Hierarchical Model10of
human competency
Data VS Information…
Data vs. Information Examples Chart
• Seeing examples of data and information side-by-side in a chart can help you
better understand the differences between the two terms.
Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by
processing functions.
13
Data Processing Cycle
Is to transform
14
Data Processing Cycle
Data processing consists of the following basic steps - input, processing,
and output.
Input − in this step, the input data is prepared in some convenient form for
processing.
A. Structured
Structured data is data that adheres to a pre-defined data model and is
therefore straightforward to analyze.
Examples of semi-structured data include JSON and XML are forms of semi-structured
data.
C. Unstructured
Unstructured data is information that either does not have a predefined data model or is
not organized in a pre-defined manner.
Unstructured information is typically text-heavy but may contain data such as dates,
numbers, and facts as well. 18
Data types from Data Analytics perspective
Metadata : Metadata is data about data
data system as a series of steps needed to generate value and useful insights from
data.
The Big Data Value Chain identifies the following key high-level activities:
21
Data value Chain
A. Data Acquisition
B. Data Analysis
Data analysis involves exploring, transforming, and modeling data with the
goal of highlighting relevant data, synthesizing and extracting useful
hidden information with high potential from a business point of view.
22
Data value Chain
C. Data Curation
It is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.
D. Data Storage
E. Data Usage
It is has 4 Vs characters:
1. Volume:-Large amount of data (in zeta bytes)
2. Velocity-Data is live streaming or in motion
3. Variety-Data comes in d/t forms from d/t sources
4. Veracity–Can we trust the data? How it is accurate?
25
• Let’s look our smart phones, now a day smart phones
generates a lot of data in the form of text, phone calls,
emails, photos, videos, searches and music.
• Approximately 40 Exabytes (10^18) of data get generated
every month by a single smart phone user, now consider
how much data will generate from 5 billon smart phone.
• That is mind blowing in fact, this amount of data quit a lot
for traditional computing systems to handle. This massive
amount of data is called big data.
• Now let’s have a look at the data generated per
minute on internet.
• 2.1M snaps are shard in Snap chat,
• 3.8M search queries are mead in Google,
• 1M people are log in Facebook,
• 4.5M videos are watched in YouTube and
• 188M emails are send.
Big Data Solutions: Clustered Computing
• Individual computers are often inadequate for handling big data a
most stages.
• Clustered computing is used to better address the high storage
and computational needs of big data.
• Clustered computing is a form of computing in which a group of
computers (often called nodes) that are connected through a LAN
(local area network) so that, they behave like a single machine.
• The set of computers is called a cluster.
• The resources from these computers are pooled to appear as one
more powerful computer than the individual computers.
Clustered Computing
Big data clustering software combines the resources of many smaller machines,
I. Resource Pooling
Combining the available storage space to hold data is a clear benefit, but CPU
Negotiator).
programming models.
34
Big data life cycle with hadoop
1. Ingesting data into the system
• The first stage of Big Data processing is to Ingest data into the
system.
• The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
• Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
2. Processing the data in storage.
• The second stage is Processing.
• In this stage, the data is stored and processed.
• The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, HBase.
• Spark and MapReduce perform data processing.
Big data life cycle with hadoop
3. Computing and analyzing data
• The third stage is to Analyze Data
• Here, the data is analyzed by processing frameworks such
as Pig, Hive, and Impala.
• Pig converts the data using a map and reduce and then
analyzes it.
• Hive is also based on the map and reduce programming
and is most suitable for structured data.
4. Visualizing the results
• The fourth stage is access, which is performed by tools
such as Sqoop, Hive, Hue and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
37