Professional Documents
Culture Documents
• Multi-disciplinary field that uses scientific methods, processes, algorithms & systems to
extract knowledge & insights from structured, semi-structured and unstructured data.
• More than just simply analyzing of data.
• Offers a whole ranges of new roles and requires a specific range of skillsets.
• The results of the proceeding processing step is collected. The particular form of the output data depends on
the usage of the data.
Out • Example: Output data may be the payroll for employees.
put
• Commonly conforms to a tabular format Contains tags or other markers to separate • Typically, text-heavy but may contain data
with a relationship between the different semantic elements and enforce hierarchies of such as dates, numbers, and facts a
rows & columns records & fields within the data
• Has structured rows and columns that can Also known as self-describing structure • results in irregularities and ambiguities
be sorted. that make it difficult to understand using
traditional programs as compared to data
stored in structured databases.
• Examples: Excel files & SQL Database Examples: JSON files & XML files Examples: Audio, Video files, No-SQL
databases.
Economical:
Its systems are highly economical as ordinary computers can be used for data processing.
Reliable:
It is reliable as it stores copies of the data on different machines and is resistant to hardware failure.
Scalable:
Easily scalable, both, horizontally & vertically. A few extra nodes helps in scaling up the framework
Flexible:
It is flexible and can store as much structured and unstructured data as needed & decide to use it later.
• Stage 2 – Processing
Data is stored & processed. Data is stored in distributed
system, HDFS & no SQL distributed data.
Access Process
• Stage 3 – Analyze
Data is analyzed by processing framework such as
Impala & Hive.
• Stage 4 – Access
Performed by tools such as hue and Cloudera Search.
Analyze In this stage, the analyzed data can be accessed by
users.
Q&
A
CT060-3-3-EMTECH Chapter 2: ET – Data Science SLIDE 25