Professional Documents
Culture Documents
Chapter-Two
Data Science
Overview of Data Science 2
Data science is a multi-disciplinary field that uses
• data mining, data warehousing, data modeling, big data and etc.
It is used for creating data-centric artifacts and applications that can address specific scientific,
socio-political, business related, or other issues.
Information:
• is a processed data on which decisions are based and
• transfers a complete meaning.
Information is interpreted data, created from:
• organized
• structured and
• processed data in a particular context.
Data Processing Cycle 9
Storages
Cont. . . 10
Input
• The format will depend on the purpose of processing and processing machine.
Process
• In this step the data obtained as an input further processed into more useful form.
Output
• The output from a particular process will be the final information required or may be used further as input for another
process.
Data: types and representations 11
Data types can be describe from different perspectives.
• From computer programming: data types are attributes of data that controls the compiler or
interpreter how it used data.
• From data analytics: data types simply articulates us how the data exists.
• Structured:-obeys a pre-defined data model and forthright for interpretations. E.g. tabular data.
• Semi-structured (Self-describing structure):-a form of structured data but not conform the formal
structure of data model instead contains tags or other markers for expressing semantic relations. E.g.
XML.
• E.g. in a photograph: size, locations, time and etc are meta data.
• Meta data is highly applicable in semantic webs, big data and etc.
Data value Chain 14
Big data is a set of strategies and technologies required to:
• gather
• organize
• process and
• gather insights from large datasets.
Data value chain- describes the flow of information within a big data systems.
Data acquisition 15
Data acquisition is a process of:
• gathering,
• filtering and
• cleaning data before its putted in data warehouse or further processed.
Data acquisition is a major challenge in big data.
The challenge is because the infrastructure:
• should support low, predictable latency in capturing data and executing query.
• should support dynamic and flexible data structure.
• should handle very high transaction volumes.
Data analysis 16
Data analysis involves:
• exploring
• transforming and
• modeling data in order to make the raw data amenable(agreeable) in decision making.
The goals of data analysis are:
• highlighting relevant data
• synthesizing and
• extracting useful hidden information.
Related areas to data analysis includes:
• Data mining
• Business intelligence and
• Machine learning
Data curation 17
Data curation refers to an active management of data to ensure its quality.
Data curation includes activities such as:
• content creation • transformation
• selection • validation and
• classification • preservation of data.
Data curation is done by data curators.
Data curators are responsible for improving accessibility and quality of the data.
The goals of data curation are:
• ensuring trustworthiness
• making data discoverable
• easing accessibility
• improving data reusability and
• making data fit their purpose
Data storage 18
Data storage
• Is the persistence and management of data in a scalable way.
• It guarantees the applications fast access to the data.
Relational Database Management Systems(RDBMS):
• RDBMS have been the main solution for data storage for almost 40 years.
• RDBMS have a property called ACID(atomicity, consistency, isolation and duration).
• ACID properties lacks flexibility with regard to schema change, fault tolerance and data volume(complexity) increase.
• Lack of flexibility makes RDBMS unsuitable for big data science.
ACID properties in DBMS are:
• Atomicity: tells that the entire transaction takes place at once or doesn't happen at all.
• Consistency: states that the database must be consistent before and after the transaction.
• Isolation: multiple transaction should occur independently without interference.
• Durability: changes made by a successful transaction should occur even if the system failure happens.
NoSQL data storage technologies designed as an alternative data model to support flexibility and scalability in data
storage.
Data usage 19
Data usage covers data-driven business activities that needs
• access to data
• its analysis and
• the tools needed to integrate the data analysis within the business activity.
Data usage in business decision making can enhance competitiveness through
• reduction of costs
• increased added value, or any
• other parameter that can be measured against existing performance criteria.
Big data: Basic concepts 20
Big data is a large and complex collection of data sets.
Big data is a set of strategies and technologies required to:
• gather
• organize
• process and
• gather insights from large datasets.
Why big data? Big data is because:
• the volume of data drastically increased over time.
• the data set in organizations becomes so large it becomes difficult ( almost impossible) to process
using
on-hand database management tools or
traditional data processing applications.
Cont… 21
• Due to the advent of new technologies, devices, and communication means like social
networking sites, IoT and soon the amount of data produced by mankind is growing
rapidly every year.
.
Data produced
Before 2003 In 2011 In 2013
5B GB 5B GB/2dys
5B GB / 10min
The amount of data produced by us
If this data is stored inside disks and pile up them, it may fill an
entire football field
Cont… 22
Big data is characterized by 3V and more:
• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data at live streaming or in motion
• Variety: data comes in many different forms from diverse sources
• Veracity: can we trust the data? How accurate is it? etc..
Clustered computing and Hadoop 23
Clustered computing:
• Due to big data individual computers are inadequate for computing.
• Therefor for addressing computational and high storage need of big data clustering
appeared.
• Big data clustering software combines the resources of many smaller machines.
Advantages of clustered computing:
• Resource pooling-combining available storage space, CPU and memory for processing large
datasets.
• High availability- clustering embraces fault tolerant and robust computing environment for
increasing availability.
• Easy scalability-clustered computing is easily scalable horizontally by adding additional
resources to the cluster.
Cont… 24
Hadoop ecosystem evolved from its four components mentioned on previous slide.
Generally the Hadoop ecosystem consists of:
• HDFS: Hadoop Distributed File System • HBase: NoSQL Database
• YARN: Yet Another Resource Negotiator • Mahout, Spark MLLib: Machine Learning algorithm
• MapReduce: Programming based Data libraries
Processing • Solar, Lucene: Searching and Indexing
• Spark: In-Memory data processing • Zookeeper: Managing cluster
• PIG, HIVE: Query-based processing of data • Oozie: Job Scheduling
services
Cont… 27
Big data lifecycle: with Hadoop 28
End of Chapter-Two
Reading Assign: List AI-applications that you encountered in your life.