Professional Documents
Culture Documents
Data
Introduction
● Data science is now one of the most influential topics all around.
● Companies and enterprises are focusing a lot on gathering data
science talent further creating more viable roles in the data science
industry.
● Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge
and insights from structured, semi-structured and unstructured data.
● Example: The data involved in buying a box of cereal from the store or
supermarket
Data Science vs Data scientist
• Data Science defined as the extraction of actionable
knowledge directly from the data through the process
of discovery, hypothesis, and analytical hypotheses
analysis.
• It is a process of effectively producing or helping to
produce some tool, method, or other product that
derives intelligence from datasets too large.
Data Science vs Data scientist
• A data scientist (is a job title) is a person engaging in a
systematic activity to acquire knowledge from data.
• In a more restricted sense, a data scientist may refer to
an individual who uses the scientific method on
existing data.
• Data Scientists perform research toward a more
comprehensive understanding of products, systems, or
nature, including physical, mathematical and social
realms.
Role of a Data Scientist
• Advance the skills of analyzing large amounts of data,
data mining, and programming skills.
• The processed and filtered data are handed to them
which are then fed to various analytics programs and
machine learning with statistical methods to generate
data which will soon be used in predictive analysis and
other fields
• Explore for more cryptic patterns to procure proper
insights.
Data Science
• Scientific method requires data to begin iterating
towards a more convincing hypothesis.
• Science doesn’t exist without data.
• Data scientist
• possess a strong
• Quantitative background in statistics
• Linear algebra
• Programming knowledge with focuses on data warehousing,
mining, and modeling to build and analyze algorithms
Algorithms
• An algorithm is a set of instructions designed to
perform a specific task.
• This can be a simple process, such as multiplying two
numbers, or a complex operation, such as playing a
compressed video file.
• Search engines use proprietary algorithms to display
the most relevant results from their search index for
specific queries.
Data vs. Information
• Data
• Can be defined as a representation of facts, concepts, or instructions in a
formalized manner, which should be suitable for communication,
interpretation, or processing, by human or electronic machines.
• It can be described as unprocessed facts and figures
• It is represented with the help of characters such as alphabets (A-Z, a-z), digits
(0-9) or special characters (+, -, /, *, <,>, =, etc.
• Information
• The processed data on which decisions and actions are based
• Information is interpreted data; created from organized, structured, and
processed data in a particular context
Data Processing Cycle
• Data processing is the conversion of raw data to meaningful
information through a process.
• Data is manipulated to produce results that lead to a resolution of a
problem or improvement of an existing situation.
• The process includes activities like data entry/input,
calculation/process, output and storage
• Input is the task where verified data is coded or converted into
machine readable form so that it can be processed through a computer.
Data entry is done through the use of a keyboard, digitizer, scanner, or
data entry from an existing source.
Data Processing Cycle
• Processing is when the data is subjected to various means and methods
of manipulation, the point where a computer program is being executed,
and it contains the program code and its current activity.
• Output and interpretation is the stage where processed information is
now transmitted to the user. Output is presented to users in various report
formats like printed report, audio, video, or on monitor.
• Storage is the last stage in the data processing cycle, where data,
instruction and information are held for future use. The importance of
this cycle is that it allows quick access and retrieval of the processed
information, allowing it to be passed on to the next stage directly, when
needed.
Data types
• A data type is way to tell compiler as to which data (integer, character, float, etc.)
is supposed to be stored and what amount of memory consequently to allocate.
• A data type is way to tell the compiler that at a cell x in a memory space, a bit
value of some range y is only supposed to be stored. It restricts the compiler to
store anything else other than that value range
• Common data types include
• Integers(int)- is used to store whole numbers, mathematically known as integers
• Booleans(bool)- is used to represent restricted to one of two values: true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of characters and numbers
Data representation
• Types are an abstraction letting us model things in categories
and it is largely a mental construct.
• All computer represent data nothing more than a string of
ones and zeroes.
• In order for said ones and zeroes to convey any meaning,
they need to be contextualized.
• Data types provide that context.
• E.g. 01100001
Data types from Data Analytics perspective
“Big Data” is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and
extract value and hidden knowledge from it…
Big data
• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
• In other words, data that is the range of 100s of TBs or PB comes into
Big Data.
• But it doesn't mean the amount of data, the thing matters is what
organization do with data.
• Big Data is analyzed for insights that lead to better decisions.
Big data
• Big Data is associated with the concept of 3 V that is
volume, velocity, and variety. Big data is characterized
by 3V and more:
• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse
sources
• Veracity: can we trust the data? How accurate is it? etc.
Clustered Computing
• Cluster Computing addresses the latest results in these fields that
support High Performance Distributed Computing .
• The Clustering methods have identified as- HPC IAAS, HPC PAAS,
that are more expensive and difficult to setup and maintain than a
single computer.
• In HPDC environments, parallel and/or distributed computing
techniques are applied to the solution of computationally intensive
applications across networks of computers.
Clustered Computing
• “Computer cluster” basically refers to a set of connected
computer working together.
• The cluster represents one system and the objective is to
improve performance.
• The computers are generally connected in a LAN (Local Area
Network).
• So, when this cluster of computers works to perform some
tasks and gives an impression of only a single entity, it is
called “cluster computing”.
Clustered Computing
• Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits:
• Resource Pooling:
• Combining the available storage space to hold data is a clear benefit, but
CPU and memory pooling are also extremely important. Processing large
datasets requires large amounts of all three of these resources.
• Object Pooling is a way which enable storing of group of object(called
pool storage) in memory.
• Whenever new object is needs to be created, it is first checked in pool
storage and if available it is reused and like this it provide reusability of
object and system resources, improves the scalability of program.
Clustered Computing
• High Availability: In computing, the term availability is used to describe the
period of time when a service is available, as well as the time required by a system
to respond to a request made by a user. High availability is a quality of a system or
component that assures a high level of operational performance for a given period
of time.
• Clusters can provide varying levels of fault tolerance and availability guarantees
to prevent hardware or software failures from affecting access to data and
processing. This becomes increasingly important as we continue to emphasize the
importance of real-time analytics.
• Easy Scalability: Clusters make it easy to scale horizontally by adding additional
machines to the group. This means the system can react to changes in resource
requirements without expanding the physical resources on a machine.
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with big
data easier. It is a framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming models.
• The four key characteristics of Hadoop are:
• Economical: Its systems are highly economical as ordinary computers can be used for
data processing.
• Reliable: It is reliable as it stores copies of the data on different machines and is
resistant to hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help
in scaling up the framework.
• Flexible: It is flexible and you can store as much structured and unstructured data as
you need to and decide to use them later.
Hadoop and its Ecosystem
● Hadoop has an ecosystem that has evolved from its four core components: data
management, access, processing, and storage.
● It is continuously growing to meet the needs of Big Data.
● It comprises the following components and many others: ○ PIG, HIVE: Query-based processing of data services
■ The first stage of Big Data processing is Ingest. The data is ingested or transferred to
Hadoop from various sources such as relational databases, systems, or local files.
Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data.
■ The second stage is Processing. In this stage, the data is stored and processed. The data
is stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase.
Spark and MapReduce perform data processing.
Big Data Life Cycle with Hadoop