You are on page 1of 130

Artificial Intelligence

Introduction to Artificial Intelligence: Artificial Intelligence (AI) refers to


the simulation of human intelligence in machines that are programmed to think and act
like humans. It involves the development of algorithms and computer programs that can
perform tasks that typically require human intelligence such as visual perception,
speech recognition, decision-making, and language translation. AI has the potential to
revolutionize many industries and has a wide range of applications, from virtual
personal assistants to self-driving cars.
Before leading to the meaning of artificial intelligence let understand what is the
meaning of Intelligence-
Intelligence: The ability to learn and solve problems. This definition is taken from
webster’s Dictionary.

The most common answer that one expects is “to make computers intelligent so that
they can act intelligently!”, but the question is how much intelligent? How can one
judge intelligence?

…as intelligent as humans. If the computers can, somehow, solve real-world problems,
by improving on their own from past experiences, they would be called “intelligent”.
Thus, the AI systems are more generic(rather than specific), can “think” and are more
flexible.

Intelligence, as we know, is the ability to acquire and apply knowledge. Knowledge is


the information acquired through experience. Experience is the knowledge gained
through exposure(training). Summing the terms up, we get artificial intelligence as the
“copy of something natural(i.e., human beings) ‘WHO’ is capable of acquiring and
applying the information it has gained through exposure.”
The main focus of artificial intelligence is towards understanding human behavior and
performance. This can be done by creating computers with human-like intelligence and
capabilities. This includes natural language processing, facial analysis and robotics. The
main applications of AI are in military, healthcare, and computing; however, it’s
expected that these applications will start soon and become part of our everyday lives.
Many theorists believe that computers will one day surpass human intelligence; they’ll
be able to learn faster, process information more effectively and make decisions faster
than humans. However, it’s still a work in progress as there are many limitations to how
much artificial intelligence is achieved. For example, computers don’t perform well in
dangerous or cold environments; they also struggle with physical tasks such as driving
cars or operating heavy machinery. Even so, there are many exciting things ahead for
artificial intelligence!
Uses of Artificial Intelligence :
Artificial Intelligence has many practical applications across various industries and
domains, including:
1. Healthcare: AI is used for medical diagnosis, drug discovery, and predictive
analysis of diseases.
2. Finance: AI helps in credit scoring, fraud detection, and financial
forecasting.
3. Retail: AI is used for product recommendations, price optimization, and
supply chain management.
4. Manufacturing: AI helps in quality control, predictive maintenance, and
production optimization.
5. Transportation: AI is used for autonomous vehicles, traffic prediction, and
route optimization.
6. Customer service: AI-powered chatbots are used for customer support,
answering frequently asked questions, and handling simple requests.
7. Security: AI is used for facial recognition, intrusion detection, and
cybersecurity threat analysis.
8. Marketing: AI is used for targeted advertising, customer segmentation, and
sentiment analysis.
9. Education: AI is used for personalized learning, adaptive testing, and
intelligent tutoring systems.

Drawbacks of Artificial Intelligence :

1. Bias and unfairness: AI systems can perpetuate and amplify existing biases
in data and decision-making.
2. Lack of transparency and accountability: Complex AI systems can be
difficult to understand and interpret, making it challenging to determine how
decisions are being made.
3. Job displacement: AI has the potential to automate many jobs, leading to job
loss and a need for reskilling.
4. Security and privacy risks: AI systems can be vulnerable to hacking and
other security threats, and may also pose privacy risks by collecting and using
personal data.
5. Ethical concerns: AI raises important ethical questions about the use of
technology for decision-making, including issues related to autonomy,
accountability, and human dignity.
Technologies Based on Artificial Intelligence:
1. Machine Learning: A subfield of AI that uses algorithms to enable systems
to learn from data and make predictions or decisions without being explicitly
programmed.
2. Natural Language Processing (NLP): A branch of AI that focuses on
enabling computers to understand, interpret, and generate human language.
3. Computer Vision: A field of AI that deals with the processing and analysis
of visual information using computer algorithms.
4. Robotics: AI-powered robots and automation systems that can perform tasks
in manufacturing, healthcare, retail, and other industries.
5. Neural Networks: A type of machine learning algorithm modeled after the
structure and function of the human brain.
6. Expert Systems: AI systems that mimic the decision-making ability of a
human expert in a specific field.
7. Chatbots: AI-powered virtual assistants that can interact with users through
text-based or voice-based interfaces.

Definitions of Al : Artificial Intelligence (AI) refers to the simulation of


human intelligence in machines that are programmed to think and act like
humans. It encompasses a broad range of techniques and approaches
aimed at enabling computers to perform tasks that typically require human
intelligence.

Agents in Artificial Intelligence

An AI system can be defined as the study of the rational


agent and its environment. The agents sense the
environment through sensors and act on their
environment through actuators. An AI agent can have
mental properties such as knowledge, belief, intention,
etc.

What is an Agent?

An agent can be anything that perceiveits environment


through sensors and act upon that environment through
actuators. An Agent runs in the cycle
of perceiving, thinking, and acting. An agent can be:
o Human-Agent: A human agent has eyes, ears, and other organs which work
for sensors and hand, legs, vocal tract work for actuators.
o Robotic Agent: A robotic agent can have cameras, infrared range finder,
NLP for sensors and various motors for actuators.
o Software Agent: Software agent can have keystrokes, file contents as
sensory input and act on those inputs and display output on the screen.

Hence the world around us is full of agents such as


thermostat, cellphone, camera, and even we are also
agents.

Before moving forward, we should first know about


sensors, effectors, and actuators.

Sensor: Sensor is a device which detects the change in


the environment and sends the information to other
electronic devices. An agent observes its environment
through sensors.

Actuators: Actuators are the component of machines


that converts energy into motion. The actuators are only
responsible for moving and controlling a system. An
actuator can be an electric motor, gears, rails, etc.

Effectors: Effectors are the devices which affect the


environment. Effectors can be legs, wheels, arms,
fingers, wings, fins, and display screen.

Intelligent Agents:
An intelligent agent is an autonomous entity which act
upon an environment using sensors and actuators for
achieving goals. An intelligent agent may learn from the
environment to achieve their goals. A thermostat is an
example of an intelligent agent.

Following are the main four rules for an AI agent:

o Rule 1: An AI agent must have the ability to perceive the environment.


o Rule 2: The observation must be used to make decisions.
o Rule 3: Decision should result in an action.
o Rule 4: The action taken by an AI agent must be a rational action.

Structure of an AI Agent

The task of AI is to design an agent program which


implements the agent function. The structure of an
intelligent agent is a combination of architecture and
agent program. It can be viewed as:

1. Agent = Architecture + Agent program

Following are the main three terms involved in the


structure of an AI agent:

Architecture: Architecture is machinery that an AI agent


executes on.

Agent Function: Agent function is used to map a


percept to an action.

1. f:P* → A

Agent program: Agent program is an implementation of


agent function. An agent program executes on the
physical architecture to produce function f.
There are many examples of agents in artificial
intelligence. Here are a few:
 Intelligent personal assistants: These are agents that are designed to help
users with various tasks, such as scheduling appointments, sending messages,
and setting reminders. Examples of intelligent personal assistants include Siri,
Alexa, and Google Assistant.
 Autonomous robots: These are agents that are designed to operate
autonomously in the physical world. They can perform tasks such as cleaning,
sorting, and delivering goods. Examples of autonomous robots include the
Roomba vacuum cleaner and the Amazon delivery robot.
 Gaming agents: These are agents that are designed to play games, either
against human opponents or other agents. Examples of gaming agents include
chess-playing agents and poker-playing agents.
 Fraud detection agents: These are agents that are designed to detect
fraudulent behavior in financial transactions. They can analyze patterns of
behavior to identify suspicious activity and alert authorities. Examples of fraud
detection agents include those used by banks and credit card companies.
 Traffic management agents: These are agents that are designed to manage
traffic flow in cities. They can monitor traffic patterns, adjust traffic lights, and
reroute vehicles to minimize congestion. Examples of traffic management
agents include those used in smart cities around the world.
 A software agent has Keystrokes, file contents, received network packages
that act as sensors and displays on the screen, files, and sent network packets
acting as actuators.
 A Human-agent has eyes, ears, and other organs which act as sensors, and
hands, legs, mouth, and other body parts act as actuators.
 A Robotic agent has Cameras and infrared range finders which act as sensors
and various motors act as actuators.

Types of Agents
Agents can be grouped into five classes based on their degree of perceived
intelligence and capability :
 Simple Reflex Agents
 Model-Based Reflex Agents
 Goal-Based Agents
 Utility-Based Agents
 Learning Agent
 Multi-agent systems
 Hierarchical agents
Simple Reflex Agents

Simple reflex agents ignore the rest of the percept history and act only on the
basis of the current percept. Percept history is the history of all that an agent has
perceived to date. The agent function is based on the condition-action rule. A
condition-action rule is a rule that maps a state i.e., a condition to an action. If the
condition is true, then the action is taken, else not. This agent function only
succeeds when the environment is fully observable. For simple reflex agents
operating in partially observable environments, infinite loops are often
unavoidable. It may be possible to escape from infinite loops if the agent can
randomize its actions.

Rational Agent:

A rational agent is an agent which has clear preference,


models uncertainty, and acts in a way to maximize its
performance measure with all possible actions.

A rational agent is said to perform the right things. AI is


about creating rational agents to use for game theory and
decision theory for various real-world scenarios.

For an AI agent, the rational action is most important


because in AI reinforcement learning algorithm, for each
best possible action, agent gets the positive reward and
for each wrong action, an agent gets a negative reward.

Model-Based Reflex Agents

It works by finding a rule whose condition matches the current situation. A


model-based agent can handle partially observable environments by the use of
a model about the world. The agent has to keep track of the internal state which
is adjusted by each percept and that depends on the percept history. The current
state is stored inside the agent which maintains some kind of structure describing
the part of the world which cannot be seen.

Goal-Based Agents

These kinds of agents take decisions based on how far they are currently from
their goal(description of desirable situations). Their every action is intended to
reduce their distance from the goal. This allows the agent a way to choose among
multiple possibilities, selecting the one which reaches a goal state. The
knowledge that supports its decisions is represented explicitly and can be
modified, which makes these agents more flexible. They usually require search
and planning. The goal-based agent’s behavior can easily be changed.

Utility-Based Agents
The agents which are developed having their end uses as building blocks are
called utility-based agents. When there are multiple possible alternatives, then to
decide which one is best, utility-based agents are used. They choose actions based
on a preference (utility) for each state. Sometimes achieving the desired goal is
not enough. We may look for a quicker, safer, cheaper trip to reach a destination.
Agent happiness should be taken into consideration. Utility describes
how “happy” the agent is. Because of the uncertainty in the world, a utility agent
chooses the action that maximizes the expected utility. A utility function maps a
state onto a real number which describes the associated degree of happiness.

Learning Agent

A learning agent in AI is the type of agent that can learn from its past experiences
or it has learning capabilities. It starts to act with basic knowledge and then is
able to act and adapt automatically through learning. A learning agent has mainly
four conceptual components, which are:
1. Learning element: It is responsible for making improvements by learning
from the environment.
2. Critic: The learning element takes feedback from critics which describes how
well the agent is doing with respect to a fixed performance standard.
3. Performance element: It is responsible for selecting external action.
4. Problem Generator: This component is responsible for suggesting actions
that will lead to new and informative experiences.

Multi-Agent Systems

These agents interact with other agents to achieve a common goal. They may
have to coordinate their actions and communicate with each other to achieve their
objective.
A multi-agent system (MAS) is a system composed of multiple interacting agents
that are designed to work together to achieve a common goal. These agents may
be autonomous or semi-autonomous and are capable of perceiving their
environment, making decisions, and taking action to achieve the common
objective.
MAS can be used in a variety of applications, including transportation systems,
robotics, and social networks. They can help improve efficiency, reduce costs,
and increase flexibility in complex systems.
MAS can be implemented using different techniques, such as game
theory, machine learning, and agent-based modeling. Game theory is used to
analyze strategic interactions between agents and predict their behavior. Machine
learning is used to train agents to improve their decision-making capabilities over
time. Agent-based modeling is used to simulate complex systems and study the
interactions between agents.
Overall, multi-agent systems are a powerful tool in artificial intelligence that can
help solve complex problems and improve efficiency in a variety of applications.

Hierarchical Agents
These agents are organized into a hierarchy, with high-
level agents overseeing the behavior of lower-level
agents. The high-level agents provide goals and
constraints, while the low-level agents carry out specific
tasks. Hierarchical agents are useful in complex
environments with many tasks and sub-tasks.
 Hierarchical agents are agents that are organized into a hierarchy, with high-
level agents overseeing the behavior of lower-level agents. The high-level
agents provide goals and constraints, while the low-level agents carry out
specific tasks. This structure allows for more efficient and organized decision-
making in complex environments.
 Hierarchical agents can be implemented in a variety of applications, including
robotics, manufacturing, and transportation systems. They are particularly
useful in environments where there are many tasks and sub-tasks that need to
be coordinated and prioritized.
 In a hierarchical agent system, the high-level agents are responsible for setting
goals and constraints for the lower-level agents
Overall, hierarchical agents are a powerful tool in
artificial intelligence that can help solve complex
problems and improve efficiency in a variety of
applications.
Uses of Agents
Agents are used in a wide range of applications in
artificial intelligence, including:
 Robotics: Agents can be used to control robots and automate tasks in
manufacturing, transportation, and other industries.
 Smart homes and buildings: Agents can be used to control heating, lighting,
and other systems in smart homes and buildings, optimizing energy use and
improving comfort.
 Transportation systems: Agents can be used to manage traffic flow, optimize
routes for autonomous vehicles, and improve logistics and supply chain
management.
 Healthcare: Agents can be used to monitor patients, provide personalized
treatment plans, and optimize healthcare resource allocation.
 Finance: Agents can be used for automated trading, fraud detection, and risk
management in the financial industry.
 Games: Agents can be used to create intelligent opponents in games and
simulations, providing a more challenging and realistic experience for players.
 Natural language processing: Agents can be used for language translation,
question answering, and chatbots that can communicate with users in natural
language.
 Cybersecurity: Agents can be used for intrusion detection, malware analysis,
and network security.
 Environmental monitoring: Agents can be used to monitor and manage
natural resources, track climate change, and improve environmental
sustainability.
 Social media: Agents can be used to analyze social media data, identify trends
and patterns, and provide personalized recommendations to users.

Problem Solving



The reflex agent of AI directly maps states into action. Whenever these agents fail to
operate in an environment where the state of mapping is too large and not easily
performed by the agent, then the stated problem dissolves and sent to a problem-solving
domain which breaks the large stored problem into the smaller storage area and resolves
one by one. The final integrated action will be the desired outcomes.
On the basis of the problem and their working domain, different types of problem-solving
agent defined and use at an atomic level without any internal state visible with a problem-
solving algorithm. The problem-solving agent performs precisely by defining problems
and several solutions. So we can say that problem solving is a part of artificial
intelligence that encompasses a number of techniques such as a tree, B-tree, heuristic
algorithms to solve a problem.
We can also say that a problem-solving agent is a result-driven agent and always focuses
on satisfying the goals.
There are basically three types of problem in artificial intelligence:
1. Ignorable: In which solution steps can be ignored.
2. Recoverable: In which solution steps can be undone.
3. Irrecoverable: Solution steps cannot be undo.
Steps problem-solving in AI: The problem of AI is directly associated with the nature
of humans and their activities. So we need a number of finite steps to solve a problem
which makes human easy works.
These are the following steps which require to solve a problem :
Problem definition: Detailed specification of inputs and acceptable system
solutions.
 Problem analysis: Analyse the problem thoroughly.
 Knowledge Representation: collect detailed information about the problem
and define all possible techniques.
 Problem-solving: Selection of best techniques.
Components to formulate the associated problem:
 Initial State: This state requires an initial state for the problem which starts
the AI agent towards a specified goal. In this state new methods also initialize
problem domain solving by a specific class.
 Action: This stage of problem formulation works with function with a
specific class taken from the initial state and all possible actions done in this
stage.
 Transition: This stage of problem formulation integrates the actual action
done by the previous action stage and collects the final stage to forward it to
their next stage.
 Goal test: This stage determines that the specified goal achieved by the
integrated transition model or not, whenever the goal achieves stop the action
and forward into the next stage to determines the cost to achieve the goal.
 Path costing: This component of problem-solving numerical assigned what
will be the cost to achieve the goal. It requires all hardware software and
human working cost.

Reasoning and planning are fundamental components of artificial


intelligence (AI) systems, enabling them to make decisions and solve problems
in complex environments. Here's an overview of reasoning and planning in AI:
1. Reasoning: Reasoning refers to the process of drawing conclusions or
making inferences based on available information. In AI, reasoning
involves various techniques for processing information, including logical
reasoning, probabilistic reasoning, and symbolic reasoning.
 Logical Reasoning: In logical reasoning, AI systems use rules of
logic to derive new conclusions from existing knowledge. This can
involve deductive reasoning (drawing specific conclusions from
general principles) or inductive reasoning (generalizing from
specific observations).
 Probabilistic Reasoning: Probabilistic reasoning involves
reasoning under uncertainty, where AI systems make decisions
based on probabilities. Bayesian networks, Markov models, and
probabilistic graphical models are commonly used for
probabilistic reasoning.
 Symbolic Reasoning: Symbolic reasoning involves manipulating
symbols and rules to perform tasks such as problem-solving and
planning. Techniques like rule-based systems, expert systems, and
theorem proving are used in symbolic reasoning.
2. Planning: Planning is the process of selecting and organizing actions to
achieve a goal in a given environment. In AI, planning involves
generating a sequence of actions that lead from an initial state to a
desired goal state while satisfying constraints and objectives.
 Search Algorithms: Planning often involves searching through a
space of possible actions and states to find a sequence of actions
that achieve the goal. Search algorithms like depth-first search,
breadth-first search, A* search, and heuristic search are commonly
used for planning.
 Automated Planning: Automated planning involves developing
algorithms and systems that can automatically generate plans to
achieve goals. Techniques like state-space search, classical
planning, and constraint satisfaction are used in automated
planning.
3. Integration of Reasoning and Planning: Reasoning and planning are
closely related in AI systems. Reasoning is used to analyze the current
state of the environment and determine the appropriate actions to take,
while planning involves selecting and organizing those actions to
achieve goals. Many AI systems integrate reasoning and planning
components to make decisions and solve problems effectively.
4. Applications: Reasoning and planning are applied in various AI
applications, including robotics, autonomous vehicles, natural language
understanding, game playing, scheduling, logistics, and resource
allocation.

Overall, reasoning and planning are essential capabilities in AI systems,


enabling them to make intelligent decisions and solve complex problems in
diverse domains. Advances in reasoning and planning techniques continue to
drive progress in AI research and applications.

What is an Agent?

An agent can be anything that perceive its environment through sensors and act
upon that environment through actuators. It runs in the cycle of
perceiving, thinking, and acting.

Logical agents in artificial intelligence



Logical agents
are essential in artificial intelligence as they enable
machines to make informed decisions based on a setof
rules and reasoning. They also play a crucial role in
knowledge representation, inference, and planning,
which are crucial components of AI systems.

Fundamental components of logic-basedagents


1.Knowledge Representation
2.Propositional Logic
3.First-Order Logic
4.Inference and Reasoning
5.Search and Planning

1)Knowledge Representation

Knowledge representation
is the process of translating real-world knowledge and
information into a format that can be understood and
processed by machines. It involves the use of symbols,
rules, and structures to capture and store information for
later retrieval and use by intelligent agents.
Types of knowledge representation
Several types of knowledge representation in artificial
intelligence are:

Semantic networks:
Use nodes and links to represent objects
Frames:
Use structures to represent objects and their attributes
Rules:
Use if-then statements to represent knowledge
Logic-based representations:
Formal logic to represent and reason about knowledge.
.

Example of a logical agent using Knowledge


Representation

An example of a logical agent using Knowledge
Representation is a chatbot designed to answer customer
service questions for a retail company.

The agent represents knowledge about the company's
products, policies, and procedures using a knowledge
base, which is essentially a database of facts and rules
represented in a logical language.

2) Propositional Logic
*Propositional logic is a branch of symbolic logic that
deals with propositions or statements that are either true
or false.
*It uses logical connectives such as AND,OR, and NOT
To combine propositions and form more complex
statements.
*Propositional logic is used in artificial intelligence to
represent and reason about knowledge in a formal and
systematic way.

Syntax and semantics of propositional logic


*The syntax of propositional logic refers to the formal
rules and conventions for constructing propositions and
expressions using logical connectives such as AND, OR,
and NOT.
*The semantics of propositional logic, on the other hand,
deals with the meaning and interpretation of propositions
and expressions, and how they relate to truth values
(True or False).

Logical connectives and Truth tables


*Logical connectives are symbols used in propositional
logic to connect propositions or statements to form more
complex expressions. Examples of logical connectives
include AND,OR, and NOT.
*Truth tables are tables used to specify the truth value of
a complex proposition based on the truth values of its
constituent propositions.
Example of a logical agent using Propositional Logic
*A simple example of a logical agent using propositional
logic could be a home security system that uses sensors
to detect motion or other suspicious activity. The system
can be represented using propositional variables such as
"motion detected" or "window opened".
For example:
if "motion detected" and "window opened" are both true,
the system can infer that someone has entered the house
and trigger the alarm.

3) First-Order Logic
*First-order logic is a formal system used in artificial
intelligence and logic programming to represent and
reason about objects and their properties, relationships,
and functions. It extends propositional logic by
introducing quantifiers, such as "for all" and "there
exists," which allow for the formal representation of
complex and relational knowledge. First-order logic is
also known as predicate logic
Quantifiers
*Quantifiers are logical symbols used in first-order logic
to express the scope or extent of a predicate over a
domain of objects. The two main quantifiers are
"for all" ()∀ , which expresses universal quantification,
and "there exists" () ∃,which expresses existential
quantification.

Example of a logical agent using first-order logic


*Example of a logical agent using first-order logic could
be a grocery list generator. The system could represent
grocery items using predicates, where each item is a
predicate that applies to a specific category of products,
such as Fruit(X),Dairy(X), or Meat(X). The system could
use the universal quantifier "for all" to represent general
rules, such as "for all X, if X is a fruit, then X should
be stored in the refrigerator."

4) Inference and Reasoning


Inference
is the process of deriving new knowledge or conclusions
based on existing knowledge or evidence.
Reasoning
is the process of using logical or probabilistic methods
to draw valid inferences or conclusions from a set of
premises or assumptions. In artificial intelligence,
inference and reasoning are used to model human
thinking and decision-making, and to develop intelligent
agents that can solve problems, make predictions, and
plan actions.

Example of a logical agent using inference and


reasoning
*Example of a logical agent using inference and
reasoning is a diagnostic system for identifying faults
in a car. The agent could use deductive reasoning to
derive new conclusions from existing knowledge about
the car's parts and symptoms.

5) Search and Planning


*Search and planning are fundamental components of
artificial intelligence that involve finding solutions to
problems and generating plans to achieve goals in a
given domain. Search algorithms are used to explore
possible problem states and find a path to a desired goal,
while planning algorithms create sequences of actions to
achieve a set of objectives.

Types of search and planning algorithms


Uninformed search algorithms
such as breadth-first search and depth-first search.
Informed search algorithms
like A* search and iterative deepening A* search
Local search algorithms
like hill climbing and simulated annealing
Classical planning algorithms
such as forward-chaining and backward-chaining
Heuristic search algorithms
like genetic algorithms and particle swarm optimization.

Example of a logical agent using Search and Planning


A common example of a logical agent using search and
planning isa robot navigating through a maze to reach a
goal. The agent uses a search algorithm, such as A* or
breadth-first search, to explore the maze and plan its path
to the goal. The agent represents the maze as a graph,
where each node represents a location in the maze, and
each edge represents a possible path between two
locations.

Classical planning
Classical planning is the planning where an agent takes
advantages of the problem structure to construct complex
plans of an action. The agent performs three tasks in
classical planning:
* Planning : The agent plans after knowing what is the
problem.
* Acting : It decides what action it has to take.
* Learning : the actions taken by the agent make him
learn new things.

A language known as PDDL(Planning Domain


Definition language) which is used to represent all
actions into one action schema.
PDDL describes the four basic things needed in a search
problem:
* Initial State : It is the representation of each state as the
conjuction of the ground and functionless atoms.
* Actions : It is defined by a set of action schemas which
implicitly define the ACTION () and RESULT()
functions.
* Result : It is obtained by the set of actions used by the
agent.
* Goal : It is same as precondition, which is a conjuction
of literals (whose value is either positive or negative).
There are various examples which will make PDDL
understandable:
* Air cargo transport
* The spare tire problem
* The blocks world and many more
Let's discuss one of them
* Air cargo transport
This problem can be illustrated with the help of the
following actions:
* Load : This action is taken to load cargo.
* Unload : This action is taken to unload the cargo when
it raeches its destination.
* Fly : This action is taken to fly from one place to
another.
Therefore, the air cargo problem is based on loading and
unloading the cargo and flying one place to another.

Learning
Robotics is a domain in artificial intelligence that deals with the
study of creating intelligent and efficient robots.

What are Robots?

Robots are the artificial agents acting in real world environment.

Objective

Robots are aimed at manipulating the objects by perceiving,


picking, moving, modifying the physical properties of object,
destroying it, or to have an effect thereby freeing manpower from
doing repetitive functions without getting bored, distracted, or
exhausted.

What is Robotics?

Robotics is a branch of AI, which is composed of Electrical


Engineering, Mechanical Engineering, and Computer Science for
designing, construction, and application of robots.

Aspects of Robotics

 The robots have mechanical construction, form, or shape designed


to accomplish a particular task.
 They have electrical components which power and control the
machinery.
 They contain some level of computer program that determines
what, when and how a robot does something.
Difference in Robot System and Other AI Program

Here is the difference between the two −

AI
Robots
Programs

They
They
usually
operate in
operate in
real
computer-
physical
stimulated
world
worlds.

Inputs to
robots is
The input analog
to an AI signal in
program is the form
in symbols of speech
and rules. wavefor
m or
images

They
They need need
general special
purpose hardware
computers with
to operate sensors
on. and
effectors.

Robot Locomotion

Locomotion is the mechanism that makes a robot capable of moving


in its environment. There are various types of locomotions −

 Legged
 Wheeled
 Combination of Legged and Wheeled Locomotion
 Tracked slip/skid
Legged Locomotion

 This type of locomotion consumes more power while


demonstrating walk, jump, trot, hop, climb up or down, etc.
 It requires more number of motors to accomplish a movement.
It is suited for rough as well as smooth terrain where irregular
or too smooth surface makes it consume more power for a
wheeled locomotion. It is little difficult to implement because
of stability issues.
 It comes with the variety of one, two, four, and six legs. If a
robot has multiple legs then leg coordination is necessary for
locomotion.
The total number of possible gaits (a periodic sequence of lift and
release events for each of the total legs) a robot can travel depends
upon the number of its legs.

If a robot has k legs, then the number of possible events N = (2k-


1)!.

In case of a two-legged robot (k=2), the number of possible events


is N = (2k-1)! = (2*2-1)! = 3! = 6.

Hence there are six possible different events −

 Lifting the Left leg


 Releasing the Left leg
 Lifting the Right leg
 Releasing the Right leg
 Lifting both the legs together
 Releasing both the legs together

In case of k=6 legs, there are 39916800 possible events. Hence the
complexity of robots is directly proportional to the number of legs.
Wheeled Locomotion

It requires fewer number of motors to accomplish a movement. It is


little easy to implement as there are less stability issues in case of
more number of wheels. It is power efficient as compared to legged
locomotion.

 Standard wheel − Rotates around the wheel axle and around the
contact
 Castor wheel − Rotates around the wheel axle and the offset
steering joint.
 Swedish 45o and Swedish 90o wheels − Omni-wheel, rotates around
the contact point, around the wheel axle, and around the
rollers.
 Ball or spherical wheel − Omnidirectional wheel, technically
difficult to implement.

Slip/Skid Locomotion
In this type, the vehicles use tracks as in a tank. The robot is
steered by moving the tracks with different speeds in the same or
opposite direction. It offers stability because of large contact area of
track and ground.

Components of a Robot

Robots are constructed with the following −

 Power Supply − The robots are powered by batteries, solar


power, hydraulic, or pneumatic power sources.
 Actuators − They convert energy into movement.
 Electric motors (AC/DC) − They are required for rotational
movement.
 Pneumatic Air Muscles − They contract almost 40% when air is
sucked in them.
 Muscle Wires − They contract by 5% when electric current is
passed through them.
 Piezo Motors and Ultrasonic Motors − Best for industrial robots.
 Sensors − They provide knowledge of real time information on
the task environment. Robots are equipped with vision sensors
to be to compute the depth in the environment. A tactile
sensor imitates the mechanical properties of touch receptors of
human fingertips.
Computer Vision

This is a technology of AI with which the robots can see. The


computer vision plays vital role in the domains of safety, security,
health, access, and entertainment.
Computer vision automatically extracts, analyzes, and comprehends
useful information from a single image or an array of images. This
process involves development of algorithms to accomplish
automatic visual comprehension.

Hardware of Computer Vision System

This involves −

 Power supply
 Image acquisition device such as camera
 A processor
 A software
 A display device for monitoring the system
 Accessories such as camera stands, cables, and connectors
Tasks of Computer Vision
 OCR − In the domain of computers, Optical Character Reader,
a software to convert scanned documents into editable text,
which accompanies a scanner.
 Face Detection − Many state-of-the-art cameras come with this
feature, which enables to read the face and take the picture of
that perfect expression. It is used to let a user access the
software on correct match.
 Object Recognition − They are installed in supermarkets,
cameras, high-end cars such as BMW, GM, and Volvo.
 Estimating Position − It is estimating position of an object with
respect to camera as in position of tumor in human’s body.
Application Domains of Computer Vision
 Agriculture
 Autonomous vehicles
 Biometrics
 Character recognition
 Forensics, security, and surveillance
 Industrial quality inspection
 Face recognition
 Gesture analysis
 Geoscience
 Medical imagery
 Pollution monitoring
 Process control
 Remote sensing
 Robotics
 Transport
Applications of Robotics

The robotics has been instrumental in the various domains such as


 Industries − Robots are used for handling material, cutting,


welding, color coating, drilling, polishing, etc.
 Military − Autonomous robots can reach inaccessible and
hazardous zones during war. A robot named Daksh, developed
by Defense Research and Development Organization (DRDO),
is in function to destroy life-threatening objects safely.
 Medicine − The robots are capable of carrying out hundreds of
clinical tests simultaneously, rehabilitating permanently
disabled people, and performing complex surgeries such as
brain tumors.
 Exploration − The robot rock climbers used for space
exploration, underwater drones used for ocean exploration are
to name a few.
 Entertainment − Disney’s engineers have created hundreds of
robots for movie making.

1. Definition of AI Perception:

 Perception in AI involves the interpretation of data from sensors, cameras, microphones,


or other input devices to understand the environment.
 It mimics human perception by enabling machines to recognize objects, understand
speech, or interpret visual and auditory signals.

2. Key Components:

 Sensors: These are devices that capture data from the environment. Examples include
cameras, microphones, and other types of detectors.
 Data Processing: Once data is captured, it needs to be processed. This involves
techniques such as image recognition, natural language processing, and signal processing.
3. Types of Perception in AI:

 Computer Vision: Involves the interpretation of visual information from the world, often
using techniques like image recognition, object detection, and facial recognition.
 Speech Recognition: Understanding and interpreting spoken language.
 Natural Language Processing (NLP): Understanding and generating human language,
involving tasks like language translation and sentiment analysis.
 Sensor Fusion: Combining information from multiple sensors to form a more complete
understanding of the environment.

4. Challenges in AI Perception:

 Ambiguity: The real world is often ambiguous, and making sense of uncertain or
incomplete information is a significant challenge.
 Variability: Environments can change, and perception systems need to adapt to different
conditions and situations.
 Real-time Processing: Some applications, like autonomous vehicles, require instant
processing of perceptual data for quick decision-making.

5. Applications:

 Autonomous Vehicles: AI perception is crucial for self-driving cars to understand and


navigate the road environment.
 Healthcare: Applications include medical image analysis, where AI can help identify
and diagnose conditions from medical images.
 Smart Assistants: Devices like smart speakers use speech recognition to understand and
respond to user commands.
 Security and Surveillance: AI perception is used for facial recognition, object tracking,
and anomaly detection.

6. Technologies and Techniques:

 Machine Learning Algorithms: Supervised and unsupervised learning techniques are


often employed for training perception models.
 Deep Learning: Neural networks, especially convolutional neural networks (CNNs) and
recurrent neural networks (RNNs), are commonly used for tasks like image recognition
and natural language processing.
 Sensor Technologies: Advancements in sensor technologies, such as Lidar and Radar,
play a crucial role in enhancing perception capabilities.

7. Ethical Considerations:

 Privacy: Perception systems, especially those involving cameras and microphones, raise
concerns about privacy and data security.
 Bias: AI perception systems can inherit biases present in training data, leading to unfair
or discriminatory outcomes.

8. Future Trends:

 Multi-Modal Perception: Integration of information from multiple modalities (e.g.,


vision, speech, and touch) for a more comprehensive understanding.
 Edge Computing: Performing perception tasks on local devices to reduce latency and
enhance privacy.
 Explainable AI: Efforts to make AI systems more transparent and understandable,
especially in critical applications.

Understanding and improving AI perception is crucial for the development of advanced


AI systems that can interact effectively with the real world. It involves a combination of
cutting-edge technologies, interdisciplinary knowledge, and ethical considerations.

Robotic Sensing in Manufacturing:

1. Vision Systems:

 Description: Robotic vision systems use cameras and image processing algorithms to
visually perceive the manufacturing environment.
 Examples:Quality Control: Cameras can inspect products for defects, ensuring that only
high-quality items proceed down the production line.Pick-and-Place Operations: Vision
systems enable robots to identify and accurately pick up objects, facilitating automation
in assembly processes.

2. Force and Tactile Sensors:

 Description: Sensors that measure force and pressure, allowing robots to interact with
their environment more intelligently.
 Examples:Assembly Verification: Force sensors help robots determine whether
components are properly assembled by measuring the force exerted during the
process.Sensitive Gripping: Tactile feedback enables robots to grip fragile objects
without damaging them.

3. Lidar and Radar Technologies:

 Description: Lidar and radar provide depth perception and object detection capabilities,
enhancing robotic navigation and safety in manufacturing environments.
 Examples:Obstacle Avoidance: Lidar sensors help robots navigate through crowded
factory floors by detecting obstacles and adjusting their paths.Human-Robot
Collaboration: Radar systems enhance safety by detecting the presence of humans in the
robot's vicinity, leading to collaborative workspaces.
4. Ultrasonic Sensors:

 Description: Ultrasonic sensors use sound waves to measure distances and detect objects.
 Examples:Material Level Monitoring: In processes involving liquids or powders,
ultrasonic sensors can monitor material levels in containers to optimize production
efficiency.Collision Avoidance: Robots equipped with ultrasonic sensors can avoid
collisions with other objects or robots in their path.

5. Predictive Analysis in Manufacturing:

6. Predictive Maintenance:

 Description: AI-driven predictive maintenance uses historical and real-time data to


forecast when equipment is likely to fail, allowing for proactive maintenance.
 Examples:Equipment Health Monitoring: Sensors on machinery collect data on factors
like temperature and vibration. Predictive models analyze this data to predict when
maintenance is needed, reducing downtime.

7. Quality Prediction:

 Description: Predictive analysis can be applied to predict the quality of products based on
various parameters.
 Examples:Defect Prediction: Machine learning models can analyze production data to
predict the likelihood of defects, allowing for corrective measures before products are
completed.

8. Supply Chain Optimization:

 Description: Predictive analytics helps optimize the supply chain by forecasting demand,
identifying potential disruptions, and improving overall efficiency.
 Examples:Demand Forecasting: Machine learning models can analyze historical sales
data, market trends, and external factors to predict future demand for products,
optimizing inventory management.

9. Energy Consumption Optimization:

 Description: Predictive models can analyze historical energy consumption patterns to


optimize energy usage in manufacturing facilities.
 Examples:Energy Consumption Forecasting: By analyzing historical data and
considering factors like production schedules, predictive models can forecast energy
needs, allowing for more efficient resource allocation.

Conclusion:
The integration of AI perception, robotic sensing, and predictive analysis in
manufacturing leads to more efficient, flexible, and adaptive production processes. As
technology continues to advance, these systems will play an increasingly integral role in
creating smart and interconnected manufacturing environments. The combination of
accurate perception through advanced sensors and predictive analytics contributes to
improved decision-making, reduced downtime, and enhanced overall productivity in the
manufacturing industry.

SECTION B

Data
Science Introduction

Data Science is a combination of multiple disciplines that uses statistics,


data analysis, and machine learning to analyze data and to extract
knowledge and insights from it.

What is Data Science?


Data Science is about data gathering, analysis and decision-making.

Data Science is about finding patterns in data, through analysis, and make
future predictions.

By using Data Science, companies are able to make:

 Better decisions (should we choose A or B)


 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden information in the
data)
Where is Data Science
Needed?
Data Science is used in many industries in the world today, e.g. banking,
consultancy, healthcare, and manufacturing.

Examples of where Data Science is needed:

 For route planning: To discover the best routes to ship


 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections

Data Science can be applied in nearly every part of a business where data is
available. Examples are:

 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce

ADVERTISEMENT

How Does a Data Scientist


Work?
A Data Scientist requires expertise in several backgrounds:
 Machine Learning
 Statistics
 Programming (Python or R)
 Mathematics
 Databases

A Data Scientist must find patterns within the data. Before he/she can find the
patterns, he/she must organize the data in a standard format.

Here is how a Data Scientist works:

1. Ask the right questions - To understand the business problem.


2. Explore and collect data - From database, web logs, customer
feedback, etc.
3. Extract the data - Transform the data to a standardized format.
4. Clean the data - Remove erroneous values from the data.
5. Find and replace missing values - Check for missing values and
replace them with a suitable value (e.g. an average value).
6. Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so
scaling is important).
7. Analyze data, find patterns and make future predictions.
8. Represent the result - Present the result with useful insights in a way
the "company" can understand.

The data science landscape is constantly evolving, as new technologies and


techniques emerge. However, there are a number of core areas that remain important for data
scientists to have expertise in. These include:

 Statistics: Data scientists need to have a strong understanding of statistical concepts,


such as probability, sampling, and hypothesis testing. This allows them to draw
meaningful insights from data and to develop predictive models.
 Machine learning: Machine learning is a subfield of artificial intelligence that allows
computers to learn without being explicitly programmed. Data scientists use machine
learning to develop algorithms that can identify patterns in data and make predictions.
 Programming: Data scientists need to be able to program in order to develop and
implement algorithms, and to analyze and visualize data. Popular programming
languages used by data scientists include Python, R, and Scala.
 Cloud computing: Cloud computing platforms such as AWS, Azure, and GCP provide
data scientists with access to powerful computing resources and tools. This allows them
to scale their data science projects and to collaborate with other data scientists.
 Communication: Data scientists need to be able to communicate their findings to both
technical and non-technical audiences. This is important for ensuring that their work is
used to make informed decisions.

The data science landscape is a dynamic and rapidly evolving field that encompasses a wide
range of techniques, tools, and technologies for extracting insights and knowledge from data.
Here’s an overview of key components and trends within the data science landscape:

1. Data Collection and Storage:


 Data Sources: Data scientists gather data from various sources, including
databases, APIs, sensors, social media, and more.
 Data Warehouses: Data is often stored in data warehouses or data lakes,
allowing for centralized storage and efficient retrieval.
2. Data Preprocessing:
 Data Cleaning: This involves handling missing values, outliers, and
inconsistencies in the data.
 Feature Engineering: Creating new features or transforming existing ones to
improve model performance.
3. Exploratory Data Analysis (EDA):
 Data Visualization: EDA involves creating visualizations to understand data
patterns and relationships.
 Statistical Analysis: Data scientists use statistical methods to uncover insights
and correlations in the data.
4. Machine Learning and Modeling:
 Supervised Learning: Building models that make predictions based on labeled
data.
 Unsupervised Learning: Discovering patterns and structures in unlabeled data.
 Deep Learning: Leveraging neural networks for complex tasks like image and
natural language processing.
 Reinforcement Learning: Teaching agents to make sequential decisions through
trial and error.
5. Model Evaluation and Validation:
 Cross-Validation: Ensuring models generalize well to new data.
 Hyperparameter Tuning: Optimizing model parameters for better performance.
 Bias and Fairness Analysis: Checking for biases and ensuring fairness in
models, especially in sensitive domains.
6. Deployment and Productionization:
 Model Deployment: Taking trained models and integrating them into production
systems.
 Monitoring: Continuously monitoring models for performance and drift.
 Scalability: Ensuring models can handle large-scale data and user traffic.
7. Big Data Technologies:
 Hadoop: Distributed storage and processing framework.
 Spark: In-memory, distributed data processing.
 NoSQL Databases: Storing and retrieving unstructured or semi-structured data.
8. Cloud Computing:
 Cloud platforms like AWS, Azure, and Google Cloud provide scalable
infrastructure for data storage, processing, and analytics.
9. Natural Language Processing (NLP):
 Analyzing and generating human language text, enabling chatbots, sentiment
analysis, and language translation.
10. Computer Vision:
 Using machine learning to interpret and understand images and videos, with
applications in object recognition, image classification, and autonomous vehicles.
11. AI Ethics and Responsible AI:
 Ensuring ethical use of AI and addressing issues related to bias, fairness,
transparency, and privacy.
12. Automated Machine Learning (AutoML):
 Tools and platforms that automate the process of selecting, training, and
deploying machine learning models.
13. IoT and Sensor Data:
 Analyzing data from Internet of Things (IoT) devices and sensors for applications
like predictive maintenance and smart cities.
14. Data Governance and Compliance:
 Managing data to ensure quality, security, and compliance with regulations like
GDPR.
15. Data Science Toolkits and Libraries:
 Python and R are popular programming languages for data science, and there are
numerous libraries like scikit-learn, TensorFlow, and PyTorch for machine
learning.
16. Data Science Team Roles:
 Data scientists, data engineers, machine learning engineers, and data analysts
collaborate to deliver data-driven solutions.
17. Education and Skill Development:
 Ongoing learning and development are essential in this rapidly changing field.
18. Interdisciplinary Applications:
 Data science is applied in various domains, including healthcare, finance, e-
commerce, marketing, and more.

Data Science Process





If you are in a technical domain or a student with a technical background then you must
have heard about Data Science from some source certainly. This is one of the booming
fields in today’s tech market. And this will keep going on as the upcoming world is
becoming more and more digital day by day. And the data certainly hold the capacity to
create a new future. In this article, we will learn about Data Science and the process
which is included in this.
What is Data Science?
Data can be proved to be very fruitful if we know how to manipulate it to get hidden
patterns from them. This logic behind the data or the process behind the manipulation is
what is known as Data Science. From formulating the problem statement and collection
of data to extracting the required results from them the Data Science process and the
professional who ensures that the whole process is going smoothly or not is known as the
Data Scientist. But there are other job roles as well in this domain as well like:
1. Data Engineers
2. Data Analysts
3. Data Architect
4. Machine Learning Engineer
5. Deep Learning Engineer
Data Science Process Life Cycle
There are some steps that are necessary for any of the tasks that are being done in the
field of data science to derive any fruitful results from the data at hand.
 Data Collection – After formulating any problem statement the main task is to
calculate data that can help us in our analysis and manipulation. Sometimes
data is collected by performing some kind of survey and there are times when
it is done by performing scrapping.
 Data Cleaning – Most of the real-world data is not structured and requires
cleaning and conversion into structured data before it can be used for any
analysis or modeling.
 Exploratory Data Analysis – This is the step in which we try to find the
hidden patterns in the data at hand. Also, we try to analyze different factors
which affect the target variable and the extent to which it does so. How the
independent features are related to each other and what can be done to achieve
the desired results all these answers can be extracted from this process as well.
This also gives us a direction in which we should work to get started with the
modeling process.
 Model Building – Different types of machine learning algorithms as well as
techniques have been developed which can easily identify complex patterns in
the data which will be a very tedious task to be done by a human.
 Model Deployment – After a model is developed and gives better results on
the holdout or the real-world dataset then we deploy it and monitor its
performance. This is the main part where we use our learning from the data to
be applied in real-world applications and use cases.
Data Science Process Life Cycle

Components of Data Science Process


Data Science is a very vast field and to get the best out of the data at hand one has to
apply multiple methodologies and use different tools to make sure the integrity of the
data remains intact throughout the process keeping data privacy in mind. Machine
Learning and Data analysis is the part where we focus on the results which can be
extracted from the data at hand. But Data engineering is the part in which the main task is
to ensure that the data is managed properly and proper data pipelines are created for
smooth data flow. If we try to point out the main components of Data Science then it
would be:
 Data Analysis – There are times when there is no need to apply advanced
deep learning and complex methods to the data at hand to derive some patterns
from it. Due to this before moving on to the modeling part, we first perform an
exploratory data analysis to get a basic idea of the data and patterns which are
available in it this gives us a direction to work on if we want to apply some
complex analysis methods on our data.
 Statistics – It is a natural phenomenon that many real-life datasets follow a
normal distribution. And when we already know that a particular dataset
follows some known distribution then most of its properties can be analyzed at
once. Also, descriptive statistics and correlation and covariances between two
features of the dataset help us get a better understanding of how one factor is
related to the other in our dataset.
 Data Engineering – When we deal with a large amount of data then we have
to make sure that the data is kept safe from any online threats also it is easy to
retrieve and make changes in the data as well. To ensure that the data is used
efficiently Data Engineers play a crucial role.
 Advanced Computing
 Machine Learning – Machine Learning has opened new horizons
which had helped us to build different advanced applications and
methodologies so, that the machines become more efficient and
provide a personalized experience to each individual and perform
tasks in a snap of the hand earlier which requires heavy human labor
and time intense.
 Deep Learning – This is also a part of Artificial Intelligence and
Machine Learning but it is a bit more advanced than machine
learning itself. High computing power and a huge corpus of data
have led to the emergence of this field in data science.
Knowledge and Skills for Data Science Professionals
As a Data Scientist, you’ll be responsible for jobs that span three domains of skills.
1. Statistical/mathematical reasoning
2. Business communication/leadership
3. Programming
1. Statistics: Wikipedia defines it as the study of the collection, analysis, interpretation,
presentation, and organization of data. Therefore, it shouldn’t be a surprise that data
scientists need to know statistics.
2. Programming Language R/ Python: Python and R are one of the most widely used
languages by Data Scientists. The primary reason is the number of packages available for
Numeric and Scientific computing.
3. Data Extraction, Transformation, and Loading: Suppose we have multiple data
sources like MySQL DB, MongoDB, Google Analytics. You have to Extract data from
such sources, and then transform it for storing in a proper format or structure for the
purposes of querying and analysis. Finally, you have to load the data in the Data
Warehouse, where you will analyze the data. So, for people from ETL (Extract
Transform and Load) background Data Science can be a good career option.
Steps for Data Science Processes:
Step 1: Defining research goals and creating a project charter
 Spend time understanding the goals and context of your research.Continue
asking questions and devising examples until you grasp the exact business
expectations, identify how your project fits in the bigger picture, appreciate
how your research is going to change the business, and understand how they’ll
use your results.
Create a project charter
A project charter requires teamwork, and your input covers at least the following:
1. A clear research goal
2. The project mission and context
3. How you’re going to perform your analysis
4. What resources you expect to use
5. Proof that it’s an achievable project, or proof of concepts
6. Deliverables and a measure of success
7. A timeline
Step 2: Retrieving Data
Start with data stored within the company
 Finding data even within your own company can sometimes be a challenge.
 This data can be stored in official data repositories such as databases, data
marts, data warehouses, and data lakes maintained by a team of IT
professionals.
 Getting access to the data may take time and involve company policies.
Step 3: Cleansing, integrating, and transforming data-
Cleaning:
 Data cleansing is a subprocess of the data science process that focuses on
removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
 The first type is the interpretation error, such as incorrect use of terminologies,
like saying that a person’s age is greater than 300 years.
 The second type of error points to inconsistencies between data sources or
against your company’s standardized values. An example of this class of errors
is putting “Female” in one table and “F” in another when they represent the
same thing: that the person is female.
Integrating:
 Combining Data from different Data Sources.
 Your data comes from several different places, and in this sub step we focus on
integrating these different sources.
 You can perform two operations to combine information from different data
sets. The first operation is joining and the second operation is appending or
stacking.
Joining Tables:
 Joining tables allows you to combine the information of one observation found
in one table with the information that you find in another table.
Appending Tables:
 Appending or stacking tables is effectively adding observations from one table
to another table.
Transforming Data
 Certain models require their data to be in a certain shape.
Reducing the Number of Variables
 Sometimes you have too many variables and need to reduce the number
because they don’t add new information to the model.
 Having too many variables in your model makes the model difficult to handle,
and certain techniques don’t perform well when you overload them with too
many input variables.
 Dummy variables can only take two values: true(1) or false(0). They’re used to
indicate the absence of a categorical effect that may explain the observation.
Step 4: Exploratory Data Analysis
 During exploratory data analysis you take a deep dive into the data.
 Information becomes much easier to grasp when shown in a picture, therefore
you mainly use graphical techniques to gain an understanding of your data and
the interactions between variables.
 Bar Plot, Line Plot, Scatter Plot ,Multiple Plots , Pareto Diagram , Link and
Brush Diagram ,Histogram , Box and Whisker Plot .
Step 5: Build the Models
 Build the models are the next step, with the goal of making better predictions,
classifying objects, or gaining an understanding of the system that are required
for modeling.
Step 6: Presenting findings and building applications on top of them –
 The last stage of the data science process is where your soft skills will be most
useful, and yes, they’re extremely important.
 Presenting your results to the stakeholders and industrializing your analysis
process for repetitive reuse and integration with other tools.
Benefits and uses of data science and big data
 Governmental organizations are also aware of data’s value. A data scientist in
a governmental organization gets to work on diverse projects such as detecting
fraud and other criminal activity or optimizing project funding.
 Nongovernmental organizations (NGOs) are also no strangers to using data.
They use it to raise money and defend their causes. The World Wildlife Fund
(WWF), for instance, employs data scientists to increase the effectiveness of
their fundraising efforts.
 Universities use data science in their research but also to enhance the study
experience of their students. • Ex: MOOC’s- Massive open online courses.
Tools for Data Science Process
As time has passed tools to perform different tasks in Data Science have evolved to a
great extent. Different software like Matlab and Power BI, and programming Languages
like Python and R Programming Language provides many utility features which help us
to complete most of the most complex task within a very limited time and efficiently.
Some of the tools which are very popular in this domain of Data Science are shown in the
below image.
Tools for Data Science Process

Usage of Data Science Process


The Data Science Process is a systematic approach to solving data-related problems and
consists of the following steps:
1. Problem Definition: Clearly defining the problem and identifying the goal of
the analysis.
2. Data Collection: Gathering and acquiring data from various sources, including
data cleaning and preparation.
3. Data Exploration: Exploring the data to gain insights and identify trends,
patterns, and relationships.
4. Data Modeling: Building mathematical models and algorithms to solve
problems and make predictions.
5. Evaluation: Evaluating the model’s performance and accuracy using
appropriate metrics.
6. Deployment: Deploying the model in a production environment to make
predictions or automate decision-making processes.
7. Monitoring and Maintenance: Monitoring the model’s performance over
time and making updates as needed to improve accuracy.
Issues of Data Science Process
1. Data Quality and Availability: Data quality can affect the accuracy of the
models developed and therefore, it is important to ensure that the data is
accurate, complete, and consistent. Data availability can also be an issue, as the
data required for analysis may not be readily available or accessible.
2. Bias in Data and Algorithms: Bias can exist in data due to sampling
techniques, measurement errors, or imbalanced datasets, which can affect the
accuracy of models. Algorithms can also perpetuate existing societal biases,
leading to unfair or discriminatory outcomes.
3. Model Overfitting and Underfitting: Overfitting occurs when a model is too
complex and fits the training data too well, but fails to generalize to new data.
On the other hand, underfitting occurs when a model is too simple and is not
able to capture the underlying relationships in the data.
4. Model Interpretability: Complex models can be difficult to interpret and
understand, making it challenging to explain the model’s decisions and
decisions. This can be an issue when it comes to making business decisions or
gaining stakeholder buy-in.
5. Privacy and Ethical Considerations: Data science often involves the
collection and analysis of sensitive personal information, leading to privacy
and ethical concerns. It is important to consider privacy implications and
ensure that data is used in a responsible and ethical manner.
6. Technical Challenges: Technical challenges can arise during the data science
process such as data storage and processing, algorithm selection, and
computational scalability.
Foundations of AI and Data Science

by midjourney
Foundations of AI and Data Science is a critical area of research at CAIDAS that explores the underlying principles
and techniques behind the development of AI and data science applications. This area encompasses several sub-
disciplines, each of which focuses on a specific aspect of AI and data science research. The sub-areas of
Foundations of AI and Data Science include Deep Learning, Representation Learning, Reinforcement Learning,
Statistical Relational Learning, Machine Learning for Complex Networks, Computer Vision, Natural Language
Processing, and Pattern Recognition.

Deep Learning focuses on the development of algorithms that enable artificial neural networks to learn from large
amounts of data. These algorithms allow AI systems to improve their performance over time and can be applied to
various applications, including image recognition, speech recognition, and natural language processing.
Representation Learning deals with the development of algorithms that can effectively represent complex data
structures. These algorithms are critical for building AI systems that can learn from structured and unstructured data
and can be used to improve the performance of applications such as computer vision and natural language
processing.
Reinforcement Learning focuses on the development of algorithms that allow AI systems to learn from experience.
These algorithms are used in applications that require the AI system to make decisions based on the consequences of
its actions, such as robotics, gaming, and autonomous driving.
Statistical Relational Learning deals with the development of algorithms that can effectively model relationships
between entities in data. These algorithms can be used to improve the performance of applications that require the
analysis of complex, relational data, such as knowledge graphs and social networks.
Machine Learning for Complex Networks focuses on the development of algorithms that can effectively analyze
complex networks of data, such as those found in social networks, transportation networks, and biological networks.
Computer Vision deals with the development of algorithms that can analyze and understand images and videos.
These algorithms can be used in applications such as object recognition, face recognition, and scene analysis.
Natural Language Processing deals with the development of algorithms that can analyze and understand human
language. These algorithms can be used in applications such as speech recognition, sentiment analysis, and machine
translation.
Pattern Recognition deals with the development of algorithms that can recognize patterns in data. These algorithms
can be used in applications such as audio data recognition in ecology, image classification, and speech recognition.
In conclusion, the area of Foundations of AI and Data Science at CAIDAS is a critical component of AI and data
science research. It encompasses a wide range of sub-disciplines that each contribute to the advancement of AI and
data science applications.

Difference between Structured data and Unstructured data

This article is going to be very important for the readers interested in Big Data. In this article, we
will discuss two major types of Big Data: structured data, unstructured data, and the difference
between them.

Hope this article will be informative to you and give you sufficient information about structured
data, unstructured data, and their comparison. We will try to make the article easy to read and
understand. So, without any delay, let's start our topic.

Before discussing the types of Big Data, let's see the brief description of Data and Big Data.

What is Data?

In general, data is a distinct piece of information that is gathered and translated for some
purpose. Data can be available in different forms, such as bits and bytes stored in electronic
memory, numbers or text on pieces of paper, or facts stored in a person's mind.

What is Big Data?

Big Data is defined as the Data which are very large in size. Normally, we work on data of size
MB (WordDoc, Excel) or maximum GB(Movies, Codes), but data in Petabytes, i.e., 10^15 byte
size, is called Big Data. It is stated that almost 90% of today's data has been generated in the past
3 years. Big data sources include Telecom Companies, Weather stations, E-commerce sites,
Share market, and many more.

Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources.

Now, let's discuss Structured Data and Unstructured Data.

Structured Data

The data which is to the point, factual, and highly organized is referred to as structured data. It is
quantitative in nature, i.e., it is related to quantities that means it contains measurable numerical
values like numbers, dates, and times.
It is easy to search and analyze structured data. Structured data exists in a predefined format.
Relational database consisting of tables with rows and columns is one of the best examples of
structured data. Structured data generally exist in tables like excel files and Google Docs
spreadsheets. The programming language SQL (structured query language) is used for managing
the structured data. SQL is developed by IBM in the 1970s and majorly used to handle relational
databases and warehouses.

Structured data is highly organized and understandable for machine language. Common
applications of relational databases with structured data include sales transactions, Airline
reservation systems, inventory control, and others.

Unstructured Data

All the unstructured files, log files, audio files, and image files are included in the unstructured
data. Some organizations have much data available, but they did not know how to derive data
value since the data is raw.
Unstructured data is the data that lacks any predefined model or format. It requires a lot of
storage space, and it is hard to maintain security in it. It cannot be presented in a data model or
schema. That's why managing, analyzing, or searching for unstructured data is hard. It resides in
various different formats like text, images, audio and video files, etc. It is qualitative in nature
and sometimes stored in a non-relational database or NO-SQL.

It is not stored in relational databases, so it is hard for computers and humans to interpret it. The
limitations of unstructured data include the requirement of data science experts and specialized
tools to manipulate the data.

The amount of unstructured data is much more than the structured or semi-structured data.
Examples of human-generated unstructured data are Text files, Email, social media, media,
mobile data, business applications, and others. The machine-generated unstructured data includes
satellite images, scientific data, sensor data, digital surveillance, and many more.

Structured data v/s Unstructured data


Let's see the comparison chart between structured and unstructured data. Here, we are tabulating
the difference between both terms based on some characteristics.

On the basis Structured data Unstructured data


of

Technology It is based on a relational database. It is based on character and binary data.

Flexibility Structured data is less flexible and There is an absence of schema, so it is more
schema-dependent. flexible.

Scalability It is hard to scale database schema. It is more scalable.

Robustness It is very robust. It is less robust.

Performance Here, we can perform a structured While in unstructured data, textual queries are
query that allows complex joining, so possible, the performance is lower than semi-
the performance is higher. structured and structured data.

Nature Structured data is quantitative, i.e., it It is qualitative, as it cannot be processed and


consists of hard numbers or things that analyzed using conventional tools.
can be counted.

Format It has a predefined format. It has a variety of formats, i.e., it comes in a


variety of shapes and sizes.

Analysis It is easy to search. Searching for unstructured data is more


difficult.
What is the difference between
qualitative and quantitative data?



Statistics is a subject that deals with the collection, analysis, and representation of
collected data. The analytical data derived from methods of statistics are used in the
fields of geology, psychology, forecasting, etc. The process of preparation of datasheet
involves steps as data collection, analysis, and summarization.

Types of Statistics

The below-given article is a study of subtopics of statistics that includes an explanation of


types of statistics along with the detailed study of quantitative and qualitative data.
Statistics is basically divided into two types that is descriptive statistics and inferential
statistics.
 Descriptive statistics: Descriptive statistics is a method that gives the
summary of data in the form of numerical, graphs, or tabular structure after the
study of the dataset. It gives a descriptive study about the population. The tools
for its measurement are,

1. Measure of central tendency (mean/median/mode)


2. The measure of variability (standard derivation, range)

 Inferential statistics: Inferential statistics is a method that runs on the basis of the
prediction and draws conclusions for the population on the basis of results. It is the
method that is run by hypotheses and predictions. This type of statistic is carried about by
probabilities derived from sample data of the population. The tools for its measurement
are

1. Hypotheses of tests
2. Analysis of variance, etc

Quantitative data: The data collected on the grounds of the numerical



variables are quantitative data. Quantitative data are more objective and
conclusive in nature. It measures the values and is expressed in numbers. The
data collection is based on “how much” is the quantity. The data in
quantitative analysis is expressed in numbers so it can be counted or
measured. The data is extracted from experiments, surveys, market reports,
matrices, etc.
 Qualitative data: The data collected on grounds of categorical variables are
qualitative data. Qualitative data are more descriptive and conceptual in
nature. It measures the data on basis of the type of data, collection, or
category. The data collection is based on what type of quality is given.
Qualitative data is categorized into different groups based on characteristics.
The data obtained from these kinds of analysis or research is used in
theorization, perceptions, and developing hypothetical theories. These data
are collected from texts, documents, transcripts, audio and video recordings,
etc.
Comparative table of qualitative and quantitative data,

Qualitative Data Quantitative Data

1. Qualitative data uses methods like 1. Quantitative data uses methods as


interviews, participant observation, focus questionnaires, surveys, and structural
on a grouping to gain collective observations to gain collective
information. information.

2. Data format used in it is textual. 2. Data format used in it is numerical.


Qualitative Data Quantitative Data

Datasheets are contained of audio or video Datasheets are obtained in the form of
recordings and notes. numerical values.

3. Qualitative data talks about the 3. Quantitative data talks about the
experience or quality and explains the quantity and explains the questions like
questions like ‘why’ and ‘how’. ‘how much’, ‘how many .

4. The data is analyzed by grouping it into 4. The data is analyzed by statistical


different categories. methods.

5.Qualitative data are subjective and can be 5. Quantitative data are fixed and
further open for interpretation. universal.

Nominal vs Ordinal Data





Data science revolves around the processing and analysis of data utilizing a range of tools
and techniques. In today’s data-driven world, we come across types of data each requiring
handling and interpretation. It is important to understand different types of data for proper data
analysis and statistical interpretation. The type of data determines the proper statistical methods
and operations that should be used. Various data types need different analysis and interpretation
methods to draw significant conclusions. In this article we will explore the concept of data, and
its significance provide real-world examples, and guide you through ways to work with it.

Levels of Measurement

Before analyzing a dataset, it is crucial to identify the type of data it contains. Luckily, all data
can be grouped into one of four categories: nominal, ordinal, interval, or ratio data. Although
these are often referred to as “data types,” they are actually different levels of measurement. The
level of measurement reflects the accuracy with which a variable has been quantified, and it
determines the methods that can be used to extract insights from the data.

The four categories of data are not always straightforward to distinguish and instead belong to a
hierarchy, with each level building on the preceding one.
There are four types of data: categorical, which can be further divided into nominal and ordinal,
and numerical, which can be further divided into interval and ratio. The nominal and ordinal
scales are relatively imprecise, which makes them easier to analyze, but they offer less accurate
insights. On the other hand, the interval and ratio scales are more complex and difficult to
analyze, but they have the potential to provide much richer insights.

 Nominal Data – Nominal data is a basic data type that categorizes data by labeling or
naming values such as Gender, hair color, or types of animal. It does not have any
hierarchy.

 Ordinal Data – Ordinal data involves classifying data based on rank, such as social
status in categories like ‘wealthy’, ‘middle income’, or ‘poor’. However, there are no set
intervals between these categories.

 Interval Data – Interval data is a way of organizing and comparing data that includes
measured intervals. Temperature scales, like Celsius or Fahrenheit, are good examples of
interval data. However, interval data doesn’t have a true zero, meaning that a
measurement of “zero” can still represent a quantifiable measure (like zero degrees
Celsius, which is just another point on the scale and doesn’t actually mean there is no
temperature present).

 Ratio Data – The most intricate level of measurement is ratio data. Similar to interval
data, it categorizes and arranges data, utilizing measured intervals. But, unlike interval
data, ratio data includes a genuine zero. When a variable is zero, there is no presence of
that variable. A prime illustration of ratio data is height measurement, which cannot be
negative.
What is Nominal Data?

Categorical data, also known as nominal data, is a crucial type of information utilized in diverse
fields such as research, statistics, and data analysis. It comprises of categories or labels that help
in classifying and arranging data. The essential feature of categorical data is that it does not
possess any inherent order or ranking among its categories. Instead, these categories are separate,
distinct, and mutually exclusive.

For example, Nominal data is used to classify information into distinct labels or categories
without any natural order or ranking. These labels or categories are represented using names or
terms, and there is no natural order or ranking among them. Nominal data is useful for
qualitative classification and organization of information, enabling researchers and analysts to
group data points based on specific attributes or characteristics without implying any numerical
relationships.

 Eye color categories like “blue” or “green” represent nominal data. Each category is
distinct, with no order or ranking.

 Smartphone brands like “iPhone” or “Samsung” are nominal data. There’s no hierarchy
among brands.

 Transportation modes like “car” or “bicycle” are nominal data. They are discrete
categories without inherent order.

Characteristics of Nominal Data


 Data that is classified as nominal is comprised of categories that are completely separate
and distinct from one another.

 Data that falls under the nominal category is distinguished by descriptive labels rather
than any numeric or quantitative value

 Nominal data cannot be ranked or ordered hierarchically, as no category is superior or


inferior to another.

Example

Here are a few examples of how nominal data is used to classify and categorize information into
distinct and non-ordered categories:

1. Colors of Car: Car colors are nominal data, with clear categories but no inherent order or
ranking. Each car falls under one color category, without any logical or numerical connection
between colors.

2. Types of Fruits: Fruit categories in a basket are nominal. Each fruit belongs to a specific
category with no hierarchy or order. All categories are distinct and discrete.

3. Movie Genres: Movie genres are nominal data since there’s no ranking among categories like
“action” or “comedy.” Each genre is unique, but we can’t say if one is better than another based
on this data alone.

What is Ordinal Data?

Ordinal data is a form of qualitative data that classifies variables into descriptive categories. It is
characterized by the fact that the categories it employs are ranked on some sort of hierarchical
scale, such as from high to low. Ordinal data is the second most complicated type of
measurement, following nominal data. Although it is more intricate than nominal data, which
lacks any inherent order, it is still relatively simplistic.
For example, Ordinal data is a type of data used to categorize items with a meaningful hierarchy
or order. These categories help us to compare and rank different achievements, positions, or
performance of students, even if the intervals between them are not equal. Ordinal data is useful
for understanding ordered choices or preferences and for assessing relative differences.

 School Grades: Grades like A, B, C are ordinal data, ranked by achievement, but intervals
between them vary.

 Education Level: Levels like high school, bachelor’s, master’s are ordinal data, ordered
by education, but gaps between levels differ.

 Seniority Level: Job levels like entry, mid, senior are ordinal data, indicating hierarchy,
but the gap varies by job and industry.

Characteristics of Ordinal Data

 Ordinal data falls under the category of non-numeric and categorical data, but it can still
make use of numerical values as labels.

 Ordinal data are always ranked in a hierarchy (hence the name ‘ordinal’).

 Ordinal data may be ranked, but their values are not evenly distributed.

 With ordinal data, you can calculate frequency distribution, mode, median, and range of
variables.
Example

Here are a few examples of how ordinal data is used in fields and domains:

1. Educational Levels: Ordinal data is commonly used to represent education levels, such, as ”
school,” “bachelors degree,” “masters degree,” and “Ph.D.” These levels have an order.

2. Customer Satisfaction Ratings: Another application of data is in customer satisfaction


surveys. These surveys often ask respondents to rate their experience on a scale, from “poor” to
“excellent.”

3. Economic Classes: classes including ” class ” “middle class,” and “upper class ” can be
classified as ordinal data based on their ranking.

These examples demonstrate the ways in which ordinal data is utilized across fields and
domains.

Nominal Vs Ordinary Data

Characteristics Nominal data Ordinal Data

Nature of
Distinct and Discrete Discrete and Distinct
Categories

Order/Ranking No inherent order Has a clear order or ranking

Numerical No meaningful numerical


No meaningful numerical values
Values values

Analysis Frequency counts, Ranking, median, non-parametric tests,


Techniques percentages, bar charts ordered bar charts, ordinal regression

Colors, gender, types of School grades, education level,


Example animals seniority level

Used for classification and Used for assessing ordered preferences,


Interpretation grouping based on category hierarchy, or rankings
The Art of Data Cleaning: Transforming Messy Data
into Structured Insights
Introduction

In the realm of machine learning, data is the key ingredient. It fuels the
algorithms and dictates the quality of the predictions and insights
derived. Yet, more often than not, data scientists face the significant
challenge of dealing with messy, unstructured data. This article aims to
provide a comprehensive overview of data cleaning, a crucial yet
frequently overlooked aspect of the data preprocessing pipeline, which
turns messy data into tidy, structured information, ready to be
consumed by machine learning algorithms.

Understanding Messy Data

Messy data, also referred to as ‘dirty’ data, is data that is inconsistent,


mislabeled, incomplete, or improperly formatted. It could contain
mistakes, discrepancies, duplicates, and even irrelevant information.
These issues may stem from a multitude of sources, including human
error during data entry, system glitches, or inconsistent data collection
protocols.

Dealing with messy data is one of the first hurdles a data scientist must
overcome in any data science project, as poor data quality can
significantly impact the performance of machine learning algorithms.
Therefore, the process of cleaning and tidying this data becomes
essential.
The Importance of Tidy Data

In contrast to messy data, tidy data is clean, consistent, and structured


in a way that makes it easily analyzable by machine learning
algorithms. Each variable forms a column, each observation forms a
row, and each type of observational unit forms a table.

Tidy data offers several advantages:

1. Ease of Manipulation: Tidy data is easier to manipulate and


analyze, allowing data scientists to focus on the analysis rather than
dealing with the data’s structure.

2. Simplified Visualization: It makes data visualization simpler


and more intuitive, enabling better understanding and communication
of the data insights.

3. Better ML Model Performance: Machine learning algorithms


perform better with tidy data as they can learn more effectively from
accurate, consistent information.

Steps in Data Cleaning

The process of transforming messy data into tidy data, often known as
data cleaning or data wrangling, involves several key steps:
Data Auditing

The first step involves examining the dataset to identify any errors or
inconsistencies. This step is crucial as it helps establish the nature and
extent of the messiness in the data.

Workflow Specification

Once the issues have been identified, the next step is to specify the
workflow or steps necessary to clean the data. This might involve
dealing with missing values, removing duplicates, or correcting
inconsistent entries.

Workflow Execution

The third step involves executing the specified workflow, which may
often require writing custom scripts or using specific data cleaning
tools.

Post-Processing Check

After the cleaning process, a post-processing check is performed to


ensure that no errors were introduced during the cleaning process and
that all identified issues have been appropriately addressed.

Data cleaning can be a time-consuming process, but it is a necessary


one. However, by automating as much of the process as possible and
using robust tools and techniques, it can be made more efficient and
less prone to error.
Tools and Techniques for Data
Cleaning

There are many tools and techniques available for data cleaning,
ranging from programming libraries in Python or R, to dedicated data
cleaning tools. The choice of tool often depends on the nature and scale
of the data, as well as the specific cleaning tasks that need to be
performed.

Some common data cleaning tasks include:

1. Handling Missing Values: Missing data can be handled in


several ways, including deleting the rows or columns with missing
data, filling in the missing values with a specified value or estimate, or
using methods like regression or machine learning to predict the
missing values.

2. Removing Duplicates: Duplicate entries can be easily identified


and removed using functions available in most data analysis libraries.

3. Outlier Detection: Outliers can be detected using various


statistical techniques and can either be removed or adjusted,
depending on the context.

4. Data Transformation: Sometimes, data may need to be


transformed to a different format or scale to be suitable for analysis.
5. Normalization and Standardization: Data normalization
(scaling values between 0 and 1) or standardization (scaling values to
have a mean of 0 and a standard deviation of 1) can be necessary for
certain algorithms to perform effectively.

Regardless of the tools or techniques used, the goal of data cleaning is


the same: to transform messy data into a tidy format that can be easily
and effectively analyzed.

Conclusion

Data cleaning is an essential step in the data preprocessing pipeline,


ensuring that machine learning algorithms are fed with high-quality,
structured data. Although it can be a time-consuming and complex
process, it is well worth the effort, as tidy data leads to better analysis,
more accurate models, and ultimately, more reliable predictions and
insights.

Prompts:

1. Explain the concept of messy data and why it is problematic in


machine learning.
2. What is tidy data, and how does it differ from messy data?
3. Discuss the advantages of having tidy data for machine learning
tasks.
4. What are the key steps involved in data cleaning?
5. Discuss the importance of data auditing in the data cleaning process.
6. How can data cleaning workflows be specified and executed?
7. What is the purpose of a post-processing check in the data cleaning
process?
8. What tools and techniques are commonly used for data cleaning?
9. Explain how missing values can be handled during data cleaning.
10. Discuss the process of detecting and removing duplicate entries in a
dataset.
11. How can outliers be identified and handled in data cleaning?
12. Explain the concept of data transformation in the context of data
cleaning.
13. Discuss the importance of normalization and standardization in
data cleaning.
14. How does data cleaning contribute to the overall effectiveness of
machine learning models?
15. What factors should be considered when choosing tools or
techniques for data cleaning?

Data cleaning is one of the important parts of machine learning. It plays a significant
part in building a model. In this article, we’ll understand Data cleaning, its significance
and Python implementation.
What is Data Cleaning?
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data. The goal of data
cleaning is to ensure that the data is accurate, consistent, and free of errors, as incorrect
or inconsistent data can negatively impact the performance of the ML model.
Professional data scientists usually invest a very large portion of their time in this step
because of the belief that “Better data beats fancier algorithms”.
Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in
the data science pipeline that involves identifying and correcting or removing errors,
inconsistencies, and inaccuracies in the data to improve its quality and usability. Data
cleaning is essential because raw data is often noisy, incomplete, and inconsistent,
which can negatively impact the accuracy and reliability of the insights derived from it.
Why is Data Cleaning Important?
Data cleansing is a crucial step in the data preparation process, playing an important
role in ensuring the accuracy, reliability, and overall quality of a dataset. For decision-
making, the integrity of the conclusions drawn heavily relies on the cleanliness of the
underlying data. Without proper data cleaning, inaccuracies, outliers, missing values,
and inconsistencies can compromise the validity of analytical results. Moreover, clean
data facilitates more effective modeling and pattern recognition, as algorithms perform
optimally when fed high-quality, error-free input.
Additionally, clean datasets enhance the interpretability of findings, aiding in the
formulation of actionable insights.
Data Cleaning in Data Science
Data clean-up is an integral component of data science, playing a fundamental role in
ensuring the accuracy and reliability of datasets. In the field of data science, where
insights and predictions are drawn from vast and complex datasets, the quality of the
input data significantly influences the validity of analytical results. Data cleaning
involves the systematic identification and correction of errors, inconsistencies, and
inaccuracies within a dataset, encompassing tasks such as handling missing values,
removing duplicates, and addressing outliers. This meticulous process is essential for
enhancing the integrity of analyses, promoting more accurate modeling, and ultimately
facilitating informed decision-making based on trustworthy and high-quality data.
Steps to Perform Data Cleanliness
Performing data cleaning involves a systematic process to identify and rectify errors,
inconsistencies, and inaccuracies in a dataset. The following are essential steps to
perform data cleaning.
 Removal of Unwanted Observations: Identify and eliminate irrelevant or
redundant observations from the dataset. The step involves scrutinizing data
entries for duplicate records, irrelevant information, or data points that do not
contribute meaningfully to the analysis. Removing unwanted observations
streamlines the dataset, reducing noise and improving the overall quality.
 Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity in
data representation. Fixing structure errors enhances data consistency and
facilitates accurate analysis and interpretation.
 Managing Unwanted outliers: Identify and manage outliers, which are data
points significantly deviating from the norm. Depending on the context,
decide whether to remove outliers or transform them to minimize their impact
on analysis. Managing outliers is crucial for obtaining more accurate and
reliable insights from the data.
 Handling Missing Data: Devise strategies to handle missing data effectively.
This may involve imputing missing values based on statistical methods,
removing records with missing values, or employing advanced imputation
techniques. Handling missing data ensures a more complete dataset,
preventing biases and maintaining the integrity of analyses.

How to Perform Data Cleanliness


Performing data cleansing involves a systematic approach to enhance the quality and
reliability of a dataset. The process begins with a thorough understanding of the data,
inspecting its structure and identifying issues such as missing values, duplicates, and
outliers. Addressing missing data involves strategic decisions on imputation or removal,
while duplicates are systematically eliminated to reduce redundancy. Managing outliers
ensures that extreme values do not unduly influence analysis. Structural errors are
corrected to standardize formats and variable types, promoting consistency.
Throughout the process, documentation of changes is crucial for transparency and
reproducibility. Iterative validation and testing confirm the effectiveness of the data
cleansing steps, ultimately resulting in a refined dataset ready for meaningful analysis
and insights.
Data Cleansing Tools
Some data cleansing tools:
 OpenRefine
 Trifacta Wrangler
 TIBCO Clarity
 Cloudingo
 IBM Infosphere Quality Stage
Advantages of Data Cleaning in Machine Learning:
 Improved model performance: Removal of errors, inconsistencies, and
irrelevant data, helps the model to better learn from the data.
 Increased accuracy: Helps ensure that the data is accurate, consistent, and free
of errors.
 Better representation of the data: Data cleaning allows the data to be transformed
into a format that better represents the underlying relationships and patterns in
the data.
 Improved data quality: Improve the quality of the data, making it more reliable
and accurate.
 Improved data security: Helps to identify and remove sensitive or confidential
information that could compromise data security.
Disadvantages of Data Cleaning in Machine Learning
 Time-consuming: Time-Consuming task, especially for large and complex
datasets.
 Error-prone: Data cleaning can be error-prone, as it involves transforming and
cleaning the data, which can result in the loss of important information or the
introduction of new errors.
 Cost and resource-intensive: Resource-intensive process that requires significant
time, effort, and expertise. It can also require the use of specialized software
tools, which can add to the cost and complexity of data cleaning.
 Overfitting: Data cleaning can inadvertently contribute to overfitting by
removing too much data.
Conclusion
So, we have discussed four different steps in data cleaning to make the data more
reliable and to produce good results. After properly completing the Data Cleaning steps,
we’ll have a robust dataset that avoids many of the most common pitfalls. In summary,
data cleaning is a crucial step in the data science pipeline that involves identifying and
correcting errors, inconsistencies, and inaccuracies in the data to improve its quality and
usability.

Tabular Presentation of Data


A table represents a large amount of data in an arranged, organised, engaging,
coordinated and easy to read form called the tabular presentation of data.

TABLE OF CONTENT
 Types of Classification in the tabular representation of data.
 What are the main parts of a presentation of data in tabular form?
 Disadvantages of a tabular representation of the data.
When a table is used to represent a large amount of data in an arranged, organised, engaging,
coordinated and easy to read form it is called the tabular representation of data. In tabular
representation of data, the given data set is presented in rows and columns. The rows and columns
method is one of the most popular forms of data representation as data tables are simple to prepare
and read. Tabular representation of data makes the representation of data more significant for more
additional statistical treatment and decision making.

Types of Classification in the


tabular representation of
data.
The analysis used for the tabular representation of data is of four types. They are: Quantitative,
Qualitative, Temporal, and Spatial.

Quantitative Classification: in this analysis, the data is classified and distributed on the basis of
features that are quantitative in nature. The features can be calculated by estimating the quantitative
value in simpler terms.
Qualitative Classification: As the data is classified and distributed according to traits such as
physical status, national, social status, etc., it is called qualitative classification.
Temporal Classification: In this type of classification, time becomes the categorising and
distribution of variables of data. By the time, it could mean years, months, days, hours, etc.
Spatial Classification: In this type of classification, the data is categorised and distributed on the
basis of location; the location or place could be country, state, district, block, village/town, etc.
The aim and objectives of a tabular representation of data are that they represent the complex set of
data in a simplified data form. Tabular representation of data brings out the essential features of data
and facilitates statistics. Using the tabular representation of data also saves space.

What are the main parts of a


presentation of data in tabular
form?
The main parts of a Table are table number, title, headnote, captions or column headings, stubs or
row headings, the body of the table, source note, and footnote.

Table number – the purpose of identification and an easy reference is provided in the table
number.
Title – it provides the basis of information adjacent to the number.
Column headings or captions – it is put up at the top columns of the table; the columns come with
specific figures within.
Footnote – it gives a scope or potential for the further explanation that might be required for any
item included in the table; the footnote is needed to clarify data.
Row heading and Stub – this provides specific issues mentioned in the horizontal rows. The stub is
provided on the left side of the table.
Information source – it is included at the bottom of the table. The information source tells us the
source related to the specific piece of information and the authenticity of the sources.

Disadvantages of a tabular
representation of the data.
Though, there are a few limitations of the presentation of data in tabular form. The first limitation is
the lack of description; the data in tabular form is only represented with figures and not attributes that
ignore the facts’ qualitative aspect. The second limitation is that the data in tabular form is incapable
of presenting individual terms; it represents aggregate data. The third limitation is the tabular
representation of data needs special knowledge to understand it, and a layman cannot easily use it.
Conclusion

A table is used to represent a large amount of data in an arranged, organised, engaging,


coordinated and easy to read form called the tabular representation of data. In tabular representation
of data, the given data set is presented in rows and columns. When a table is used to represent a
large amount of data in an arranged, organised, engaging, coordinated and easy to read form it is
called the tabular representation of data. The main parts of a Table are table number, title, headnote,
captions or column headings, stubs or row headings, the body of the table, source note, and
footnote.

Different forms of data representation in today’s world

Overview :
Data is can be anything which represents the specific result or any number, text, image,
audio, video etc. For example, If you will take an example of human being then data for
a human being such that name, personal id, country, profession, bank account details
etc. are the important data. Data can be divide into three categories such that data can be
personal, public and private.
Forms of data representation :
At present Information comes in different forms such as follows.

1. Numbers
2. Text
3. Images
4. Audio
5. Video

Text –
Text is also represented as bit pattern or sequence of bits(such as 0001111). Various
types of bits are assigned to represent text symbols. A code where each number
represents a character can be used to convert text into binary.
Text File Formats –
.doc,.docx, .pdf, .rtf, .txt, etc.
Example :
The letter ‘a’ has the binary number 0110 0001.

What is Graph Database –


Introduction



What is a Graph Database?
A graph database (GDB) is a database that uses graph structures for storing data. It uses
nodes, edges, and properties instead of tables or documents to represent and store data.
The edges represent relationships between the nodes. This helps in retrieving data more
easily and, in many cases, with one operation. Graph databases are commonly referred to
as a NoSQL.
Representation:
The graph database is based on graph theory. The data is stored in the nodes of the graph
and the relationship between the data are represented by the edges between the nodes.

graph representation of data

When do we need Graph Database?

1. It solves Many-To-Many relationship problems

If we have friends of friends and stuff like that, these are many to many relationships.
Used when the query in the relational database is very complex.

When relationships between data elements are more important

For example- there is a profile and the profile has some specific information in it but
the major selling point is the relationship between these different profiles that is how
you get connected within a network.
In the same way, if there is data element such as user data element inside a graph
database there could be multiple user data elements but the relationship is what is going
to be the factor for all these data elements which are stored inside the graph database.

3. Low latency with large scale data


When you add lots of relationships in the relational database, the data sets are going to
be huge and when you query it, the complexity is going to be more complex and it is
going to be more than a usual time. However, in graph database, it is specifically
designed for this particular purpose and one can query relationship with ease.

Why do Graph Databases matter? Because graphs are good at handling relationships,
some databases store data in the form of a graph.
Example We have a social network in which five friends are all connected. These
friends are Anay, Bhagya, Chaitanya, Dilip, and Erica. A graph database that will store
their personal information may look something like this:

id first name last name email phone


1 Anay Agarwal anay@example.net 555-111-5555
2 Bhagya Kumar bhagya@example.net 555-222-5555
Chaitany
3 a Nayak chaitanya@example.net 555-333-5555
4 Dilip Jain dilip@example.net 555-444-5555
5 Erica Emmanuel erica@example.net 555-555-5555
Now, we will also a need another table to capture the friendship/relationship between
users/friends. Our friendship table will look something like this:
user_id friend_id
1 2
1 3
1 4
1 5
2 1
2 3
2 4
2 5
3 1
3 2
3 4
3 5
4 1
4 2
4 3
4 5
5 1
5 2
5 3
5 4
We will avoid going deep into the Database(primary key & foreign key) theory. Instead
just assume that the friendship table uses id’s of both the friends. Assume that our
social network here has a feature that allows every user to see the personal information
of his/her friends. So, If Chaitanya were requesting information then it would mean she
needs information about Anay, Bhagya, Dilip and Erica. We will approach this problem
the traditional way(Relational database). We must first identify Chaitanya’s id in the
User’s table:

id first name last name email phone


Chaitany
3 a Nayak chaitanya@example.net 555-333-5555
Now, we’d look for all tuples in friendship table where the user_id is 3. Resulting
relation would be something like this:
user_id friend_id
3 1
3 2
3 4
3 5
Now, let’s analyse the time taken in this Relational database approach. This will be
approximately log(N) times where N represents the number of tuples in friendship table
or number of relations. Here, the database maintains the rows in the order of id’s. So, in
general for ‘M’ no of queries, we have a time complexity of M*log(N) Only if we had
used a graph database approach, the total time complexity would have been O(N).
Because, once we’ve located Cindy in the database, we have to take only a single step
for finding her friends. Here is how our query would be executed:

Advantages: Frequent schema changes, managing volume of data, real-time query


response time, and more intelligent data activation requirements are done by graph
model.
Disadvantages: Note that graph databases aren’t always the best solution for an
application. We will need to assess the needs of application before deciding the
architecture.
Limitations of Graph Databases:
 Graph Databases may not be offering better choice over the NoSQL
variations.
 If application needs to scale horizontally this may introduces poor
performance.
 Not very efficient when it needs to update all nodes with a given parameter.
The Types of Modern Databases

Overview
Today, databases are everywhere. Often hidden in plain sight, databases power online banking,
airline reservations, medical records, employment records, and personal transactions.

But what is a database? A database is a shared collection of related


data used to support the activities of a particular organization. It can be viewed as a repository
of data that is defined and accessed by various users (Database Design 2nd Ed, Watt & Eng).
This definition is inherently broad: in practice, databases differ by intended purpose—each is
dependent on the type of data stored and the type of transactions that occur.

Is data most frequently being read or written? Does data need to be accessed by row or column?
How can the database management system ensure control over data integrity,
avoid redundancy, and secure data while performing optimally? This article will present and
assess modern databases, assisting in the research and selection of an optimal system for
information storage and retrieval.
Selecting the Optimal Database
There are a plethora of characteristics to evaluate when selecting the ideal database for your
team/project. The “right” solution will be the one that best serves your needs and intended use.
Characteristics like business intelligence optimization, data structure, storage type,
data volume, query speed, and data model requirements play a role, but the best solution will be
one tailored to your use-case.

A good starting point for selecting the optimal database is to understand the different database
structures:

 Hierarchical: A hierarchical database utilizes a parent-child


relationship to organize and manage data in a tree-like structure. In
this type of database, the schema of a hierarchy has a single root.
 Network: A network database enables a child record to link
to several parent records, allowing multi-directional
relationships. This allows several records to link to the same owner file.
 Object-Oriented: Object-Oriented Databases (OODBs) are
used for complex data structures to reduce overhead and flatten the object for
storage. The structure of the objects determines the relationships
between objects.
 Relational: A relational database structures the data as a two-
dimensional array. The data is placed into tables and organized
by rows and columns. Relational databases use keys within a
column to order and create relationships to other tables.
 Non-relational: A non-relational database doesn’t use a tabular schema that
most database systems use. Instead, it utilizes numerous formats
for database design, enabling more flexibility and scalability in the
design to accommodate different data types.
Most organizations opt for a relational database leveraging SQL (RDBMS—relational database
management system) or the less-common non-relational database (like NoSQL).

Relational & Non-Relational Systems

Relational Database Systems (RDBMS)


A relational database (or SQL database) stores data in tables and rows, also referred to as
records. The term “relational database” is not new—it's been around since the 1970’s and was
coined by researchers at IBM. Some of the most popular relational database systems include
Postgres, MySQL, SQLite, and Microsoft SQL Server.
A relational database works by linking information from multiple tables through the use of keys,
which are identifiers assigned to a row of data. The unique identifier, a “primary” key, can
be attached to a corresponding record in another table (when the two records are related). This
attached record is then known as a “foreign” key. The primary key (PK), foreign key (FK)
relationship allows the two records to be “joined” using SQL.
An example relationship between primary and foreign keys

A significant advantage of the RDBMS is “referential integrity,” which refers to


the consistency and accuracy of data. Referential integrity is obtained through
proper use of primary and foreign keys.

Non-Relational Databases
Non-relational or NoSQL databases are also used to store data, but unlike relational databases,
there are no tables, rows, primary keys, or foreign keys. Instead, these data-stores use
models optimized for specific data types. The four most popular non-relational
types are document data stores, key-value stores, graph databases, and search engine stores.

Learn more about the differences between SQL and NoSQL:

Popular Modern Databases

Snowflake

Snowflake is an analytic data warehouse provided as Software-as-a-Service


(SaaS). The Snowflake data warehouse uses a new SQL database engine with a unique
architecture designed for the cloud.

To the user, Snowflake has many similarities to other enterprise data warehouses, but
contains additional functionality and unique capabilities. Snowflake is a turn-key
solution for managing data engineering and science, data warehouses and lakes, and the
creation of data applications—including sharing data internally and externally.

Snowflake provides one platform that can integrate, manage, and collaborate securely across
different workloads. It also scales with your business and can run seamlessly across
multiple clouds.

Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
Amazon Redshift is a cost-effective solution for:
 Running high-performance queries to improve business
intelligence.
 Generating real-time operational analytics on events,
applications, and systems.
 Sharing data securely inside and outside the organization.
 Creating predictive analytics of the data in your data warehouse.

PostgreSQL

PostgreSQL, usually referred to as Postgres, is a free, open source, object-relational


database management system (ORDBMS) emphasizing standards compliance
and extensibility.

PostgreSQL has earned a strong reputation for its proven architecture, reliability, data integrity,
robust feature set, extensibility, and the dedication of the open-source community behind the
software to consistently deliver performant and innovative solutions.

There are many benefits to a PostgreSQL database, not limited to:

 Custom functions in a host of languages.


 A wide array of data types and the ability to construct new types.
 Robust access control, authentication, and privilege management.
 Support for views and materialized views.
MySQL

MySQL is also a free, open-source RDBMS. MySQL runs on virtually all platforms,
including Windows, UNIX, and Linux. A popular database model that is used globally, MySQL
offers amultitude of benefits:
 Industry leading data security.
 On-demand flexibility that enables a smaller footprint or massive
warehouse.
 Distinct storage engine that facilitates high performance.
 24/7 uptime with a range of high availability solutions.
 Comprehensive support for transactions.

Microsoft SQL Server

Microsoft SQL Server is an RDBMS database that supports a wide variety of analytic
applications in corporate IT, transaction processing, and business intelligence.

Organizations can implement a Microsoft SQL Server on-premise or in the cloud, depending on
structural needs. It can also run on Windows, Linux, or Docker systems. As part of the Microsoft
toolkit, many Microsoft’s applications and software integrate well with Microsoft SQL Server.

Distributed Database System





A distributed database is basically a database that is not limited to one system, it is spread
over different sites, i.e, on multiple computers or over a network of computers. A
distributed database system is located on various sites that don’t share physical
components. This may be required when a particular database needs to be accessed by
various users globally. It needs to be managed such that for the users it looks like one
single database.
Types:
1. Homogeneous Database:
In a homogeneous database, all different sites store database identically. The operating
system, database management system, and the data structures used – all are the same at
all sites. Hence, they’re easy to manage.
2. Heterogeneous Database:
In a heterogeneous distributed database, different sites can use different schema and
software that can lead to problems in query processing and transactions. Also, a particular
site might be completely unaware of the other sites. Different computers may use a
different operating system, different database application. They may even use different
data models for the database. Hence, translations are required for different sites to
communicate.
Distributed Data Storage :
There are 2 ways in which data can be stored on different sites. These are:
1. Replication –
In this approach, the entire relationship is stored redundantly at 2 or more sites. If the
entire database is available at all sites, it is a fully redundant database. Hence, in
replication, systems maintain copies of data.
This is advantageous as it increases the availability of data at different sites. Also, now
query requests can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated. Any
change made at one site needs to be recorded at every site that relation is stored or else
it may lead to inconsistency. This is a lot of overhead. Also, concurrency control
becomes way more complex as concurrent access now needs to be checked over a
number of sites.
2. Fragmentation –
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts)
and each of the fragments is stored in different sites where they’re required. It must be
made sure that the fragments are such that they can be used to reconstruct the original
relation (i.e, there isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data, consistency is not a
problem.

Fragmentation of relations can be done in two ways:

 Horizontal fragmentation – Splitting by rows –


The relation is fragmented into groups of tuples so that each tuple is assigned
to at least one fragment.
 Vertical fragmentation – Splitting by columns –
The schema of the relation is divided into smaller schemas. Each fragment
must contain a common candidate key so as to ensure a lossless join.
In certain cases, an approach that is hybrid of fragmentation and replication is used.
Applications of Distributed Database:
 It is used in Corporate Management Information System.
 It is used in multimedia applications.
 Used in Military’s control system, Hotel chains etc.
 It is also used in manufacturing control system.
A distributed database system is a type of database management system that stores
data across multiple computers or sites that are connected by a network. In a distributed
database system, each site has its own database, and the databases are connected to each
other to form a single, integrated system.

The main advantage of a distributed database system is that it can provide higher
availability and reliability than a centralized database system. Because the data is stored
across multiple sites, the system can continue to function even if one or more sites fail. In
addition, a distributed database system can provide better performance by distributing the
data and processing load across multiple sites.

There are several different architectures for distributed database systems,


including:

Client-server architecture: In this architecture, clients connect to a central server, which


manages the distributed database system. The server is responsible for coordinating
transactions, managing data storage, and providing access control.

Peer-to-peer architecture: In this architecture, each site in the distributed database


system is connected to all other sites. Each site is responsible for managing its own data
and coordinating transactions with other sites.

Federated architecture: In this architecture, each site in the distributed database system
maintains its own independent database, but the databases are integrated through a
middleware layer that provides a common interface for accessing and querying the data.

Distributed database systems can be used in a variety of applications, including e-


commerce, financial services, and telecommunications. However, designing and
managing a distributed database system can be complex and requires careful
consideration of factors such as data distribution, replication, and consistency.

Advantages of Distributed Database System :

1) There is fast data processing as several sites participate in request processing.


2) Reliability and availability of this system is high.
3) It possess reduced operating cost.
4) It is easier to expand the system by adding more sites.
5) It has improved sharing ability and local autonomy.
Disadvantages of Distributed Database System :

1) The system becomes complex to manage and control.


2) The security issues must be carefully managed.
3) The system require deadlock handling during the transaction processing otherwise
the entire system may be in inconsistent state.
4) There is need of some standardization for processing of distributed database
system.

Difference between Spreadsheet and Database





1. Spreadsheet :
A file that exists of cells in rows and columns and can help arrange, calculate and sort
data is known as Spreadsheet. It can have a numeric value, text, formulas and functions.
It features columns and rows to keep inserted information legible and simple to
understand. It is an electronic graph sheet.
Example –
Microsoft Excel, Lotus 1-2-3.
2. Database :
It is an organized collection of data arranged for ease and speed of search and retrieval. It
contains multiple tables. A database engine can sort, change or serve the information on
the database. Basically, it is a set of information which is held in a computer.
Example –
Banks use databases to keep track of customer accounts, balances and deposits.

Difference between Spreadsheet and Database :

Spreadsheet Database

It is accessed by a user or by an application


It is accessed directly by the user.
to enter or modify the data.

It can be classified as numeric


It can be classified as classification of data.
relationships.

It is easy to start. It is difficult to start with SQL.


Spreadsheet Database

It is somewhat difficult than spreadsheet to


It is easy to learn for the user.
learn.

It is an interactive computer It is an organized collection of data


application for organization, analysis generally stored and accessed electronically
and storage of data in tabular form. from a computer system.

It stores less data. It stores more data.

It is used in large enterprises to store the


It is used for accounting tasks.
data.

It can produce reports in the format easily


It produce tables and graphs for but would need to copy the final data to
presentations. spreadsheet or other tools to make it look
good.

It have more formatting features than It automatically update forms, reports and
database. queries when data is updated.

Most Used Data Science Tools





Data Science is the art of drawing and visualizing useful insights from data. Basically, it
is the process of collecting, analyzing, and modeling data to solve problems related to the
real-world. To implement the operations we have to use such tools to manipulate the data
and entities to solve the issues. With the help of these tools, no need to use core
programming languages in order to implement Data Science. There are pre-defined
functions, algorithms, and a user-friendly Graphical User Interface (GUI). As we know
that Data Science has a very fast execution process, one tool is not enough to implement
this.

1. Apache Hadoop

Apache Hadoop is a free, open-source framework by Apache Software


Foundation authorized under the Apache License 2.0 that can manage and store tons
and tons of data. It is used for high-level computations and data processing. By using its
parallel processing nature, we can work with the number of clusters of nodes. It also
facilitates solving highly complex computational problems and tasks related to data-
intensive.
 Hadoop offers standard libraries and functions for the subsystems.
 Effectively scale large data on thousands of Hadoop clusters.
 It speeds up disk-powered performance by up to 10 times per project.
 Provides the functionalities of modules like Hadoop Common, Hadoop
YARN, Hadoop MapReduce.

2. SAS (Statistical Analysis System)

SAS is a statistical tool developed by SAS Institute. It is a closed source proprietary software
that is used by large organizations to analyze data. It is one of the oldest tools developed for Data
Science. It is used in areas like Data Mining, Statistical Analysis, Business Intelligence
Applications, Clinical Trial Analysis, Econometrics & Time-Series Analysis.

Latest Version: SAS 9.4

 It is a suite of well-defined tools.


 It has a simple but most effective GUI.
 It provides a Granular analysis of textual content.
 Easy to learn and execute as there is a lot of available tutorials with appropriate
knowledge.
 Can make visually appealing reports with seamless and dedicated technical
support.

3. Apache Spark

Apache Spark is the data science tool developed by Apache Software Foundation used for
analyzing and working on large-scale data. It is a unified analytics engine for large-scale data
processing. It is specially designed to handle batch processing and stream processing. It allows
you to create a program to clusters of data for processing them along with incorporating data
parallelism and fault-tolerance. It inherits some of the features of Hadoop like YARN,
MapReduce, and HDFS.

Latest Version: Apache Spark 2.4.5


 It offers data cleansing, transformation, model building & evaluation.
 It has the ability to work in-memory makes it extremely fast for processing data
and writing to disk.
 It provides many APIs that facilitate repeated access to data.

4. Data Robot

DataRobot Founded in 2012, is the leader in enterprise AI, that aids in developing accurate
predictive models for the real-world problems of any organization. It facilitates the environment
to automate the end-to-end process of building, deploying, and maintaining your AI.
DataRobot’s Prediction Explanations help you understand the reasons behind your machine
learning model results.

 Highly Interpretable.
 It has the ability to making the model’s predictions easy to explain to anyone.
 It provides the suitability to implement the whole Data Science process at a large
scale.

5. Tableau

Tableau is the most popular data visualization tool used in the market, is an American
interactive data visualization software company founded in January 2003, was recently acquired
by Salesforce. It provides the facilities to break down raw, unformatted data into a processable
and understandable format. It has the ability to visualize geographical data and for plotting
longitudes and latitudes in maps.

Latest Version: Tableau 2020.2

 It offers comprehensive end-to-end analytics.


 It is a fully protected system that reduces security risks to the maximum state.
 It provides a responsive user interface that fits all types of devices and screen
dimensions.

6. BigML
BigML, founded in 2011, is a Data Science tool that provides a fully interactable, cloud-based
GUI environment that you can use for processing Complex Machine Learning Algorithms. The
main goal of using BigML is to make building and sharing datasets and models easier for
everyone. It provides an environment with just one framework for reduced dependencies.

Latest Version: BigML Winter 2020

 It specializes in predictive modeling.


 It has ability to export models via JSON PML and PMML makes for a seamless
transition from one platform to another.
 It provides an easy to use web-interface using Rest APIs.

7. TensorFlow

TensorFlow, developed by Google Brain team, is a free and open-source software library for
dataflow and differentiable programming across a range of tasks. It provides an environment for
building and training models, deploying platforms such as computers, smartphones, and servers,
to achieving maximum potential with finite resources. It is one of the very useful tools that is
used in the fields of Artificial Intelligence, Deep Learning, & Machine Learning.

Latest Version: TensorFlow 2.2.0

 It provides good performance and high computational abilities.


 Can run on both CPUs and GPUs.
 It provides features like easily trainable and responsive construct.

8. Jupyter

Jupyter, developed by Project Jupyter on February 2015 open-source software, open-standards,


and services for interactive computing across dozens of programming languages. It is a web-
based application tool running on the kernel, used for writing live code, visualizations, and
presentations. It is one of the best tools, used by scratch level programmers & data science
aspirants, by which they can easily learn and adapt the functionalities related to the Data Science
field.
Latest Version: Jupyter Notebook 6.0.3

 It provides an environment to perform data cleaning, statistical computation,


visualization and create predictive machine learning models.
 It has the ability to display plots that are the output of running code cells.
 It is quite extensible, supports many programming languages, easily hosted on
almost any server.

Images –

Images are also represented as bit patterns. An image is composed of matrix of pixels with
different values of pixels each where each pixel is represented as dots. Size of the picture is
dependent on its resolution. Consider a simple black and white image. If 1 is black (or on) and 0
is white (or off), then a simple black and white picture can be created using binary.

Image File Formats –

Image can be in the format of jpeg, PNG, TIFF, GIF, etc.

Image Data Representation -

one bit per pixel (0 or 1) - two possible colors

two bits per pixel (00 to 11) - four possible colors

three bits per pixel (000 to 111) - eight possible colors

four bits per pixel (0000 to 1111) - 16 possible colors

WHAT IS PREDICTIVE MODELING ?


Predictive modeling is the process of creating, testing and validating a model to best
predict the probability of an outcome. A number of modeling methods from machine
learning, artificial intelligence, and statistics are available in predictive
analytics software solutions for this task.
The model is chosen on the basis of testing, validation and evaluation using the
detection theory to guess the probability of an outcome in a given set amount of input
data. Models can use one or more classifiers in trying to determine the probability of a
set of data belonging to another set. The different models available on the Modeling
portfolio of predictive analytics software enables to derive new information about the
data and to develop the predictive models.
Each model has its own strengths and weakness and is best suited for particular types of
problems. A model is reusable and is created by training an algorithm using historical
data and saving the model for reuse purpose to share the common business rules which
can be applied to similar data, in order to analyze results without the historical data, by
using the trained algorithm.

Most of the predictive modeling software solutions has the capability to export the
model information into a local file in industry standard Predictive Modeling Markup
Language, (PMML) format for sharing the model with other PMML compliant
applications to perform analysis on similar data.
Business process on Predictive Modeling
1. Creating the model : Software solutions allows you to create a model to run one or
more algorithms on the data set.
2. Testing the model: Test the model on the data set. In some scenarios, the testing is
done on past data to see how best the model predicts.
3. Validating the model : Validate the model run results using visualization tools and
business data understanding.
4. Evaluating the model : Evaluating the best fit model from the models used and
choosing the model right fitted for the data.
Predictive modeling process
The process involve running one or more algorithms on the data set where prediction is
going to be carried out. This is an iterative processing and often involves training the
model, using multiple models on the same data set and finally arriving on the best fit
model based on the business data understanding.

Models Category
1.Predictive models :The models in Predictive models analyze the past performance
for future predictions.
2.Descriptive models: The models in descriptive model category quantify the
relationships in data in a way that is often used to classify data sets into groups.
3.Decision models: The decision models describe the relationship between all the
elements of a decision in order to predict the results of decisions involving many
variables.
Algorithms
Algorithms perform data mining and statistical analysis in order to determine trends and
patterns in data. The predictive analytics software solutions has built in algorithms such
as regressions, time series, outliers, decision trees, k-means and neural network for
doing this. Most of the software also provide integration to open source R library.
1. Time Series Algorithms which perform time based predictions. Example Algorithms
are Single Exponential Smoothing, Double Exponential Smoothing and Triple
Exponential Smoothing.
2. Regression Algorithms which predicts continuous variables based on other variables
in the dataset. Example algorithms are Linear Regression, Exponential Regression,
Geometric Regression, Logarithmic Regression and Multiple Linear Regression.
3. Association Algorithms which Finds the frequent patterns in large transactional
dataset to generate association rules. Example algorithms are Apriori
4. Clustering Algorithms which clustor observations into groups of similar Groups.
Example algorithms are K-Means , Kohonen, and TwoStep.
5. Decision Trees Algorithms classify and predict one or more discrete variables based
on other variables in the dataset. Example algorithms are C 4.5 and CNR Tree
6. Outlier Detection Algorithms detect the outlying values in the dataset. Example
algorithms are Inter Quartile Range and Nearest Neighbour Outlier
7. Neural Network Algorithms does the forecasting, classification, and statistical
pattern recognition. Example algorithms are NNet Neural Network and MONMLP
Neural Network
8.Ensemble models are a form of Monte Carlo analysis where multiple numerical
predictions are conducted using slightly different initial conditions.
9.Factor Analysis deals with variability among observed, correlated variables in terms
of a potentially lower number of unobserved variables called factors. Example
algorithms are Maximum likelihood algorithm.
10.Naive Bayes are probabilistic classifier based on applying Bayes' theorem with
strong (naive) independence assumptions.
11.Support vector machines are supervised learning models with associated learning
algorithms that analyze data and recognize patterns, used for classification and
regression analysis.
12.Uplift modeling, models the incremental impact of a treatment on an individual's
behavior.
13.Survival analysis are analysis of time to events.
Features in Predictive Modeling
1) Data Analysis and manipulation : Tools for data analysis, create new data sets,
modify, club, categorize, merge and filter data sets.
2) Visualization : Visualization features includes interactive graphics, reports.
3) Statistics : Statistics tools to create and confirm the relationships between variables
in the data. Statistics from different statistical software can be integrated to some of the
solutions.
4) Hypothesis testing : Creation of models, evaluation and choosing of the right model.
What is a Generative Model?
A generative model is a type of machine learning model that aims to learn the underlying patterns or
distributions of data in order to generate new, similar data. In essence, it's like teaching a computer to
dream up its own data based on what it has seen before. The significance of this model lies in its ability to
create, which has vast implications in various fields, from art to science.

Generative Models Explained

Generative models are a cornerstone in the world of artificial intelligence (AI). Their
primary function is to understand and capture the underlying patterns or distributions
from a given set of data. Once these patterns are learned, the model can then generate
new data that shares similar characteristics with the original dataset.

Imagine you're teaching a child to draw animals. After showing them several pictures of
different animals, the child begins to understand the general features of each animal.
Given some time, the child might draw an animal they've never seen before, combining
features they've learned. This is analogous to how a generative model operates: it
learns from the data it's exposed to and then creates something new based on that
knowledge.

The distinction between generative and discriminative models is fundamental in


machine learning:

Generative models: These models focus on understanding how the data is generated.
They aim to learn the distribution of the data itself. For instance, if we're looking at
pictures of cats and dogs, a generative model would try to understand what makes a cat
look like a cat and a dog look like a dog. It would then be able to generate new images
that resemble either cats or dogs.

Discriminative models: These models, on the other hand, focus on distinguishing between
different types of data. They don't necessarily learn or understand how the data is generated;
instead, they learn the boundaries that separate one class of data from another. Using the same
example of cats and dogs, a discriminative model would learn to tell the difference between the
two, but it wouldn't necessarily be able to generate a new image of a cat or dog on its own.

In the realm of AI, generative models play a pivotal role in tasks that require the creation of new
content. This could be in the form of synthesizing realistic human faces, composing music, or
even generating textual content. Their ability to "dream up" new data makes them invaluable in
scenarios where original content is needed, or where the augmentation of existing datasets is
beneficial.

In essence, while discriminative models excel at classification tasks, generative models shine in
their ability to create. This creative prowess, combined with their deep understanding of data
distributions, positions generative models as a powerful tool in the AI toolkit.
Types of Generative Models

Generative models come in various forms, each with its unique approach to understanding and
generating data. Here's a more comprehensive list of some of the most prominent types:

 Bayesian networks. These are graphical models that represent the probabilistic
relationships among a set of variables. They're particularly useful in scenarios
where understanding causal relationships is crucial. For example, in medical
diagnosis, a Bayesian network might help determine the likelihood of a disease
given a set of symptoms.
 Diffusion models. These models describe how things spread or evolve over
time. They're often used in scenarios like understanding how a rumor spreads in
a network or predicting the spread of a virus in a population.
 Generative Adversarial Networks (GANs). GANs consist of two neural
networks, the generator and the discriminator, that are trained together. The
generator tries to produce data, while the discriminator attempts to distinguish
between real and generated data. Over time, the generator becomes so good that
the discriminator can't tell the difference. GANs are popular in image
generation tasks, such as creating realistic human faces or artworks.
 Variational Autoencoders (VAEs). VAEs are a type of autoencoder that
produces a compressed representation of input data, then decodes it to generate
new data. They're often used in tasks like image denoising or generating new
images that share characteristics with the input data.
 Restricted Boltzmann Machines (RBMs). RBMs are neural networks with
two layers that can learn a probability distribution over its set of inputs. They've
been used in recommendation systems, like suggesting movies on streaming
platforms based on user preferences.
 Pixel Recurrent Neural Networks (PixelRNNs). These models generate
images pixel by pixel, using the context of previous pixels to predict the next
one. They're particularly useful in tasks where the sequential generation of data
is crucial, like drawing an image line by line.
 Markov chains. These are models that predict future states based solely on the
current state, without considering the states that preceded it. They're often used
in text generation, where the next word in a sentence is predicted based on the
current word.
 Normalizing flows. These are a series of invertible transformations applied to
simple probability distributions to produce more complex distributions. They're
useful in tasks where understanding the transformation of data is crucial, like in
financial modeling.

Real-World Use Cases of Generative Models

Generative models have penetrated mainstream consumption, revolutionizing the way we


interact with technology and experience content, for example:

 Art creation. Artists and musicians are using generative models to create new
pieces of art or compositions, based on styles they feed into the model. For
example, Midjourney is a very popular tool that is used to generate artwork.
 Drug discovery. Scientists can use generative models to predict molecular
structures for new potential drugs.
 Content creation. Website owners leverage generative models to speed up the
content creation process. For example, Hubspot's AI content writer helps
marketers generate blog posts, landing page copy and social media posts.
 Video games. Game designers use generative models to create diverse and
unpredictable game environments or characters.

What are the Benefits of Generative Models?

Generative models, with their unique ability to create and innovate, offer a plethora of
advantages that extend beyond mere data generation. Here's a deeper dive into the myriad
benefits they bring to the table:

 Data augmentation. In domains where data is scarce or expensive to obtain,


generative models can produce additional data to supplement the original set.
For instance, in medical imaging, where obtaining large datasets can be
challenging, these models can generate more images to aid in better training of
diagnostic tools.
 Anomaly detection. By gaining a deep understanding of what constitutes
"normal" data, generative models can efficiently identify anomalies or outliers.
This is particularly useful in sectors like finance, where spotting fraudulent
transactions quickly is paramount.
 Flexibility. Generative models are versatile and can be employed in a range of
learning scenarios, including unsupervised, semi-supervised, and supervised
learning. This adaptability makes them suitable for a wide array of tasks.
 Personalization. These models can be tailored to generate content based on
specific user preferences or inputs. For example, in the entertainment industry,
generative models can create personalized music playlists or movie
recommendations, enhancing user experience.
 Innovation in design. In fields like architecture or product design, generative
models can propose novel designs or structures, pushing the boundaries of
creativity and innovation.
 Cost efficiency. By automating the creation of content or solutions, generative
models can reduce the costs associated with manual production or research,
leading to more efficient processes in industries like manufacturing or
entertainment.

What are the Limitations of Generative Models?

While generative models are undeniably powerful and transformative, they are not without their
challenges. Here's an exploration of some of the constraints and challenges associated with these
models:

 Training complexity. Generative models, especially sophisticated ones like


GANs, require significant computational resources and time. Training them
demands powerful hardware and can be resource-intensive.
 Quality control. While they can produce vast amounts of data, ensuring the
quality and realism of the generated content can be challenging. For instance, a
model might generate an image that looks realistic at first glance but has subtle
anomalies upon closer inspection.
 Overfitting. There's a risk that generative models can become too attuned to
the training data, producing outputs that lack diversity or are too closely tied to
the input they've seen.
 Lack of interpretability. Many generative models, particularly deep learning-
based ones, are often seen as "black boxes." This means it can be challenging to
understand how they make decisions or why they produce specific outputs,
which can be a concern in critical applications like healthcare.
 Ethical concerns. The ability of generative models to produce realistic content
raises ethical issues, especially in the creation of deep fakes or counterfeit
content. Ensuring responsible use is paramount to prevent misuse or deception.
 Data dependency. The quality of the generated output is heavily dependent on
the quality of the training data. If the training data is biased or unrepresentative,
the model's outputs will reflect those biases.
 Mode collapse. Particularly in GANs, there's a phenomenon called mode
collapse where the generator produces limited varieties of samples, reducing
the diversity of the generated outputs.

How to use Generative Models for Data Science

Generative models like GPT-4 are transforming how data scientists approach their work. These
large language models can generate human-like text and code, allowing data scientists to be
more creative and productive. Here are some ways generative AI can be applied in data science.

Data Exploration
Generative models can summarize and explain complex data sets and results. By describing
charts, statistics, and findings in natural language, they help data scientists explore and
understand data faster. Models can also highlight insights and patterns that humans may miss.

Code Generation
For common data science tasks like data cleaning, feature engineering, and model building,
generative models can generate custom code. This automates repetitive coding work and allows
data scientists to iterate faster. Models can take high-level instructions and turn them into
functional Python or R or SQL code.

Report Writing
Writing reports and presentations to explain analyses is time-consuming. Generative models
like GPT-4 can draft reports by summarizing findings, visualizations, and recommendations in
coherent narratives. Data scientists can provide bullets and results, and AI will generate an initial
draft. It can also help you write data analytical reports which include necessary actionable insists
for a business to improve the business revenue.

Synthetic Data Generation


Generative models can create synthetic training data for machine learning models. This helps
when real data is limited or imbalanced. The synthetic data matches the patterns and distributions
of real data, allowing models to be trained effectively.
Data Visualization Explained: Benefits, Types &
Tools
Data visualization is the graphical representation of information and data using visual
elements like charts, graphs, and maps to provide an accessible way to see and
understand trends and patterns in data. It allows massive amounts of information to be
analyzed and data-driven decisions to be made. Data visualization tells a story by
removing noise from data and highlighting useful information. Common types include
charts, graphs, maps, and infographics, with tools ranging from simple online options
to more complex offline programs. The key is to focus on best practices and
developing a personal style when creating visualizations.
Data visualization: A detailed guide to visualizing
data in your presentation

"The greatest value of a picture is when it forces us to notice what we never expected
to see."

- John W. Tukey, mathematician and statistician

Visualization helps decipher or break down information that is challenging to


understand in text or numeric form. It's mostly used for data storytelling, as it
is a great way to simplify information and present it in a format that is
understandable, insightful, and actionable.
Whether you're a data analyst, a graphic designer, a content strategist or a
social media manager, expertise in data visualization can help you solve a
wide range of business challenges and tell impactful stories. In this blog post,
we will look at a step-by-step approach to using data visualizations in your
presentation.

What is data visualization?


Data visualization is the process of presenting data in a visual format, such as
a chart, graph, or map. It helps users identify patterns and trends in a data
set, making it easier to understand complex information. Visualizations can be
used to analyze data, make predictions, and even communicate ideas more
effectively.

Some examples of data visualizations include dashboards to track analytics,


infographics for storytelling, or even word clouds to highlight the crux of your
article or script.

Why do we have to visualize


data?
In today's information-rich world, audiences are often bombarded with vast
amounts of data and complex information. This is where data visualization
comes into play—it transforms raw data into visually appealing and
comprehensible formats, allowing audiences to grasp key insights and trends
at a glance.
The option on the left is a table displaying two categories of data, whereas the
option on the right is a graph representing sales growth. As you can see, the
chart is more insightful, and makes it easier to identify trends in the numbers.

A good visualization typically represents some form of collected data as a


picture, and can help with:

 Faster decision-making
 Identification of patterns and trends

 Presentation of an argument or story


Why is data visualization
important in presentations?
Whether it's a business pitch, a campaign report, or a research presentation,
data visualizations help you engage viewers on both rational and emotional
levels.

They can be used to evoke empathy, urgency, or excitement, making the


content more relatable and compelling. This is particularly crucial in decision-
making contexts, where data-driven insights can sway opinions, drive actions,
and guide strategic choices.

Ultimately, by incorporating data visualizations into presentations, you can


benefit in the following ways:
 Elevate communication and convey impactful, data-centric narratives.
 Tell your story using visuals in a clear and meaningful way.
 Foster a deeper understanding of your data to make a stronger impact on the
audience.
 Support idea generation and help derive business insights.
 Simplify data and business processes.
Step-by-step approach to data
visualizations in presentations:
There are several factors to consider before adding a data visualization to
your presentation. Here's a detailed guide:

Step 1: Define your purpose


The first step to visualizing data in your presentation is to determine your key
message and decide on the type of story you are going to tell. Whether you
plan to reveal trends, compare data, or explain a concept, a well-defined
purpose will guide your data selection and visualization design, ensuring your
visuals play a meaningful role in conveying your message.

Step 2: Understand your audience


Identify who your visualization is meant for and then make sure it fits their
needs. Tailor your approach to suit your audience's familiarity with the topic
and preferred level of detail. Knowing their expectations will help you fine-tune
the complexity and depth of your visualizations, ensuring your presentation
truly resonates with your audience.

Step 3: Choose your visualization type


Different data types and relationships call for different visualization formats.
Selecting the appropriate chart, graph, or diagram is essential for accurately
conveying your information. Here are some visualization types commonly
used in presentations:

 Tables: These consist of rows and columns and are used to compare variables
in a structured way. Tables display data as categorical objects and make
comparative data analysis easier. Example use: Pricing vs. feature comparison
table.
 Bar charts: Also known as column charts, these chart types use vertical or
horizontal bars to compare categorical data. They are mainly used for analyzing
value trends. Example use: Measure employee growth within a year.

 Pie charts: These graphs are divided into sections that represent parts of a
whole. They are used to compare the size of each component and are usually
used to determine a percentage of the whole. Example use: Display website
visitors by country.

 Area charts: These are similar to bar and line graphs and show the progress of
values over a period. These are mostly used to showcase data with a time-series
relationship, and can be used to gauge the degree of a change in values.
Example use: Show sales of different products in a financial year.

 Histograms: Similar to bar charts (but with no space in between), histograms


distribute numerical data. They are mainly used to plot the distribution of
numbers and analyze the largest frequencies within a particular range. Example
use: Measure app users by age.

 Scatter charts: Also know as scatter plots, these graphs present the relationship
between two variables. They are used to visualize large data sets, and show
trends, clusters, patterns, and outliers. Example use: Track performance of
different products in a suite.

 Heat maps: These are a graphical way to visualize data in the form of hot and
cold spots to identify user behavior. Example use: Present visitor behavior on
your webpage.

 Venn diagrams: These are best for showcasing similarities and differences
between two or more categories. They are incredibly versatile and great for
making comparisons, unions and intersections of different categories.

 Timelines: These are best used for presenting chronological data. This is the
most effective and efficient way to showcase events or time passage.

 Flowcharts: These types of charts are ideal for showcasing a process or a


workflow.

 Infographics: These are a visual representation of content or data in a graphic


format to make it more understandable at a glance.
Bonus: In addition to the above mentioned visualization types, you can use Gantt
charts, word clouds, and tree maps. Gantt charts are used in project management
presentations to demonstrate the work completed in a given period. Word clouds are
a graphical representation of word frequency that gives greater prominence to the
words that appear most within content. Tree maps display hierarchical data as a set
of nested shapes, typically in the shape of rectangles.
Step 4: Use an appropriate chart
Once you're familiar with the different chart types available, the next step is
to select the one that best conveys your key message. Knowing when and
how to use each chart type empowers you to represent your data accurately
and enhances the persuasiveness of your presentation. The best chart type
for your needs depends more on the kind of analysis you are targeting than
the type of data you've collected. Let's take a look at some of the most-used
data visualization approaches in presentations.

 Display changes over time: One of the most common applications of data
visualizations is to show changes that have occurred over time. Bar or line
charts are helpful in these instances.

 Illustrate a part-to-whole composition: There might be times when you need to


analyze the different components of a whole composition. Use pie, doughnut,
and stacked bar charts for these part-to-whole compositions.

 Visualize data distribution: Another important use of data visualization is to


show how data has been distributed. Scatter plots, bar charts, and histograms
help identify the outliers and demonstrate the range of information in the
values.

 Explore variable relationships: When you want to understand the relationship


between two variables, use scatter plots or bubble charts. These can help you
depict relationships between two variables, and observe trends and patterns
between them.

 Compare values between groups: Another common application of data


visualization is in comparing values between two distinct groups. Using a
grouped bar or line chart makes it easy to understand and compare trends.

Step 5: Pick the right visualization tool


Utilize visualization software or tools that align with your proficiency and
presentation needs. Factors such as ease of use, customization options, and
compatibility with your data source should influence your choice of tool,
enabling you to create impactful visualizations efficiently.

Zoho Show's charts are customizable, easy to use and come with wide range of
options to make your data visualization easier. Some of the other prominent
data visualization tools include Zoho Analytics, Tableau, Power Bi, and Infogram.
These tools support a variety of visual styles and are capable of handling a
large volume of data.
Step 6: Follow design best practices
Applying design principles will help you make sure your visualization is both
aesthetically pleasing and easy to understand. You may apply these principles
by choosing appropriate font colors and styles, or by effectively labeling and
annotating your charts. By adhering to design best practices, you can create
polished visuals and amplify the impact of your data-driven narrative.

 Keep it simple: Data overload can quickly lead to confusion, so it’s important
to include only the important information and simplify complex data. As a rule
of thumb, don't crowd your slides with too much data, and avoid distracting
elements.

 Choose colors wisely: Use colors to differentiate and highlight information.


The best practice is to use contrasting colors. You can also use patterns or
texture to convey different types of information—but remember not to distort
the data by applying 3D or gradient effects.

 Add titles, labels, and annotations: Be sure to add a title, label, and description
to your chart so your audience knows what they are looking at. Remember to
keep it clear and concise.

 Use proper fonts and text sizes: Use proper font styles and sizes to label and
describe your charts. Your font choices may be playful, sophisticated,
attention-grabbing, or elegant. Just be sure to choose a font that is easy to read
and appropriate for your key message.

Closing thoughts
Human brains are naturally attuned to processing visual patterns and
imageryUsing visuals not only helps you simplify complex information, but
also makes your information more memorable. By leveraging charts and
graphs, presenters can convey information to their audiences in a highly
comprehensible manner. This helps them offer key insights and contribute to
the decision-making process.

Ultimately, by incorporating data visualizations into presentations, presenters


can elevate their communication from mere data sharing to impactful
storytelling, fostering a deeper understanding of information among their
audiences.

What is a scatter plot?

A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. The position of each dot on the horizontal and vertical axis
indicates values for an individual data point. Scatter plots are used to observe
relationships between variables.
The example scatter plot above shows the diameters and heights for a
sample of fictional trees. Each dot represents a single tree; each point’s
horizontal position indicates that tree’s diameter (in centimeters) and the
vertical position indicates that tree’s height (in meters). From the plot, we
can see a generally tight positive correlation between a tree’s diameter
and its height. We can also observe an outlier point, a tree that has a
much larger diameter than the others. This tree appears fairly short for its
girth, which might warrant further investigation.
When you should use a scatter plot

Scatter plots’ primary uses are to observe and show relationships between two numeric
variables. The dots in a scatter plot not only report the values of individual data points,
but also patterns when the data are taken as a whole.

Identification of correlational relationships are common with scatter plots. In these


cases, we want to know, if we were given a particular horizontal value, what a good
prediction would be for the vertical value. You will often see the variable on the
horizontal axis denoted an independent variable, and the variable on the vertical axis
the dependent variable. Relationships between variables can be described in many ways:
positive or negative, strong or weak, linear or nonlinear.
A scatter plot can also be useful for identifying other patterns in data. We
can divide data points into groups based on how closely sets of points
cluster together. Scatter plots can also show if there are any unexpected
gaps in the data and if there are any outlier points. This can be useful if
we want to segment the data into different parts, like in the development
of user personas.

Example of data structure


DIAM HEI
ETER GH
T

4.20 3.14

5.55 3.87

3.33 2.84

6.91 4.34

… …

In order to create a scatter plot, we need to select two columns from a data table, one
for each dimension of the plot. Each row of the table will become a single dot in the plot
with position according to the column values.
Common issues when using scatter plots

Overplotting

When we have lots of data points to plot, this can run into the issue of overplotting.
Overplotting is the case where data points overlap to a degree where we have difficulty
seeing relationships between points and variables. It can be difficult to tell how densely-
packed data points are when many of them are in a small area.

There are a few common ways to alleviate this issue. One alternative is to sample only a
subset of data points: a random selection of points should still give the general idea of
the patterns in the full data. We can also change the form of the dots, adding
transparency to allow for overlaps to be visible, or reducing point size so that fewer
overlaps occur. As a third option, we might even choose a different chart type like
the heatmap, where color indicates the number of points in each bin. Heatmaps in this
use case are also known as 2-d histograms.
Interpreting correlation as causation

This is not so much an issue with creating a scatter plot as it is an issue with its
interpretation. Simply because we observe a relationship between two variables in a
scatter plot, it does not mean that changes in one variable are responsible for changes
in the other. This gives rise to the common phrase in statistics that correlation does not
imply causation. It is possible that the observed relationship is driven by some third
variable that affects both of the plotted variables, that the causal link is reversed, or that
the pattern is simply coincidental.
For example, it would be wrong to look at city statistics for the amount of green space
they have and the number of crimes committed and conclude that one causes the other,
this can ignore the fact that larger cities with more people will tend to have more of
both, and that they are simply correlated through that and other factors. If a causal link
needs to be established, then further analysis to control or account for other potential
variables effects needs to be performed, in order to rule out other possible explanations.
Common scatter plot options

Add a trend line

When a scatter plot is used to look at a predictive or correlational relationship between


variables, it is common to add a trend line to the plot showing the mathematically best
fit to the data. This can provide an additional signal as to how strong the relationship
between the two variables is, and if there are any unusual points that are affecting the
computation of the trend line.
Categorical third variable

A common modification of the basic scatter plot is the addition of a third variable.
Values of the third variable can be encoded by modifying how the points are plotted.
For a third variable that indicates categorical values (like geographical region or gender),
the most common encoding is through point color. Giving each point a distinct hue
makes it easy to show membership of each point to a respective group.

Coloring points by tree type shows that Fersons (yellow) are generally wider than
Miltons (blue), but also shorter for the same diameter.
One other option that is sometimes seen for third-variable encoding is that of shape.
One potential issue with shape is that different shapes can have different sizes and
surface areas, which can have an effect on how groups are perceived. However, in
certain cases where color cannot be used (like in print), shape may be the best option
for distinguishing between groups.

The shapes above have been scaled to use the same amount of ink.

Numeric third variable

For third variables that have numeric values, a common encoding comes from changing
the point size. A scatter plot with point size based on a third variable actually goes by a
distinct name, the bubble chart. Larger points indicate higher values. A more detailed
discussion of how bubble charts should be built can be read in its own article.

Hue can also be used to depict numeric values as another alternative. Rather than using
distinct colors for points like in the categorical case, we want to use a continuous
sequence of colors, so that, for example, darker colors indicate higher value. Note that,
for both size and color, a legend is important for interpretation of the third variable,
since our eyes are much less able to discern size and color as easily as position.
Highlight using annotations and color

If you want to use a scatter plot to present insights, it can be good to highlight
particular points of interest through the use of annotations and color. Desaturating
unimportant points makes the remaining points stand out, and provides a reference to
compare the remaining points against.

Related plots

Scatter map
When the two variables in a scatter plot are geographical coordinates – latitude and
longitude – we can overlay the points on a map to get a scatter map (aka dot map). This
can be convenient when the geographic context is useful for drawing particular insights
and can be combined with other third-variable encodings like point size and color.

A famous example of scatter map is John Snow’s 1854 cholera outbreak map, showing
that cholera cases (black bars) were centered around a particular water pump on Broad
Street (central dot). Original: Wikimedia Commons
Heatmap

As noted above, a heatmap can be a good alternative to the scatter plot when there are
a lot of data points that need to be plotted and their density causes overplotting issues.
However, the heatmap can also be used in a similar fashion to show relationships
between variables when one or both variables are not continuous and numeric. If we try
to depict discrete values with a scatter plot, all of the points of a single level will be in a
straight line. Heatmaps can overcome this overplotting through their binning of values
into boxes of counts.
Connected scatter plot

If the third variable we want to add to a scatter plot indicates timestamps, then one
chart type we could choose is the connected scatter plot. Rather than modify the form
of the points to indicate date, we use line segments to connect observations in order.
This can make it easier to see how the two main variables not only relate to one another,
but how that relationship changes over time. If the horizontal axis also corresponds with
time, then all of the line segments will consistently connect points from left to right, and
we have a basic line chart.
Visualization tools

The scatter plot is a basic chart type that should be creatable by any visualization tool or
solution. Computation of a basic linear trend line is also a fairly common option, as is
coloring points according to levels of a third, categorical variable. Other options, like
non-linear trend lines and encoding third-variable values by shape, however, are not as
commonly seen. Even without these options, however, the scatter plot can be a valuable
chart type to use when you need to investigate the relationship between numeric
variables in your data.

What is a Time Series?

Time series refers to a sequence of data points that are collected, recorded,
or observed at regular intervals over a specific period of time. In a time
series, each data point is associated with a specific timestamp or time
period, which allows for the chronological organization of the data.

Time series data can be found in various domains and industries, including
finance, economics, meteorology, sales, stock markets, healthcare, and
more. It is used to analyze historical patterns, identify trends, forecast future
values, and understand the behavior of a phenomenon over time.

Time series data can be either continuous or discrete. One can easily
visualize time series data using python. Continuous time series data
represents measurements that can take any value within a range, such as
temperature readings or stock prices. Discrete-time series data, on the other
hand, represents measurements that are limited to specific values or
categories, such as the number of sales per day or customer ratings.
Analyzing and visualizing time series data plays a crucial role in gaining
insights, making predictions, and understanding the underlying dynamics of
a system or process over time.

Types of Time Series Data

Time series data in data visualization can be classified into two main types
based on the nature of the data: continuous and discrete.

A. Continuous Data

Continuous time series data refers to measurements or observations that


can take any value within a specified range. It is characterized by a
continuous and uninterrupted flow of data points over time. Continuous time
series data is commonly found in various domains, such as:

 Temperature Data: Continuous temperature recordings collected at regular intervals, such as hourly or
daily measurements.
 Stock Market Data: Continuous data representing the prices or values of stocks, which are recorded
throughout trading hours.
 Sensor Data: Measurements from sensors that record continuous variables like pressure, humidity, or air
quality at frequent intervals.
 Financial Data: Continuous data related to financial metrics like revenue, sales, or profit, which are
tracked over time.
 Environmental Data: Continuous data collected from environmental monitoring devices, such as weather
stations, to track variables like wind speed, rainfall, or pollution levels.
 Physiological Data: Continuous data capturing physiological parameters like heart rate, blood pressure, or
glucose levels recorded at regular intervals.
Continuous time series data is typically visualized using techniques such as
line plots, area charts, or smooth plots. These visualization methods allow
us to observe trends, fluctuations, and patterns in the data over time, aiding
in understanding the behavior and dynamics of the underlying phenomenon.

B. Discrete Data
Discrete time series data refers to measurements or observations that are
limited to specific values or categories. Unlike continuous data, discrete data
does not have a continuous range of possible values but instead consists of
distinct and separate data points. Discrete time series data is commonly
encountered in various domains, including:

 Count Data: Data representing the number of occurrences or events within a specific time. Examples
include the number of daily sales, the number of customer inquiries per month, or the number of website
visits per hour.
 Categorical Data: Data that falls into distinct categories or classes. This can include variables such as
customer segmentation, product types, or survey responses with predefined response options.
 Binary Data: Data that has only two possible outcomes or states. For instance, a time series tracking
whether a machine is functioning (1) or not (0) at each time point.
 Rating Scales: Data obtained from surveys or feedback forms where respondents provide ratings on a
discrete scale, such as a Likert scale.
Discrete time series data is often visualized using techniques such as bar
charts, histograms, or stacked area charts. These visualizations help in
understanding the distribution, changes, and patterns within the discrete
data over time. By examining the frequency or proportion of different
categories or values, analysts can gain insights into trends and patterns
within the data.

Ways for Time Series Data Visualization

To effectively visualize time series data, various visualization techniques


can be employed. Let's explore some popular visualization methods:

1. Tabular Visualization: Tabular visualization presents time series data in a structured table format, with
each row representing a specific time period and columns representing different variables or measurements.
It provides a concise overview of the data but may not capture trends or patterns as effectively as graphical
visualizations.
2. 1D Plot of Measurement Times: This type of visualization represents the measurement times along a one-
dimensional axis, such as a timeline. It helps in understanding the temporal distribution of data points and
identifying any temporal patterns.
3. 1D Plot of Measurement Values: A 1D plot of measurement values display the variation in data values
over time along a single axis. Line plots and step plots are commonly used techniques for visualizing
continuous time series data, while bar charts or dot plots can be used for discrete data.
4. 1D Color Plot of Measurement Values: In this visualization technique, the variation in measurement
values is represented using colors on a one-dimensional axis. It enables the quick identification of high or
low values and provides an intuitive overview of the data.
5. Bubble Plot: Bubble plots represent time series data using bubbles, where each bubble represents a data
point with its size or color encoding a specific measurement value. This visualization method allows the
simultaneous representation of multiple variables and their evolution over time.
6. Scatter Plot: Scatter plots display the relationship between two variables by plotting data points as
individual dots on a Cartesian plane. Time series data can be visualized by representing one variable on the
x-axis and another on the y-axis.
7. Linear Line Plot: Linear line plots connect consecutive data points with straight lines, emphasizing the
trend and continuity of the data over time.
8. Linear Step Plot: Linear step plots also connect consecutive data points, but with vertical and horizontal
lines, resulting in a stepped appearance. This visualization is useful when tracking changes that occur
instantaneously at specific time points.
9. Linear Smooth Plot: Linear smooth plots apply a smoothing algorithm to the data, resulting in a
continuous curve that captures the overall trend while reducing noise or fluctuations. It helps in visualizing
long-term patterns more clearly.
10. Area Chart: Area charts fill the area between the line representing the data and the x-axis, emphasizing the
cumulative value or distribution over time. They are commonly used to visualize stacked time series data or
to show the composition of a variable over time.
11. Horizon Chart: Horizon charts condense time series data into a compact, horizontally layered
representation. They are particularly useful when comparing multiple time series data on a single chart,
optimizing screen space usage.
12. Bar Chart: Bar charts represent discrete time series data using rectangular bars, with the height of each bar
indicating the value of a specific measurement. They are effective in comparing values between different
time periods or categories.
13. Histogram: Histograms display the distribution of continuous or discrete time series data by dividing the
range of values into equal intervals (bins) and representing the frequency or count of data points falling
within each bin. Here, Business Intelligence and Visualization training will help you get mentored by
international Tableau, BI, TIBCO, and Data Visualization experts and represent data through insightful
visuals.
Best Platforms to Visualize Data

Several powerful platforms can aid in visualizing time series data


Visualization effectively. Let us explore some of the top platforms and how
they support time series visualization:

1. Microsoft Power BI

Microsoft Power BI is a popular business intelligence platform that provides a


wide range of data visualization capabilities. It offers various visualizations
specific to time series data, such as line charts, area charts, scatter plots,
and custom visuals from the Power BI marketplace. Power BI allows users
to connect to different data sources, apply transformations, and create
interactive dashboards and reports.

How to Visualize Time Series Data in Microsoft Power BI?

To visualize time series data in Power BI, follow these steps:

 Import the time series data into Power BI.


 Choose the appropriate visualization type (e.g., line chart, area chart) and add the relevant fields to the
visualization.
 Configure axes, legends, and tooltips to provide meaningful insights.
 Apply filters, slicers, or drill-through functionalities to interact with the data dynamically.
 Customize the visual appearance and layout of the report or dashboard.
 Publish and share the visualizations with others.
 To present the time series better visually, below is the best tool used for it
2. Tableau
Tableau is a powerful data visualization and analytics platform widely used
for exploring and presenting data. It offers a comprehensive set of features
to visualize time series data effectively, including line charts, area charts,
heatmaps, and maps. Tableau supports interactive filtering, drill-down, and
animation features to enhance the exploration of time-based trends. These
are just a few examples, and there are many other tools and libraries
available depending on your specific requirements and programming
language preferences. Also, you can go through KnowledgeHut's Business
Intelligence and Visualization classes and get mentored by the best of experts.

How to Visualize Time Series in Tableau?

To visualize time series data in Tableau, follow these steps:

 Connect to the time series data source within Tableau.


 Drag and drop the desired fields onto the workspace.
 Choose the appropriate visualization type from the available options.
 Customize the visualization by adjusting axes, adding reference lines, or applying color schemes.
 Create interactive features such as filters, parameters, or actions to enable dynamic exploration of the data.
 Design a visually appealing dashboard or story to present the insights effectively.
 Share the visualizations with others using Tableau Server, Tableau Public, or other sharing options.
 R: R is a widely used programming language and software environment for statistical computing and
graphics. It offers numerous packages and libraries specifically designed for time series analysis and
visualization. Popular packages like ggplot2, plotly, and graphs provide a wide range of functions and
capabilities for creating interactive and publication-quality time series visualizations.
3. R

R is a popular open-source programming language and software which we


mainly use for statistical computing and graphics. It provides us with a wide
range of tools and libraries for data manipulation, analysis, and
visualization. Here are some ways and techniques to visualize time series in
R:
How to Visualize Time Series in R?

To visualize time series data visualization in R, follow these steps:

 Import the time series data into R using appropriate data structures such as data frames or time series
objects.
 Install and load the required packages for time series visualization (e.g., ggplot2, plotly).
 Use functions from the chosen package to create the desired visualizations, such as line plots, area charts,
or interactive plots.
 Customize the visual appearance, labels, and annotations.
 Add interactivity, tooltips, or animations to enhance the exploration of the data.
 Export the visualizations to various formats or integrate them into reports or presentations.
 Excel: Excel, a widely used spreadsheet software, also offers basic time series visualization capabilities.
While not as advanced as dedicated data visualization platforms or programming languages, Excel provides
various chart types that can effectively represent time series data, such as line charts, bar charts, and scatter
plots.
4. Excel

Excel is a popular spreadsheet software and a great tool for data analysis. It
allows users to perform various tasks related to data organization, analysis,
and presentation. Here are some ways to visualize Time series in Excel:

How to Visualize Time Series in Excel?

To visualize time series data in Excel, follow these steps:

 Import the time series data into an Excel worksheet.


 Select the data range and choose the appropriate chart type from the Excel charting options.
 Customize the chart by adjusting axes, adding labels, or applying formatting options.
 Add additional series or data points to represent different variables or measurements.
 Apply filtering, sorting, or conditional formatting to interact with the data dynamically.
 Incorporate the chart into an Excel dashboard or report.
Time Series Data Visualization Examples

Let us explore some examples of time series data visualizations:

 Gantt Charts: Gantt charts are widely used to visualize project schedules or timelines. They display tasks
or events along a horizontal timeline, with bars representing the start and end dates of each task. Gantt
charts provide a clear overview of project progress, dependencies, and resource allocation over time.
 Line Graphs: Line graphs are effective for visualizing continuous time series data. They connect data
points with straight lines, allowing us to observe trends, seasonality, or irregularities over time.
 Heatmap: Heatmaps represent time series data using color intensity in a grid format. They are useful for
visualizing patterns, correlations, or anomalies in multi-dimensional time series data.
 Map: Maps can be employed to visualize time series data geographically. By plotting data points on a map,
we can observe spatial patterns or changes in variables over time.
 Stacked Area Charts: Stacked area charts display the cumulative value or proportion of different variables
over time. They are useful for visualizing the composition or contribution of each variable to the total.
Conclusion

Time series data visualization is crucial for gaining insights, identifying


patterns, and making informed decisions. Various visualization techniques,
such as line plots, bar charts, and heatmaps, can effectively represent time
series data. Platforms like Microsoft Power BI, Tableau, R, and Excel
provide powerful tools for creating interactive and visually appealing time
series visualizations. By leveraging these platforms and techniques,
analysts and data professionals can effectively communicate trends,
patterns, and anomalies hidden within time series data.

3D VISUALIZATION
DEFINITION
3D Visualization is the process of creating three-dimensional visual representations of
objects, environments, or concepts using 3D software. It allows users to view objects
or concepts more interactively and realistically. Typically, 3D visualization software
allows users to manipulate and interact with the object or environment, resulting in a
more immersive experience.
3D Visualization can be a useful capability in PLM and QMS solutions, providing a
more accurate and detailed representation of the product and enhancing the overall
efficiency and effectiveness of product development processes. Here are some
examples:
 3D Visualizations for Product Design: 3D visualizations can aid in the

design phase of a product by providing a more detailed and realistic


representation of the product. Before the product is manufactured, these
visualizations can help identify any design flaws or potential issues, saving
time and money in the long run.
 Quality Control and 3D Visualizations: 3D visualizations can be used in

quality control to ensure that products meet specific quality standards.


These visualizations can help identify product flaws or inconsistencies that
may have been missed by conventional inspection methods.
 3D Visualizations for Assembly and Manufacturing: 3D visualizations
can be used to improve assembly and manufacturing processes by providing
a more detailed and accurate product representation. These visualizations
can aid in identifying any potential problems or obstacles in the
manufacturing process, enabling adjustments to be made before the product
is manufactured.
 3D Visualizations for Training and Education: 3D visualizations can be
used to train employees and educate customers about the features and
functionality of a product. These visuals provide a more engaging and
interactive method of learning about the product, thereby enhancing
retention and comprehension.
 3D Visualizations for Product Documentation: 3D visualizations can be
used to develop product documentation, such as assembly instructions and
user manuals. These illustrations provide a more complete and accurate
depiction of the product, making it simpler for users to comprehend and
adhere to the instructions.

You might also like