Professional Documents
Culture Documents
The most common answer that one expects is “to make computers intelligent so that
they can act intelligently!”, but the question is how much intelligent? How can one
judge intelligence?
…as intelligent as humans. If the computers can, somehow, solve real-world problems,
by improving on their own from past experiences, they would be called “intelligent”.
Thus, the AI systems are more generic(rather than specific), can “think” and are more
flexible.
1. Bias and unfairness: AI systems can perpetuate and amplify existing biases
in data and decision-making.
2. Lack of transparency and accountability: Complex AI systems can be
difficult to understand and interpret, making it challenging to determine how
decisions are being made.
3. Job displacement: AI has the potential to automate many jobs, leading to job
loss and a need for reskilling.
4. Security and privacy risks: AI systems can be vulnerable to hacking and
other security threats, and may also pose privacy risks by collecting and using
personal data.
5. Ethical concerns: AI raises important ethical questions about the use of
technology for decision-making, including issues related to autonomy,
accountability, and human dignity.
Technologies Based on Artificial Intelligence:
1. Machine Learning: A subfield of AI that uses algorithms to enable systems
to learn from data and make predictions or decisions without being explicitly
programmed.
2. Natural Language Processing (NLP): A branch of AI that focuses on
enabling computers to understand, interpret, and generate human language.
3. Computer Vision: A field of AI that deals with the processing and analysis
of visual information using computer algorithms.
4. Robotics: AI-powered robots and automation systems that can perform tasks
in manufacturing, healthcare, retail, and other industries.
5. Neural Networks: A type of machine learning algorithm modeled after the
structure and function of the human brain.
6. Expert Systems: AI systems that mimic the decision-making ability of a
human expert in a specific field.
7. Chatbots: AI-powered virtual assistants that can interact with users through
text-based or voice-based interfaces.
What is an Agent?
Intelligent Agents:
An intelligent agent is an autonomous entity which act
upon an environment using sensors and actuators for
achieving goals. An intelligent agent may learn from the
environment to achieve their goals. A thermostat is an
example of an intelligent agent.
Structure of an AI Agent
1. f:P* → A
Types of Agents
Agents can be grouped into five classes based on their degree of perceived
intelligence and capability :
Simple Reflex Agents
Model-Based Reflex Agents
Goal-Based Agents
Utility-Based Agents
Learning Agent
Multi-agent systems
Hierarchical agents
Simple Reflex Agents
Simple reflex agents ignore the rest of the percept history and act only on the
basis of the current percept. Percept history is the history of all that an agent has
perceived to date. The agent function is based on the condition-action rule. A
condition-action rule is a rule that maps a state i.e., a condition to an action. If the
condition is true, then the action is taken, else not. This agent function only
succeeds when the environment is fully observable. For simple reflex agents
operating in partially observable environments, infinite loops are often
unavoidable. It may be possible to escape from infinite loops if the agent can
randomize its actions.
Rational Agent:
Goal-Based Agents
These kinds of agents take decisions based on how far they are currently from
their goal(description of desirable situations). Their every action is intended to
reduce their distance from the goal. This allows the agent a way to choose among
multiple possibilities, selecting the one which reaches a goal state. The
knowledge that supports its decisions is represented explicitly and can be
modified, which makes these agents more flexible. They usually require search
and planning. The goal-based agent’s behavior can easily be changed.
Utility-Based Agents
The agents which are developed having their end uses as building blocks are
called utility-based agents. When there are multiple possible alternatives, then to
decide which one is best, utility-based agents are used. They choose actions based
on a preference (utility) for each state. Sometimes achieving the desired goal is
not enough. We may look for a quicker, safer, cheaper trip to reach a destination.
Agent happiness should be taken into consideration. Utility describes
how “happy” the agent is. Because of the uncertainty in the world, a utility agent
chooses the action that maximizes the expected utility. A utility function maps a
state onto a real number which describes the associated degree of happiness.
Learning Agent
A learning agent in AI is the type of agent that can learn from its past experiences
or it has learning capabilities. It starts to act with basic knowledge and then is
able to act and adapt automatically through learning. A learning agent has mainly
four conceptual components, which are:
1. Learning element: It is responsible for making improvements by learning
from the environment.
2. Critic: The learning element takes feedback from critics which describes how
well the agent is doing with respect to a fixed performance standard.
3. Performance element: It is responsible for selecting external action.
4. Problem Generator: This component is responsible for suggesting actions
that will lead to new and informative experiences.
Multi-Agent Systems
These agents interact with other agents to achieve a common goal. They may
have to coordinate their actions and communicate with each other to achieve their
objective.
A multi-agent system (MAS) is a system composed of multiple interacting agents
that are designed to work together to achieve a common goal. These agents may
be autonomous or semi-autonomous and are capable of perceiving their
environment, making decisions, and taking action to achieve the common
objective.
MAS can be used in a variety of applications, including transportation systems,
robotics, and social networks. They can help improve efficiency, reduce costs,
and increase flexibility in complex systems.
MAS can be implemented using different techniques, such as game
theory, machine learning, and agent-based modeling. Game theory is used to
analyze strategic interactions between agents and predict their behavior. Machine
learning is used to train agents to improve their decision-making capabilities over
time. Agent-based modeling is used to simulate complex systems and study the
interactions between agents.
Overall, multi-agent systems are a powerful tool in artificial intelligence that can
help solve complex problems and improve efficiency in a variety of applications.
Hierarchical Agents
These agents are organized into a hierarchy, with high-
level agents overseeing the behavior of lower-level
agents. The high-level agents provide goals and
constraints, while the low-level agents carry out specific
tasks. Hierarchical agents are useful in complex
environments with many tasks and sub-tasks.
Hierarchical agents are agents that are organized into a hierarchy, with high-
level agents overseeing the behavior of lower-level agents. The high-level
agents provide goals and constraints, while the low-level agents carry out
specific tasks. This structure allows for more efficient and organized decision-
making in complex environments.
Hierarchical agents can be implemented in a variety of applications, including
robotics, manufacturing, and transportation systems. They are particularly
useful in environments where there are many tasks and sub-tasks that need to
be coordinated and prioritized.
In a hierarchical agent system, the high-level agents are responsible for setting
goals and constraints for the lower-level agents
Overall, hierarchical agents are a powerful tool in
artificial intelligence that can help solve complex
problems and improve efficiency in a variety of
applications.
Uses of Agents
Agents are used in a wide range of applications in
artificial intelligence, including:
Robotics: Agents can be used to control robots and automate tasks in
manufacturing, transportation, and other industries.
Smart homes and buildings: Agents can be used to control heating, lighting,
and other systems in smart homes and buildings, optimizing energy use and
improving comfort.
Transportation systems: Agents can be used to manage traffic flow, optimize
routes for autonomous vehicles, and improve logistics and supply chain
management.
Healthcare: Agents can be used to monitor patients, provide personalized
treatment plans, and optimize healthcare resource allocation.
Finance: Agents can be used for automated trading, fraud detection, and risk
management in the financial industry.
Games: Agents can be used to create intelligent opponents in games and
simulations, providing a more challenging and realistic experience for players.
Natural language processing: Agents can be used for language translation,
question answering, and chatbots that can communicate with users in natural
language.
Cybersecurity: Agents can be used for intrusion detection, malware analysis,
and network security.
Environmental monitoring: Agents can be used to monitor and manage
natural resources, track climate change, and improve environmental
sustainability.
Social media: Agents can be used to analyze social media data, identify trends
and patterns, and provide personalized recommendations to users.
Problem Solving
The reflex agent of AI directly maps states into action. Whenever these agents fail to
operate in an environment where the state of mapping is too large and not easily
performed by the agent, then the stated problem dissolves and sent to a problem-solving
domain which breaks the large stored problem into the smaller storage area and resolves
one by one. The final integrated action will be the desired outcomes.
On the basis of the problem and their working domain, different types of problem-solving
agent defined and use at an atomic level without any internal state visible with a problem-
solving algorithm. The problem-solving agent performs precisely by defining problems
and several solutions. So we can say that problem solving is a part of artificial
intelligence that encompasses a number of techniques such as a tree, B-tree, heuristic
algorithms to solve a problem.
We can also say that a problem-solving agent is a result-driven agent and always focuses
on satisfying the goals.
There are basically three types of problem in artificial intelligence:
1. Ignorable: In which solution steps can be ignored.
2. Recoverable: In which solution steps can be undone.
3. Irrecoverable: Solution steps cannot be undo.
Steps problem-solving in AI: The problem of AI is directly associated with the nature
of humans and their activities. So we need a number of finite steps to solve a problem
which makes human easy works.
These are the following steps which require to solve a problem :
Problem definition: Detailed specification of inputs and acceptable system
solutions.
Problem analysis: Analyse the problem thoroughly.
Knowledge Representation: collect detailed information about the problem
and define all possible techniques.
Problem-solving: Selection of best techniques.
Components to formulate the associated problem:
Initial State: This state requires an initial state for the problem which starts
the AI agent towards a specified goal. In this state new methods also initialize
problem domain solving by a specific class.
Action: This stage of problem formulation works with function with a
specific class taken from the initial state and all possible actions done in this
stage.
Transition: This stage of problem formulation integrates the actual action
done by the previous action stage and collects the final stage to forward it to
their next stage.
Goal test: This stage determines that the specified goal achieved by the
integrated transition model or not, whenever the goal achieves stop the action
and forward into the next stage to determines the cost to achieve the goal.
Path costing: This component of problem-solving numerical assigned what
will be the cost to achieve the goal. It requires all hardware software and
human working cost.
What is an Agent?
An agent can be anything that perceive its environment through sensors and act
upon that environment through actuators. It runs in the cycle of
perceiving, thinking, and acting.
1)Knowledge Representation
Knowledge representation
is the process of translating real-world knowledge and
information into a format that can be understood and
processed by machines. It involves the use of symbols,
rules, and structures to capture and store information for
later retrieval and use by intelligent agents.
Types of knowledge representation
Several types of knowledge representation in artificial
intelligence are:
Semantic networks:
Use nodes and links to represent objects
Frames:
Use structures to represent objects and their attributes
Rules:
Use if-then statements to represent knowledge
Logic-based representations:
Formal logic to represent and reason about knowledge.
.
2) Propositional Logic
*Propositional logic is a branch of symbolic logic that
deals with propositions or statements that are either true
or false.
*It uses logical connectives such as AND,OR, and NOT
To combine propositions and form more complex
statements.
*Propositional logic is used in artificial intelligence to
represent and reason about knowledge in a formal and
systematic way.
3) First-Order Logic
*First-order logic is a formal system used in artificial
intelligence and logic programming to represent and
reason about objects and their properties, relationships,
and functions. It extends propositional logic by
introducing quantifiers, such as "for all" and "there
exists," which allow for the formal representation of
complex and relational knowledge. First-order logic is
also known as predicate logic
Quantifiers
*Quantifiers are logical symbols used in first-order logic
to express the scope or extent of a predicate over a
domain of objects. The two main quantifiers are
"for all" ()∀ , which expresses universal quantification,
and "there exists" () ∃,which expresses existential
quantification.
Classical planning
Classical planning is the planning where an agent takes
advantages of the problem structure to construct complex
plans of an action. The agent performs three tasks in
classical planning:
* Planning : The agent plans after knowing what is the
problem.
* Acting : It decides what action it has to take.
* Learning : the actions taken by the agent make him
learn new things.
Learning
Robotics is a domain in artificial intelligence that deals with the
study of creating intelligent and efficient robots.
Objective
What is Robotics?
Aspects of Robotics
AI
Robots
Programs
They
They
usually
operate in
operate in
real
computer-
physical
stimulated
world
worlds.
Inputs to
robots is
The input analog
to an AI signal in
program is the form
in symbols of speech
and rules. wavefor
m or
images
They
They need need
general special
purpose hardware
computers with
to operate sensors
on. and
effectors.
Robot Locomotion
Legged
Wheeled
Combination of Legged and Wheeled Locomotion
Tracked slip/skid
Legged Locomotion
In case of k=6 legs, there are 39916800 possible events. Hence the
complexity of robots is directly proportional to the number of legs.
Wheeled Locomotion
Standard wheel − Rotates around the wheel axle and around the
contact
Castor wheel − Rotates around the wheel axle and the offset
steering joint.
Swedish 45o and Swedish 90o wheels − Omni-wheel, rotates around
the contact point, around the wheel axle, and around the
rollers.
Ball or spherical wheel − Omnidirectional wheel, technically
difficult to implement.
Slip/Skid Locomotion
In this type, the vehicles use tracks as in a tank. The robot is
steered by moving the tracks with different speeds in the same or
opposite direction. It offers stability because of large contact area of
track and ground.
Components of a Robot
This involves −
Power supply
Image acquisition device such as camera
A processor
A software
A display device for monitoring the system
Accessories such as camera stands, cables, and connectors
Tasks of Computer Vision
OCR − In the domain of computers, Optical Character Reader,
a software to convert scanned documents into editable text,
which accompanies a scanner.
Face Detection − Many state-of-the-art cameras come with this
feature, which enables to read the face and take the picture of
that perfect expression. It is used to let a user access the
software on correct match.
Object Recognition − They are installed in supermarkets,
cameras, high-end cars such as BMW, GM, and Volvo.
Estimating Position − It is estimating position of an object with
respect to camera as in position of tumor in human’s body.
Application Domains of Computer Vision
Agriculture
Autonomous vehicles
Biometrics
Character recognition
Forensics, security, and surveillance
Industrial quality inspection
Face recognition
Gesture analysis
Geoscience
Medical imagery
Pollution monitoring
Process control
Remote sensing
Robotics
Transport
Applications of Robotics
1. Definition of AI Perception:
2. Key Components:
Sensors: These are devices that capture data from the environment. Examples include
cameras, microphones, and other types of detectors.
Data Processing: Once data is captured, it needs to be processed. This involves
techniques such as image recognition, natural language processing, and signal processing.
3. Types of Perception in AI:
Computer Vision: Involves the interpretation of visual information from the world, often
using techniques like image recognition, object detection, and facial recognition.
Speech Recognition: Understanding and interpreting spoken language.
Natural Language Processing (NLP): Understanding and generating human language,
involving tasks like language translation and sentiment analysis.
Sensor Fusion: Combining information from multiple sensors to form a more complete
understanding of the environment.
4. Challenges in AI Perception:
Ambiguity: The real world is often ambiguous, and making sense of uncertain or
incomplete information is a significant challenge.
Variability: Environments can change, and perception systems need to adapt to different
conditions and situations.
Real-time Processing: Some applications, like autonomous vehicles, require instant
processing of perceptual data for quick decision-making.
5. Applications:
7. Ethical Considerations:
Privacy: Perception systems, especially those involving cameras and microphones, raise
concerns about privacy and data security.
Bias: AI perception systems can inherit biases present in training data, leading to unfair
or discriminatory outcomes.
8. Future Trends:
1. Vision Systems:
Description: Robotic vision systems use cameras and image processing algorithms to
visually perceive the manufacturing environment.
Examples:Quality Control: Cameras can inspect products for defects, ensuring that only
high-quality items proceed down the production line.Pick-and-Place Operations: Vision
systems enable robots to identify and accurately pick up objects, facilitating automation
in assembly processes.
Description: Sensors that measure force and pressure, allowing robots to interact with
their environment more intelligently.
Examples:Assembly Verification: Force sensors help robots determine whether
components are properly assembled by measuring the force exerted during the
process.Sensitive Gripping: Tactile feedback enables robots to grip fragile objects
without damaging them.
Description: Lidar and radar provide depth perception and object detection capabilities,
enhancing robotic navigation and safety in manufacturing environments.
Examples:Obstacle Avoidance: Lidar sensors help robots navigate through crowded
factory floors by detecting obstacles and adjusting their paths.Human-Robot
Collaboration: Radar systems enhance safety by detecting the presence of humans in the
robot's vicinity, leading to collaborative workspaces.
4. Ultrasonic Sensors:
Description: Ultrasonic sensors use sound waves to measure distances and detect objects.
Examples:Material Level Monitoring: In processes involving liquids or powders,
ultrasonic sensors can monitor material levels in containers to optimize production
efficiency.Collision Avoidance: Robots equipped with ultrasonic sensors can avoid
collisions with other objects or robots in their path.
6. Predictive Maintenance:
7. Quality Prediction:
Description: Predictive analysis can be applied to predict the quality of products based on
various parameters.
Examples:Defect Prediction: Machine learning models can analyze production data to
predict the likelihood of defects, allowing for corrective measures before products are
completed.
Description: Predictive analytics helps optimize the supply chain by forecasting demand,
identifying potential disruptions, and improving overall efficiency.
Examples:Demand Forecasting: Machine learning models can analyze historical sales
data, market trends, and external factors to predict future demand for products,
optimizing inventory management.
Conclusion:
The integration of AI perception, robotic sensing, and predictive analysis in
manufacturing leads to more efficient, flexible, and adaptive production processes. As
technology continues to advance, these systems will play an increasingly integral role in
creating smart and interconnected manufacturing environments. The combination of
accurate perception through advanced sensors and predictive analytics contributes to
improved decision-making, reduced downtime, and enhanced overall productivity in the
manufacturing industry.
SECTION B
Data
Science Introduction
Data Science is about finding patterns in data, through analysis, and make
future predictions.
Data Science can be applied in nearly every part of a business where data is
available. Examples are:
Consumer goods
Stock markets
Industry
Politics
Logistic companies
E-commerce
ADVERTISEMENT
A Data Scientist must find patterns within the data. Before he/she can find the
patterns, he/she must organize the data in a standard format.
The data science landscape is a dynamic and rapidly evolving field that encompasses a wide
range of techniques, tools, and technologies for extracting insights and knowledge from data.
Here’s an overview of key components and trends within the data science landscape:
by midjourney
Foundations of AI and Data Science is a critical area of research at CAIDAS that explores the underlying principles
and techniques behind the development of AI and data science applications. This area encompasses several sub-
disciplines, each of which focuses on a specific aspect of AI and data science research. The sub-areas of
Foundations of AI and Data Science include Deep Learning, Representation Learning, Reinforcement Learning,
Statistical Relational Learning, Machine Learning for Complex Networks, Computer Vision, Natural Language
Processing, and Pattern Recognition.
Deep Learning focuses on the development of algorithms that enable artificial neural networks to learn from large
amounts of data. These algorithms allow AI systems to improve their performance over time and can be applied to
various applications, including image recognition, speech recognition, and natural language processing.
Representation Learning deals with the development of algorithms that can effectively represent complex data
structures. These algorithms are critical for building AI systems that can learn from structured and unstructured data
and can be used to improve the performance of applications such as computer vision and natural language
processing.
Reinforcement Learning focuses on the development of algorithms that allow AI systems to learn from experience.
These algorithms are used in applications that require the AI system to make decisions based on the consequences of
its actions, such as robotics, gaming, and autonomous driving.
Statistical Relational Learning deals with the development of algorithms that can effectively model relationships
between entities in data. These algorithms can be used to improve the performance of applications that require the
analysis of complex, relational data, such as knowledge graphs and social networks.
Machine Learning for Complex Networks focuses on the development of algorithms that can effectively analyze
complex networks of data, such as those found in social networks, transportation networks, and biological networks.
Computer Vision deals with the development of algorithms that can analyze and understand images and videos.
These algorithms can be used in applications such as object recognition, face recognition, and scene analysis.
Natural Language Processing deals with the development of algorithms that can analyze and understand human
language. These algorithms can be used in applications such as speech recognition, sentiment analysis, and machine
translation.
Pattern Recognition deals with the development of algorithms that can recognize patterns in data. These algorithms
can be used in applications such as audio data recognition in ecology, image classification, and speech recognition.
In conclusion, the area of Foundations of AI and Data Science at CAIDAS is a critical component of AI and data
science research. It encompasses a wide range of sub-disciplines that each contribute to the advancement of AI and
data science applications.
This article is going to be very important for the readers interested in Big Data. In this article, we
will discuss two major types of Big Data: structured data, unstructured data, and the difference
between them.
Hope this article will be informative to you and give you sufficient information about structured
data, unstructured data, and their comparison. We will try to make the article easy to read and
understand. So, without any delay, let's start our topic.
Before discussing the types of Big Data, let's see the brief description of Data and Big Data.
What is Data?
In general, data is a distinct piece of information that is gathered and translated for some
purpose. Data can be available in different forms, such as bits and bytes stored in electronic
memory, numbers or text on pieces of paper, or facts stored in a person's mind.
Big Data is defined as the Data which are very large in size. Normally, we work on data of size
MB (WordDoc, Excel) or maximum GB(Movies, Codes), but data in Petabytes, i.e., 10^15 byte
size, is called Big Data. It is stated that almost 90% of today's data has been generated in the past
3 years. Big data sources include Telecom Companies, Weather stations, E-commerce sites,
Share market, and many more.
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources.
Structured Data
The data which is to the point, factual, and highly organized is referred to as structured data. It is
quantitative in nature, i.e., it is related to quantities that means it contains measurable numerical
values like numbers, dates, and times.
It is easy to search and analyze structured data. Structured data exists in a predefined format.
Relational database consisting of tables with rows and columns is one of the best examples of
structured data. Structured data generally exist in tables like excel files and Google Docs
spreadsheets. The programming language SQL (structured query language) is used for managing
the structured data. SQL is developed by IBM in the 1970s and majorly used to handle relational
databases and warehouses.
Structured data is highly organized and understandable for machine language. Common
applications of relational databases with structured data include sales transactions, Airline
reservation systems, inventory control, and others.
Unstructured Data
All the unstructured files, log files, audio files, and image files are included in the unstructured
data. Some organizations have much data available, but they did not know how to derive data
value since the data is raw.
Unstructured data is the data that lacks any predefined model or format. It requires a lot of
storage space, and it is hard to maintain security in it. It cannot be presented in a data model or
schema. That's why managing, analyzing, or searching for unstructured data is hard. It resides in
various different formats like text, images, audio and video files, etc. It is qualitative in nature
and sometimes stored in a non-relational database or NO-SQL.
It is not stored in relational databases, so it is hard for computers and humans to interpret it. The
limitations of unstructured data include the requirement of data science experts and specialized
tools to manipulate the data.
The amount of unstructured data is much more than the structured or semi-structured data.
Examples of human-generated unstructured data are Text files, Email, social media, media,
mobile data, business applications, and others. The machine-generated unstructured data includes
satellite images, scientific data, sensor data, digital surveillance, and many more.
Flexibility Structured data is less flexible and There is an absence of schema, so it is more
schema-dependent. flexible.
Performance Here, we can perform a structured While in unstructured data, textual queries are
query that allows complex joining, so possible, the performance is lower than semi-
the performance is higher. structured and structured data.
Types of Statistics
Inferential statistics: Inferential statistics is a method that runs on the basis of the
prediction and draws conclusions for the population on the basis of results. It is the
method that is run by hypotheses and predictions. This type of statistic is carried about by
probabilities derived from sample data of the population. The tools for its measurement
are
1. Hypotheses of tests
2. Analysis of variance, etc
Datasheets are contained of audio or video Datasheets are obtained in the form of
recordings and notes. numerical values.
3. Qualitative data talks about the 3. Quantitative data talks about the
experience or quality and explains the quantity and explains the questions like
questions like ‘why’ and ‘how’. ‘how much’, ‘how many .
5.Qualitative data are subjective and can be 5. Quantitative data are fixed and
further open for interpretation. universal.
Levels of Measurement
Before analyzing a dataset, it is crucial to identify the type of data it contains. Luckily, all data
can be grouped into one of four categories: nominal, ordinal, interval, or ratio data. Although
these are often referred to as “data types,” they are actually different levels of measurement. The
level of measurement reflects the accuracy with which a variable has been quantified, and it
determines the methods that can be used to extract insights from the data.
The four categories of data are not always straightforward to distinguish and instead belong to a
hierarchy, with each level building on the preceding one.
There are four types of data: categorical, which can be further divided into nominal and ordinal,
and numerical, which can be further divided into interval and ratio. The nominal and ordinal
scales are relatively imprecise, which makes them easier to analyze, but they offer less accurate
insights. On the other hand, the interval and ratio scales are more complex and difficult to
analyze, but they have the potential to provide much richer insights.
Nominal Data – Nominal data is a basic data type that categorizes data by labeling or
naming values such as Gender, hair color, or types of animal. It does not have any
hierarchy.
Ordinal Data – Ordinal data involves classifying data based on rank, such as social
status in categories like ‘wealthy’, ‘middle income’, or ‘poor’. However, there are no set
intervals between these categories.
Interval Data – Interval data is a way of organizing and comparing data that includes
measured intervals. Temperature scales, like Celsius or Fahrenheit, are good examples of
interval data. However, interval data doesn’t have a true zero, meaning that a
measurement of “zero” can still represent a quantifiable measure (like zero degrees
Celsius, which is just another point on the scale and doesn’t actually mean there is no
temperature present).
Ratio Data – The most intricate level of measurement is ratio data. Similar to interval
data, it categorizes and arranges data, utilizing measured intervals. But, unlike interval
data, ratio data includes a genuine zero. When a variable is zero, there is no presence of
that variable. A prime illustration of ratio data is height measurement, which cannot be
negative.
What is Nominal Data?
Categorical data, also known as nominal data, is a crucial type of information utilized in diverse
fields such as research, statistics, and data analysis. It comprises of categories or labels that help
in classifying and arranging data. The essential feature of categorical data is that it does not
possess any inherent order or ranking among its categories. Instead, these categories are separate,
distinct, and mutually exclusive.
For example, Nominal data is used to classify information into distinct labels or categories
without any natural order or ranking. These labels or categories are represented using names or
terms, and there is no natural order or ranking among them. Nominal data is useful for
qualitative classification and organization of information, enabling researchers and analysts to
group data points based on specific attributes or characteristics without implying any numerical
relationships.
Eye color categories like “blue” or “green” represent nominal data. Each category is
distinct, with no order or ranking.
Smartphone brands like “iPhone” or “Samsung” are nominal data. There’s no hierarchy
among brands.
Transportation modes like “car” or “bicycle” are nominal data. They are discrete
categories without inherent order.
Data that falls under the nominal category is distinguished by descriptive labels rather
than any numeric or quantitative value
Example
Here are a few examples of how nominal data is used to classify and categorize information into
distinct and non-ordered categories:
1. Colors of Car: Car colors are nominal data, with clear categories but no inherent order or
ranking. Each car falls under one color category, without any logical or numerical connection
between colors.
2. Types of Fruits: Fruit categories in a basket are nominal. Each fruit belongs to a specific
category with no hierarchy or order. All categories are distinct and discrete.
3. Movie Genres: Movie genres are nominal data since there’s no ranking among categories like
“action” or “comedy.” Each genre is unique, but we can’t say if one is better than another based
on this data alone.
Ordinal data is a form of qualitative data that classifies variables into descriptive categories. It is
characterized by the fact that the categories it employs are ranked on some sort of hierarchical
scale, such as from high to low. Ordinal data is the second most complicated type of
measurement, following nominal data. Although it is more intricate than nominal data, which
lacks any inherent order, it is still relatively simplistic.
For example, Ordinal data is a type of data used to categorize items with a meaningful hierarchy
or order. These categories help us to compare and rank different achievements, positions, or
performance of students, even if the intervals between them are not equal. Ordinal data is useful
for understanding ordered choices or preferences and for assessing relative differences.
School Grades: Grades like A, B, C are ordinal data, ranked by achievement, but intervals
between them vary.
Education Level: Levels like high school, bachelor’s, master’s are ordinal data, ordered
by education, but gaps between levels differ.
Seniority Level: Job levels like entry, mid, senior are ordinal data, indicating hierarchy,
but the gap varies by job and industry.
Ordinal data falls under the category of non-numeric and categorical data, but it can still
make use of numerical values as labels.
Ordinal data are always ranked in a hierarchy (hence the name ‘ordinal’).
Ordinal data may be ranked, but their values are not evenly distributed.
With ordinal data, you can calculate frequency distribution, mode, median, and range of
variables.
Example
Here are a few examples of how ordinal data is used in fields and domains:
1. Educational Levels: Ordinal data is commonly used to represent education levels, such, as ”
school,” “bachelors degree,” “masters degree,” and “Ph.D.” These levels have an order.
3. Economic Classes: classes including ” class ” “middle class,” and “upper class ” can be
classified as ordinal data based on their ranking.
These examples demonstrate the ways in which ordinal data is utilized across fields and
domains.
Nature of
Distinct and Discrete Discrete and Distinct
Categories
In the realm of machine learning, data is the key ingredient. It fuels the
algorithms and dictates the quality of the predictions and insights
derived. Yet, more often than not, data scientists face the significant
challenge of dealing with messy, unstructured data. This article aims to
provide a comprehensive overview of data cleaning, a crucial yet
frequently overlooked aspect of the data preprocessing pipeline, which
turns messy data into tidy, structured information, ready to be
consumed by machine learning algorithms.
Dealing with messy data is one of the first hurdles a data scientist must
overcome in any data science project, as poor data quality can
significantly impact the performance of machine learning algorithms.
Therefore, the process of cleaning and tidying this data becomes
essential.
The Importance of Tidy Data
The process of transforming messy data into tidy data, often known as
data cleaning or data wrangling, involves several key steps:
Data Auditing
The first step involves examining the dataset to identify any errors or
inconsistencies. This step is crucial as it helps establish the nature and
extent of the messiness in the data.
Workflow Specification
Once the issues have been identified, the next step is to specify the
workflow or steps necessary to clean the data. This might involve
dealing with missing values, removing duplicates, or correcting
inconsistent entries.
Workflow Execution
The third step involves executing the specified workflow, which may
often require writing custom scripts or using specific data cleaning
tools.
Post-Processing Check
There are many tools and techniques available for data cleaning,
ranging from programming libraries in Python or R, to dedicated data
cleaning tools. The choice of tool often depends on the nature and scale
of the data, as well as the specific cleaning tasks that need to be
performed.
Conclusion
Prompts:
Data cleaning is one of the important parts of machine learning. It plays a significant
part in building a model. In this article, we’ll understand Data cleaning, its significance
and Python implementation.
What is Data Cleaning?
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data. The goal of data
cleaning is to ensure that the data is accurate, consistent, and free of errors, as incorrect
or inconsistent data can negatively impact the performance of the ML model.
Professional data scientists usually invest a very large portion of their time in this step
because of the belief that “Better data beats fancier algorithms”.
Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in
the data science pipeline that involves identifying and correcting or removing errors,
inconsistencies, and inaccuracies in the data to improve its quality and usability. Data
cleaning is essential because raw data is often noisy, incomplete, and inconsistent,
which can negatively impact the accuracy and reliability of the insights derived from it.
Why is Data Cleaning Important?
Data cleansing is a crucial step in the data preparation process, playing an important
role in ensuring the accuracy, reliability, and overall quality of a dataset. For decision-
making, the integrity of the conclusions drawn heavily relies on the cleanliness of the
underlying data. Without proper data cleaning, inaccuracies, outliers, missing values,
and inconsistencies can compromise the validity of analytical results. Moreover, clean
data facilitates more effective modeling and pattern recognition, as algorithms perform
optimally when fed high-quality, error-free input.
Additionally, clean datasets enhance the interpretability of findings, aiding in the
formulation of actionable insights.
Data Cleaning in Data Science
Data clean-up is an integral component of data science, playing a fundamental role in
ensuring the accuracy and reliability of datasets. In the field of data science, where
insights and predictions are drawn from vast and complex datasets, the quality of the
input data significantly influences the validity of analytical results. Data cleaning
involves the systematic identification and correction of errors, inconsistencies, and
inaccuracies within a dataset, encompassing tasks such as handling missing values,
removing duplicates, and addressing outliers. This meticulous process is essential for
enhancing the integrity of analyses, promoting more accurate modeling, and ultimately
facilitating informed decision-making based on trustworthy and high-quality data.
Steps to Perform Data Cleanliness
Performing data cleaning involves a systematic process to identify and rectify errors,
inconsistencies, and inaccuracies in a dataset. The following are essential steps to
perform data cleaning.
Removal of Unwanted Observations: Identify and eliminate irrelevant or
redundant observations from the dataset. The step involves scrutinizing data
entries for duplicate records, irrelevant information, or data points that do not
contribute meaningfully to the analysis. Removing unwanted observations
streamlines the dataset, reducing noise and improving the overall quality.
Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity in
data representation. Fixing structure errors enhances data consistency and
facilitates accurate analysis and interpretation.
Managing Unwanted outliers: Identify and manage outliers, which are data
points significantly deviating from the norm. Depending on the context,
decide whether to remove outliers or transform them to minimize their impact
on analysis. Managing outliers is crucial for obtaining more accurate and
reliable insights from the data.
Handling Missing Data: Devise strategies to handle missing data effectively.
This may involve imputing missing values based on statistical methods,
removing records with missing values, or employing advanced imputation
techniques. Handling missing data ensures a more complete dataset,
preventing biases and maintaining the integrity of analyses.
TABLE OF CONTENT
Types of Classification in the tabular representation of data.
What are the main parts of a presentation of data in tabular form?
Disadvantages of a tabular representation of the data.
When a table is used to represent a large amount of data in an arranged, organised, engaging,
coordinated and easy to read form it is called the tabular representation of data. In tabular
representation of data, the given data set is presented in rows and columns. The rows and columns
method is one of the most popular forms of data representation as data tables are simple to prepare
and read. Tabular representation of data makes the representation of data more significant for more
additional statistical treatment and decision making.
Quantitative Classification: in this analysis, the data is classified and distributed on the basis of
features that are quantitative in nature. The features can be calculated by estimating the quantitative
value in simpler terms.
Qualitative Classification: As the data is classified and distributed according to traits such as
physical status, national, social status, etc., it is called qualitative classification.
Temporal Classification: In this type of classification, time becomes the categorising and
distribution of variables of data. By the time, it could mean years, months, days, hours, etc.
Spatial Classification: In this type of classification, the data is categorised and distributed on the
basis of location; the location or place could be country, state, district, block, village/town, etc.
The aim and objectives of a tabular representation of data are that they represent the complex set of
data in a simplified data form. Tabular representation of data brings out the essential features of data
and facilitates statistics. Using the tabular representation of data also saves space.
Table number – the purpose of identification and an easy reference is provided in the table
number.
Title – it provides the basis of information adjacent to the number.
Column headings or captions – it is put up at the top columns of the table; the columns come with
specific figures within.
Footnote – it gives a scope or potential for the further explanation that might be required for any
item included in the table; the footnote is needed to clarify data.
Row heading and Stub – this provides specific issues mentioned in the horizontal rows. The stub is
provided on the left side of the table.
Information source – it is included at the bottom of the table. The information source tells us the
source related to the specific piece of information and the authenticity of the sources.
Disadvantages of a tabular
representation of the data.
Though, there are a few limitations of the presentation of data in tabular form. The first limitation is
the lack of description; the data in tabular form is only represented with figures and not attributes that
ignore the facts’ qualitative aspect. The second limitation is that the data in tabular form is incapable
of presenting individual terms; it represents aggregate data. The third limitation is the tabular
representation of data needs special knowledge to understand it, and a layman cannot easily use it.
Conclusion
Overview :
Data is can be anything which represents the specific result or any number, text, image,
audio, video etc. For example, If you will take an example of human being then data for
a human being such that name, personal id, country, profession, bank account details
etc. are the important data. Data can be divide into three categories such that data can be
personal, public and private.
Forms of data representation :
At present Information comes in different forms such as follows.
1. Numbers
2. Text
3. Images
4. Audio
5. Video
Text –
Text is also represented as bit pattern or sequence of bits(such as 0001111). Various
types of bits are assigned to represent text symbols. A code where each number
represents a character can be used to convert text into binary.
Text File Formats –
.doc,.docx, .pdf, .rtf, .txt, etc.
Example :
The letter ‘a’ has the binary number 0110 0001.
If we have friends of friends and stuff like that, these are many to many relationships.
Used when the query in the relational database is very complex.
For example- there is a profile and the profile has some specific information in it but
the major selling point is the relationship between these different profiles that is how
you get connected within a network.
In the same way, if there is data element such as user data element inside a graph
database there could be multiple user data elements but the relationship is what is going
to be the factor for all these data elements which are stored inside the graph database.
Why do Graph Databases matter? Because graphs are good at handling relationships,
some databases store data in the form of a graph.
Example We have a social network in which five friends are all connected. These
friends are Anay, Bhagya, Chaitanya, Dilip, and Erica. A graph database that will store
their personal information may look something like this:
Overview
Today, databases are everywhere. Often hidden in plain sight, databases power online banking,
airline reservations, medical records, employment records, and personal transactions.
Is data most frequently being read or written? Does data need to be accessed by row or column?
How can the database management system ensure control over data integrity,
avoid redundancy, and secure data while performing optimally? This article will present and
assess modern databases, assisting in the research and selection of an optimal system for
information storage and retrieval.
Selecting the Optimal Database
There are a plethora of characteristics to evaluate when selecting the ideal database for your
team/project. The “right” solution will be the one that best serves your needs and intended use.
Characteristics like business intelligence optimization, data structure, storage type,
data volume, query speed, and data model requirements play a role, but the best solution will be
one tailored to your use-case.
A good starting point for selecting the optimal database is to understand the different database
structures:
Non-Relational Databases
Non-relational or NoSQL databases are also used to store data, but unlike relational databases,
there are no tables, rows, primary keys, or foreign keys. Instead, these data-stores use
models optimized for specific data types. The four most popular non-relational
types are document data stores, key-value stores, graph databases, and search engine stores.
Snowflake
To the user, Snowflake has many similarities to other enterprise data warehouses, but
contains additional functionality and unique capabilities. Snowflake is a turn-key
solution for managing data engineering and science, data warehouses and lakes, and the
creation of data applications—including sharing data internally and externally.
Snowflake provides one platform that can integrate, manage, and collaborate securely across
different workloads. It also scales with your business and can run seamlessly across
multiple clouds.
Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
Amazon Redshift is a cost-effective solution for:
Running high-performance queries to improve business
intelligence.
Generating real-time operational analytics on events,
applications, and systems.
Sharing data securely inside and outside the organization.
Creating predictive analytics of the data in your data warehouse.
PostgreSQL
PostgreSQL has earned a strong reputation for its proven architecture, reliability, data integrity,
robust feature set, extensibility, and the dedication of the open-source community behind the
software to consistently deliver performant and innovative solutions.
MySQL is also a free, open-source RDBMS. MySQL runs on virtually all platforms,
including Windows, UNIX, and Linux. A popular database model that is used globally, MySQL
offers amultitude of benefits:
Industry leading data security.
On-demand flexibility that enables a smaller footprint or massive
warehouse.
Distinct storage engine that facilitates high performance.
24/7 uptime with a range of high availability solutions.
Comprehensive support for transactions.
Microsoft SQL Server is an RDBMS database that supports a wide variety of analytic
applications in corporate IT, transaction processing, and business intelligence.
Organizations can implement a Microsoft SQL Server on-premise or in the cloud, depending on
structural needs. It can also run on Windows, Linux, or Docker systems. As part of the Microsoft
toolkit, many Microsoft’s applications and software integrate well with Microsoft SQL Server.
The main advantage of a distributed database system is that it can provide higher
availability and reliability than a centralized database system. Because the data is stored
across multiple sites, the system can continue to function even if one or more sites fail. In
addition, a distributed database system can provide better performance by distributing the
data and processing load across multiple sites.
Federated architecture: In this architecture, each site in the distributed database system
maintains its own independent database, but the databases are integrated through a
middleware layer that provides a common interface for accessing and querying the data.
Spreadsheet Database
It have more formatting features than It automatically update forms, reports and
database. queries when data is updated.
1. Apache Hadoop
SAS is a statistical tool developed by SAS Institute. It is a closed source proprietary software
that is used by large organizations to analyze data. It is one of the oldest tools developed for Data
Science. It is used in areas like Data Mining, Statistical Analysis, Business Intelligence
Applications, Clinical Trial Analysis, Econometrics & Time-Series Analysis.
3. Apache Spark
Apache Spark is the data science tool developed by Apache Software Foundation used for
analyzing and working on large-scale data. It is a unified analytics engine for large-scale data
processing. It is specially designed to handle batch processing and stream processing. It allows
you to create a program to clusters of data for processing them along with incorporating data
parallelism and fault-tolerance. It inherits some of the features of Hadoop like YARN,
MapReduce, and HDFS.
4. Data Robot
DataRobot Founded in 2012, is the leader in enterprise AI, that aids in developing accurate
predictive models for the real-world problems of any organization. It facilitates the environment
to automate the end-to-end process of building, deploying, and maintaining your AI.
DataRobot’s Prediction Explanations help you understand the reasons behind your machine
learning model results.
Highly Interpretable.
It has the ability to making the model’s predictions easy to explain to anyone.
It provides the suitability to implement the whole Data Science process at a large
scale.
5. Tableau
Tableau is the most popular data visualization tool used in the market, is an American
interactive data visualization software company founded in January 2003, was recently acquired
by Salesforce. It provides the facilities to break down raw, unformatted data into a processable
and understandable format. It has the ability to visualize geographical data and for plotting
longitudes and latitudes in maps.
6. BigML
BigML, founded in 2011, is a Data Science tool that provides a fully interactable, cloud-based
GUI environment that you can use for processing Complex Machine Learning Algorithms. The
main goal of using BigML is to make building and sharing datasets and models easier for
everyone. It provides an environment with just one framework for reduced dependencies.
7. TensorFlow
TensorFlow, developed by Google Brain team, is a free and open-source software library for
dataflow and differentiable programming across a range of tasks. It provides an environment for
building and training models, deploying platforms such as computers, smartphones, and servers,
to achieving maximum potential with finite resources. It is one of the very useful tools that is
used in the fields of Artificial Intelligence, Deep Learning, & Machine Learning.
8. Jupyter
Images –
Images are also represented as bit patterns. An image is composed of matrix of pixels with
different values of pixels each where each pixel is represented as dots. Size of the picture is
dependent on its resolution. Consider a simple black and white image. If 1 is black (or on) and 0
is white (or off), then a simple black and white picture can be created using binary.
Most of the predictive modeling software solutions has the capability to export the
model information into a local file in industry standard Predictive Modeling Markup
Language, (PMML) format for sharing the model with other PMML compliant
applications to perform analysis on similar data.
Business process on Predictive Modeling
1. Creating the model : Software solutions allows you to create a model to run one or
more algorithms on the data set.
2. Testing the model: Test the model on the data set. In some scenarios, the testing is
done on past data to see how best the model predicts.
3. Validating the model : Validate the model run results using visualization tools and
business data understanding.
4. Evaluating the model : Evaluating the best fit model from the models used and
choosing the model right fitted for the data.
Predictive modeling process
The process involve running one or more algorithms on the data set where prediction is
going to be carried out. This is an iterative processing and often involves training the
model, using multiple models on the same data set and finally arriving on the best fit
model based on the business data understanding.
Models Category
1.Predictive models :The models in Predictive models analyze the past performance
for future predictions.
2.Descriptive models: The models in descriptive model category quantify the
relationships in data in a way that is often used to classify data sets into groups.
3.Decision models: The decision models describe the relationship between all the
elements of a decision in order to predict the results of decisions involving many
variables.
Algorithms
Algorithms perform data mining and statistical analysis in order to determine trends and
patterns in data. The predictive analytics software solutions has built in algorithms such
as regressions, time series, outliers, decision trees, k-means and neural network for
doing this. Most of the software also provide integration to open source R library.
1. Time Series Algorithms which perform time based predictions. Example Algorithms
are Single Exponential Smoothing, Double Exponential Smoothing and Triple
Exponential Smoothing.
2. Regression Algorithms which predicts continuous variables based on other variables
in the dataset. Example algorithms are Linear Regression, Exponential Regression,
Geometric Regression, Logarithmic Regression and Multiple Linear Regression.
3. Association Algorithms which Finds the frequent patterns in large transactional
dataset to generate association rules. Example algorithms are Apriori
4. Clustering Algorithms which clustor observations into groups of similar Groups.
Example algorithms are K-Means , Kohonen, and TwoStep.
5. Decision Trees Algorithms classify and predict one or more discrete variables based
on other variables in the dataset. Example algorithms are C 4.5 and CNR Tree
6. Outlier Detection Algorithms detect the outlying values in the dataset. Example
algorithms are Inter Quartile Range and Nearest Neighbour Outlier
7. Neural Network Algorithms does the forecasting, classification, and statistical
pattern recognition. Example algorithms are NNet Neural Network and MONMLP
Neural Network
8.Ensemble models are a form of Monte Carlo analysis where multiple numerical
predictions are conducted using slightly different initial conditions.
9.Factor Analysis deals with variability among observed, correlated variables in terms
of a potentially lower number of unobserved variables called factors. Example
algorithms are Maximum likelihood algorithm.
10.Naive Bayes are probabilistic classifier based on applying Bayes' theorem with
strong (naive) independence assumptions.
11.Support vector machines are supervised learning models with associated learning
algorithms that analyze data and recognize patterns, used for classification and
regression analysis.
12.Uplift modeling, models the incremental impact of a treatment on an individual's
behavior.
13.Survival analysis are analysis of time to events.
Features in Predictive Modeling
1) Data Analysis and manipulation : Tools for data analysis, create new data sets,
modify, club, categorize, merge and filter data sets.
2) Visualization : Visualization features includes interactive graphics, reports.
3) Statistics : Statistics tools to create and confirm the relationships between variables
in the data. Statistics from different statistical software can be integrated to some of the
solutions.
4) Hypothesis testing : Creation of models, evaluation and choosing of the right model.
What is a Generative Model?
A generative model is a type of machine learning model that aims to learn the underlying patterns or
distributions of data in order to generate new, similar data. In essence, it's like teaching a computer to
dream up its own data based on what it has seen before. The significance of this model lies in its ability to
create, which has vast implications in various fields, from art to science.
Generative models are a cornerstone in the world of artificial intelligence (AI). Their
primary function is to understand and capture the underlying patterns or distributions
from a given set of data. Once these patterns are learned, the model can then generate
new data that shares similar characteristics with the original dataset.
Imagine you're teaching a child to draw animals. After showing them several pictures of
different animals, the child begins to understand the general features of each animal.
Given some time, the child might draw an animal they've never seen before, combining
features they've learned. This is analogous to how a generative model operates: it
learns from the data it's exposed to and then creates something new based on that
knowledge.
Generative models: These models focus on understanding how the data is generated.
They aim to learn the distribution of the data itself. For instance, if we're looking at
pictures of cats and dogs, a generative model would try to understand what makes a cat
look like a cat and a dog look like a dog. It would then be able to generate new images
that resemble either cats or dogs.
Discriminative models: These models, on the other hand, focus on distinguishing between
different types of data. They don't necessarily learn or understand how the data is generated;
instead, they learn the boundaries that separate one class of data from another. Using the same
example of cats and dogs, a discriminative model would learn to tell the difference between the
two, but it wouldn't necessarily be able to generate a new image of a cat or dog on its own.
In the realm of AI, generative models play a pivotal role in tasks that require the creation of new
content. This could be in the form of synthesizing realistic human faces, composing music, or
even generating textual content. Their ability to "dream up" new data makes them invaluable in
scenarios where original content is needed, or where the augmentation of existing datasets is
beneficial.
In essence, while discriminative models excel at classification tasks, generative models shine in
their ability to create. This creative prowess, combined with their deep understanding of data
distributions, positions generative models as a powerful tool in the AI toolkit.
Types of Generative Models
Generative models come in various forms, each with its unique approach to understanding and
generating data. Here's a more comprehensive list of some of the most prominent types:
Bayesian networks. These are graphical models that represent the probabilistic
relationships among a set of variables. They're particularly useful in scenarios
where understanding causal relationships is crucial. For example, in medical
diagnosis, a Bayesian network might help determine the likelihood of a disease
given a set of symptoms.
Diffusion models. These models describe how things spread or evolve over
time. They're often used in scenarios like understanding how a rumor spreads in
a network or predicting the spread of a virus in a population.
Generative Adversarial Networks (GANs). GANs consist of two neural
networks, the generator and the discriminator, that are trained together. The
generator tries to produce data, while the discriminator attempts to distinguish
between real and generated data. Over time, the generator becomes so good that
the discriminator can't tell the difference. GANs are popular in image
generation tasks, such as creating realistic human faces or artworks.
Variational Autoencoders (VAEs). VAEs are a type of autoencoder that
produces a compressed representation of input data, then decodes it to generate
new data. They're often used in tasks like image denoising or generating new
images that share characteristics with the input data.
Restricted Boltzmann Machines (RBMs). RBMs are neural networks with
two layers that can learn a probability distribution over its set of inputs. They've
been used in recommendation systems, like suggesting movies on streaming
platforms based on user preferences.
Pixel Recurrent Neural Networks (PixelRNNs). These models generate
images pixel by pixel, using the context of previous pixels to predict the next
one. They're particularly useful in tasks where the sequential generation of data
is crucial, like drawing an image line by line.
Markov chains. These are models that predict future states based solely on the
current state, without considering the states that preceded it. They're often used
in text generation, where the next word in a sentence is predicted based on the
current word.
Normalizing flows. These are a series of invertible transformations applied to
simple probability distributions to produce more complex distributions. They're
useful in tasks where understanding the transformation of data is crucial, like in
financial modeling.
Art creation. Artists and musicians are using generative models to create new
pieces of art or compositions, based on styles they feed into the model. For
example, Midjourney is a very popular tool that is used to generate artwork.
Drug discovery. Scientists can use generative models to predict molecular
structures for new potential drugs.
Content creation. Website owners leverage generative models to speed up the
content creation process. For example, Hubspot's AI content writer helps
marketers generate blog posts, landing page copy and social media posts.
Video games. Game designers use generative models to create diverse and
unpredictable game environments or characters.
Generative models, with their unique ability to create and innovate, offer a plethora of
advantages that extend beyond mere data generation. Here's a deeper dive into the myriad
benefits they bring to the table:
While generative models are undeniably powerful and transformative, they are not without their
challenges. Here's an exploration of some of the constraints and challenges associated with these
models:
Generative models like GPT-4 are transforming how data scientists approach their work. These
large language models can generate human-like text and code, allowing data scientists to be
more creative and productive. Here are some ways generative AI can be applied in data science.
Data Exploration
Generative models can summarize and explain complex data sets and results. By describing
charts, statistics, and findings in natural language, they help data scientists explore and
understand data faster. Models can also highlight insights and patterns that humans may miss.
Code Generation
For common data science tasks like data cleaning, feature engineering, and model building,
generative models can generate custom code. This automates repetitive coding work and allows
data scientists to iterate faster. Models can take high-level instructions and turn them into
functional Python or R or SQL code.
Report Writing
Writing reports and presentations to explain analyses is time-consuming. Generative models
like GPT-4 can draft reports by summarizing findings, visualizations, and recommendations in
coherent narratives. Data scientists can provide bullets and results, and AI will generate an initial
draft. It can also help you write data analytical reports which include necessary actionable insists
for a business to improve the business revenue.
"The greatest value of a picture is when it forces us to notice what we never expected
to see."
Faster decision-making
Identification of patterns and trends
Tables: These consist of rows and columns and are used to compare variables
in a structured way. Tables display data as categorical objects and make
comparative data analysis easier. Example use: Pricing vs. feature comparison
table.
Bar charts: Also known as column charts, these chart types use vertical or
horizontal bars to compare categorical data. They are mainly used for analyzing
value trends. Example use: Measure employee growth within a year.
Pie charts: These graphs are divided into sections that represent parts of a
whole. They are used to compare the size of each component and are usually
used to determine a percentage of the whole. Example use: Display website
visitors by country.
Area charts: These are similar to bar and line graphs and show the progress of
values over a period. These are mostly used to showcase data with a time-series
relationship, and can be used to gauge the degree of a change in values.
Example use: Show sales of different products in a financial year.
Scatter charts: Also know as scatter plots, these graphs present the relationship
between two variables. They are used to visualize large data sets, and show
trends, clusters, patterns, and outliers. Example use: Track performance of
different products in a suite.
Heat maps: These are a graphical way to visualize data in the form of hot and
cold spots to identify user behavior. Example use: Present visitor behavior on
your webpage.
Venn diagrams: These are best for showcasing similarities and differences
between two or more categories. They are incredibly versatile and great for
making comparisons, unions and intersections of different categories.
Timelines: These are best used for presenting chronological data. This is the
most effective and efficient way to showcase events or time passage.
Display changes over time: One of the most common applications of data
visualizations is to show changes that have occurred over time. Bar or line
charts are helpful in these instances.
Zoho Show's charts are customizable, easy to use and come with wide range of
options to make your data visualization easier. Some of the other prominent
data visualization tools include Zoho Analytics, Tableau, Power Bi, and Infogram.
These tools support a variety of visual styles and are capable of handling a
large volume of data.
Step 6: Follow design best practices
Applying design principles will help you make sure your visualization is both
aesthetically pleasing and easy to understand. You may apply these principles
by choosing appropriate font colors and styles, or by effectively labeling and
annotating your charts. By adhering to design best practices, you can create
polished visuals and amplify the impact of your data-driven narrative.
Keep it simple: Data overload can quickly lead to confusion, so it’s important
to include only the important information and simplify complex data. As a rule
of thumb, don't crowd your slides with too much data, and avoid distracting
elements.
Add titles, labels, and annotations: Be sure to add a title, label, and description
to your chart so your audience knows what they are looking at. Remember to
keep it clear and concise.
Use proper fonts and text sizes: Use proper font styles and sizes to label and
describe your charts. Your font choices may be playful, sophisticated,
attention-grabbing, or elegant. Just be sure to choose a font that is easy to read
and appropriate for your key message.
Closing thoughts
Human brains are naturally attuned to processing visual patterns and
imageryUsing visuals not only helps you simplify complex information, but
also makes your information more memorable. By leveraging charts and
graphs, presenters can convey information to their audiences in a highly
comprehensible manner. This helps them offer key insights and contribute to
the decision-making process.
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. The position of each dot on the horizontal and vertical axis
indicates values for an individual data point. Scatter plots are used to observe
relationships between variables.
The example scatter plot above shows the diameters and heights for a
sample of fictional trees. Each dot represents a single tree; each point’s
horizontal position indicates that tree’s diameter (in centimeters) and the
vertical position indicates that tree’s height (in meters). From the plot, we
can see a generally tight positive correlation between a tree’s diameter
and its height. We can also observe an outlier point, a tree that has a
much larger diameter than the others. This tree appears fairly short for its
girth, which might warrant further investigation.
When you should use a scatter plot
Scatter plots’ primary uses are to observe and show relationships between two numeric
variables. The dots in a scatter plot not only report the values of individual data points,
but also patterns when the data are taken as a whole.
4.20 3.14
5.55 3.87
3.33 2.84
6.91 4.34
… …
In order to create a scatter plot, we need to select two columns from a data table, one
for each dimension of the plot. Each row of the table will become a single dot in the plot
with position according to the column values.
Common issues when using scatter plots
Overplotting
When we have lots of data points to plot, this can run into the issue of overplotting.
Overplotting is the case where data points overlap to a degree where we have difficulty
seeing relationships between points and variables. It can be difficult to tell how densely-
packed data points are when many of them are in a small area.
There are a few common ways to alleviate this issue. One alternative is to sample only a
subset of data points: a random selection of points should still give the general idea of
the patterns in the full data. We can also change the form of the dots, adding
transparency to allow for overlaps to be visible, or reducing point size so that fewer
overlaps occur. As a third option, we might even choose a different chart type like
the heatmap, where color indicates the number of points in each bin. Heatmaps in this
use case are also known as 2-d histograms.
Interpreting correlation as causation
This is not so much an issue with creating a scatter plot as it is an issue with its
interpretation. Simply because we observe a relationship between two variables in a
scatter plot, it does not mean that changes in one variable are responsible for changes
in the other. This gives rise to the common phrase in statistics that correlation does not
imply causation. It is possible that the observed relationship is driven by some third
variable that affects both of the plotted variables, that the causal link is reversed, or that
the pattern is simply coincidental.
For example, it would be wrong to look at city statistics for the amount of green space
they have and the number of crimes committed and conclude that one causes the other,
this can ignore the fact that larger cities with more people will tend to have more of
both, and that they are simply correlated through that and other factors. If a causal link
needs to be established, then further analysis to control or account for other potential
variables effects needs to be performed, in order to rule out other possible explanations.
Common scatter plot options
A common modification of the basic scatter plot is the addition of a third variable.
Values of the third variable can be encoded by modifying how the points are plotted.
For a third variable that indicates categorical values (like geographical region or gender),
the most common encoding is through point color. Giving each point a distinct hue
makes it easy to show membership of each point to a respective group.
Coloring points by tree type shows that Fersons (yellow) are generally wider than
Miltons (blue), but also shorter for the same diameter.
One other option that is sometimes seen for third-variable encoding is that of shape.
One potential issue with shape is that different shapes can have different sizes and
surface areas, which can have an effect on how groups are perceived. However, in
certain cases where color cannot be used (like in print), shape may be the best option
for distinguishing between groups.
The shapes above have been scaled to use the same amount of ink.
For third variables that have numeric values, a common encoding comes from changing
the point size. A scatter plot with point size based on a third variable actually goes by a
distinct name, the bubble chart. Larger points indicate higher values. A more detailed
discussion of how bubble charts should be built can be read in its own article.
Hue can also be used to depict numeric values as another alternative. Rather than using
distinct colors for points like in the categorical case, we want to use a continuous
sequence of colors, so that, for example, darker colors indicate higher value. Note that,
for both size and color, a legend is important for interpretation of the third variable,
since our eyes are much less able to discern size and color as easily as position.
Highlight using annotations and color
If you want to use a scatter plot to present insights, it can be good to highlight
particular points of interest through the use of annotations and color. Desaturating
unimportant points makes the remaining points stand out, and provides a reference to
compare the remaining points against.
Related plots
Scatter map
When the two variables in a scatter plot are geographical coordinates – latitude and
longitude – we can overlay the points on a map to get a scatter map (aka dot map). This
can be convenient when the geographic context is useful for drawing particular insights
and can be combined with other third-variable encodings like point size and color.
A famous example of scatter map is John Snow’s 1854 cholera outbreak map, showing
that cholera cases (black bars) were centered around a particular water pump on Broad
Street (central dot). Original: Wikimedia Commons
Heatmap
As noted above, a heatmap can be a good alternative to the scatter plot when there are
a lot of data points that need to be plotted and their density causes overplotting issues.
However, the heatmap can also be used in a similar fashion to show relationships
between variables when one or both variables are not continuous and numeric. If we try
to depict discrete values with a scatter plot, all of the points of a single level will be in a
straight line. Heatmaps can overcome this overplotting through their binning of values
into boxes of counts.
Connected scatter plot
If the third variable we want to add to a scatter plot indicates timestamps, then one
chart type we could choose is the connected scatter plot. Rather than modify the form
of the points to indicate date, we use line segments to connect observations in order.
This can make it easier to see how the two main variables not only relate to one another,
but how that relationship changes over time. If the horizontal axis also corresponds with
time, then all of the line segments will consistently connect points from left to right, and
we have a basic line chart.
Visualization tools
The scatter plot is a basic chart type that should be creatable by any visualization tool or
solution. Computation of a basic linear trend line is also a fairly common option, as is
coloring points according to levels of a third, categorical variable. Other options, like
non-linear trend lines and encoding third-variable values by shape, however, are not as
commonly seen. Even without these options, however, the scatter plot can be a valuable
chart type to use when you need to investigate the relationship between numeric
variables in your data.
Time series refers to a sequence of data points that are collected, recorded,
or observed at regular intervals over a specific period of time. In a time
series, each data point is associated with a specific timestamp or time
period, which allows for the chronological organization of the data.
Time series data can be found in various domains and industries, including
finance, economics, meteorology, sales, stock markets, healthcare, and
more. It is used to analyze historical patterns, identify trends, forecast future
values, and understand the behavior of a phenomenon over time.
Time series data can be either continuous or discrete. One can easily
visualize time series data using python. Continuous time series data
represents measurements that can take any value within a range, such as
temperature readings or stock prices. Discrete-time series data, on the other
hand, represents measurements that are limited to specific values or
categories, such as the number of sales per day or customer ratings.
Analyzing and visualizing time series data plays a crucial role in gaining
insights, making predictions, and understanding the underlying dynamics of
a system or process over time.
Time series data in data visualization can be classified into two main types
based on the nature of the data: continuous and discrete.
A. Continuous Data
Temperature Data: Continuous temperature recordings collected at regular intervals, such as hourly or
daily measurements.
Stock Market Data: Continuous data representing the prices or values of stocks, which are recorded
throughout trading hours.
Sensor Data: Measurements from sensors that record continuous variables like pressure, humidity, or air
quality at frequent intervals.
Financial Data: Continuous data related to financial metrics like revenue, sales, or profit, which are
tracked over time.
Environmental Data: Continuous data collected from environmental monitoring devices, such as weather
stations, to track variables like wind speed, rainfall, or pollution levels.
Physiological Data: Continuous data capturing physiological parameters like heart rate, blood pressure, or
glucose levels recorded at regular intervals.
Continuous time series data is typically visualized using techniques such as
line plots, area charts, or smooth plots. These visualization methods allow
us to observe trends, fluctuations, and patterns in the data over time, aiding
in understanding the behavior and dynamics of the underlying phenomenon.
B. Discrete Data
Discrete time series data refers to measurements or observations that are
limited to specific values or categories. Unlike continuous data, discrete data
does not have a continuous range of possible values but instead consists of
distinct and separate data points. Discrete time series data is commonly
encountered in various domains, including:
Count Data: Data representing the number of occurrences or events within a specific time. Examples
include the number of daily sales, the number of customer inquiries per month, or the number of website
visits per hour.
Categorical Data: Data that falls into distinct categories or classes. This can include variables such as
customer segmentation, product types, or survey responses with predefined response options.
Binary Data: Data that has only two possible outcomes or states. For instance, a time series tracking
whether a machine is functioning (1) or not (0) at each time point.
Rating Scales: Data obtained from surveys or feedback forms where respondents provide ratings on a
discrete scale, such as a Likert scale.
Discrete time series data is often visualized using techniques such as bar
charts, histograms, or stacked area charts. These visualizations help in
understanding the distribution, changes, and patterns within the discrete
data over time. By examining the frequency or proportion of different
categories or values, analysts can gain insights into trends and patterns
within the data.
1. Tabular Visualization: Tabular visualization presents time series data in a structured table format, with
each row representing a specific time period and columns representing different variables or measurements.
It provides a concise overview of the data but may not capture trends or patterns as effectively as graphical
visualizations.
2. 1D Plot of Measurement Times: This type of visualization represents the measurement times along a one-
dimensional axis, such as a timeline. It helps in understanding the temporal distribution of data points and
identifying any temporal patterns.
3. 1D Plot of Measurement Values: A 1D plot of measurement values display the variation in data values
over time along a single axis. Line plots and step plots are commonly used techniques for visualizing
continuous time series data, while bar charts or dot plots can be used for discrete data.
4. 1D Color Plot of Measurement Values: In this visualization technique, the variation in measurement
values is represented using colors on a one-dimensional axis. It enables the quick identification of high or
low values and provides an intuitive overview of the data.
5. Bubble Plot: Bubble plots represent time series data using bubbles, where each bubble represents a data
point with its size or color encoding a specific measurement value. This visualization method allows the
simultaneous representation of multiple variables and their evolution over time.
6. Scatter Plot: Scatter plots display the relationship between two variables by plotting data points as
individual dots on a Cartesian plane. Time series data can be visualized by representing one variable on the
x-axis and another on the y-axis.
7. Linear Line Plot: Linear line plots connect consecutive data points with straight lines, emphasizing the
trend and continuity of the data over time.
8. Linear Step Plot: Linear step plots also connect consecutive data points, but with vertical and horizontal
lines, resulting in a stepped appearance. This visualization is useful when tracking changes that occur
instantaneously at specific time points.
9. Linear Smooth Plot: Linear smooth plots apply a smoothing algorithm to the data, resulting in a
continuous curve that captures the overall trend while reducing noise or fluctuations. It helps in visualizing
long-term patterns more clearly.
10. Area Chart: Area charts fill the area between the line representing the data and the x-axis, emphasizing the
cumulative value or distribution over time. They are commonly used to visualize stacked time series data or
to show the composition of a variable over time.
11. Horizon Chart: Horizon charts condense time series data into a compact, horizontally layered
representation. They are particularly useful when comparing multiple time series data on a single chart,
optimizing screen space usage.
12. Bar Chart: Bar charts represent discrete time series data using rectangular bars, with the height of each bar
indicating the value of a specific measurement. They are effective in comparing values between different
time periods or categories.
13. Histogram: Histograms display the distribution of continuous or discrete time series data by dividing the
range of values into equal intervals (bins) and representing the frequency or count of data points falling
within each bin. Here, Business Intelligence and Visualization training will help you get mentored by
international Tableau, BI, TIBCO, and Data Visualization experts and represent data through insightful
visuals.
Best Platforms to Visualize Data
1. Microsoft Power BI
Import the time series data into R using appropriate data structures such as data frames or time series
objects.
Install and load the required packages for time series visualization (e.g., ggplot2, plotly).
Use functions from the chosen package to create the desired visualizations, such as line plots, area charts,
or interactive plots.
Customize the visual appearance, labels, and annotations.
Add interactivity, tooltips, or animations to enhance the exploration of the data.
Export the visualizations to various formats or integrate them into reports or presentations.
Excel: Excel, a widely used spreadsheet software, also offers basic time series visualization capabilities.
While not as advanced as dedicated data visualization platforms or programming languages, Excel provides
various chart types that can effectively represent time series data, such as line charts, bar charts, and scatter
plots.
4. Excel
Excel is a popular spreadsheet software and a great tool for data analysis. It
allows users to perform various tasks related to data organization, analysis,
and presentation. Here are some ways to visualize Time series in Excel:
Gantt Charts: Gantt charts are widely used to visualize project schedules or timelines. They display tasks
or events along a horizontal timeline, with bars representing the start and end dates of each task. Gantt
charts provide a clear overview of project progress, dependencies, and resource allocation over time.
Line Graphs: Line graphs are effective for visualizing continuous time series data. They connect data
points with straight lines, allowing us to observe trends, seasonality, or irregularities over time.
Heatmap: Heatmaps represent time series data using color intensity in a grid format. They are useful for
visualizing patterns, correlations, or anomalies in multi-dimensional time series data.
Map: Maps can be employed to visualize time series data geographically. By plotting data points on a map,
we can observe spatial patterns or changes in variables over time.
Stacked Area Charts: Stacked area charts display the cumulative value or proportion of different variables
over time. They are useful for visualizing the composition or contribution of each variable to the total.
Conclusion
3D VISUALIZATION
DEFINITION
3D Visualization is the process of creating three-dimensional visual representations of
objects, environments, or concepts using 3D software. It allows users to view objects
or concepts more interactively and realistically. Typically, 3D visualization software
allows users to manipulate and interact with the object or environment, resulting in a
more immersive experience.
3D Visualization can be a useful capability in PLM and QMS solutions, providing a
more accurate and detailed representation of the product and enhancing the overall
efficiency and effectiveness of product development processes. Here are some
examples:
3D Visualizations for Product Design: 3D visualizations can aid in the