Professional Documents
Culture Documents
Ai Unit Wise Notes
Ai Unit Wise Notes
- Formulating problems
Steps performed by Problem-solving agent
Goal Formulation: It is the first and simplest step in
problem-solving. It organizes the steps/sequence
required to formulate one goal out of multiple goals as
well as actions to achieve that goal. Goal formulation is
based on the current situation and the agent’s
performance measure (discussed below).
Problem Formulation: It is the most important step of
problem-solving which decides what actions should be taken to
achieve the formulated goal. There are following five components
involved in problem formulation:
Initial State: It is the starting state or initial step of the
agent towards its goal.
Actions: It is the description of the possible actions
available to the agent.
Transition Model: It describes what each action does.
CS6659-Artificial Page
8-queens problem: The aim of this problem is to place
eight queens on a chessboard in an order where no
queen may attack another. A queen can attack other
queens either diagonally or in same row and
column.
From the following figure, we can understand the problem
as well as its correct solution.
CS6659-Artificial Page
Traveling salesperson problem(TSP): It is a touring
problem where the salesman can visit each city only
once. The objective is to find the shortest tour and sell-
out the stuff in each city.
VLSI Layout problem: In this problem, millions of
components and connections are positioned on a chip
in order to minimize the area, circuit-delays, stray-
capacitances, and maximizing the manufacturing yield.
The layout problem is split into two parts:
Cell layout: Here, the primitive components of the
circuit are grouped into cells, each performing its
specific function. Each cell has a fixed shape and size.
The task is to place the cells on the chip without
overlapping each other.
Channel routing: It finds a specific route for each wire
through the gaps between the cells.
Protein Design: The objective is to find a sequence of
amino acids which will fold into 3D protein having a
property to cure some disease.
Searching for solutions
We have seen many problems. Now, there is a need to
search for solutions to solve them.
A* search
Breadth-first search (BFS)
BFS expands the shallowest (i.e., not deep) node first using
FIFO (First in first out) order. Thus, new nodes (i.e., children
of a parent node) remain in the queue and old unexpanded
CS6659-Artificial Page
node which are shallower than the new nodes, get
expanded first.
In BFS, goal test (a test to check whether the current
state is a goal state or not) is applied to each node at the
time of its generation rather when it is selected for
expansion.
Breadth-
first search tree
In the above figure, it is seen that the nodes are expanded
level by level starting from the root node A till the last
node I in the tree. Therefore, the BFS sequence followed
is: A->B->C->D->E->F->G->I.
BFS Algorithm
Set a variable NODE to the initial state, i.e., the root
node.
Set a variable GOAL which contains the value of
the goal state.
CS6659-Artificial Page
Loop each node by traversing level by level until the
goal state is not found.
While performing the looping, start removing the
elements from the queue in FIFO order.
If the goal state is found, return goal state otherwise
continue the search.
The performance measure of BFS is as follows:
Completeness: It is a complete strategy as it definitely
finds the goal state.
Optimality: It gives an optimal solution if the cost of
each node is same.
Space Complexity: The space complexity of BFS
is O(bd), i.e., it requires a huge amount of memory.
Here, b is the branching factor and d denotes
the depth/level of the tree
Time Complexity: BFS consumes much time to reach
the goal node for large instances.
Disadvantages of BFS
The biggest disadvantage of BFS is that it requires a lot
of memory space, therefore it is a memory bounded
strategy.
BFS is time taking search strategy because it expands
the nodes breadthwise.
Note: BFS expands the nodes level by level, i.e.,
breadthwise, therefore it is also known as a Level search
technique.
Example:
CS6659-Artificial Page
Consider the below search problem, and we will
traverse it using greedy best-first search. At each
iteration, each node is expanded using evaluation
function f(n)=h(n) , which is given in the below
table.
CS6659-Artificial Page
Expand the nodes of S and put in the CLOSED
list
Initialization: Open [A, B], Closed [S]
Iteration 1: Open [A], Closed [S, B]
Iteration 2: Open [E, F, A], Closed [S, B]
: Open [E, A], Closed [S, B, F]
Iteration 3: Open [I, G, E, A], Closed [S, B, F]
: Open [I, E, A], Closed [S, B, F, G]
Hence the final solution path will be: S----> B-----
>F----> G
Time Complexity: The worst case time complexity
of Greedy best first search is O(bm).
CS6659-Artificial Page
Space Complexity: The worst case space
complexity of Greedy best first search is O(bm).
Where, m is the maximum depth of the search
space.
Complete: Greedy best-first search is also
incomplete, even if the given state space is finite.
Optimal: Greedy best first search algorithm is not
optimal.
Uniform-cost search
CS6659-Artificial Page
positive so, paths never get shorter when a new node is
added in the search.
CS6659-Artificial Page
DFS search tree
In the above figure, DFS works starting from the initial
node A (root node) and traversing in one direction deeply till
node I and then backtrack to B and so on. Therefore, the
sequence will be A->B->D->I->E->C->F->G.
DFS Algorithm
Set a variable NODE to the initial state, i.e., the root
node.
Set a variable GOAL which contains the value of
the goal state.
Loop each node by traversing deeply in one
direction/path in search of the goal node.
While performing the looping, start removing the
elements from the stack in LIFO order.
If the goal state is found, return goal state otherwise
backtrack to expand nodes in other direction.
The performance measure of DFS
Completeness: DFS does not guarantee to reach the
goal state.
CS6659-Artificial Page
Optimality: It does not give an optimal solution as it
expands nodes in one direction deeply.
Space complexity: It needs to store only a single path
from the root node to the leaf node. Therefore, DFS
has O(bm) space complexity where b is the branching
factor(i.e., total no. of child nodes, a parent node
have) and m is the maximum length of any path.
Time complexity: DFS has O(b ) time complexity.
m
Disadvantages of DFS
It may get trapped in an infinite loop.
CS6659-Artificial Page
Depth-limited search on a binary
tree
In the above figure, the depth-limit is 1. So, only level 0
and 1 get expanded in A->B->C DFS sequence, starting
from the root node A till node B. It is not giving satisfactory
result because we could not reach the goal node I.
Depth-limited search Algorithm
Set a variable NODE to the initial state, i.e., the root
node.
Set a variable GOAL which contains the value of
the goal state.
Set a variable LIMIT which carries a depth-limit value.
CS6659-Artificial Page
In the above figure, the goal node is H and initial depth-
limit =[0-1]. So, it will expand level 0 and 1 and will
terminate with A->B->C sequence. Further, change
the depth-limit =[0-3], it will again expand the nodes from
level 0 till level 3 and the search terminate with A->B->D-
>F->E->H sequence where H is the desired goal node.
Iterative deepening search Algorithm
CS6659-Artificial Page
An informed search is more efficient than an uninformed
search because in informed search, along with the current
state information, some additional information is also
present, which make it easy to reach the goal state.
CS6659-Artificial Page
and the CLOSE list contains visited as well as
expanded nodes.
2. Initially, traverse the root node and visit its next
successor nodes and place them in the OPEN list in
ascending order of their heuristic value.
3. Select the first successor node from the OPEN list with
the lowest heuristic value and expand further.
4. Now, rearrange all the remaining unexpanded nodes in
the OPEN list and repeat above two steps.
5. If the goal node is reached, terminate the search, else
expand further.
In the above figure, the root node is A, and its next level
successor nodes are B and C with h(B)=2 and h(C)=4. Our
task is to explore that node which has the lowest h(n)
value. So, we will select node B and expand it further to
node D and E. Again, search out that node which has the
lowest h(n) value and explore it further.
CS6659-Artificial Page
The performance measure of Best-first search
Algorithm:
Completeness: Best-first search is incomplete even in
finite state space.
Optimality: It does not provide an optimal solution.
2+12=14
Calculation of f(n) for node B:
f(B)=(distance from node S to B)+h(B)
3+14=17
Therefore, node A has the lowest f(n) value. Hence, node A
will be explored to its next level nodes C and D and again
calculate the lowest f(n) value. After calculating, the
sequence we get is S->A–>D->G with f(n)=13(lowest
value).
How to make A* search admissible to get an optimized
solution?
CS6659-Artificial Page
Here, h(n) is the actual heuristic cost value and h’(n) is
the estimated heuristic cost value.
Note: An overestimated cost value may or may not lead to
an optimized solution, but an underestimated cost value
always lead to an optimized solution.
Let’s understand with the help of an example:
CS6659-Artificial Page
Here, the destination/ goal is to eat some food. We have
two ways, either order food from any restaurant or buy
some food ingredients and cook to eat food. Thus, we can
apply any of the two ways, the choice depends on us. It is
not guaranteed whether the order will be delivered on time,
food will be tasty or not, etc. But if we will purchase and
cook it, we will be more satisfied.
CS6659-Artificial Page
Heuristic Functions
As we have already seen that an informed search make use
of heuristic functions in order to reach the goal node in a
more prominent way. Therefore, there are several pathways
in a search tree to reach the goal node from the current
node. The selection of a good heuristic function matters
certainly. A good heuristic function is determined by its
efficiency. More is the information about the problem, more
is the processing time.
Some toy problems, such as 8-puzzle, 8-queen, tic-tac-toe,
etc., can be solved more efficiently with the help of a
heuristic function. Let’s see how:
CS6659-Artificial Page
A heuristic function for the 8-puzzle problem is defined
below:
CS6659-Artificial Page
It is seen from the above state space tree that the goal state
is minimized from h(n)=3 to h(n)=0. However, we can create
and use several heuristic functions as per the reqirement. It
is also clear from the above example that a heuristic
function h(n) can be defined as the information required to
solve a given problem more efficiently. The information can
be related to the nature of the state, cost of transforming
from one state to another, goal node
characterstics, etc., which is expressed as a heuristic
function.
Properties of a Heuristic search Algorithm
CS6659-Artificial Page
Use of heuristic function in a heuristic search algorithm
leads to following properties of a heuristic search algorithm:
It uses the queue to store data It uses the stack to store data
in the memory. in the memory.
The oldest unexpanded node is The nodes along the edge are
its first priority to explore it. explored first.
CS6659-Artificial Page
Unit 3
Knowledge representation and reasoning (KR², KR&R) is the field of artificial intelligence (AI)
dedicated to representing information about the world in a form that a computer system can utilize to
solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language.
Knowledge representation incorporates findings from psychology about how humans solve problems
and represent knowledge in order to design formalisms that will make complex systems easier to
design and build. Knowledge representation and reasoning also incorporates findings from logic to
automate various kinds of reasoning, such as the application of rules or the relations
of sets and subsets.
Examples of knowledge representation formalisms include semantic nets, systems
architecture, frames, rules, and ontologies. Examples of automated reasoning engines
include inference engines, theorem provers, and classifiers.
Human beings are good at understanding, reasoning and interpreting knowledge. And
using this knowledge, they are able to perform various actions in the real world. But how
do machines perform the same? In this article, we will learn about Knowledge
Representation in AI and how it helps the machines perform reasoning and interpretation
using Artificial Intelligence in the following sequence:
Knowledge Representation and Reasoning (KR, KRR) represents information from the
real world for a computer to understand and then utilize this knowledge to solve complex
real-life problems like communicating with human beings in natural language.
Knowledge representation in AI is not just about storing data in a database, it allows a
machine to learn from that knowledge and behave intelligently like a human being.
Objects
Events
Performance
CS6659-Artificial Page
Facts
Meta-Knowledge
Knowledge-base
Now that you know about Knowledge representation in AI, let’s move on and know about
the different types of Knowledge.
CS6659-Artificial Page
These are the important types of Knowledge Representation in AI. Now, let’s have a look
at the cycle of knowledge representation and how it works.
Perception
Learning
Knowledge Representation & Reasoning
Planning
Execution
Here is an example to show the different components of the system and how it works:
Example
The above diagram shows the interaction of an AI system with the real world and
the components involved in showing intelligence.
CS6659-Artificial Page
The Perception component retrieves data or information from the environment.
with the help of this component, you can retrieve data from the environment, find
out the source of noises and check if the AI was damaged by anything. Also, it
defines how to respond when any sense has been detected.
Then, there is the Learning Component that learns from the captured data by the
perception component. The goal is to build computers that can be taught instead of
programming them. Learning focuses on the process of self-improvement. In order
to learn new things, the system requires knowledge acquisition, inference,
acquisition of heuristics, faster searches, etc.
The main component in the cycle is Knowledge Representation and
Reasoning which shows the human-like intelligence in the machines. Knowledge
representation is all about understanding intelligence. Instead of trying to
understand or build brains from the bottom up, its goal is to understand and build
intelligent behavior from the top-down and focus on what an agent needs to know
in order to behave intelligently. Also, it defines how automated reasoning
procedures can make this knowledge available as needed.
The Planning and Execution components depend on the analysis of knowledge
representation and reasoning. Here, planning includes giving an initial state, finding
their preconditions and effects, and a sequence of actions to achieve a state in
which a particular goal holds. Now once the planning is completed, the final stage
is the execution of the entire process.
So, these are the different components of the cycle of Knowledge Representation in AI.
Now, let’s understand the relationship between knowledge and intelligence.
CS6659-Artificial Page
In this example, there is one decision-maker whose actions are justified by sensing the
environment and using knowledge. But, if we remove the knowledge part here, it will not
be able to display any intelligent behavior.
Reasoning:
The reasoning is the mental process of deriving logical conclusion and making predictions from
available knowledge, facts, and beliefs. Or we can say, "Reasoning is a way to infer facts
from existing data." It is a general process of thinking rationally, to find valid conclusions.
In artificial intelligence, the reasoning is essential so that the machine can also think rationally
as a human brain, and can perform like a human.
Types of Reasoning
In artificial intelligence, reasoning can be divided into the following categories:
o Deductive reasoning
o Inductive reasoning
o Abductive reasoning
o Common Sense Reasoning
o Monotonic Reasoning
o Non-monotonic Reasoning
Note: Inductive and deductive reasoning are the forms of propositional logic.
1. Deductive reasoning:
CS6659-Artificial Page
Deductive reasoning is deducing new information from logically related known information. It
is the form of valid reasoning, which means the argument's conclusion must be true when the
premises are true.
Deductive reasoning is a type of propositional logic in AI, and it requires various rules and
facts. It is sometimes referred to as top-down reasoning, and contradictory to inductive
reasoning.
In deductive reasoning, the truth of the premises guarantees the truth of the conclusion.
Deductive reasoning mostly starts from the general premises to the specific conclusion, which
can be explained as below example.
Example:
2. Inductive Reasoning:
Inductive reasoning is a form of reasoning to arrive at a conclusion using limited sets of facts
by the process of generalization. It starts with the series of specific facts or data and reaches
to a general statement or conclusion.
In inductive reasoning, we use historical data or various premises to generate a generic rule,
for which premises support the conclusion.
In inductive reasoning, premises provide probable supports to the conclusion, so the truth of
premises does not guarantee the truth of the conclusion.
Example:
Premise: All of the pigeons we have seen in the zoo are white.
CS6659-Artificial Page
Conclusion: Therefore, we can expect all the pigeons to be white.
3. Abductive reasoning:
Abductive reasoning is a form of logical reasoning which starts with single or multiple
observations then seeks to find the most likely explanation or conclusion for the observation.
Example:
Conclusion It is raining.
Common Sense reasoning simulates the human ability to make presumptions about events
which occurs on every day.
It relies on good judgment rather than exact logic and operates on heuristic
knowledge and heuristic rules.
Example:
The above two statements are the examples of common sense reasoning which a human mind
can easily understand and assume.
5. Monotonic Reasoning:
CS6659-Artificial Page
In monotonic reasoning, once the conclusion is taken, then it will remain the same even if we
add some other information to existing information in our knowledge base. In monotonic
reasoning, adding knowledge does not decrease the set of prepositions that can be derived.
To solve monotonic problems, we can derive the valid conclusion from the available facts only,
and it will not be affected by new facts.
Monotonic reasoning is not useful for the real-time systems, as in real time, facts get
changed, so we cannot use monotonic reasoning.
Example:
It is a true fact, and it cannot be changed even if we add another sentence in knowledge base
like, "The moon revolves around the earth" Or "Earth is not round," etc.
6. Non-monotonic Reasoning
In Non-monotonic reasoning, some conclusions may be invalidated if we add some more
information to our knowledge base.
Logic will be said as non-monotonic if some conclusions can be invalidated by adding more
knowledge into our knowledge base.
"Human perceptions for various things in daily life, "is a general example of non-monotonic
reasoning.
CS6659-Artificial Page
Example: Let suppose the knowledge base contains the following knowledge:
So from the above sentences, we can conclude that Pitty can fly.
However, if we add one another sentence into knowledge base "Pitty is a penguin", which
concludes "Pitty cannot fly", so it invalidates the above conclusion.
Substitution:
Equality:
First-Order logic does not only use predicate and terms for making atomic sentences but also
uses another way, which is equality in FOL. For this, we can use equality symbols which
specify that the two terms refer to the same object.
CS6659-Artificial Page
As in the above example, the object referred by the Brother (John) is similar to the object
referred by Smith. The equality symbol can also be used with negation to represent that two
terms are not the same objects.
o Universal Generalization
o Universal Instantiation
o Existential Instantiation
o Existential introduction
1. Universal Generalization:
o Universal generalization is a valid inference rule which states that if premise P(c) is
true for any arbitrary element c in the universe of discourse, then we can have a
conclusion as ∀ x P(x).
Example: Let's represent, P(c): "A byte contains 8 bits", so for ∀ x P(x) "All bytes
contain 8 bits.", it will also be true.
2. Universal Instantiation:
CS6659-Artificial Page
Example:1.
Example: 2.
"All kings who are greedy are Evil." So let our knowledge base contains this detail as in the
form of FOL:
So from this information, we can infer any of the following statements using Universal
Instantiation:
3. Existential Instantiation:
Example:
So we can infer: Crown(K) ∧ OnHead( K, John), as long as K does not appear in the
knowledge base.
CS6659-Artificial Page
o The above used K is a constant symbol, which is called Skolem constant.
o The Existential instantiation is a special case of Skolemization process.
4. Existential introduction
Example:
1. a) It is Sunday.
2. b) The Sun rises from West (False proposition)
3. c) 3+3= 7(False proposition)
4. d) 5 is a prime number.
CS6659-Artificial Page
o A proposition formula which is always true is called tautology, and it is also called a
valid sentence.
o A proposition formula which is always false is called Contradiction.
o A proposition formula which has both true and false values is called
o Statements which are questions, commands, or opinions are not propositions such as
"Where is Rohini", "How are you", "What is your name", are not propositions.
a. Atomic Propositions
b. Compound propositions
Example:
Example:
CS6659-Artificial Page
Q= Rohan is hardworking. → P∧ Q.
3. Disjunction: A sentence which has ∨ connective, such as P ∨ Q. is called disjunction,
where P and Q are the propositions.
Example: "Ritika is a doctor or Engineer",
Here P= Ritika is Doctor. Q= Ritika is Doctor, so we can write it as P ∨ Q.
4. Implication: A sentence such as P → Q, is called an implication. Implications are also
known as if-then rules. It can be represented as
If it is raining, then the street is wet.
Let P= It is raining, and Q= Street is wet, so it is represented as P → Q
5. Biconditional: A sentence such as P⇔ Q is a Biconditional sentence, example If I
am breathing, then I am alive
P= I am breathing, Q= I am alive, it can be represented as P ⇔ Q.
Truth Table:
In propositional logic, we need to know the truth values of propositions in all possible
scenarios. We can combine all the possible combination with logical connectives, and the
representation of these combinations in a tabular format is called Truth table. Following are
the truth table for all logical connectives:
CS6659-Artificial Page
Truth table with three propositions:
CS6659-Artificial Page
We can build a proposition composing three propositions P, Q, and R. This truth table is made-
up of 8n Tuples as we have taken three proposition symbols.
Precedence of connectives:
Just like arithmetic operators, there is a precedence order for propositional connectors or
logical operators. This order should be followed while evaluating a propositional problem.
Following is the list of the precedence order for operators:
Precedence Operators
Logical equivalence:
Logical equivalence is one of the features of propositional logic. Two propositions are said to
be logically equivalent if and only if the columns in the truth table are identical to each other.
Let's take two propositions A and B, so for logical equivalence, we can write it as A⇔B. In
below truth table we can see that column for ¬A∨ B and A→B, are identical hence A is
Equivalent to B
CS6659-Artificial Page
Properties of Operators:
o Commutativity:
o P∧ Q= Q ∧ P, or
o P ∨ Q = Q ∨ P.
o Associativity:
o (P ∧ Q) ∧ R= P ∧ (Q ∧ R),
o (P ∨ Q) ∨ R= P ∨ (Q ∨ R)
o Identity element:
o P ∧ True = P,
o P ∨ True= True.
o Distributive:
o P∧ (Q ∨ R) = (P ∧ Q) ∨ (P ∧ R).
o P ∨ (Q ∧ R) = (P ∨ Q) ∧ (P ∨ R).
o DE Morgan's Law:
o ¬ (P ∧ Q) = (¬P) ∨ (¬Q)
o ¬ (P ∨ Q) = (¬ P) ∧ (¬Q).
o Double-negation elimination:
o ¬ (¬P) = P.
CS6659-Artificial Page
First-Order Logic in Artificial intelligence
In the topic of Propositional logic, we have seen that how to represent statements using
propositional logic. But unfortunately, in propositional logic, we can only represent the facts,
which are either true or false. PL is not sufficient to represent the complex sentences or
natural language statements. The propositional logic has very limited expressive power.
Consider the following sentence, which we cannot represent using PL logic.
To represent the above statements, PL logic is not sufficient, so we required some more
powerful logic, such as first-order logic.
First-Order logic:
o First-order logic is another way of knowledge representation in artificial intelligence. It
is an extension to propositional logic.
o FOL is sufficiently expressive to represent the natural language statements in a concise
way.
o First-order logic is also known as Predicate logic or First-order predicate logic.
First-order logic is a powerful language that develops information about the objects in
a more easy way and can also express the relationship between those objects.
o First-order logic (like natural language) does not only assume that the world contains
facts like propositional logic but also assumes the following things in the world:
o Objects: A, B, people, numbers, colors, wars, theories, squares, pits, wumpus,
......
o Relations: It can be unary relation such as: red, round, is adjacent, or n-
any relation such as: the sister of, brother of, has color, comes between
o Function: Father of, best friend, third inning of, end of, ......
o As a natural language, first-order logic also has two main parts:
a. Syntax
b. Semantics
CS6659-Artificial Page
The syntax of FOL determines which collection of symbols is a logical expression in first-order
logic. The basic syntactic elements of first-order logic are symbols. We write statements in
short-hand notation in FOL.
Variables x, y, z, a, b,....
Connectives ∧, ∨, ¬, ⇒, ⇔
Equality ==
Quantifier ∀, ∃
Atomic sentences:
o Atomic sentences are the most basic sentences of first-order logic. These sentences are
formed from a predicate symbol followed by a parenthesis with a sequence of terms.
o We can represent atomic sentences as Predicate (term1, term2, ......, term n).
Complex Sentences:
o Complex sentences are made by combining atomic sentences using connectives.
Consider the statement: "x is an integer.", it consists of two parts, the first part x is the
subject of the statement and second part "is an integer," is known as a predicate.
CS6659-Artificial Page
Quantifiers in First-order logic:
o A quantifier is a language element which generates quantification, and quantification
specifies the quantity of specimen in the universe of discourse.
o These are the symbols that permit to determine or identify the range and scope of the
variable in the logical expression. There are two types of quantifier:
Universal Quantifier:
Universal quantifier is a symbol of logical representation, which specifies that the statement
within its range is true for everything or every instance of a particular thing.
o For all x
o For each x
o For every x.
Example:
All man drink coffee.
Let a variable x which refers to a cat so all x can be represented in UOD as below:
CS6659-Artificial Page
∀x man(x) → drink (x, coffee).
It will be read as: There are all x where x is a man who drink coffee.
Existential Quantifier:
Existential quantifiers are the type of quantifiers, which express that the statement within its
scope is true for at least one instance of something.
It is denoted by the logical operator ∃, which resembles as inverted E. When it is used with a
predicate variable then it is called as an existential quantifier.
If x is a variable, then existential quantifier will be ∃x or ∃(x). And it will be read as:
Example:
Some boys are intelligent.
CS6659-Artificial Page
∃x: boys(x) ∧ intelligent(x)
It will be read as: There are some x where x is a boy who is intelligent.
Points to remember:
o The main connective for universal quantifier ∀ is implication →.
o The main connective for existential quantifier ∃ is and ∧.
Properties of Quantifiers:
o In universal quantifier, ∀x∀y is similar to ∀y∀x.
o In Existential quantifier, ∃x∃y is similar to ∃y∃x.
o ∃x∀y is not similar to ∀y∃x.
Free Variable: A variable is said to be a free variable in a formula if it occurs outside the
scope of the quantifier.
Bound Variable: A variable is said to be a bound variable in a formula if it occurs within the
scope of the quantifier.
Inference engine:
CS6659-Artificial Page
The inference engine is the component of the intelligent system in artificial intelligence,
which applies logical rules to the knowledge base to infer new information from known
facts. The first inference engine was part of the expert system. Inference engine commonly
proceeds in two modes, which are:
a. Forward chaining
b. Backward chaining
Horn clause and definite clause are the forms of sentences, which enables knowledge base
to use a more restricted and efficient inference algorithm. Logical inference algorithms use
forward and backward chaining approaches, which require KB in the form of the first-
order definite clause.
Definite clause: A clause which is a disjunction of literals with exactly one positive
literal is known as a definite clause or strict horn clause.
Horn clause: A clause which is a disjunction of literals with at most one positive
literal is known as horn clause. Hence all the definite clauses are horn clauses.
It is equivalent to p ∧ q → k.
A. Forward Chaining
Forward chaining is also known as a forward deduction or forward reasoning method when
using an inference engine. Forward chaining is a form of reasoning which start with atomic
sentences in the knowledge base and applies inference rules (Modus Ponens) in the forward
direction to extract more data until a goal is reached.
The Forward-chaining algorithm starts from known facts, triggers all rules whose premises
are satisfied, and add their conclusion to the known facts. This process repeats until the
problem is solved.
Properties of Forward-Chaining:
CS6659-Artificial Page
o It is a process of making a conclusion based on known facts or data, by starting
from the initial state and reaches the goal state.
o Forward-chaining approach is also called as data-driven as we reach to the goal
using available data.
o Forward -chaining approach is commonly used in the expert system, such as CLIPS,
business, and production rule systems.
Consider the following famous example which we will use in both approaches:
Example:
"As per the law, it is a crime for an American to sell weapons to hostile nations.
Country A, an enemy of America, has some missiles, and all the missiles were sold
to it by Robert, who is an American citizen."
To solve the above problem, first, we will convert all the above facts into first-order definite
clauses, and then we will use a forward-chaining algorithm to reach the goal.
In the first step we will start with the known facts and will choose the sentences which do
not have implications, such as: American(Robert), Enemy(A, America), Owns(A, T1),
and Missile(T1). All these facts will be represented as below.
Step-2:
At the second step, we will see those facts which infer from available facts and with
satisfied premises.
Rule-(1) does not satisfy premises, so it will not be added in the first iteration.
Rule-(4) satisfy with the substitution {p/T1}, so Sells (Robert, T1, A) is added, which
infers from the conjunction of Rule (2) and (3).
Rule-(6) is satisfied with the substitution(p/A), so Hostile(A) is added and which infers from
Rule-(7).
Step-3:
CS6659-Artificial Page
At step-3, as we can check Rule-(1) is satisfied with the substitution {p/Robert, q/T1,
r/A}, so we can add Criminal(Robert) which infers all the available facts. And hence we
reached our goal statement.
B. Backward Chaining:
Backward-chaining is also known as a backward deduction or backward reasoning method
when using an inference engine. A backward chaining algorithm is a form of reasoning,
which starts with the goal and works backward, chaining through rules to find known facts
that support the goal.
CS6659-Artificial Page
o The backward-chaining method mostly used a depth-first search strategy for
proof.
Example:
In backward-chaining, we will use the same above example, and will rewrite all the rules.
Backward-Chaining proof:
In Backward chaining, we will start with our goal predicate, which is Criminal(Robert),
and then infer further rules.
Step-1:
At the first step, we will take the goal fact. And from the goal fact, we will infer other facts,
and at last, we will prove those facts true. So our goal fact is "Robert is Criminal," so
following is the predicate of it.
Step-2:
At the second step, we will infer other facts form goal fact which satisfies the rules. So as
we can see in Rule-1, the goal predicate Criminal (Robert) is present with substitution
{Robert/P}. So we will add all the conjunctive facts below the first level and will replace p
with Robert.
Step-4:
CS6659-Artificial Page
At step-4, we can infer facts Missile(T1) and Owns(A, T1) form Sells(Robert, T1, r) which
satisfies the Rule- 4, with the substitution of A in place of r. So these two statements are
proved here.
Step-5:
At step-5, we can infer the fact Enemy(A, America) from Hostile(A) which satisfies
Rule- 6. And hence all the statements are proved true using backward chaining.
CS6659-Artificial Page
CS6659-Artificial Page
Unit 4
UNIT 4
PLANNING AND MACHINE LEARNING
The agent first generates a goal to achieve and then constructs a plan to achieve it from
the Current state.
Forward search
Backward search
Heuristic search
Solutions
Why Planning?
Intelligent agents must operate in the world. They are not simply passive reasons (Knowledge
Representation, reasoning under uncertainty) or problem solvers (Search), they must also act on
the world.
We want intelligent agents to act in “intelligent ways”. Taking purposeful actions, predicting the
expected effect of such actions, composing actions together to achieve complex goals. E.g. if we
have a robot we want robot to decide what to do; how to act to achieve our goals.
CS6659-Artificial Page
Planning Problem
Critical issue: we need to reason about what the world will be like after doing a few actions, not
just what it is like now
CS6659-Artificial Page
GOAL: Craig has coffee
CURRENTLY: robot in mailroom, has no coffee, coffee not made, Craig in office etc.
o Start step has the initial state description and its effect
Right Sock
Right Shoe
Left Sock
Left Shoe
Finish
CS6659-Artificial Page
Partial Order Plan Algorithm
CS6659-Artificial Page
4.2 Stanford Research Institute Problem Solver (STRIPS)
STRIPS is a classical planning language, representing plan components as states, goals, and
actions, allowing algorithms to parse the logical structure of the planning problem to provide a
solution.
The goal is also represented as a conjunction of positive, ground literals. A state satisfies a goal
if the state contains all of the conjuncted literals in the goal; e.g., Stacked ^ Ordered ^ Purchased
satisfies Ordered ^ Stacked.
Actions (or operators) are defined by action schemas, each consisting of three parts:
CS6659-Artificial Page
variables in a precondition must appear in the action‟s parameter list. Any world state not
explicitly impacted by the action schema‟s effect is assumed to remain unchanged.
The following, simple action schema describes the action of moving a box from location x to
location y:
Action: MoveBox(x, y)
Precond: BoxAt(x)
Effect: BoxAt(y), ¬ BoxAt(x)
If an action is applied, but the current state of the system does not meet the necessary
preconditions, then the action has no effect. But if an action is successfully applied, then any
positive literals, in the effect, are added to the current state of the world; correspondingly, any
negative literals, in the effect, result in the removal of the corresponding positive literals from the
state of the world.
For example, in the action schema above, the effect would result in the proposition BoxAt(y)
being added to the known state of the world, while BoxAt(x) would be removed from the known
state of the world. (Recall that state only includes positive literals, so a negation effect results in
the removal of positive literals.) Note also that positive effects can not get duplicated in state;
likewise, a negative of a proposition that is not currently in state is simply ignored. For example,
if Open(x) was not previously part of the state, ¬ Open(x) would have no effect.
A STRIPS problem includes the complete (but relevant) initial state of the world, the goal
state(s), and action schemas. A STRIPS algorithm should then be able to accept such a problem,
returning a solution. The solution is simply an action sequence that, when applied to the initial
state, results in a state which satisfies the goal.
STRIPS(A, s, g)
p = empty plan
loop…
if s satisfies g then return p
a = [an applicable action in A, relevant for g]
if a = null, then return failure
p’ = STRIPS(A, s, precond(a))
if p’ = failure, then return failure
CS6659-Artificial Page
s = apply p’ to s
CS6659-Artificial Page
s = apply a to s
p = p + p’ + a
In the above STRIPS algorithm, A represents all of the possible, grounded actions (i.e., action
schemas with variables replaced with values), while s is the current state, and g is the goal state.
4.3 Explanation
Most expert systems have explanation facilities that allow the user to ask questions- why and
how it reached some conclusion
The questions are answered by referring to the system goals, the rules being used, and existing
problem solving. The rules typically reflect empirical or “compiled” knowledge. They are codes
of an expert‟s rule of thumb, not the expert‟s deeper understanding.
Example:
Dialog with an expert system designed to give advice on car problems.
User No.
User yes
User yes
User Why?
Note: The rule gives the correct advice for a flooded car. And knows the questions to be ask to
determine if the car is flooded, but it does not contain the knowledge of what a flooded car is and
why waiting will help.
Types of Explanation
There are four types of explanations commonly used in expert systems.
4.4 Learning
Machine Learning
Like human learning from past experiences,a computer does not have “experiences”.
A computer system learns from data, which represent some “past experiences” of an
application domain.
Objective of machine learning : learn a target function that can be used to predict the
values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low
risk.
Supervised Learning
Supervised learning is a machine learning technique for learning a function from training data.
The training data consist of pairs of input objects (typically vectors), and desired outputs. The
output of the function can be a continuous value (called regression), or can predict a class label
of the input object (called classification). The task of the supervised learner is to predict the
value of the function for any valid input object after having seen a number of training examples
(i.e. pairs of input and target output). To achieve this, the learner has to generalize from the
presented data to unseen situations in a "reasonable" way.
CS6659-Artificial Page
Another term for supervised learning is classification. Classifier performance depend greatly on
the characteristics of the data to be classified. There is no single classifier that works best on all
given problems. Determining a suitable classifier for a given problem is however still more an art
than a science. The most widely used classifiers are the Neural Network (Multi-layer
Perceptron), Support Vector Machines, k-Nearest Neighbors, Gaussian Mixture Model,
Gaussian, Naive Bayes, Decision Tree and RBF classifiers.
Supervision: The data (observations, measurements, etc.) are labeled with pre-
defined classes. It is like that a “teacher” gives the classes (supervision).
Given a set of data, the task is to establish the existence of classes or clusters in
the data
Decision Tree
CS6659-Artificial Page
A decision tree takes as input an object or situation described by a set of attributes and
returns a “decision” – the predicted output value for the input.
A decision tree reaches its decision by performing a sequence of tests. Each internal node in the
tree corresponds to a test of the value of one of the properties, and the branches from the node
are labeled with the possible values of the test. Each leaf node in the tree specifies the value to be
returned if that leaf is reached. The decision tree representation seems to be very natural for
humans; indeed, many "How To" manuals (e.g., for car repair) are written entirely as a single
decision tree stretching over hundreds of pages.
A somewhat simpler example is provided by the problem of whether to wait for a table at a
restaurant. The aim here is to learn a definition for the goal predicate Will Wait. In setting this up
as a learning problem, we first have to state what attributes are available to describe examples in
the domain. we will see how to automate this task; for now, let's suppose we decide on the
following list of attributes:
2. Bar: whether the restaurant has a comfortable bar area to wait in.
5. Patrons: how many people are in the restaurant (values are None, Some, and Full).
10. Wait Estimate: the wait estimated by the host (0-10 minutes, 10-30, 30-60, >60).
CS6659-Artificial Page
Decision tree induction from examples
An example for a Boolean decision tree consists of a vector of' input attributes, X, and a single
Boolean output value y. A set of examples (X1,Y1) . . . , (X2, y2) is shown in Figure. The
positive examples are the ones in which the goal Will Wait is true (XI, X3, . . .); the negative
examples are the ones in which it is false (X2, X5, . . .). The complete set of examples is called
the training set.
Examples
Robotics: Quadruped Gait Control, Ball Acquisition (Robocup)
Control: Helicopters
Economics/Finance: Trading
CS6659-Artificial Page
Markov decision process VS Reinforcement Learning
Markov decision process
CS6659-Artificial Page
Set of state S, set of actions A
Model-based: Learn transition and reward model, use it to get optimal policy
Passive Learning
CS6659-Artificial Page
Same as policy evaluation for known transition & reward models
CS6659-Artificial Page
Agent executes a sequence of trials:
CS6659-Artificial Page
The utility of states satisfy the Bellman equations
Example:
The TD equation
The update for s‟ will occur T(s, π(s), s‟) fraction of the time
Convergence is guaranteed if
TD is model free
TD Vs ADP
TD is mode free as opposed to ADP which is model based
Active Learning
Agent updates policy as it learns
Exploitation vs Exploration
The passive approach gives a greedy agent
Trade-off
CS6659-Artificial Page
Mostly gets stuck in bad policies
Pure Exploration
Exploration
Greedy in the limit of infinite exploration (GLIE)
Simple GLIE
Q-Learning
Exploration function gives a active ADP agent
A model-free TD method
PART- A
2. What is planning?
PART B
CS6659-Artificial Page
2. Explain Machine Learning.
3. Explain STRIPS.
CS6659-Artificial Page
UNIT 5
Uncertainty:
Till now, we have learned knowledge representation using first-order logic and propositional
logic with certainty, which means we were sure about the predicates. With this knowledge
representation, we might write A→B, which means if A is true then B is true, but consider a
situation where we are not sure about whether A is true or not then we cannot express this
statement, this situation is called uncertainty.
So to represent uncertain knowledge, where we are not sure about the predicates, we need
uncertain reasoning or probabilistic reasoning.
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
Probabilistic reasoning:
Probabilistic reasoning is a way of knowledge representation where we apply the concept of
probability to indicate the uncertainty in knowledge. In probabilistic reasoning, we combine
probability theory with logic to handle the uncertainty.
In the real world, there are lots of scenarios, where the certainty of something is not
confirmed, such as "It will rain today," "behavior of someone for some situations," "A match
between two teams or two players." These are probable sentences for which we can assume
that it will happen but not sure about it, so here we use probabilistic reasoning.
In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:
CS6659-Artificial Page
o Bayes' rule
o Bayesian Statistics
Probability: Probability can be defined as a chance that an uncertain event will occur. It is
the numerical measure of the likelihood that an event will occur. The value of probability
always remains between 0 and 1 that represent ideal uncertainties.
We can find the probability of an uncertain event by using the below formula.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in the
real world.
Posterior Probability: The probability that is calculated after all evidence or information
has taken into account. It is a combination of prior probability and new information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has already
happened.
Let's suppose, we want to calculate the event A when event B has already occurred, "the
probability of A under the conditions of B", it can be written as:
CS6659-Artificial Page
Where P(A⋀B)= Joint probability of a and B
If the probability of A is given and we need to find the probability of B, then it will be given
as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample
space will be reduced to set B, and now we can only calculate event A when event B is
already occurred by dividing the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students who
likes English and mathematics, and then what is the percent of students those who like
English also like mathematics?
Solution:
CS6659-Artificial Page
Bayes' theorem in Artificial intelligence
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two
random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference is an application of Bayes' theorem, which is fundamental to Bayesian statistics.
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can
determine the probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic
of most modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability
of hypothesis A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we
calculate the probability of evidence.
CS6659-Artificial Page
P(A) is called the prior probability, probability of hypothesis before considering the
evidence
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule
can be written as:
Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff
neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs
80% of the time. He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
CS6659-Artificial Page
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff
neck.
Example-2:
Question: From a standard deck of playing cards, a single card is drawn. The
probability that the card is king is 4/52, then calculate posterior probability
P(King|Face), which means the drawn face card is a king card.
Solution:
o It is used to calculate the next step of the robot when the already executed step is
given.
o Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning,
time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
CS6659-Artificial Page
o Each node corresponds to the random variables, and a variable can
be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes
in the graph.
These links represent that one node directly influence the other node, and if there is
no directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables
represented by the nodes of the network graph.
o If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi)
), which determines the effect of the parent on that node.
CS6659-Artificial Page
Bayesian network is based on Joint probability distribution and conditional probability. So
let's first understand the joint probability distribution:
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
In general for each variable Xi, we can write the equation as:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry
has two neighbors David and Sophia, who have taken a responsibility to inform Harry at
work when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here
we would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary,
nor an earthquake occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
CS6659-Artificial Page
o The conditional distributions for each node are given as conditional probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent
an exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E],
can rewrite the above probability statement using joint probability distribution:
Conditional probability
P(B|E)= P(E|B).P(B)/P(E)
CS6659-Artificial Page
Let's take the observed probability for the Burglary and earthquake component:
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
The Conditional probability of David that he will call depends on the probability of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
There are two ways to understand the semantics of the Bayesian network, which is given
below:
CS6659-Artificial Page
2. To understand the network as an encoding of a collection of conditional
independence statements.
S x is Small
MN x is Medium Negative
LN x is Large Negative
The triangular membership function shapes are most common among various other
membership function shapes such as trapezoidal, singleton, and Gaussian.
Here, the input to 5-level fuzzifier varies from -10 volts to +10 volts. Hence the
corresponding output also changes.
CS6659-Artificial Page
Algorithm
RoomTemp.
Very_Cold Cold Warm Hot Very_Hot
/Target
CS6659-Artificial Page
Very_Hot Cool Cool Cool Cool No_Change
Build a set of rules into the knowledge base in the form of IF-THEN-ELSE structures.
Hi-Fi Systems
Photocopiers
Still and Video Cameras
Television
Domestic Goods
Microwave Ovens
Refrigerators
Toasters
Vacuum Cleaners
Washing Machines
Environment Control
Air Conditioners/Dryers/Heaters
Humidifiers
Advantages of FLSs
Mathematical concepts within fuzzy reasoning are very simple.
You can modify a FLS by just adding or deleting rules due to flexibility of fuzzy
logic.
Fuzzy logic Systems can take imprecise, distorted, noisy input information.
FLSs are easy to construct and understand.
Fuzzy logic is a solution to complex problems in all fields of life, including
medicine, as it resembles human reasoning and decision making.
Disadvantages of FLSs
There is no systematic approach to fuzzy system designing.
They are understandable only when simple.
They are suitable for the problems which do not need high accuracy.
CS6659-Artificial Page
Preference Structure: Representation theorems are used to show that an agent with a
preference structure has a utility function as:
U(x1, . . . , xn) = F[f1(x1), . . . , fn(xn)],
where F indicates any arithmetic function such as an addition function.
Therefore, preference can be done in two ways :
Preference without uncertainty: The preference where two attributes are preferentially
independent of the third attribute. It is because the preference between the outcomes of the
first two attributes does not depend on the third one.
Preference with uncertainty: This refers to the concept of preference structure with
uncertainty. Here, the utility independence extends the preference independence where a
set of attributes X is utility independent of another Y set of attributes, only if the value of
attribute in X set is independent of Y set attribute value. A set is said to be mutually utility
independent (MUI) if each subset is utility-independent of the remaining attribute.
CS6659-Artificial Page
UNIT 6
Inductive Learning Algorithm
Inductive Learning Algorithm (ILA) is an iterative and inductive machine learning algorithm
which is used for generating a set of a classification rule, which produces rules of the form
“IF-THEN”, for a set of examples, producing rules at each iteration and appending to the
set of rules.
Basic Idea:
There are basically two methods for knowledge extraction firstly from domain experts and
then with machine learning.
For a very large amount of data, the domain experts are not very useful and reliable. So
we move towards the machine learning approach for this work.
To use machine learning One method is to replicate the experts logic in the form of
algorithms but this work is very tedious, time taking and expensive.
So we move towards the inductive algorithms which itself generate the strategy for
performing a task and need not instruct separately at each step.
Need of ILA in presence of other machine learning algorithms:
The ILA is a new algorithm which was needed even when other reinforcement learnings
like ID3 and AQ were available.
The need was due to the pitfalls which were present in the previous algorithms,
one of the major pitfalls was lack of generalisation of rules.
The ID3 and AQ used the decision tree production method which was too specific
which were difficult to analyse and was very slow to perform for basic short
classification problems.
The decision tree-based algorithm was unable to work for a new problem if some
attributes are missing.
The ILA uses the method of production of a general set of rules instead of decision
trees, which overcome the above problems
THE ILA ALGORITHM:
General requirements at start of the algorithm:-
1. list the examples in the form of a table ‘T’ where each row corresponds to an
example and each column contains an attribute value.
2. create a set of m training examples, each example composed of k attributes and a
class attribute with n possible decisions.
3. create a rule set, R, having the initial value false.
4. initially all rows in the table are unmarked.
Steps in the algorithm:-
Step 1:
divide the table ‘T’ containing m examples into n sub-tables (t1, t2,…..tn). One table for
each possible value of the class attribute. (repeat steps 2-8 for each sub-table)
Step 2:
Initialize the attribute combination count ‘ j ‘ = 1.
Step 3:
CS6659-Artificial Page
For the sub-table on which work is going on, divide the attribute list into distinct
combinations, each combination with ‘j ‘ distinct attributes.
Step 4:
For each combination of attributes, count the number of occurrences of attribute values
that appear under the same combination of attributes in unmarked rows of the sub-table
under consideration, and at the same time, not appears under the same combination of
attributes of other sub-tables. Call the first combination with the maximum number of
occurrences the max-combination ‘ MAX’.
Step 5:
If ‘MAX’ = = null , increase ‘ j ‘ by 1 and go to Step 3.
Step 6:
Mark all rows of the sub-table where working, in which the values of ‘MAX’ appear, as
classi?ed.
Step 7:
Add a rule (IF attribute = “XYZ” –> THEN decision is YES/ NO) to R whose left-hand side
will have attribute names of the ‘MAX’ with their values separated by AND, and its right-
hand side contains the decision attribute value associated with the sub-table.
Step 8:
If all rows are marked as classi?ed, then move on to process another sub-table and go to
Step 2. else, go to Step 4. If no sub-tables are available, exit with the set of rules obtained
till then.
An example showing the use of ILA
suppose an example set having attributes Place type, weather, location, decision and
seven examples, our task is to generate a set of rules that under what condition what is
the decision.
EXAMPLE NO. PLACE TYPE WEATHER LOCATION DECISION
subset 2
S.NO PLACE TYPE WEATHER LOCATION DECISION
step (2-8)
at iteration 1
row 3 & 4 column weather is selected and row 3 & 4 are marked.
the rule is added to R IF weather is warm then a decision is yes.
at iteration 2
row 1 column place type is selected and row 1 is marked.
the rule is added to R IF place type is hilly then the decision is yes.
at iteration 3
row 2 column location is selected and row 2 is marked.
the rule is added to R IF location is Shimla then the decision is yes.
at iteration 4
row 5&6 column location is selected and row 5&6 are marked.
the rule is added to R IF location is Mumbai then a decision is no.
at iteration 5
CS6659-Artificial Page
row 7 column place type & the weather is selected and row 7 is marked.
rule is added to R IF place type is beach AND weather is windy then the decision is no.
finally we get the rule set :-
Rule Set
Rule 1: IF the weather is warm THEN the decision is yes.
Rule 2: IF place type is hilly THEN the decision is yes.
Rule 3: IF location is Shimla THEN the decision is yes.
Rule 4: IF location is Mumbai THEN the decision is no.
Rule 5: IF place type is beach AND the weather is windy THEN the decision is no.
Decision Tree
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
CS6659-Artificial Page
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
CS6659-Artificial Page
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation
of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
CS6659-Artificial Page
Disadvantages of the Decision Tree
o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.
Supervised Larning
Supervised learning is commonly used in real world applications, such as face and speech recognition,
products or movie recommendations, and sales forecasting. Supervised learning can be further classified
into two types - Regression and Classification.
Regression trains on and predicts a continuous-valued response, for example predicting real estate
prices.
Classification attempts to find the appropriate class label, such as analyzing positive/negative sentiment,
male and female persons, benign and malignant tumors, secure and unsecure loans etc
• In supervised learning, learning data comes with description, labels, targets or desired outputs
and the objective is to find a general rule that maps inputs to outputs. This kind of learning data is
called labeled data. The learned rule is then used to label new data with unknown outputs.
• Supervised learning involves building a machine learning model that is based on labeled
samples. For example, if we build a system to estimate the price of a plot of land or a house
based on various features, such as size, location, and so on, we first need to create a database
and label it. We need to teach the algorithm what features correspond to what prices. Based on
this data, the algorithm will learn how to calculate the price of real estate using the values of the
input features
Unsupervised Learning
• Unsupervised learning is used to detect anomalies, outliers, such as fraud or defective
equipment, or to group customers with similar behaviors for a sales campaign. It is the opposite
of supervised learning. There is no labeled data here.
When learning data contains only some indications without any description or labels, it is up to the coder
or to the algorithm to find the structure of the underlying data, to discover hidden patterns, or to determine
how to describe the data. This kind of learning data is called unlabeled data.
Current-Best-Hypothesis Search
First described by John Stuart Mill in 1843. The algorithm is extremely simple; if a
new example is encountered that our hypothesis misclassifies, then change the
hypothesis as follows.
If it is a false positive, specialize the hypothesis not to cover it. This can be done by
dropping disjuncts or adding new terms.
CS6659-Artificial Page
If it is a false negative, generalize the hypothesis by adding disjuncts or dropping
terms.
(Think of the Russell and Norvig example of two sets, the positive instances inside the
negative ones, and extending or shrinking the positive set as more data comes in.)
The earliest machine learning system to use this approach was the arch-learning
program of [Winston, 1970]. However, it naturally suffers from inefficiency in large
search spaces; after every modification the past instances must be checked to ensure
they are still classified correctly, and it is difficult to find good search heuristics for
generalizing or specializing the definition.
Least-Commitment Search
Rather than backtracking, this approach keeps all hypotheses that are consistent with
the data seen so far. As more data becomes available, this version space shrinks. The
algorithm for doing this is called the candidate elimination learning algorithm or
the version space learning algorithm [Mitchell, 1977], and consists simply of
constraining the version space to be consistent with all data seen so far, updating it as
each new instance is seen.
The obvious problem is that this method potentially requires an enormous number of
hypotheses to record. The solution is to use boundary sets to circumscribe the space of
possible hypotheses. The two boundary sets are the G-set (the most general boundary)
and the S-set (the most specific boundary). Every member of the G-set is consistent,
and there are no more general consistent hypotheses; every member of the S-set is
consistent and there are no more specific hypotheses.
The algorithm works as follows. Initially, the G-set is simply True, and the S-
set False. For every new instance, there are four possible cases:
This process is repeated until one of three things happens. Eventually, either there is
only one hypothesis left in the version space (in which case we return it), the version
space collapses (either S or G becomes empty), meaning there is no consistent
hypothesis, or we run out of examples and our version space still has several
hypotheses, so we can use their collective evaluation (breaking disagreements with
majority vote).
The main problems with this approach are two. If there is any noise, or
insufficient attributes for classification, the version space will collapse. Also, if
unlimited disjunction is allowed in the hypothesis space, S will contain only the most-
specific hypothesis (the conjunction of the positive examples), and G will contain
only the most-general hypothesis (the negation of the disjunctions of the negative
examples). The latter problem can partially be solved by using a generalization
hierarchy.
Such learning systems cannot handle noisy data. One solution is to maintain
several S and G sets, consistent with decreasing numbers of training instances.
Computational Complexity
Strategy Time Complexity Space Complexity
DFS (current best hypothesis) O(pn) O(p + n)
Version Space O(sg(p + n) + s2p + g2n) O(s + g)
In this table, p and n are the number of positive and negative training instances,
respectively, and s and g are the largest sizes of sets S and G.
The complexity of the depth-first approach stems from the need to reexamine old
instances after each revision to the hypothesis; in particular, each time a positive
instance forces the hypothesis to be changed, all past negative hypotheses must be
CS6659-Artificial Page
examined. Similarly, revising the hypothesis in response to negative instances requires
reexamining all positive instances.
In the version space strategy, no training instances need be saved, so the space
complexity is just the largest sizes of S and G. Notice that for this strategy, processing
time grows linearly with the number of training instances. However, it grows as the
square of the sizes of the boundary sets.
PAC Learning
The argument is, roughly, that incorrect hypotheses will be discovered relatively
quickly because they will incorrectly identify instances. Any hypothesis that is
consistent with a large enough set of training examples is unlikely to be seriously
wrong. That is, it is probably approximately correct. PAC learning is the part of
computational learning theory that studies this idea.
The key assumption underlying this argument is that the training set and the test
set are randomly drawn from the same population of examples, and that they are
drawn using the same probability distribution. This idea (due to Valiant) is
the stationarity assumption. It is required to associate any future instances with the
ones seen so far.
In particular, we would like to bound the likelihood that our hypothesis is not within a
certain range of the correct hypothesis. Specifically, we define the error of a
hypothesis thus:
P(Hbad contains a consistent hypothesis) <= |Hbad|(1 - e)m <= |H|(1 - e)m.
training examples to the system. Note that this approach assumes that the true
function f is somewhere in the hypothesis space. Roughly speaking, by theorem of
Blumer, the number of examples required for learning is proportional to the log of the
size of the hypothesis space.
Connectionist Learning
Neural Networks
CS6659-Artificial Page
More specifically, a neural network consists of a set of nodes (or units), links that
connect one node to another, and weights associated with each link. Some nodes
receive inputs via links; others directly from the environment, and some nodes send
outputs out of the network. Learning usually occurs by adjusting the weights on the
links.
Each unit has a set of weighted inputs, an activation level, and a way to compute its
activation level at the next time step. This is done by applying an activation
function to the weighted sum of the node's inputs. Generally, the weighted sum (also
called the input function) is a strictly linear sum, while the activation function may be
nonlinear. If the value of the activation function is above a threshold, the node "fires."
Generally, all nodes share the same activation function and threshold value, and only
the topology and weights change.
Network Structures
The two fundamental types of network structure are feed-forward and recurrent. A
feed-forward network is a directed acyclic graph; information flows in one direction
only, and there are no cycles. Such networks cannot represent internal state.
Usually, neural networks are also layered, meaning that nodes are organized into
groups of layers, and links only go from nodes to nodes in adjacent layers.
Recurrent networks allow loops, and as a result can represent state, though they are
much more complex to analyze. Hopfield networks and Boltzmann machines are
examples of recurrent networks; Hopfield networks are the best understood. All
connections in Hopfield networks are bidirectional with symmetric weights, all units
have outputs of 1 or -1, and the activation function is the sign function. Also, all nodes
in a Hopfield network are both input and output nodes. Interestingly, it has been
shown that a Hopfield network can reliably recognize 0.138N training examples,
where N is the number of units in the network.
One problem in building neural networks is deciding on the initial topology, e.g., how
many nodes there are and how they are connected. Genetic algorithms have been used
to explore this problem, but it is a large search space and this is a computationally
intensive approach. The optimal brain damage method uses information theory to
determine whether weights can be removed from the network without loss of
CS6659-Artificial Page
performance, and possibly improving it. The alternative of making the network larger
has been tested with the tiling algorithm [Mezard and Nadal, 1989] which takes an
approach similar to induction on decision trees; it expands a unit by adding new ones
to cover instances it misclassified. Cross-validation techniques can be used to
determine when the network size is right.
Perceptrons
Perceptrons are single-layer, feed-forward networks that were first studied in the
1950's. They are only capable of learning linearly separable functions. That is, if we
view F features as defining an F-dimensional space, the network can recognize any
class that involves placing a single hyperplane between the instances of two classes.
So, for example, they can easily represent AND, OR, or NOT, but cannot
represent XOR.
Perceptrons learn by updating the weights on their links in response to the difference
between their output value and the correct output value. The updating rule (due to
Frank Rosenblatt, 1960) is as follows. Define Err as the difference between the correct
output and actual output. Then the learning rule for each weight is
Wj <- Wj + A x Ij x Err
Of course, this was too good to last, and in Perceptrons [Minsky and Papert, 1969] it
was observed how limited linearly separable functions were. Work on perceptrons
withered, and neural networks didn't come into vogue again until the 1980's, when
multi-layer networks became the focus.
The back-propagation rule is similar to the perceptron learning rule. If Erri is the error
at the output node, then the weight update for the link from unit j to unit i (the output
node) is
CS6659-Artificial Page
where g' is the derivative of the activation function, and aj is the activation of the
unit j. (Note that this means the activation function must have a derivative, so the
sigmoid function is usually used rather than the step function.) Define Di as Erri x
g'(ini).
This updates the weights leading to the output node. To update the weights on the
interior links, we use the idea that the hidden node j is responsible for part of the error
in each of the nodes to which it connects. Thus the error at the output is divided
according to the strength of the connection between the output node and the hidden
node, and propogated backward to previous layers. Specifically,
Lastly, the weight updating rule for the weights from the input layer to the hidden
layer is is
where k is the input node and j the hidden node, and Ik is the input value of k.
A neural network requires 2n/n hidden units to represent all Boolean functions
of n inputs. For m training examples and W weights, each epoch in
the learning process takes O(mW) time; but in the worst case, the number
of epochs can be exponential in the number of inputs.
In general, if the number of hidden nodes is too large, the network may learn only the
training examples, while if the number is too small it may never converge on a set of
weights consistent with the training examples.
John Denker remarked that "neural networks are the second best way of doing just
about anything." They provide passable performance on a wide variety of problems
that are difficult to solve well using other methods.
CS6659-Artificial Page
NETtalk [Sejnowski and Rosenberg, 1987] was designed to learn how to pronounce
written text. Input was a seven-character centered on the target character, and output
was a set of Booleans controlling the form of the sound to be produced. It learned
95% accuracy on its training set, but had only 78% accuracy on the test set. Not
spectacularly good, but important because it impressed many people with the potential
of neural networks.
Other applications include a ZIP code recognition [Le Cun et al., 1989] system that
achieves 99% accuracy on handwritten codes, and driving [Pomerleau, 1993] in the
ALVINN system at CMU. ALVINN controls the NavLab vehicles, and translates
inputs from a video image into steering control directions. ALVINN performs
exceptionally well on the particular road-type it learns, but poorly on other terrain
types. The extended MANIAC system [Jochem et al., 1993] has multiple ALVINN
subnets combined to handle different road types.
(NOTE: As I am largely unfamiliar with belief networks, I would take the material in
this section with a grain of salt, if I were you. -- POD)
Bayesian learning maintains a number of hypotheses about the data, each one
weighted its posterior probability when a prediction is made. The idea is that, rather
than keeping only one hypothesis, many are entertained, and weighted based on their
likelihoods.
Since P(D) is fixed across the hypotheses, we only need to maximize the numerator.
The first term represents the probability that this particular data set would be seen,
given Hi as the model of the world. The second is the prior probability assigned to the
model.
known structure, fully observable -- In this case the only learnable part is the
conditional probability tables. These can be estimated directly using the
statistics of the sample data set.
Both kinds of network are attribute-based representations. Both can handle either
discrete or continuous output. The major difference is that belief networks are
localized representations, while neural networks are distributed. Belief network nodes
represent propositions with clearly defined semantics and relationships to other nodes.
In neural networks, nodes generally don't represent specific propositions, and the
calculations would not treat them in a semantically-meaningful way. The effect is that
human beings can neither construct nor understand neural network representations,
where both can be done with belief networks.
Belief networks handle two kinds of activation, both in terms of the values a
proposition may take, and the probabilities assigned to each value. Neural
network outputs could be values or probabilities, but they cannot handle both
simultaneously.
CS6659-Artificial Page
Trained feed-forward neural network inference can execute in linear time, where in
belief networks inference is NP-hard. However, a neural network may have to be
exponentially larger to represent the same things that a belief network can.
As for learning, belief networks have the advantage of being easier to give prior
knowledge; also, since they represent propositions locally, it may be easier for them to
converge, since they are directly affected only by a small number of other
propositions.
Reinforcement Learning
The reason reinforcement learning is harder than supervised learning is that the agent
is never told what the right action is, only whether it is doing well or poorly, and in
some cases (such as chess) it may only receive feedback after a long string of actions.
There are two basic kinds of information an agent can try to learn.
utility function -- The agent learns the utility of being in various states, and
chooses actions to maximize the expected utility of their outcomes. This
requires the agent keep a model of the environment.
Introduction to Optimization
Optimization is the process of making something better. In any process, we have a
set of inputs and a set of outputs as shown in the following figure.
CS6659-Artificial Page
Optimization refers to finding the values of inputs in such a way that we get the “best”
output values. The definition of “best” varies from problem to problem, but in
mathematical terms, it refers to maximizing or minimizing one or more objective
functions, by varying the input parameters.
The set of all possible solutions or values which the inputs can take make up the
search space. In this search space, lies a point or a set of points which gives the
optimal solution. The aim of optimization is to find that point or set of points in the
search space.
Current-Best-Hypothesis Search
First described by John Stuart Mill in 1843. The algorithm is extremely simple; if a
new example is encountered that our hypothesis misclassifies, then change the
hypothesis as follows.
If it is a false positive, specialize the hypothesis not to cover it. This can be done by
dropping disjuncts or adding new terms.
(Think of the Russell and Norvig example of two sets, the positive instances inside the
negative ones, and extending or shrinking the positive set as more data comes in.)
CS6659-Artificial Page
The earliest machine learning system to use this approach was the arch-learning
program of [Winston, 1970]. However, it naturally suffers from inefficiency in large
search spaces; after every modification the past instances must be checked to ensure
they are still classified correctly, and it is difficult to find good search heuristics for
generalizing or specializing the definition.
Least-Commitment Search
Rather than backtracking, this approach keeps all hypotheses that are consistent with
the data seen so far. As more data becomes available, this version space shrinks. The
algorithm for doing this is called the candidate elimination learning algorithm or
the version space learning algorithm [Mitchell, 1977], and consists simply of
constraining the version space to be consistent with all data seen so far, updating it as
each new instance is seen.
The obvious problem is that this method potentially requires an enormous number of
hypotheses to record. The solution is to use boundary sets to circumscribe the space of
possible hypotheses. The two boundary sets are the G-set (the most general boundary)
and the S-set (the most specific boundary). Every member of the G-set is consistent,
and there are no more general consistent hypotheses; every member of the S-set is
consistent and there are no more specific hypotheses.
The algorithm works as follows. Initially, the G-set is simply True, and the S-
set False. For every new instance, there are four possible cases:
This process is repeated until one of three things happens. Eventually, either there is
only one hypothesis left in the version space (in which case we return it), the version
space collapses (either S or G becomes empty), meaning there is no consistent
hypothesis, or we run out of examples and our version space still has several
hypotheses, so we can use their collective evaluation (breaking disagreements with
majority vote).
The main problems with this approach are two. If there is any noise, or
insufficient attributes for classification, the version space will collapse. Also, if
unlimited disjunction is allowed in the hypothesis space, S will contain only the most-
specific hypothesis (the conjunction of the positive examples), and G will contain
only the most-general hypothesis (the negation of the disjunctions of the negative
examples). The latter problem can partially be solved by using a generalization
hierarchy.
Such learning systems cannot handle noisy data. One solution is to maintain
several S and G sets, consistent with decreasing numbers of training instances.
Computational Complexity
Strategy Time Complexity Space Complexity
DFS (current best hypothesis) O(pn) O(p + n)
Version Space O(sg(p + n) + s2p + g2n) O(s + g)
In this table, p and n are the number of positive and negative training instances,
respectively, and s and g are the largest sizes of sets S and G.
The complexity of the depth-first approach stems from the need to reexamine old
instances after each revision to the hypothesis; in particular, each time a positive
instance forces the hypothesis to be changed, all past negative hypotheses must be
examined. Similarly, revising the hypothesis in response to negative instances requires
reexamining all positive instances.
CS6659-Artificial Page
In the version space strategy, no training instances need be saved, so the space
complexity is just the largest sizes of S and G. Notice that for this strategy, processing
time grows linearly with the number of training instances. However, it grows as the
square of the sizes of the boundary sets.
Neural Networks
More specifically, a neural network consists of a set of nodes (or units), links that
connect one node to another, and weights associated with each link. Some nodes
receive inputs via links; others directly from the environment, and some nodes send
outputs out of the network. Learning usually occurs by adjusting the weights on the
links.
Each unit has a set of weighted inputs, an activation level, and a way to compute its
activation level at the next time step. This is done by applying an activation
function to the weighted sum of the node's inputs. Generally, the weighted sum (also
called the input function) is a strictly linear sum, while the activation function may be
nonlinear. If the value of the activation function is above a threshold, the node "fires."
Generally, all nodes share the same activation function and threshold value, and only
the topology and weights change.
Artificial Neural Network Tutorial provides basic and advanced concepts of ANNs. Our
Artificial Neural Network tutorial is developed for beginners as well as professions.
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain. An Artificial neural network is usually a computational
CS6659-Artificial Page
network based on biological neural networks that construct the structure of the human
brain. Similar to a human brain has neurons interconnected to each other, artificial neural
networks also have neurons that are linked to each other in various layers of the networks.
These neurons are known as nodes.
Artificial neural network tutorial covers all the aspects related to the artificial neural
network. In this tutorial, we will discuss ANNs, Adaptive resonance theory, Kohonen self-
organizing map, Building blocks, unsupervised learning, Genetic algorithm, etc.
The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.
CS6659-Artificial Page
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell
nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
Dendrites Inputs
Synapse Weights
Axon Output
There are around 1000 billion neurons in the human brain. Each neuron has an association
point somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in
such a manner as to be distributed, and we can extract more than one piece of this data
when necessary from our memory parallelly. We can say that the human brain is made up
of incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example of a
digital logic gate that takes an input and gives an output. "OR" gate, which takes two
inputs. If one or both the inputs are "On," then we get "On" in output. If both the inputs are
"Off," then we get "Off" in output. Here the output depends upon input. Our brain does not
perform the same task. The outputs to inputs relationship keep changing because of the
CS6659-Artificial
neurons in our brain, which are "learning." Page
The architecture of an artificial neural network:
To understand the concept of the architecture of an artificial neural network, we have to
understand what a neural network consists of. In order to define a neural network that
consists of a large number of artificial neurons, which are termed units arranged in a
sequence of layers. Lets us look at various types of layers available in an artificial neural
network.
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.
CS6659-Artificial Page
It determines weighted total is passed as an input to an activation function to produce the
output. Activation functions choose whether a node should fire or not. Only those who are
fired make it to the output layer. There are distinctive activation functions available that can
be applied upon the sort of task we are performing.
Artificial neural networks have a numerical value that can perform more than one task
simultaneously.
Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent the
network from working.
After ANN training, the information may produce output even with inadequate data. The loss
of performance here relies upon the significance of missing data.
For ANN is to be able to adapt, it is important to determine the examples and to encourage
the network according to the desired output by demonstrating these examples to the
network. The succession of the network is directly proportional to the chosen instances, and
if the event can't appear to the network in all its aspects, it can produce false output.
Extortion of one or more cells of ANN does not prohibit it from generating output, and this
feature makes the network fault-tolerance.
There is no particular guideline for determining the structure of artificial neural networks.
The appropriate network structure is accomplished through experience, trial, and error.
Unrecognized
CS6659-Artificial behavior of the network: Page
It is the most significant issue of ANN. When ANN produces a testing solution, it does not
provide insight concerning why and how. It decreases trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their
structure. Therefore, the realization of the equipment is dependent.
ANNs can work with numerical data. Problems must be converted into numerical values
before being introduced to ANN. The presentation mechanism to be resolved here will
directly impact the performance of the network. It relies on the user's abilities.
CS6659-Artificial Page
CS6659-Artificial Page
UNIT 7
Natural Language Processing
Natural Language Processing (NLP) refers to AI method of communicating with an
intelligent systems using a natural language such as English.
Processing of Natural Language is required when you want an intelligent system like
robot to perform as per your instructions, when you want to hear decision from a
dialogue based clinical expert system, etc.
The field of NLP involves making computers to perform useful tasks with the natural
languages humans use. The input and output of an NLP system can be −
Speech
Written Text
Components of NLP
There are two components of NLP as given −
Natural Language Understanding (NLU)
Understanding involves the following tasks −
Difficulties in NLU
NL has an extremely rich form and structure.
It is very ambiguous. There can be different levels of ambiguity −
Lexical ambiguity − It is at very primitive level such as word-level.
For example, treating the word “board” as noun or verb?
Syntax Level ambiguity − A sentence can be parsed in different ways.
CS6659-Artificial Page
For example, “He lifted the beetle with red cap.” − Did he use cap to lift the
beetle or he lifted a beetle that had red cap?
Referential ambiguity − Referring to something using pronouns. For example,
Rima went to Gauri. She said, “I am tired.” − Exactly who is tired?
One input can mean different meanings.
Many inputs can mean the same thing.
NLP Terminology
Phonology − It is study of organizing sound systematically.
Morphology − It is a study of construction of words from primitive meaningful
units.
Morpheme − It is primitive unit of meaning in a language.
Syntax − It refers to arranging words to make a sentence. It also involves
determining the structural role of words in the sentence and in phrases.
Semantics − It is concerned with the meaning of words and how to combine
words into meaningful phrases and sentences.
Pragmatics − It deals with using and understanding sentences in different
situations and how the interpretation of the sentence is affected.
Discourse − It deals with how the immediately preceding sentence can affect
the interpretation of the next sentence.
World Knowledge − It includes the general knowledge about the world.
Steps in NLP
There are general five steps −
Lexical Analysis − It involves identifying and analyzing the structure of words.
Lexicon of a language means the collection of words and phrases in a language.
Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences,
and words.
Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for
grammar and arranging words in a manner that shows the relationship among
the words. The sentence such as “The school goes to boy” is rejected by
English syntactic analyzer.
CS6659-Artificial Page
Semantic Analysis − It draws the exact meaning or the dictionary meaning from
the text. The text is checked for meaningfulness. It is done by mapping syntactic
structures and objects in the task domain. The semantic analyzer disregards
sentence such as “hot ice-cream”.
Discourse Integration − The meaning of any sentence depends upon the
meaning of the sentence just before it. In addition, it also brings about the
meaning of immediately succeeding sentence.
Pragmatic Analysis − During this, what was said is re-interpreted on what it
actually meant. It involves deriving those aspects of language which require real
world knowledge.
CS6659-Artificial Page
Now consider the above rewrite rules. Since V can be replaced by both, "peck" or
"pecks", sentences such as "The bird peck the grains" can be wrongly permitted. i. e.
the subject-verb agreement error is approved as correct.
Merit − The simplest style of grammar, therefore widely used one.
Demerits −
They are not highly precise. For example, “The grains peck the bird”, is a
syntactically correct according to parser, but even if it makes no sense, parser
takes it as a correct sentence.
To bring out high precision, multiple sets of grammar need to be prepared. It
may require a completely different sets of rules for parsing singular and plural
variations, passive sentences, etc., which can lead to creation of huge set of
rules that are unmanageable.
Top-Down Parser
Here, the parser starts with the S symbol and attempts to rewrite it into a sequence
of terminal symbols that matches the classes of the words in the input sentence until it
consists entirely of terminal symbols.
CS6659-Artificial Page
These are then checked with the input sentence to see if it matched. If not, the process
is started over again with a different set of rules. This is repeated until a specific rule is
found which describes the structure of the sentence.
Merit − It is simple to implement.
Demerits −
High performance
Understandable
Reliable
Highly responsive
Capabilities of Expert Systems
The expert systems are capable of −
Advising
Instructing and assisting human in decision making
Demonstrating
Deriving a solution
Diagnosing
Explaining
Interpreting input
Predicting results
Justifying the conclusion
Suggesting alternative options to a problem
They are incapable of −
Knowledge Base
Inference Engine
User Interface
Let us see them one by one briefly −
Knowledge Base
It contains domain-specific and high-quality knowledge.
Knowledge is required to exhibit intelligence. The success of any ES majorly depends
upon the collection of highly accurate and precise knowledge.
What is Knowledge?
The data is collection of facts. The information is organized as data and facts about the
task domain. Data, information, and past experience combined together are termed
as knowledge.
Components of Knowledge Base
The knowledge base of an ES is a store of both, factual and heuristic knowledge.
Factual Knowledge − It is the information widely accepted by the Knowledge
Engineers and scholars in the task domain.
CS6659-Artificial Page
Heuristic Knowledge − It is about practice, accurate judgement, one’s ability of
evaluation, and guessing.
Knowledge representation
It is the method used to organize and formalize the knowledge in the knowledge base.
It is in the form of IF-THEN-ELSE rules.
Knowledge Acquisition
The success of any expert system majorly depends on the quality, completeness, and
accuracy of the information stored in the knowledge base.
The knowledge base is formed by readings from various experts, scholars, and
the Knowledge Engineers. The knowledge engineer is a person with the qualities of
empathy, quick learning, and case analyzing skills.
He acquires information from subject expert by recording, interviewing, and observing
him at work, etc. He then categorizes and organizes the information in a meaningful
way, in the form of IF-THEN-ELSE rules, to be used by interference machine. The
knowledge engineer also monitors the development of the ES.
Inference Engine
Use of efficient procedures and rules by the Inference Engine is essential in deducting
a correct, flawless solution.
In case of knowledge-based ES, the Inference Engine acquires and manipulates the
knowledge from the knowledge base to arrive at a particular solution.
In case of rule based ES, it −
Applies rules repeatedly to the facts, which are obtained from earlier rule
application.
Adds new knowledge into the knowledge base if required.
Resolves rules conflict when multiple rules are applicable to a particular case.
To recommend a solution, the Inference Engine uses the following strategies −
Forward Chaining
Backward Chaining
Forward Chaining
It is a strategy of an expert system to answer the question, “What can happen next?”
Here, the Inference Engine follows the chain of conditions and derivations and finally
deduces the outcome. It considers all the facts and rules, and sorts them before
concluding to a solution.
This strategy is followed for working on conclusion, result, or effect. For example,
prediction of share market status as an effect of changes in interest rates.
CS6659-Artificial Page
Backward Chaining
With this strategy, an expert system finds out the answer to the question, “Why this
happened?”
On the basis of what has already happened, the Inference Engine tries to find out
which conditions could have happened in the past for this result. This strategy is
followed for finding out cause or reason. For example, diagnosis of blood cancer in
humans.
User Interface
User interface provides interaction between user of the ES and the ES itself. It is
generally Natural Language Processing so as to be used by the user who is well-
versed in the task domain. The user of the ES need not be necessarily an expert in
Artificial Intelligence.
It explains how the ES has arrived at a particular recommendation. The explanation
may appear in the following forms −
Application Description
1. G = <V, N, P, S>
Where:
N describes a finite set of non-terminal symbols.
V describes a finite set of terminal symbols.
P describes a set of production rules
S is the start symbol.
Example:
Production rules:
1. S = bR
2. R = aR
3. R = aB
4. B = b
Through this production we can produce some strings like: bab, baab, baaab etc.
Where leftside ∈ (Vn∪ Vt)+ and definition ∈ (Vn∪ Vt)*. In BNF, the leftside contains one non-
terminal.
We can define the several productions with the same leftside. All the productions are
separated by a vertical bar symbol "|".
1. S → aSa
2. S → bSb
3. S → c
1. S → aSa| bSb| c
CS6659-Artificial Page