You are on page 1of 126

Parallel Concurrent

Processing
Mike Swing
TruTek
mswing@trutek.com
RMOUG 2009
1
Conclusions
• You don’t need RAC to use Parallel Concurrent
Processing (PCP)!
• If you have PCP enabled, secondary nodes
must be defined during the upgrade to R12
• Tuning of TCP, SQLNet and PMON
parameters can minimize PCP failover time.
• Implement Failover Sensitive Workshifts

2
Concurrent Processing Server
Allows scheduling of jobs – batch jobs, or Requests in
Oracle terms.
Processes concurrent programs as a Request.
Requests can be grouped together into Request Sets.
Different types of concurrent managers handle different
types of requests.
A concurrent program can be assigned to a responsibility,
and that responsibility can be assigned to users, allowing
them the permission to run the concurrent program.
Concurrent managers may have limits on the concurrent
programs that can be run, and the times that they can be
started. Requests have priorities, status and log and out
files in the above directory

3
Definitions
• CP => Concurrent Processing
• DCD => Dead Connection Detection
• ICM => Internal Concurrent Manager
• IM => Internal Monitor
• CRM => Conflict Resolution Manager
• PCP => Parallel Concurrent Processing
• PMON => Process Monitor for ICM

4
Concurrent Request

5
Phase and Status of Concurrent Requests
Phase Status Description - Action
Pending Normal The request is waiting to be picked up by the next
available manager.
Pending Standby Waiting for CRM to resolve conflict. CRM could be
slow or an incompatible program is running.
Running Normal The request is running normally.
Completed Normal The request has finished successfully

Completed Error The request has finished with an error. Check


logs.
Completed Warning The request has finished with a Warning. Check
the logs.
Inactive No Manager Request won’t run without a manager.
Specialization rules aren’t configured properly.

6
PCP Failover
•DB Node – RH8

•Database
•RH7 •RH8 •RH9

•sqlnet.ora
•PCP •PCP •PCP
•Database
Listener
•SQL*Net •SQL*Net •SQL*Net
•Client •Client •Client

•TCP_KEEPALIVE takes 240 seconds before issuing DCD

7
Concurrent Managers

8
Concurrent Managers
Manager Type Service Instance Program
Internal Concurrent Manager Internal Manager FNDLIBR
Conflict Resolution Manager Conflict Resolution Manager FNDCRM
Internal Monitor Internal Monitor:Node FNDIMON
Service Manager: Node FNDSM
Concurrent Manager Standard Manager FNDLIBR
Concurrent Manager Inventory Manager INVLIBR
Concurrent Manager Session History Cleanup FNDLIBR
Concurrent Manager PA Streamline Manager PALIBR
Transaction Manager CRP Inquiry Manager CYQLIB
Transaction Manager FastFormula Transaction Manager FFTM
Transaction Manager PO Document Approval Manager POXCON
Transaction Manager Transaction Manager FNDTMTST
Scheduler/Prerelease Manager FNDSVC
OAM Generic Collection Service:Node FNDSVC
9
Concurrent Processing
1. The Concurrent
Web HTML Web Server Processing server
Browser Interface
communicates with
Forms Server
the database using
Oracle SQL*Net.
JInitiator
JAVA 2. The concurrent
Interface
Reports Server program log or output
file from a request is
passed back as a
ICM Service SQL*Net report to the Report
Internal Report
Monitor FNDLIBR Manager Review
Review Agent.
FNDIMON FNDSM Agent .rdx 3. The Report Review
Standard Agent passes a file
Manager
Requests Log Out
containing the entire
FNDCRM FNDLIBR report to the forms
server.
4. The Forms Services component passes the report back to the user’s browser one
page at time. Profile options can be used to control the size of the files and pages
passed, to suit report volume and available network capacity.

10
Internal Concurrent Manager
• The Internal Concurrent Manager (ICM) starts, sets the
number of active processes, monitors, and terminates all
other concurrent processes through requests made to
the Service Manager, including restarting any failed
processes.
• The ICM also starts and stops, and restarts the Service
Manager for each node.
• The ICM will perform process migration during an
instance or node failure.
• The ICM will be active on a single node.
• This is also true in a PCP environment, where the ICM
will be active on at least one node at all times.

11
Internal Concurrent Manager
• The ICM really does not have any scheduling
responsibilities. It has NOTHING to do with scheduling
requests, or deciding which manager will run a particular
request. The function of the ICM is to run 'queue control'
requests; requests to startup or shutdown other
managers.
• The ICM is responsible for startup and shutdown of the
whole concurrent processing facility, and it monitors the
other managers periodically, and restarts them if they
should go down. It can also take over the Conflict
Resolution manager's job, and resolve incompatibilities.
• If the ICM itself should go down, requests will continue to
run normally, except for 'queue control' requests. Restart
the ICM with 'startmgr'; no need to kill the other
managers first.
12
Internal Concurrent Manager

13
Service Manager
FNDSM process - Communicates with the Internal Concurrent
Manager, Concurrent Manager, and non-Manager Service
processes.
• The Service Manager (SM) spawns, and terminates manager and
service processes (these could be Forms, or Apache Listeners,
Metrics or Reports Server, and any other process controlled through
Generic Service Management).
• When the ICM terminates the SM that resides on the same node
with the ICM will also terminate.
• The SM is “chained” to the ICM. The SM will only reinitialize after
termination when there is a function it needs to perform (start, or
stop a process), so there may be periods of time when the SM is not
active, and this would be normal.

14
Service Manager
• All processes initialized by the SM inherit the
same environment as the SM.
• The SM’s environment is set by APPSORA.env
file, and the gsmstart.sh script.
• The apps_<sid> listener must be active on each
CP node to support the SM connection to the
local instance.
• There should be a Service Manager active on
each node where a Concurrent or non-Manager
service process will reside.

15
FNDSM Failure
FNDSM failover as noted in the concurrent manager log:

Could not contact Service Manager FNDSM_RH8_VIS. The TNS


alias could not be located, the listener process on RH8 could not
be contacted, or the listener failed to spawn the Service
Manager process.
Found dead process: spid=(962754), cpid=(2259578), Service
Instance=(1045)
CONC-SM TNS FAIL
Call to PingProcess failed for WFMAILER
CONC-SM TNS FAIL
Call to StopProcess failed for WFMAILER
CONC-SM TNS FAIL
Call to PingProcess failed for FNDCPGSC

16
FNDSM Failover
Found dead process: spid=(716870), cpid=(2259580), Service
Instance=(2009)
Found dead process: spid=(1442020), cpid=(2259579), Service
Instance=(2010)

Starting WFMGSMD Concurrent Manager : 15-AUG-2008


13:28:56
Starting WFMGSMDB Concurrent Manager : 15-AUG-2008
13:28:56
Starting WFALSNRSVCB Concurrent Manager : 15-AUG-2008
13:28:57
Starting STANDARD Concurrent Manager : 15-AUG-2008
13:30:31
Starting Internal Concurrent Manager Concurrent Manager : 15-AUG-
2008 13:30:32

17
Internal Monitor
(FNDIMON process) - Communicates with the Internal Concurrent
Manager.
• This manager/service is used to implement Parallel Concurrent
Processing.
• You do not need to run this manager/service unless you are using
Parallel Concurrent Processing.
• The Internal Monitor (IM) monitors the Internal Concurrent Manager,
and restarts any failed ICM on the local node. It monitors whether
the ICM is still running, and if the ICM crashes, it will restart it on
another node.
• During a node failure in a PCP environment the IM will restart the
ICM on a surviving node (multiple ICM's may be started on multiple
nodes, but only the first ICM started will eventually remain active, all
others will gracefully terminate).
• There should be an Internal Monitor defined on each node where
the ICM may migrate.

18
Standard Manager
• (FNDLIBR process) - Communicates with
the Service Manager and any client
application process.
• The Standard Manager is a worker
process that initiates, and executes client
requests on behalf of Applications batch,
and OLTP clients.

19
Standard Manager

20
Standard Manager - OAM

The Standard Manager is active


on RH9, even though no primary
node is defined

Since no
secondary node is
defined, the
Standard Manager
will not failover
“Failover Processes” in the Work Shifts definition
are the number of processes that will run (3)
when the Standard Manager fails over to the
secondary node.

21
Transaction Manager
A Transaction Manger communicates with the Service
Manager, and any user process initiated on behalf of
Forms, or a Standard Manager request.
A Transaction Manager:
• Supports synchronous processing of requests from a
client program
• Gets request for a client program to run a server-side
program synchronously.
• Return a status/results to the client program.
• At runtime, it starts a number of these managers as
defined.
• Doesn’t poll concurrent request table for a new request
• Only need 1 transaction manager per database, not 1
per instance.
22
Transaction Managers

Some of the Transaction


Managers in R12

23
Configuring Transaction Managers
for RAC
• R11i Transaction Managers use DBMS_PIPE
– This does not work across RAC instances
– RAC users must perform additional configuration
• Requires complicated configuration or additional hardware
• R12 Transaction Managers use AQ
– Works across RAC Instances
– Simplifies configuration
– Reduces complexity
– Profile Option can switch between mechanisms
• DBMS_PIPE can be used for non-RAC users if performance
becomes an issue

24
Configuring Transaction Managers
for RAC
• Edit $ORACLE_HOME/dbs/<context_name>_ifile.ora and add
these parameters:
• _lm_global_posts=TRUE
• _immediate_commit_propagation=TRUE

• Change the profile option ‘Concurrent: TM Transport Type' to


‘QUEUE', and verify that the transaction manager works across
the RAC instance. ATG RUP3 (4334965) or higher provides an
option to use AQs in place of Pipes.
• Profile “Concurrent:TM Transport Type”
• Set to QUEUE
• Pipes are more efficient but require a Transaction Manager to be
running on each DB Instance.
• Navigate to Concurrent > Manager > Define screen, and set up
the primary and secondary node names for transaction managers.

25
Configuring Transaction Managers
for RAC
• Transaction Managers allow a client to make a request for a
program to be run on the server immediately. The client then waits
for the program to complete and can receive program results from
the server. As the client and server are two separate database
sessions, the communication between has been handled using the
DBMS_PIPE package.
• Unfortunately the DBMS_PIPE package does not extend to
communications between sessions on different RAC instances. On
an Applications instance using RAC, the client and server are very
likely to be on different instances, causing transactions to time out
for long periods or fail completely. The current workaround is to
manually set up Transaction managers to connect to all RAC
instances, which not only takes up additional resources, it may
require additional middle-tier hardware or a complicated
configuration that is difficult to maintain.

26
R12 Transaction Managers
• In R12, the Transaction Managers use the AQ
mechanism; the Transaction Managers, work on
RAC connected to either instance.
• This greatly simplifies the configuration and
reduces the complexity for RAC administrators.
A Profile Option has been introduced to allow
users to switch between the two transports
DBMS_PIPE or AQ.

27
Concurrent:PCP Instance Check
• Concurrent processing provides database instance-
sensitive failover capabilities. When an instance is down,
all managers connecting to it switch to a secondary
middle-tier node.
• However, if you prefer to handle instance failover
separately from such middle-tier failover (for example,
using TNS connection-time failover mechanism instead),
use the profile option Concurrent:PCP Instance Check.
• When this profile option is set to OFF, Parallel
Concurrent Processing will not provide database
instance failover support; however, it will continue to
provide middle-tier node failover support when a node
goes down.
28
Conflict Resolution Manager
• Concurrent managers read requests to start concurrent programs.
The Conflict Resolution Manager checks concurrent program
definitions for incompatibility rules.
• If a program is identified as Run Alone, then the Conflict Resolution
Manager prevents the concurrent managers from starting other
programs in the same conflict domain.
• When a program lists other programs as being incompatible with it, the
Conflict Resolution Manager prevents the program from starting until
any incompatible programs in the same domain have completed
running.
• To enable/disable the Conflict Resolution Manager, use the system
profile option 'Concurrent: Use ICM'. Set this to 'No' (default) allows
the CRM to be started.
• Setting it to 'Yes' causes the CRM to be shutdown and the Internal
Manager (ICM) will take over the conflict resolution duties.
• If the CRM will not start (it is started automatically by the ICM), check
this profile option.

29
Conflict Resolution Manager

• Use the system profile option 'Concurrent:


Use ICM'. 'No‘ allows the CRM to be started.
• Setting it to 'Yes' causes the CRM to shutdown.
The Internal Manager (ICM) will take over the
conflict resolution duties.
• Using the ICM to resolve conflicts is not
recommended.
• The CRM's sole purpose is to resolve conflicts,
while the ICM has other functions to perform as
well.
• Setting this option to 'YES' is not recommended.

30
Generic Service Management
• An E-Business Suite system depends on a variety of services, such
as Forms Listeners, HTTP Servers, Concurrent Managers, and
Workflow Mailers. These services are composed of one or more
processes. In the past, many of these processes had to be
individually started and monitored by system administrators.
• Management of these processes is complicated, since these
services can be distributed across multiple host machines.
• The introduction of Generic Service Management in Release 11i
helped simplify the management of these processes by providing a
fault tolerant service framework and a central management console
built into Oracle Applications Manager.
• Service Management is an extension of Concurrent Processing, and
provides a framework for managing processes on multiple host
machines. With Service Management, virtually any application tier
service can be integrated into this framework.
• Patch 2221688 introduces GSM.

31
GSM

32
Generic Services

33
GSM and Multiple Nodes
• GSM enables users to manage Applications
services across multiple middle-tier nodes.
• This includes services on Web/Forms nodes that
previously have had no concurrent processing
footprint.
• Users configuring GSM in a multiple-node
system should be sure to have followed the
instructions for Parallel Concurrent Processing.
• This includes setting the environment variable
APPLDCP=ON and assigning a primary node for
all defined managers and services (if not already
defined.)
34
Seeded GSM Services
When configuring GSM the following GSM
Services are seeded automatically:
– Forms Listener
– Metrics Server
– Metrics Client
– Reports Server
– Apache Listener

LINUX users should not Activate the Reports


Server under GSM
35
Starting GSM
Apps Listener:
listener.ora
gsmstart.sh
exec FNDSM

36
adcmctl.sh
adcmctl.sh calls:
starmgr.sh
batchmgr.sh
CONCSUB
FNDSVCRG

37
FNDSVCRG – Service Controller
Utility
• FNDSVCRG is an executable introduced as a
part of the Seeded GSM Services. It provides
improved coordination between the GSM
monitoring of these service and their command-
line control scripts.
• The $FND_TOP/bin/FNDSVCRG executable is
called from adcmctl.sh control script before and
after the script starts or stops the service.
FNDSVCRG connects to the database using
JDBC and validates the configuration of the
Seeded GSM Service.
38
Verify GSM
• To verify GSM is working, start the concurrent
managers.
• Once GSM is enabled, the ICM uses Service
Managers to start all concurrent managers and
activated services.
• If the ICM is successfully starting the managers,
then GSM has been configured properly.
• If managers and/or services fail to start, errors
should appear in the ICM log file.

39
Service Manager Log
• Each Service Manager maintains its own
log file named FNDSMxxxx.mgr, located in
the same directory as concurrent manager
log files.
• If you cannot locate the Service Manager
log file, it is likely that the Service
Managers are not starting properly and
there is a configuration issue that needs
troubleshooting.

40
Test – Kill services and see if
Kill FNDSM
GSM restarts them
applvis 9007 1 0 11:53 ? 00:00:00 FNDSM
applvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR
applvis 9161 5683 0 11:55 pts/3 00:00:00 grep FND

[applvis@rh9 scripts]$ kill -9 9007


[applvis@rh9 scripts]$ ps -ef |grep FND
applvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR
applvis 9169 1 0 11:55 ? 00:00:00 FNDSM
applvis 9249 5683 0 11:57 pts/3 00:00:00 grep FND

Kill FNDCRM
[applvis@rh9 scripts]$ ps -ef |grep FNDCRM
applvis 8886 1 0 11:52 ? 00:00:00 FNDCRM
APPS/ZGA13053E1E1B7BA773417089054DA88F194EAC0D687728CC2551870E6B78C4B439
EADB287342795115A88DBC85788CCB4 FND FNDCRM N 10 c LOCK Y RH9 1302318
[applvis@rh9 scripts]$ kill -9 8886

[applvis@rh9 scripts]$ ps -ef |grep FNDCRM


applvis 9457 9392 0 12:09 ? 00:00:00 FNDCRM
APPS/ZG26430816FA3570354BC57DE47FF105D145F8DE226EFE58CE04B416633DCB90126
7BFECFA7585114F7090060EFE1147BE FND FNDCRM N 10 c LOCK Y RH9 1302343

Both of these services were started before I could enter the grep command to find the corresponding
process.
41
11i - Defining PCP Details

In Release 11i,
the Secondary
Node doesn’t
need to be filled
in for failover to
occur 42
R12 PCP Details

In Release 12,
failover won’t
occur if there is
no Secondary
Node defined

43
R12 PCP Setup

The only
Standard
Manager set
up to fail over
is the
“Standard
Manager”

44
R12 Manager Failover

45
PCP Failover
•DB Node – RH8

Database
•RH7 •RH8 •RH9

sqlnet.ora
•PCP •PCP •PCP
Database
Listener
SQL*Net •SQL*Net •SQL*Net
•Client Client Client

•TCP_KEEPALIVE takes 240 seconds before issuing DCD

46
Parallel Concurrent Processing
• Parallel concurrent processing allows distribution of
concurrent managers across multiple nodes.
• Benefits are better: performance, availability and
scalability (load balancing).
• Parallel Concurrent Processing (PCP) is activated along
with Generic Service Management (GSM); it can not be
activated independently of GSM.
• With parallel concurrent processing implemented with
GSM, the Internal Concurrent Manager (ICM) tries to
assign valid nodes for concurrent managers and other
service instances.

47
Parallel Concurrent Processing
• There should be only one ICM and CRM,
at any given time, although the ICM and
CRM could be configured to run on
several of the nodes.
• Concurrent Managers migrate to the
surviving node when one of the concurrent
nodes goes down.

48
Parallel Concurrent Processing
Web HTML Web Server
Browser Interface

Forms Server
Data
JInitiator JAVA
Interface
Reports Server

Internal ICM Service Report


SQL*Net
Monitor FNDLIBR Manager Review
FNDIMON FNDSM Agent .rdx
Standard
Manager
FNDCRM FNDLIBR Requests Logs Out

Internal ICM Service Report SQL*Net


Monitor FNDLIBR Manager Review
FNDIMON FNDSM Agent .rdx
Standard Database
Manager
FNDCRM Requests Logs Out
FNDLIBR

What’s wrong with this picture?


49
APPLDCP Profile Option
Starting with Release 11.5.10, FND.H, the APPLDCP environment
variable is ignored. R12 GSM requires the value of APPLDCP to be
set to “ON”. The value is hard-coded in afpcsq.lpc version 115.35,
thereby ignoring the value of APPLDCP.

As per ATG Development:


As of file "afpcsq.lpc" version 115.35 or higher, APPLDCP is internally
hard-coded to "ON" when the Generic Service Management (GSM) is
enabled--"keeping in mind, use of the GSM is required".
In short, at "afpcsq.lpc" version 115.35 or higher with the GSM enabled,
the setting of the APPLDCP environment variable is ignored--this is the
"default behavior on all R12 releases."
NOTE: As per ARU, "Patch 11i.FND.H" (3262159) and "Oracle
Applications Release 11.5.10" (3140000) contains "afpcsq.lpc" version
115.37.
From Note: 753678.1

50
PCP Failover Mechanisms
• TCP keepalive
• PMON – ICM Process Monitor
• Dead Connection Detection
• Connection Failure Recovery – R12
• 10g Timeout Parameters (untested)
– sqlnet.inbound_connect_timeout (server)
– sqlnet.send_timeout (client and/or server)
– sqlnet.recv_timeout (client and/or server)

51
11i PCP Failure
• TCP Failure
• ICM Lock is released, FNDIMON pings
ICM node, if ping fails, check PMON
• PMON detects a “dead process”, crashed
ICM
• reviver.sh
• DCD

52
R12 PCP Failure
• TCP Failure
• PMON detects a “dead process”
• ICM Shutdown
– Look for error messages ORA-3113, ORA-
3114 or ORA-1041
• reviver.sh
• DCD

53
Reviver
REVIVER
ICM
Start No
Receive
From the CM log file:
Starts to Shutdown Shutdown?
• The ICM has lost its
database
Attempt to connection and is
Lost DB
Get DB
Connection No shutting down.
Sleep
Connection?
Yes
• Spawning reviver
Yes process to restart
No Kill Previous DB the ICM when the
Spawn Reviver
Yes
Session No
database becomes
ICM
Started? available again.
Yes
• Spawned reviver
Start ICM
process 10910.
Exit Exit

54
reviver.log
The ICM has lost its database connection
and is shutting down.
Spawning reviver process to restart the ICM
when the database becomes available
again.
Spawned reviver process 10910.

55
TCP
TCP/IP is a connection-oriented protocol; TCP
implements packet timeout and retransmission
in an effort to guarantee the safe and sequenced
order of data packets.
If a timely acknowledgement is not received in
response to the probe packet, the TCP/IP stack
will retransmit the packet some number of times
before timing out.
After TCP/IP gives up, SQL*Net receives
notification that the probe failed.

56
TCP Keepalive
At this time, client side SQL*Net connections do not enable
keepalive for TCP connections by default.
However, it is possible to enable this by adding the
ENABLE=BROKEN parameter to the SQL*Net connect
string, by adding this parameter to the sqlnet.ora file.
**WARNING** Keepalive intervals can typically be set to 2
hours or more (i.e,,it can take more than 2 hours to
notice a dead server even if keepalive is enabled). To
make keepalive useful for PCP and TAF the keepalive
interval needs to be reduced to a smaller value (such as
2 minutes).
If there are a lot of IDLE connections on your network, then
reducing keepalive can increase network traffic
significantly.
57
ENABLE=BROKEN
Sample TNS alias to enable keepalive (notice the
ENABLE=BROKEN clause)

VIS_BALANCE = (DESCRIPTION =
(ENABLE=BROKEN)
(ADDRESS_LIST = (LOAD_BALANCE = ON)
(FAILOVER = ON)
ADDRESS = (PROTOCOL = TCP)
(HOST = rh8)(PORT = 1521)) (ADDRESS =
(PROTOCOL = TCP)(HOST = rh6)(PORT = 1521)))

58
TCP Keepalive
• **WARNING** Keepalive intervals are
typically set to 2 hours or more (ie: it can
take more than 2 hours to notice a dead
server even if keepalive is enabled).
• To make keepalive useful for TAF, the
keepalive interval would need to be
reduced to a smaller value (such as 2
minutes). Note: 249213.1

59
TCP KeepAlive Parameters for
Linux
tcp_keepalive_time the time since the last data
packet sent and the first
keepalive probe
tcp_keepalive_intvl the time between keepalive
probes
tcp_keepalive_probes the number of probes to be
sent before declaring the
connection dead

Default Settings tcp_keepalive_time = 7200 seconds


tcp_keepalive_intvl = 75
tcp_keepalive_probes = 9

A total of 7875 seconds, or 2 hours 11 minutes and 15 seconds.

60
TCP Keepalive
Initial Settings
– tcp_keepalive_time = 200 secs
– tcp_keepalive_intvl = 20
– tcp_keepalive_probes = 2

• After 200 seconds of no response, TCP sends


the first of 2 probes, 20 seconds apart.
• TCP notifies SQL*Net of the failure, and
SQL*Net removes the offending connection.

61
TCP Retries
• tcp_retries1 (default: 3) The number of times TCP will
attempt to retransmit a packet on an established
connection normally, without the extra effort of getting
the network layers involved.

• tcp_retries2 (default: 15) The maximum number of times


a TCP packet is retransmitted in established state before
giving up

• tcp_syn_retries (default: 5) The maximum number of


times initial SYNs for an active TCP connection attempt
will be retransmitted. The default value is 5, corresponds
to approximately 180 seconds.

62
TCP Retries
Now let’s consider changing the following
TCP parameters from their default values:
tcp_retries1 = 2
tcp_retries2 = 2
tcp_syn_retries = 2
In this example, the time to initialize the PCP
failover was an average of 8 seconds after
changing these TCP parameters.

63
Disconnect TCP Connection
from RH9
From the ICM log:

The Internal Concurrent Manager has encountered an error.


Review concurrent manager log file for more detailed information. : 12-
JAN-2009 15:22:55 -
Shutting down Internal Concurrent Manager : 12-JAN-2009 15:22:55
12-JAN-2009 15:22:55
The ICM has lost its database connection and is shutting down.
Spawning reviver process to restart the ICM when the database
becomes available again.
Spawned reviver process 1541.
The VIS_0112@VIS internal concurrent manager has terminated with
status 1 - giving up.
Found dead process: spid=(17963), cpid=(1302176), ORA pid=(26),
manager=(0/1)

64
PMON & fnd_concurrent _queues
PMON updates the work_start column in the
fnd_concurrent_queues table every 4 PMON cycles

fdpsrp() (running_processes correction):


ICM cannot obtain exclusive lock on
FND_CONCURRENT_QUEUES
Oracle error code returned: 1
This message is information and does not indicate a
problem with CP functionality.
remote call function (FNDIMON)
15-AUG-2008 10:06:02 - Function to call: PingProcess

65
PMON – ICM Lock – 11i
• If the “ICM lock” is not available, FNDIMON will
now ping the node of the ICM.
• If the ping succeeds, we conclude that the ICM is
fine. What????
• If the ping fails, we further check if it has been over
“quesiz” pmon cycles since the ICM updated the
work_start column fnd_concurrent_queues.
• If it has been more than four pmon cycles we
conclude that the ICM is dead.

66
PMON “found dead process”

On RH9 the PMON found a dead process. The


PMON takes about 1 second to run, then sleeps for
2 minutes:

Process monitor session started : 18-JAN-2009 21:46:05


Found dead process: spid=(16977), cpid=(1321475), Service
Instance=(36543)
Process monitor session ended : 18-JAN-2009 21:46:06
The Internal Concurrent Manager has encountered an error.
Review concurrent manager log file for more detailed
information. : 18-JAN-2009 22:02:01

67
PMON – node RH9 is down
From the ICM log:

Process monitor session started : 12-JAN-2009


15:18:27
Internal Concurrent Manager found node RH9 to
be down. Adding it to the list of unavailable
nodes.
CONC-SM TNS FAIL
Call to PingProcess failed for XDPCTRLS

68
PMON
Process monitor session started : 18-JAN-2009
22:38:57
CONC-SM TNS FAIL
Call to PingProcess failed for OAMGCS
18-JAN-2009 22:38:58 - Node:(RH7), Service
Manager:(FNDSM_RH7_VIS) currently unreachable by TNS
Found dead process: spid=(11234), cpid=(1321563), ORA
pid=(167), manager=(0/4)
Process monitor session ended : 18-JAN-2009
22:38:58

69
PMON
Shutting down Internal Concurrent Manager : 18-
JAN-2009 22:02:01
18-JAN-2009 22:02:01
The ICM has lost its database connection and is
shutting down.
Spawning reviver process to restart the ICM when
the database becomes available again.
Spawned reviver process 10910.

70
PMON runs every 2 minutes
Process monitor session ended : 18-JAN-
2009 21:49:05

Process monitor session started : 18-JAN-


2009 21:51:05

71
Edit ICM Runtime Parameters

72
Edit PMON Parameters

73
Edit PMON Parameters

ICM parameters are read


from batchmgr.sh when
adcmctl.sh runs. Changing
these parameters here does
not change batchmgr.sh!

74
$FND_TOP/bin/batchmgr.sh
Make sure the PMON changes are made in the $FND_TOP/bin/batchmgr.sh file.

FILENAME
# batchmgr
# DESCRIPTION
# fire up Internal Concurrent Manager process
# USAGE
# batchmgr arg1=val1 arg2=val2 ...
#
# Parameters may be sent via the environment.
#
# ARGUMENTS DEFAULT
# [appmgr|sysmgr]=username/password
# [sleep=sleep_seconds] 15
# [mgrname=manager_name] icm
# [logfile=log_filename] $FND_TOP/$APPLLOG/$mgrname.mgr
# [restart=N|mim minutes between restarts] N
# [mailto="user1 user2..."] current user
# [PRINTER=printer_name]
# [pmon=iterations] 4
# [quesiz=pmon_iterations] 1
# [diag=Y|N] N

75
Reviver
REVIVER
ICM
Start No
Receive
From the CM log file:
Starts to Shutdown Shutdown?
• The ICM has lost its
database
Attempt to connection and is
Lost DB
Get DB
Connection No shutting down.
Sleep
Connection?
Yes
• Spawning reviver
Yes process to restart
No Kill Previous DB the ICM when the
Spawn Reviver
Yes
Session No
database becomes
ICM
Started? available again.
Yes
• Spawned reviver
Start ICM
process 10910.
Exit Exit

76
reviver.log
reviver.sh starting up...
[ Mon Jan 12 20:02:15 MST 2009 ] - Read APPS username/password.
[ Mon Jan 12 20:02:45 MST 2009 ] - Attempting database connection...
[ Mon Jan 12 20:02:45 MST 2009 ] - Successful database connection.
[ Mon Jan 12 20:02:45 MST 2009 ] - Killing previous ICM session...
1 row updated.
Commit complete.
[ Mon Jan 12 20:02:45 MST 2009 ] - Looking for a running ICM
process...
[ Mon Jan 12 20:02:45 MST 2009 ] - ICM now running, reviver.sh
complete.

77
reviver.sh
reviver.sh – code summary
Sleep 30
Test_connection
Kill_old _icm
Get session
Alter system kill session
Check_running_icm
Fnd_conc.ecm_alive
start_icm
startmgr.sh
78
Dead Connection Detection
• Dead Connection Detection (DCD) is a
feature of SQL*Net 2.1 and later, including
Oracle Net8. DCD detects when a partner
in a SQL*Net V2 client/server or
server/server connection has terminated
unexpectedly, and releases the resources
associated with it.

79
Implement DCD
• Implement by:

adding SQLNET.EXPIRE_TIME = 1 (Minutes)


to the sqlnet.ora file

If the connection is idle for the time interval


specified in minutes by the
SQLNET.EXPIRE_TIME parameter, the server-
side process sends a small 10-byte packet to the
client. The packet is sent using TCP/IP.
80
DCD – ICM Lock
• ICM and IM can use the DCD functionality
of the Network (TCP sqlnet).
• ICM is a client process connected to a
DCD enabled DB dedicated server
process.
• ICM holds the named PL/SQL Lock, the
“ICM lock”.
• IM is continuously trying to check whether
it can get the same named PL/SQL Lock.
81
DCD – ICM Lock
• As soon as the “ICM lock” is released by the DB / DCD,
FNDIMON pings the ICM node, and the IM deduces that
the ICM has crashed.
– If the ping succeeds, we conclude that the ICM is fine.
• Obviously, the ICM can be down, even if TCP is working, this is bad
logic.
– If the ping fails, FNDIMON determines if it’s been over four
pmon cycles since the ICM updated the work_start column
fnd_concurrent_queues.
– If it has been more than four pmon cycles FNDIMON concludes
the ICM is dead.
• The DCD comes into picture here after ICM has crashed
and DB needs to identify that the ICM is gone.
• The DB needs to clean up the dedicated server process
resource corresponding to the ICM client process
82
FNDIMON has the ICM Lock
Check if the ICM updated the work_start column fnd_concurrent_queues.

Be aware that if a TCP failure is not detected, failover will not occur.
The following except from a concurrent manager log shows:
fdpsrp() (running_processes correction):
ICM cannot obtain exclusive lock on FND_CONCURRENT_QUEUES
Oracle error code returned: 1
This message is information and does not indicate a problem with CP
functionality.
remote call function (FNDIMON)
15-AUG-2008 10:06:02 - Function to call: PingProcess

The PingProcess continues until the CP processes resume, or a TCP


failure is detected, and failover is begun.

83
11i PCP Failure
• TCP Failure
• ICM Lock is released, FNDIMON pings
ICM node, if ping fails, check PMON
• PMON detects a “dead process”, crashed
ICM
• reviver.sh
• DCD

84
R12 PCP Failure
• TCP Failure
• PMON detects a “dead process”
• ICM Shutdown
– Look for error messages ORA-3113, ORA-
3114 or ORA-1041
• reviver.sh
• DCD

85
Test PCP Failover Parameters
• Test to explore effect of DCD, PMON and TCP
failover methods.
• Variables: sqlnet.expire_time, pmon sleep and
number of cycles, and the following TCP
Keepalive parameters:
• tcp_keepalive_time,
• tcp_keepalive_intvl,
• tcp_keepalive_probes
• tcp_retries1 (default: 3, new value 2)
• tcp_retries2 (default: 15, new value 2)
• tcp_syn_retries (default: 5, new value 2)
86
Failover Test Results
Failover time / Expire_time PMON PMON tcp_KA tcp KA tcp KA tcp tcp tcp syn
Failback time Sleep Cycles time intvl probes retries retries2 retries

241 secs / 1 minute 30 secs 4 200 20 2 3 15 5

250 secs / 50 secs 5 minute 30 secs 4 200 20 2 3 15 5

262 secs / 100 sec 10 minutes 30 secs 4 200 20 2 3 15 5

300 secs / 75 secs 1 minute 15 secs 2 200 20 2 3 15 5

285 secs / 35 min 10 minute 30 secs 4 1000 60 10 3 15 5

8 secs / 105 secs 1 minute 30 secs 4 1000 60 10 2 2 2

10 secs / 42 secs 1 minute 30 secs 4 200 20 2 2 2 2

7 secs / 40 secs 10 minutes 30 secs 4 200 20 2 2 2 2

6 secs / 34 secs 1 minute 15 secs 2 200 20 2 2 2 2

87
All Services are UP

88
Concurrent Managers

• Processes - Actual = 1 and Target = 1, manager is running


• Processes - Actual = 0 and Target = 1, manager is running
89
Actual Processes = 0

Example of Actual Processes = 0,


in this example the CRM is not
running

90
PCP Setup

PCP setup – this screen is continued on the next slide


91
Primary and Secondary Nodes
Any
concurrent
programs not
assigned to
the Standard
Manager will
not fail over

The CRM, ICM


and Standard
Manager will
fail over

92
TCP Failure

• TCP disconnected at 2:57:25


• 10 seconds after the TCP connection was pulled, OAM reported the status above.
• It took 10 seconds for OAM to register a failure of services on RH9.

93
CRM is DOWN

If any of the subordinate


services fail, it rolls up to the
Dashboard

94
CRM Failure

CRM has failed, Actual


Processes = 0

95
PCP Failover from RH9 to RH7

Adding Node:(RH9), to unavailable list


Found dead process: spid=(9696), cpid=(1321449), ORA pid=(80), manager=(0/0)
Found dead process: spid=(9784), cpid=(1321458), ORA pid=(114), manager=(0/0)
Found dead process: spid=(9783), cpid=(1321457), ORA pid=(104), manager=(0/0)
Found running request 4413565 attached to dead manager process.
Attempting to restart request.
Internal Concurrent Manager found node RH9 to be down. Adding it to the list of
unavailable nodes.

96
GSM tries to restart the services
TCP and TNS is unavailable:
Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42
CONC-SM TNS FAIL
Routine AFPEIM encountered an error while starting concurrent manager STANDARD
with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR.
Check that your system has enough resources to start a concurrent manager process.
Contac : 18-JAN-2009 21:43:42
Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42
CONC-SM TNS FAIL
Routine AFPEIM encountered an error while starting concurrent manager STANDARD
with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR.
Check that your system has enough resources to start a concurrent manager process.
Contac : 18-JAN-2009 21:43:42
Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42
CONC-SM TNS FAIL
Routine AFPEIM encountered an error while starting concurrent manager STANDARD
with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR.

97
ICM and CRM are DOWN

98
RH9 is DOWN

Not really down, just not on the


network

99
PCP is DOWN

This is momentary as
GSM figures out what to
do

100
Failover to Secondary Node

The ICM and CRM failed


over to RH7 in about 1
minute and 30 seconds

101
Failover from RH9 to RH7
Starting Internal Concurrent Manager Concurrent
Manager : 18-JAN-2009 21:51:23
: Started ICM on Target RH7.
Process monitor session ended : 18-
JAN-2009 21:52:53
: Migration of ICM has completed.
Shutting down Internal Concurrent Manager : 18-
JAN-2009 21:53:23
The VIS_0118@VIS internal concurrent manager
has terminated successfully - exiting.
102
ICM Failover to RH7
Starting Internal Concurrent Manager Concurrent
Manager : 18-JAN-2009 21:51:23
: Started ICM on Target RH7.
Process monitor session ended : 18-
JAN-2009 21:52:53
: Migration of ICM has completed.
Shutting down Internal Concurrent Manager : 18-
JAN-2009 21:53:23
The VIS_0118@VIS internal concurrent manager
has terminated successfully - exiting.
103
RH9 not available

104
Request Failover

105
Standard Manager Failover
Configuration

• Note the Inventory Manager, MRP Manager and OAM


Metrics Collection Manager are not setup to failover.
106
Managers with a Secondary Node

• Note the Inventory Manager, MRP Manager and OAM


Metrics Collection Manager are not setup to failover.
107
Failback

FAILBACK – tcp connected at 31:40


The host, RH9 becomes available on OAM about 2
minutes later.
108
RH9 available

109
ICM Failback

110
Concurrent Manager Log
Starting Internal Concurrent Manager Concurrent
Manager : 18-JAN-2009 22:53:33
: Started ICM on Target RH9.
Process monitor session ended : 18-
JAN-2009 22:55:03
: Migration of ICM has completed.
Shutting down Internal Concurrent Manager : 18-
JAN-2009 22:55:33
The VIS_0118@VIS internal concurrent manager
has terminated successfully - exiting.
111
112
Failback Complete

Total Failback Time 3 minutes and 45 seconds


113
Standard Manager before Failover

The Standard Manager


has 3 Actual and Target
processes.

114
Standard Manager is DOWN

115
Standard Manager has 2
Processes on Failover

After 3 minutes and 30 seconds the Standard Manager started on RH7

116
Shutdown of CP

117
Concurrent Processing Load
Balancing
Two types of Load Balancing

• Load Balancing with both nodes running –


no failover
• Load Balancing during failover

118
PCP Load Balancing
• One of the benefits Parallel Concurrent
Processing provides:
– failover in case of node failure
• maintain throughput and keep the business running during
node failures.

• When a node fails, the processes that were


running on the failed node are restarted on
secondary nodes.
• However, a resource intensive node may
overload the secondary node when it fails-over.
119
PCP Load Balancing
• If too many processes are running on the secondary
node when the primary node fails over, the secondary
node may not have the capacity to process the requests
from additional concurrent managers.
• R12 introduces Failover Sensitive Workshifts. This
enhancement allows the System Administrator to
configure how many processes failover for each
workshift. With this added control, System Administrators
can enjoy the benefits of PCP failover without risking
performance issues through overloaded resources.

120
R12 Failover Sensitive Workshifts

121
Failover Sensitive Workshifts

122
Failover Sensitive Workshifts

• Conversely, if a failover occurs from node 1 to


node 2, we may want to reduce the failover
processes, however, this doesn’t work.
• Only if the node fails does the “failover
processes” take effect.
123
Failover Processes

PO Document Approval Manager and the Standard Manager will reduce the number of
processes when RH7 fails. When RH9 fails, the number of failover processes for managers
that run on RH7 are not reduced.

124
Failover Sensitive Workshifts
It’s clear: to run a R11i or R12 system during
a failover, there are two choices:
• Run the servers at 35% or less utilization
• Reduce the number of processes that are
allowed during failover
For most businesses the second option is
the most practical.

125
References
• 249213.1 - Performance problems with Failover when TCP Network goes down
• 364171.1- TAF Session Hangs, Select Fails To Complete W/ Loss Of NIC: Tune TCP
Keepalive
• 211362.1 - Process Monitor Session Cycle Repeats Too Frequently
• 291201.1 - How To Remove a Dead Connection to the Target Database
• 362135.1 - Configuring Oracle Applications Release 11i with Oracle10g Release 2 Real
Application Clusters and Automatic Storage Management
• Optimizing the E-Business Suite with Real Application Clusters (RAC) - Ahmed Alomari
• 240818.1 - Concurrent Processing: Transaction Manager Setup and Configuration
Requirement in an 11i RAC Environment
• R12 ATG - Concurrent Processing Functional Overview – Aaron Weisberg
• 210062.1 - Generic Service Management (GSM) in Oracle Applications 11i
• 271090.1 - Parallel Concurrent Processing Failover/Failback Expectations
• 241370.1 - Concurrent Manager Setup and Configuration Requirements in an 11i RAC
Environment
• 602899.1 - Some More Facts On How to Activate Parallel Concurrent Processing

126

You might also like