You are on page 1of 45

High Availability- Demystified

Aman Sharma

@amansharma81 http://blog.aristadba.com
Who Am I?
▪ Aman Sharma
▪ About 14+ years using Oracle Database
▪ Oracle ACE
▪ Frequent Contributor to OTN Database forum(Aman….)
▪ Board Member(AIOUG)
▪ Lead-North India Chapter
▪ Sun Certified
|@amansharma81 *
|http://blog.aristadba.com *

OTNYatra-2017
Confused?

Application Continuity
OTNYatra-2017
Overview : Oracle RAC Database Tier

• Software based clustering using Grid Database Tier


Infrastructure software
• Cluster nodes contain only database
and ASM instances
• Homogenous configuration
• Dedicated access to the shared storage
for the cluster nodes
• Applications/users connect via nodes
outsides the cluster
• Reflects Point-to-Point model

OTNYatra-2017
Overview : Oracle RAC Application Tier

Application Tier

Database Tier

OTNYatra-2017
Solutions possible using Oracle RAC
• Load balancing
• Work load can be balanced of different sessions connecting to the RAC cluster
nodes.
• Different options i.e. Client Side Connect-Time, Server Side Connect-Time,
Run Time
• High Availability
• Database availability is provided in the events of Node and Instance failures
• Performance
• RAC, using Cache Fusion, minimizes the physical IO needed to access the
database
• With more number of nodes, performance of certain database operations can
be improved in comparison to single instance

OTNYatra-2017
Oracle RAC based Load Balancing - Overview
• Different sessions can do different kind of work across the RAC nodes
• Load balancing allows workload distribution among RAC nodes
• Workload balance is important to avoid “hot spots”
• Various methods i.e. Client Side Load Balancing, Server Side Load
Balancing, Run Time Load balancing exists

OTNYatra-2017
Oracle Net Failover Overview
• Oracle RAC is a part of MAA architecture and provides High
Availability solution for Node & Instance crash
• Failover is how a client handles the “exception” of the disconnect
from the server
• In Oracle Network context, Failover refers to switching to a different
instance or database
• Can happen at two different intervals
• At the time of the initial connection request is made
• At the time when an existing connection is terminated
• Implemented using different techniques like Transparent Application
Failover, Fast Connection Failover etc.
OTNYatra-2017
Oracle Database Services

OTNYatra-2017
Oracle Database Services- contd….
• Services present a single system image to the user in a multi-instance
environment
• For every service, a resource profile describing how the relationship
between service, instance and the database
• Services are used to configure in RAC
• Load balancing
• Failover (using TAF)

OTNYatra-2017
Oracle Call Interface(OCI)
• Native C language interface for oracle database
• Works for both, native and custom applications
• Used by Oracle tools i.e. SQL*PLUS, Real Application Test(RAT),
SQL*Loader, Data Pump etc.
• Foundation for other language specific interfaces i.e. Oracle JDBC-OCI,
Oracle Data Provider(ODP.NET), Oracle Precompilers, Oracle ODBC,
Oracle C++ Interface(OCCI)
• Provides support for Transparent Application Failover(TAF)

OTNYatra-2017
Understanding Connect-Time Failover
• Connect Time Failover is used at the time of the initial connection
request
• Connect Time failover is a default property of the Oracle Net services
• Can be disabled using the parameter FAILOVER=False
• Can be implemented in both Single instance as well in RAC

OTNYatra-2017
Understanding Connect-Time Failover(Contd..)
• Implemented implicitly
• Can be controlled using the parameter FAILOVER=TRUE/FALSE
• To implement in single instance, use:
• Multiple Listener addresses with a single connect descriptor
• Multiple connect descriptors
• To implement in RAC, use
• Virtual IP’s (VIPs)

OTNYatra-2017
Connect Time Failover Example (Single Instance)
ORCL= ORCL=
(DESCRIPTION=
(DESCRIPTION_LIST=
(ADDRESS_LIST= (FAILOVER=true)
(DESCRIPTION=
(ADDRESS= (ADDRESS=
(PROTOCOL=TCP) (HOST=Host2) (PROTOCOL=TCP) (HOST=Host1) (PORT=1521))
(PORT=1521)) (CONNECT_DATA=
(ADDRESS=
(SERVICE_NAME=orcl))
(PROTOCOL=TCP) (HOST=Host1)
(PORT=1521)) )
(DESCRIPTION=
(FAILOVER= TRUE) (ADDRESS=
) (PROTOCOL=TCP) (HOST=Host2) (PORT=1521))
(CONNECT_DATA=
(CONNECT_DATA= (SERVICE_NAME=
orcl_bkp))
(SERVICE_NAME= orcl)
) )

OTNYatra-2017
Connect Time Failover Example (RAC)
ORCL =
(DESCRIPTION =
(ADDRESS_LIST =
(LOAD_BALANCE=ON)
(FAILOVER=ON)
(ADDRESS=(PROTOCOL=TCP)(HOST=vip1)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=vip2)(PORT=1521))
)
(CONNECT_DATA=(SERVICE_NAME=orcl)))

OTNYatra-2017
Understanding Connect-Time Failover- Node is Down

N1 I IP VIP Failover
SRV(P) SRV(A)
RE-ARP
Impact on User I have VIP2!
Connection
▪ User’s connection
gets VIP1 VIP2
disconnected u/p@HR_PROD
HR(A) HR(A)
HR(P)
▪ User’s Session
gets aborted OCI T
HR(Host1=VIP1 NAK N
▪ User’s work is Host2=VIP2 S SGA SGA
lost Service=HR_PROD) 1 2
▪ User receives NL1 NL2
error
▪ User needs to Host01 Host02
re-connect
explicitly
Understanding Connect-Time Failover- Instance is Down
I SRV(P) SRV(A)
N1 IP VIP
Impact on User
Connection
▪ User’s
connection gets
disconnected VIP1 VIP2
u/p@HR_PROD
▪ User’s Session Error! HR(A) HR(P)
gets aborted OCI T PMON
▪ User’s work is HR(Host1=VIP1 NAK N
S
lost Host2=VIP2 SGA SGA
Service=HR_PROD) 1 2
▪ User receives NL1 NL2
error
▪ User needs to re- HR service is Host01 Host02
connect explicitly NOT available!
Transparent Application Failover(TAF)
• Works as an exception handler
• Feature of OCI driver
• Enables the application to reconnect automatically
• Needs to be explicitly enabled on either the client
side or at the server side
• Two failover methods
• Basic
• Pre-connect
• Two failover types
• Session
• Select

OTNYatra-2017
Database Services & TAF
• TAF options needs to be entered for the connection to use
• With database services, implementing TAF becomes easier
• Defining a TAF policy for a service enables the settings for all
the subsequent connections which will use that service
• TAF settings implied using the database service overrides any
settings done on the client side
• To define a TAF policy for a service, the srvctl utility can be
used

srvctl modify service -db orcl -service SRV


-failovermethod BASIC -failovertype SELECT
-failoverretry 2 -failoverdelay 5

OTNYatra-2017
TAF & Node Failover
N1 I IP VIP Failover
SRV(P) SRV(A)
RE-ARP
Impact on User I have VIP2!
Connection
▪ User’s connection
gets disconnected VIP1 VIP2
▪ User’s Session gets u/p@HR_PROD
aborted HR(A) HR(A)
HR(P)
▪ User’s work is lost OCI T
▪ User receives error HR(Host1=VIP1 NAK N
S
▪ User gets Host2=VIP2
Service=HR_PROD)
SGA
1
SGA
2
reconnected
automatically TAF NL1 NL2
without explicitly
giving connect Host01 Host02
request!! I shall handle
this!
Configuring TAF- Client Side

Service creation on the server side


srvctl add service -db ORCL -service SRV1 –r I1 –a I2

Client-side TNSNAMES.ora
SRV1 =
(DESCRIPTION =(FAILOVER=ON)
(ADDRESS=(PROTOCOL=TCP)(HOST=VIP1)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=VIP2)(PORT=1521))
(CONNECT_DATA =
(SERVICE_NAME = SRV1)
(FAILOVER_MODE= (TYPE=select)
(METHOD=basic)
(RETRIES=5)
(DELAY=1))))

OTNYatra-2017
Configuring TAF- Server Side

Service creation on the server side


srvctl add service -db ORCL -service SRV1
-failovermethod BASIC -failovertype SELECT
-failoverretry 5 -failoverdelay 1 –r I1 –a I2

Client-side TNSNAMES.ora
SRV1 =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = cluster01-scan)
(PORT = 1521))
(CONNECT_DATA =
(SERVICE_NAME = SRV1)))

OTNYatra-2017
Is using TAF is Enough?
• Applications can waste time in many critical ways:
• Waiting for TCP/IP timeouts when a node fails without
closing sockets, and for every subsequent connection
while that IP address is down.
• Attempting to connect when services are down.
• Not connecting when services resume.
• Processing the last result at the client when the server
goes down.
• Attempting to execute work on sub-optimal nodes.

OTNYatra-2017
A typical Example when TAF is not enough

• A node is failed without closing all the sockets


• All connected sessions, that are blocked on IO wait-wait for
tcp_keepalive
• Sessions which were still processing the last results, are now in a
“hung state”
• Session will only get disconnected after receiving the interrupt when
the next data is requested

OTNYatra-2017
Fast Application Notification(FAN)

• FAN is a high-availability notification mechanism used for fast


detection of failures
• Oracle RAC uses FAN to notify other processes about configuration
and service level information
• Oracle client drivers and connection pools can subscribe to FAN
events and take immediate action

OTNYatra-2017
Why FAN?
• Client applications, typically rely on connection timeouts or out-of-bound polling
mechanisms
• Timeout based mechanisms slow down the failure detection
• FAN enables the push of the high availability events i.e. Node/Service down as
soon they occur
• FAN events are immediately pushed to the applications to decide on the course
of action without waiting for any time out
• FAN events also helps in propagating run-time load balancing for more efficient
work load distribution across the nodes
• FAN works with the following integrated Oracle clients :
• Oracle JDBC Universal Connection Pool
• ODP.NET connection pool
• OCI session pool
• Oracle WebLogic Server Active Gridlink for Oracle RAC
• OCI and ODP.NET clients

OTNYatra-2017
Let’s Implement FAN
Three possible ways
• Your application can use FAN without programmatic changes if you use an integrated
Oracle client.
• Applications can use FAN programmatically by using the JDBC and Oracle RAC FAN
application programming interface (API) or by using callbacks with OCI and ODP.NET to
subscribe to FAN events and to execute event handling actions upon the receipt of an
event.
• FAN with server-side callouts on your database tier.
• FAN events are broadcasted to different client applications by the CRS using different
processes
• Oracle Notification Server(ONS)
• ONS OCI API
• ONS Java API
• JDBC
• Advance Queue (AQ)
• ODP.NET
• OCI API
• Event Manager Daemon(EMD)
• Enterprise Manager
OTNYatra-2017
Installing & Configuring Oracle RAC FAN API
Install the Oracle RAC FAN APIs by performing the following steps:
• Download the simplefan.jar file from the following link
http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-
090769.html
• Add the simplefan.jar file to the classpath
• Perform the following in your Java code:
▪ Get an instance of the FanManager class by using the getInstance method.
▪ Configure the event daemon using the configure method of the FanManager class. The configure
method sets the following properties:
• onsNodes: A comma separated list of host:port pairs of ONS daemons that the ONS runtime in
this Java VM should communicate with. The host in a host:port pair is the host name of a system
running the ONS daemon. The port is the local port configuration parameter for that daemon.
• onsWalletFile: The path name of the ONS wallet file. The wallet file is the path to a local wallet
file used by SSL to store SSL certificates. Same as wallet file configuration parameter to ONS
daemon.
• onsWalletPassword: The password for accessing the ONS wallet file.

OTNYatra-2017
Fast Connection Failover(FCF)
• FCF is a FAN client implemented using a connection pool
• FCF automatically and without delay recovers a lost or
interrupted connection
• FCF is implemented in JDBC or ODP.NET based connections
using ONS(either local or remote)
• FCF enabled the client connections and work loads get
balanced using Load Balancing Advisory

OTNYatra-2017
TAF or FAN or FCF?

• FAN is event based and thus makes the status information updated to the
subscribers immediately
• TAF relies on the timeouts thus making the detection and failover delayed
• TAF is database level failover mechanism
• FCF is application level failover mechanism
• Use of FAN and connection pools allows to use FCF
• FCF is meant for Java based N-tier applications; especially for short-lived
connections
• TAF is meant for client-server model applications; especially meant for
long-lived connections performing batch job workloads
• If you are using FCF, there is no need to use TAF

OTNYatra-2017
Transaction Issues Before 12c
• TAF didn’t encounter for
transactions 1 In-
doubt
• Outage on Database or
Application level can cause 5
In-flight work loss
• User’s reattempt for Application error
transaction may lead to
logical errors i.e. duplication
of data 4
• Handling of such exceptions 2

at application level is not


easy Database error

3
OTNYatra-2017
Solution: Transaction Guard & Appl. Continuity

• Transaction Guard
• Transaction Guard provides a generic protocol and API
for applications to use for at-most-once execution in
case of planned and unplanned outages and repeated
submissions
• Application Continuity
• Enables the replay of in-flight, recoverable transactions
following the outage of database

OTNYatra-2017
What Is Transaction Guard

• Part of both Standard & Flex cluster


• Returns the outcome of the last transaction after a recoverable
error using Logical Transaction ID(LTXID)
• Used by Application Continuity(automatically enabled)
• Can be used also independently

OTNYatra-2017
Transaction Guard Key Concepts
• Database Request
• Unit of work submitted by SQL, PL/SQL etc.
• Recoverable Error
• Error due to any issue independent of application i.e. network,
node, database, storage errors
• Reliable Commit Outcome
• Outcome of the last transaction(preserved by TG using LTXID)
• Session State Consistency
• Describes how the application changes the non-transaction state
during a database
• Mutable Functions
• Functions that change their state with every executions

OTNYatra-2017
Transaction Guard Key Concepts
• Logical Transaction ID(LTXID)
• Foundation of at-most-once executions
• Stored in the OCI session handle & in the connection objects for
JDBC thin and ODP.Net drivers
• At-Most-Once execution
• Transaction Guard uses LTXID to avoid duplications
• Is maintained by preserving the LTXID
• Using getLTXID API, application can retrieve the logical
transaction id that was used in the last dead session
• Transaction Guard supported drivers
• 12c JDBC type 4 driver
• 12c OCI and OCCI client drivers
• 12c Oracle Data Provider for .NET (ODP.NET) client driver
What Is Logical TX ID(LTXID)?
• Used to fetch the outcome of the last transaction’s commit
status
• Saved in SYS.LTXID_TRANS so it survives reboots, failover and
disconnections( retention default=24 hours)
• DBMS_CONT_APP.GET_LTXID_OUTCOME
• Client is supplied unique LTXID for each authentication and for
each round-trip for client driver for commit operations
• Both client and database hold LTXID
• Transaction Guard ensure that each LTXID is unique
• LTXID is present at the commit for default retention period-24
hours
• While obtaining the outcome, LTXID is blocked to ensure it’s
integrity
OTNYatra-2017
OTNYatra-2017
Transaction Guard-Pseudo Workflow
Receive a FAN down event (or recoverable error)

FAN aborts the dead session

If recoverable error (new OCI_ATTRIBUTE for OCI, isRecoverable for JDBC)


Get last LTXID from dead session using getLTXID or from your callback
Obtain a new session
Call GET_LTXID_OUTCOME with last LTXID to obtain COMMITTED and
USER_CALL_COMPLETED status

If COMMITTED and USER_CALL_COMPLETED


Then return result

ELSEIF COMMITTED and NOT USER_CALL_COMPLETED


Then return result with a warning (that details such as out binds or row count were not
returned)

OTNYatra-2017
ELSEIF NOT COMMITTED
Transaction Guard-(Un)Supported Transactions
• Supported
• Local Transactions
• Parallel Transactions
• Distributed & Remote Transactions
• DDL & DCL Transactions
• Auto-commit and commit-on success
• Pl/SQL with embedded Commit
• Unsupported
• Recursive transactions
• Autonomous transactions
• Active Data Guard with read/write DB links for forwarding transactions
• Golden Gate & Logical Standby
• API supported for
• 12c JDBC Type 4
• 12c OCI/OCCI Client drivers
• 12c ODP.net

OTNYatra-2017
Configuring Database for Transaction Guard

• Database release 12.1.0.1 or later


• Grant execute on DBMS_APP_CONT to <user>;
• Configure Fast Application Notification(FAN)
• Locate and define Transaction History
table(LTXID_TRANS)
• Configure following parameters for Service
• COMMIT_OUTCOME = TRUE
• FAILOVER_TYPE=TRANSACTION
• RETENTION_TIMEOUT=<value>

OTNYatra-2017
Sample Service Configuration for Transaction Guard

Adding an Admin-managed Service


srvctl add service -database orcl -service GOLD -prefer inst1 -available inst2 -commit_outcome TRUE -
retention 604800

Modifying a Single Instance Service


DECLARE
params dbms_service.svc_parameter_array;
BEGIN params('COMMIT_OUTCOME'):='true';
params('RETENTION_TIMEOUT'):=604800;
dbms_service.modify_service('<service-name>',params);
END; /

OTNYatra-2017
Application Continuity

• Masks the issues for the


applications
• Replays the in-flight
transactions
• Uses Transaction Guard
implicitly

OTNYatra-2017
Application Continuity-Resource Requirements

• For Java Client


• Increase memory for replay queues
• Additional CPU for garbage collection
• For Database Server
• Additional CPU for validation
• Transaction Guard
• Bundled with the kernel
• Minimal overhead

OTNYatra-2017
Take Away
• Failover property is enabled in the OCI clients
natively
• Using TAF allows to detect the failure of the
connection by doing either a session or session
with the SQL query failover
• FAN enables the detection and quick resolution
of the failures using HA events
• Transaction Guard and Application Continuity
are the evolution of TAF from 12c

OTNYatra-2017
Thank You!

| @amansharma81

| http://blog.aristadba.com

| amansharma@aristadba.com

OTNYatra-2017

You might also like