Professional Documents
Culture Documents
Document Revisions
Version Date Author Comments
1.0
11/01/2004
Initial Draft Portions taken from EMS Best Practices Sun Cluster section taken from Sun Cluster Overview for Solaris
Document Approvals
Name Signature Date
Document Owners
Name
This document contains information that is confidential to both COMPANY XYZ and TIBCO Software Inc.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary
404040
Copyright Notice
04 TIBCO Software Inc. This document is unpublished and the foregoing notice is affixed to protect TIBCO Software nadvertent publication. All rights reserved. No part of this document may be reproduced in any form, including nsmission electronically to any computer, without prior written consent of TIBCO Software Inc. The information cument is confidential and proprietary to TIBCO Software Inc. and may not be used or disclosed except as expressly g by TIBCO Software Inc. Copyright protection includes material generated from our software programs displayed on cons, screen displays, and the like.
bed herein are either covered by existing patents or patent applications are in progress. All brand and product names gistered trademarks of their respective holders and are hereby acknowledged.
his document is subject to change without notice. This document contains information that is confidential and O Software Inc. and may not be copied, published, or disclosed to others, or used for any purposes other than review, horization of an officer of TIBCO Software Inc. Submission of this document does not represent a commitment to on of this specification in the products of the submitters.
his document is subject to change without notice. THIS DOCUMENT IS PROVIDED "AS IS" AND TIBCO MAKES EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO ALL WARRANTIES OF ITY OR FITNESS FOR A PARTICULAR PURPOSE. TIBCO Software Inc. shall not be liable for errors contained ntal or consequential damages in connection with the furnishing, performance or use of this material.
c. nue 4
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
404040
tents
.....................................................................................................................4
ware..............................................................................................................8
..................................................................................................................................8 ..................................................................................................................................9 nnect.........................................................................................................................9 ership.......................................................................................................................10 uration Repository....................................................................................................10 ................................................................................................................................10 es............................................................................................................................11 ................................................................................................................................11 g..............................................................................................................................12 s...............................................................................................................................12
ster..............................................................................................................15
s..................................................................................................................21
................................................................................................................................21 er Installation...........................................................................................................21 ership......................................................................................................................21 ers...........................................................................................................................22 s under Cluster Control............................................................................................22 ation Changes..........................................................................................................22 rver Instance............................................................................................................24 Password................................................................................................................24 rver Instance............................................................................................................24
...................................................................................................................26
................................................................................................................................26
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
404040
g..............................................................................................................................27
...................................................................................................................28
...................................................................................................................29
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
1 Introduction
This document is part of the Enterprise Integration Framework (EIF) for COMPANY XYZ. COMPANY XYZ have chosen TIBCO Enterprise Message Service (EMS) as the messaging backbone for all of their integration projects. To this end, it is imperative that EMS be implemented in such a way as to deliver the Level Of Service required by the business. In addition COMPANY XYZ want to make best use of server hardware by not having hardware tied up waiting to be brought into use in the event of server failure. They are also current and reasonably experienced users of Sun Cluster Software.
1.1 Audience
The audience of this document are: Developers attempting to understand the rationale for a clustered deployment Administration staff involved in implementing or supporting such a deployment
1.2 Purpose
This document addresses the following questions: What do we mean by the terms Fault Tolerant and Highly Available in the context of EMS? What benefits does Sun Cluster provide? How do we install the EMS components into a Sun Cluster? What role will TIBCO Administrator play? What role will TIBCO Hawk play?
Although not intended to be study on Sun Cluster software, there are certain principles that Sun Cluster uses to achieve its objectives that are common across other forms of cluster software and can therefore be treated as patterns or templates for re-use.
Sun Documentation for the following products: TIBCO Enterprise Message Service
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
2.1 Datastore
The TIBCO EMS Server requires storage for persistent messages, state metadata and configuration data, known as its datastore. This consists of three disk files as follows: meta.db stores information required by the server, but stores no messages sync-msgs.db stores data for queues or topics defined as failsafe async-msgs.db - stores data for queues or topics NOT defined as failsafe
It is obvious that a large amount of business data will pass through the EMS Server and be stored on disk. This places some stringent requirements on the disk storage: performance it must be fast, robust and reliable size the storage allocated should be able to grow dynamically over time recovery in the event of a disaster, some if not all of the data should be recoverable
SAN based storage is the best option due to its ability to deliver each of the above requirements.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
The first option is achieved by running the Fault Tolerant pair of EMS servers simultaneously and configuring them to be aware of each other via a tcp connection. In the event of the primary server failing, the backup server will be aware and attempt to gain control of the datastore by locking it. This option works well in situations where the datastore is local to the servers but is complicated when the datastore resides on a network device or on a SAN. Cheap network locking protocols such as NFS are notoriously unreliable whereas commercial products that provide this functionality reliably are prohibitively expensive. The second option is achieved through the use of TIBCO Hawk or clustering software. TIBCO Hawk can be used to ensure that only a single instance of a process is running, but Clustering software has the added advantage that it can detect network malfunctions and can also guarantee, through the mounting and unmounting of disk partitions, that only a single server has access to the datastore. NOTE: Even though the combination of the above features provides a reasonable level of Fault Tolerance it cannot mitigate every possible failure mode. Failures involving both physical server nodes or prolonged network outages will result in client disconnects eventually. However these situations can be handled by TIBCO Hawk rulebases running locally to the client in conjunction with good process design to shutdown clients and restart them when the EMS Servers come back up, thus preventing many spurious error conditions.
reconnect_attempt_delay
While this can be done on an individual basis through code, the recommended method is to use a JNDI call to a Connection Factory object. This allows the retrieval of the above parameters from the server, thus centralizing control and administration. A Fault Tolerant JNDI Connection Factory URL takes the form: tibjmsnaming://<server 1>:<port>, tibjmsnaming://<server 2>:<port>
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
Where possible these files should reside on the SAN. This will allow access to diagnose issues in the event that the server node cannot be immediately recovered.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
3.1 Introduction
A cluster is two or more systems, or nodes, that work together as a single, continuously available system to provide applications, system resources, and data to users. Each node on a cluster is a fully functional standalone system. However, in a clustered environment, the nodes are connected by an interconnect and work together as a single entity to provide increased availability and performance.
Figure 1.
Highly available clusters provide nearly continuous access to data and applications by keeping the cluster running through failures that would normally bring down a single server system. No single failure hardware, software, or networkcan cause a cluster to fail. By contrast, fault-tolerant hardware systems provide constant access to data and applications, but at a higher cost because of specialized hardware. Fault-tolerant systems usually have no provision for software failures. An application is highly available if it survives any single software or hardware failure in the system. Failures that are caused by bugs or data corruption within the application itself are excluded. The following apply to highly available applications: Recovery is transparent from the applications that use a resource. Resource access is fully preserved across node failure. Applications cannot detect that the hosting node has been moved to another node. Failure of a single node is completely transparent to programs on remaining nodes that use the files, devices, and disk volumes attached to this node.
A failover service provides high availability through redundancy. When a failure occurs, you can configure an application that is running to either restart on the same node, or be moved to another node in the cluster, without user intervention.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
The Sun Cluster system makes the path between users and data highly available by using multihost disks, multipathing, and a global file system. The Sun Cluster system monitors failures for the following: Applications Most of the Sun Cluster data services supply a fault monitor that periodically probes the data service to determine its health. A fault monitor verifies that the application daemon or daemons are running and that clients are being served. Based on the information that is returned by probes, a predefined action such as restarting daemons or causing a failover can be initiated. Disk-Paths Sun Cluster software supports disk-path monitoring (DPM). DPM improves the overall reliability of failover and switchover by reporting the failure of a secondary disk path. Internet Protocol (IP) Multipath Solaris IP network multipathing software on Sun Cluster systems provide the basic mechanism for monitoring public network adapters. IP multipathing also enables failover of IP addresses from one adapter to another adapter when a fault is detected.
The following sections describe some of the key terms and definitions used when discussing clustering using Sun Cluster software.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
The main function of the CMM is to establish cluster membership, which requires a cluster-wide agreement on the set of nodes that participate in the cluster at any time. The CMM detects major cluster status changes on each node, such as loss of communication between one or more nodes. The CMM relies on the transport kernel module to generate heartbeats across the transport medium to other nodes in the cluster. When the CMM does not detect a heartbeat from a node within a defined time-out period, the CMM considers the node to have failed and the CMM initiates a cluster reconfiguration to renegotiate cluster membership. To determine cluster membership and to ensure data integrity, the CMM performs the following tasks: Accounting for a change in cluster membership, such as a node joining or leaving the cluster Ensuring that an unhealthy node leaves the cluster Ensuring that an unhealthy node remains inactive until it is repaired Preventing the cluster from partitioning itself into subsets of nodes.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
3.6.1
Each Sun Cluster data service supplies a fault monitor that periodically probes the data service to determine its health. A fault monitor verifies that the application daemon or daemons are running and that clients are being served. Based on the information returned by probes, predefined actions such as restarting daemons or causing a failover, can be initiated.
3.6.2
Disk-Path Monitoring
Sun Cluster software supports disk-path monitoring (DPM). DPM improves the overall reliability of failover and switchover by reporting the failure of a secondary disk-path.
3.6.3
IP Multipath Monitoring
Each cluster node has its own IP network multipathing configuration, which can differ from the configuration on other cluster nodes. IP network multipathing monitors the following network communication failures: The transmit and receive path of the network adapter has stopped transmitting packets. The attachment of the network adapter to the link is down. The port on the switch does not transmit-receive packets. The physical interface in a group is not present at system boot..
Split brain occurs when the cluster interconnect between nodes is lost and the cluster becomes partitioned into subclusters, and each subcluster believes that it is the only partition. A subcluster that is not aware of the other subclusters could cause a conflict in shared resources such as duplicate network addresses and data corruption. Amnesia occurs if all the nodes leave the cluster in staggered groups. An example is a two-node cluster with nodes A and B. If node A goes down, the configuration data in the CCR is updated on node B only, and not node A. If node B goes down at a later time, and if node A is rebooted, node A will be running with old contents of the CCR.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
This state is called amnesia and might lead to running a cluster with stale configuration information. Sun Cluster avoids split brain and amnesia by giving each node one vote and mandating a majority of votes for an operational cluster. A partition with the majority of votes has a quorum and is enabled to operate. This majority vote mechanism works well if more than two nodes are in the cluster. In a twonode cluster, a majority is two. If such a cluster becomes partitioned, an external vote enables a partition to gain quorum. This external vote is provided by a quorum device. A quorum device can be any disk that is shared between the two nodes..
The configuration files of a data service define the properties of the resource that represents the application to the RGM. The RGM controls the disposition of the failover and scalable data services in the cluster. The RGM is responsible for starting and stopping the data services on selected nodes of the cluster in response to cluster membership changes. The RGM enables data service applications to utilize the cluster framework. The RGM controls data services as resources. These implementations are either supplied by Sun or created by a developer who uses a generic data service template, the Data Service Development Library API (DSDL API), or the Resource Management API (RMAPI). The cluster administrator creates and manages resources in containers that are called resource groups. RGM and administrator actions cause resources and resource groups to move between online and offline states.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
3.10.2 Resources
A resource is an instance of a resource type that is defined cluster wide. The resource type enables multiple instances of an application to be installed on the cluster. When you initialize a resource, the RGM assigns values to application-specific properties and the resource inherits any properties on the resource type level. Data services utilize several types of resources. Applications such as Apache Web Server or Sun Java System Web Server utilize network addresses (logical hostnames and shared addresses) on which the applications depend. Application and network resources form a basic unit that is managed by the RGM.
3.10.4.1 Failover Data Services Failover is the process by which the cluster automatically relocates an application from a failed primary node to a designated redundant secondary node. Failover applications have the following characteristics: Capable of running on only one node of the cluster Not cluster-aware Dependent on the cluster framework for high availability
If the fault monitor detects an error, it either attempts to restart the instance on the same node, or to start the instance on another node (failover), depending on how the data service has been configured. Failover services use a failover resource group, which is a container for application instance resources and network resources (logical hostnames). Logical hostnames are IP addresses that can be configured up on one node, and later, automatically configured down on the original node and configured up on another node. Clients might have a brief interruption in service and might need to reconnect after the failover has finished. However, clients are not aware of the change in the physical server that is providing the service. 3.10.4.2 Scalable Data Services The scalable data service enables application instances to run on multiple nodes simultaneously. Scalable services use two resource groups. The scalable resource group contains the application resources and the failover resource group contains the network resources (shared addresses) on which the scalable service depends. The scalable resource group can be online on multiple nodes, so multiple instances of the service can be running simultaneously. The failover resource group that hosts the shared address is online on only one node at a time. All nodes that host a scalable service use the same shared address to host the service.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
The cluster receives service requests through a single network interface (the global interface). These requests are distributed to the nodes, based on one of several predefined algorithms that are set by the load-balancing policy. The cluster can use the load-balancing policy to balance the service load between several nodes. 3.10.4.3 Parallel Applications Sun Cluster systems provide an environment that shares parallel execution of applications across all the nodes of the cluster by using parallel databases. Sun Cluster Support for Oracle Parallel Server/Real Application Clusters is a set of packages that, when installed, enables Oracle Parallel Server/Real Application Clusters to run on Sun Cluster nodes. This data service also enables Sun Cluster Support for Oracle Parallel Server/Real Application Clusters to be managed by using Sun Cluster commands. A parallel application has been instrumented to run in a cluster environment so that the application can be mastered by two or more nodes simultaneously. In an Oracle Parallel Server/Real Application Clusters environment, multiple Oracle instances cooperate to provide access to the same shared database. The Oracle clients can use any of the instances to access the database. Thus, if one or more instances have failed, clients can connect to a surviving instance and continue to access the database.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
Cluster Node A
Cluster Node B
Datastore Resource
Application Resource
Figure 2.
Conceptual Architecture
Additionally, TIBCO Runtime Agent is installed on each server node in the cluster and bound to the physical name and IP address of each server. TIBCO Runtime Agent is NOT under cluster control and is started at system boot via the usual init.d mechanism.
4.2.1
Control Scripts
When an EMS server is registered into the TIBCO Administration Domain, a control shell script is created as follows:
$TIBCO_HOME/ems/bin/domain/<domain name>/ TIBCOServers-E4JMS_<port number>.sh
When a second server is added anywhere in the domain with the same port number, the control script is created with a different name as follows:
$TIBCO_HOME/ems/bin/domain/<domain name>/ TIBCOServers-E4JMS-1_<port number>.sh
The COMPANY XYZ EMS installation package creates a Unix shell script tibco_ems.sh that is the main script used to start/stop/check an EMS service. It takes a single argument, the EMS listening port number and utilizes whichever of the above shell scripts is present to start/stop a given EMS server. The use of the tibco_ems.sh script whenever interacting with EMS at the command line ensures that a consistent state will always be reported in TIBCO Administrator. It is also used by the Sun Cluster software to check whether EMS is running and to start/stop it as necessary. The contents of the tibco_ems.sh script are listed in Appendix 8.1
4.2.2
Configuration Files
Configuration files are created in advance for each server and contain the Queue, Topic and ACL definitions modeled in lower environments and promoted through change management procedures. These files are originally located in the $CONFIG_ROOT root folder as designated in the tibco.sh environment control file. Under the ems sub-folder there is a folder for each individual server containing the set of configuration files required by EMS. This is the location specified in the Domain Utility when adding the EMS server to the TIBCO Administration Domain. $CONFIG_ROOT ems 7020 tibemsd.conf factories.conf users.conf 7030 7040 hawk
Figure 3.
When first installed on the server the configuration files have the following important characteristics:
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
A copy physically resides on each server They point to logfiles local to each server They point to a datastore local to each server They contain the same server name, which is the name of the Business Domain (e.g. EMSMERCH) They contain the listen parameter tcp://<port number> which binds EMS to the default interface for the given server They do NOT contain any Fault-Tolerant setup parameters
This configuration allows each server to be registered into the domain and tested prior to placing them under Sun Cluster control. Once the Sun Cluster configuration has been created and tested, the following modifications are made: A single copy of the configuration files is copied to the Sun Cluster partition A logical link is created from the original config folder to the above folder The central tibemsd.conf file is edited to place the datastore on the Sun Cluster partition The central tibemsd.conf file is edited to place logfiles on the Sun Cluster partition The central tibemsd.conf file is edited to configure the FT Connection Factories
Note that the servers are NOT configured to be aware of each other in the traditional Fault-Tolerant setup configuration. At no time will the two servers ever be allowed to be both running simultaneously. This is controlled by the configuration of the Sun Cluster software. The centralized configuration file factories.conf that controls the Connection Factory parameters is modified to add the reconnect_attempt_count and reconnect_attempt_delay parameters as follows:
[FTTopicConnectionFactory] type = topic url = tcp://<server a>:<port number>,tcp://<server b>:<port number> reconnect_atempt_count = 60 reconnect_atempt_delay = 5 [FTQueueConnectionFactory] type = queue url = tcp://<server a>:<port number>,tcp://<server b>:<port number> reconnect_atempt_count = 60 reconnect_atempt_delay = 5
The settings above will allow for the client libraries to attempt to re-connect every 5 seconds for up to 5 minutes. These settings will be subject to change with further experience and testing.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
Datastore Resource this is the disk partition that will be mounted on only the single active primary node for each EMS server Application Resource this defines the application in terms of how to start/stop it and how to check its status
For example, the following Resource Groups were created to support the Merchandising and Supply Chain Business Domains: Resource Group ctibco_merch_rg Resources lh_cert_tibcomerch HA_ctibco_merch_store ctibco_merch_app lh_cert_tibcosuppch HA_ctibco_suppch_store ctibco_suppch_app
Figure 4.
Description Logical Host resource for Merchandising Datastore resource for Merchandising Application Resource for Merchandising Logical Host resource for Supply Chain Datastore resource for Supply Chain Application Resource for Supply Chain
ctibco_suppch_rg
These resources are setup such that the ctibco_merch_rg items are active on node a and inactive on node b and the ctibco_suppch_rg items are active on node b and inactive on node a. When creating an Application Resource Sun Cluster requires three parameters A script to start the Application A script to stop the Application A script to return 1 if the application is running correctly and 0 otherwise
These requirements are fulfilled by the tibco_ems_cluster.sh script which provides all three services via a single command line argument which is either start, stop or check. It utilizes the tibco_ems.sh script and translates the output of that script into the format required by Sun Cluster. The tibco_ems_cluster.sh script is listed in full in Appendix 8.2 and is installed by the Unix Services team as part of the cluster Resource Groups creation. During the installation they will rename the script as appropriate, e.g. tibco_ems_merch.sh and change the port number as required. The cluster software is configured to automatically start the Application Resource items on their primary cluster nodes at startup. It is also configured to check them at regular intervals and attempt to restart them if not running. If the application will not restart after a given number of attempts then it will be failed over to the other cluster node. The monitoring interval and number of restart attempts are configurable and were set to 15 seconds and 3 restart attempts during testing. It is also important to note that the Application Resource is created as Non-network aware. This means that the cluster software will not attempt to assess the status of the EMS servers by connecting to a tcp port at regular intervals. Instead it will rely on the information returned from the check script as configured above. A full listing of the Sun Cluster configuration as returned by the command scstat p can be found in Appendix 8.3. Note that the EMS listen port is not bound to the logical or virtual IP address of the logical host. Due to the limitations of the current EMS Administrator Plugin, the server must be listening on the default interface of the host on which it is running in order to be administered through TIBCO Administrator. This limitation may be removed in future releases.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
Figure 5.
Figure 6.
Display after failover of the 7020 server instance to its secondary cluster node
Note: It is recommended that, for consistency, the proposed Primary Server node be added first, then the secondary due to that fact the Administrator will add the -1 suffix to the second server. This will ensure that whenever an operator sees a service with a -1 suffix running they know that the server is a secondary server. Should a failover occur then the TIBCO Administrator display will automatically update to show the correct status of the two servers. This ensures that operators always receive coherent information on the status of the servers regardless of whether they use TIBCO Administrator, the Sun Cluster Manager or the command line shell scripts.
4.5 Hawk
TIBCO Hawk is not used to control the lifecycle of the EMS servers as they are controlled by the cluster software.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
However, Hawk can be and is used to monitor the health of the cluster nodes and to notify clients if for some reason the EMS Service becomes completely unavailable.
These rulebases are deployed into the TIBCO Runtime Agents running on each cluster node.
2004 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
5 Installation Steps
The step-by-step instruction guide can be found in the Message Server Installation Guide. The following sections describe the reason for each step and the order in which they must be completed.
5.1 Pre-Requisites
5.1.1 Configuration Files
Prior to installation, a set of configuration files will be created which will control the installation of all the COMPANY XYZ Packages based on the servers purpose. These files are un-tarred into the $CONFIG_ROOT folder and will contain (amongst other things) a folder for each EMS Service to be created, denoted by its EMS port number. See section 4.2.2 for details.
5.1.2
Before the Message Server Package can be installed the Systems Management Package (consisting of TIBCO Runtime Agent and Domain Membership) must be installed. The Systems Management Package will create the TIBCO Runtime Agent for the server nodes based on the configuration files loaded on the machine in the previous step.
The password will be set to the current administrator password for the environment. This will be set in the EMS Servers in a subsequent step. It is imperative that this information, especially the password, be specified correctly as the only way of changing it is to remove the EMS Server from the domain and add it again.
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
Now stop the Server Instance from the command line as follows:
$ ./tibco_ems.sh <port number> stop TIBCO Enterprise Messaging Server (<port number>) stopping
Confirm that the Administrator eventually shows the Server Instance as Stopped. Repeat this process for each Server Instance on each Cluster Node.
5.6.1
For a given Server Instance, on the currently active Cluster Node, move the configuration folder to the mounted folder for that Server Instance and replace it with a link to the folders new location. For example, on the testing servers: $ $ $ $ cd /opt/tibco/config/ems mkdir /var/ems_data/ems_merch/config mv 7020/* /var/ems_data/ems_merch/config/ rmdir 7020
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
$ ln s /var/ems_data/ems_merch/config 7020 On the currently inactive cluster node, simply delete the existing configuration files and create a link to where the folder will be mounted. Even though the folder is not currently mounted, it will be valid when the Cluster Software fails the Server Instance over along with its Data Store Resource. $ cd /opt/tibco/config/ems $ rm -rf 7020 $ ln s /var/ems_data/ems_merch/config 7020 Repeat this process for each pair of EMS Server Instances using the correct port numbers and mount point folders.
5.6.2
On the currently active Cluster Node for a given Server Instance, edit the tibemsd.conf file and set the store parameter to point to the desired folder under the mounted partition. ######################################################################## # Persistent Storage. store = /var/tibco_ems/ems_merch/datastore
EMS will create the datastore folder at startup if it does not already exist.
5.6.3
On the currently active Cluster Node for a given Server Instance, edit the tibemsd.conf file and set the logfile parameter to point to the desired folder under the mounted partition. ####################################################################### # Log file name and tracing parameters. logfile = /var/tibco_ems/ems_merch/logs
5.6.4
Although the configuration files should be pre-built with a blank Admin password, it is worthwhile confirming this as follows. On the currently active Cluster Node for a given Server Instance, edit the users.conf file and identify the following line: admin:<misc text>:"Administrator" Remove any text between the two colons to leave the line as follows: admin::"Administrator"
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
5.6.5
Although the configuration files are pre-built, it is worthwhile confirming that the following settings are correct for each Server Instance: The listen parameter in tibemsd.conf is set to tcp://<port number> The Fault-tolerant Setup parameters in tibemsd.conf are all empty The server parameter in tibemsd.conf is correct
Server Name on the General Page Log File Name on the General Page. Server Store Directory on the Server Page
FTQueueConnectionFactory on the Resources Page FTTopicConnectionFactory on the Resources Page
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
6 Testing Results
6.1 Test Clients
6.1.1 Java Client (Connection Factory)
This client connected to the EMS Server Instance via a Connection Factory URL of the form. tibjmsnaming://<server a>:<port>,tibjmsnaming:// <server b>:<port> The test program tibjmsFactoryQueueSender.java was created from the existing sample program tibjmsMsgProducer.java and was modified to create the test queue from the class factory as follows:
String providerContextFactory = "com.tibco.tibjms.naming.TibjmsInitialContextFactory"; String defaultTopicConnectionFactory = "FTTopicConnectionFactory"; String defaultQueueConnectionFactory = "FTQueueConnectionFactory"; String providerUrls ="tibjmsnaming://localhost:7222,tibjmsnaming://localhost:7222"; Hashtable env = new Hashtable (); env.put ( Context.INITIAL_CONTEXT_FACTORY, providerContextFactory ); env.put ( Context.PROVIDER_URL, providerUrls ); InitialContext jndiContext = new InitialContext ( env ); QueueConnectionFactory factory = (QueueConnectionFactory)jndiContext.lookup ( defaultQueueConnectionFactory ); QueueConnection connection = factory.createQueueConnection ( userName, password );
In addition, the message sending code was modified to send the same test message 1000 times in a loop. The full listing is contained in Appendix 8.4 During testing, the program prints the following output
Sent message(1): Sent message(2): Sent message(3):
During the failover from one Cluster Node to the other, the output pauses, then continues on uninterrupted.
6.1.2
This client connects to the EMS Server Instance via a Fault-Tolerant URL of the form. tcp://<server a>:<port>,tcp:// <server b>:<port>
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
The test program tibjmsFactoryQueueSender.java was modified only slightly to incorporate the message sending loop described in the previous section and to increase the default Fault-Tolerant connection retry and timeout settings as follows.
String reconnect = new String ( "60, 5000" ); Tibjms.setReconnectAttempts ( reconnect ); System.out.println ( "After change for reconnections: " + Tibjms.getReconnectAttempts () ); ConnectionFactory factory = new com.tibco.tibjms.TibjmsConnectionFactory ( serverUrl );
The full listing is contained in Appendix 8.5 This test client behaved identically to the one using the Connection Factory URL.
6.1.3
BW Client
This test client consisted of a BW process using a timer instance to create and send a JMS message to a test queue every second. A second process subscribed to the same queue and pulled off the message. The number of process instances created was monitored via TIBCO Administrator. No errors were seen during the failover testing.
6.2.2
Process Failure
In order to simulate a real-world problem, the executable permissions were removed from the $TIBCO_HOME/ems/bin/tibemsd file and the running process terminated with a kill signal. After going through its retry loop, the Cluster Software failed the process over to the other Cluster Node. The test clients were paused for a longer period of time, approximately 60 seconds which is three retry periods of 15 seconds plus the failover time of 15 seconds.
6.2.3
Machine Failure
In an effort to simulate a catastrophic machine failure, the active Cluster Node for one of the EMS Server Instances was forcefully rebooted. The Cluster Software detected the failure after the configured timeout period and migrated the EMS Server Instance to the other Node. The test clients were paused for approximately 30 seconds.
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
7 Conclusions
7.1 Failover time
The low failover time (circa 15 seconds) in conjunction with the uninterrupted operation of clients makes the use of more expensive distributed lock manager systems unnecessary at COMPANY XYZ. It is felt that this solution meets the business needs for COMPANY XYZ at the present time. Other parameters will affect the failover time, such as: Size of datastore file system Number of messages in datastore Number of clients attempting to reconnect at failover.
However, these factors are common to both a clustered and distributed locking solution and are therefore excluded from the decision making process.
SSL parameters required are unique to each physical server A specific interface must be entered into the listen parameter
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
8 Appendices
8.1 tibco_ems.sh
#!/bin/sh # For all Unix platforms # # ######################################################################### # Boot script for Unix platforms # This script takes one argument: "start", "stop" or check. # # ######################################################################### # Copyright 2004 TIBCO Software Inc. All rights reserved. TIBCO_ROOT=/opt/tibco export TIBCO_ROOT # All environment variables are set in tibco.sh. Can't proceed further # if file is missing if [ -f $TIBCO_ROOT/tibco.sh ]; then . $TIBCO_ROOT/tibco.sh else echo "File not found $TIBCO_ROOT/tibco.sh" exit 1 fi # Check if [ $# then echo exit fi that the correct number of options have been passed -ne 2 ] "Usage: $0 [EMS Port] [start|stop|check]" 1>&2 1
EMS_PORT=$1 EMS_BIN=$TIBCO_ROOT/ems/bin/domain/$TIBCO_DOMAIN_NAME # Find the script that controls the server on the given Port Number # Secondary servers have '-1', '-2' etc inserted in the script name SCRIPT_FILE=`/usr/bin/find $EMS_BIN -name TIBCOServers-E4JMS*_$EMS_PORT.sh -print` # Check if [ -z echo exit fi that an EMS Server has been installed for this Port Number "$SCRIPT_FILE" ]; then "EMS Server for port $EMS_PORT not installed" 1
NOHUP="nohup" OS_TYPE=`uname -a | awk '{print $1}'` case $OS_TYPE in 'SunOS') ulimit -n 256 ;; *) ;; esac # ######################################################################### # # This function checks for a running process. #
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
;; esac
case "$2" in # ######################################################################### # # Start TIBCO Enterprise Messaging Server # # ######################################################################### 'start') procname="$EMS_PORT/tibemsd.conf" pid=`findPid "$procname"` if [ "$pid" != "" ]; then echo "TIBCO Enterprise Messaging Server ($EMS_PORT) already running" else cd $CONFIG_ROOT/ems if [ -x $SCRIPT_FILE ]; then echo "TIBCO Enterprise Messaging Server ($EMS_PORT) starting..." $NOHUP $SCRIPT_FILE >/dev/null 2>&1 & echo "Started TIBCO Enterprise Messaging Server ($EMS_PORT)" else echo "EMS Server for port $EMS_PORT not installed" fi fi ;; # ######################################################################### # # Stop TIBCO Enterprise Messaging Server # # ######################################################################### 'stop') procname="$EMS_PORT/tibemsd.conf" pid=`findPid "$procname"` if [ "$pid" != "" ]; then echo "TIBCO Enterprise Messaging Server ($EMS_PORT) stopping." kill $pid else echo "TIBCO Enterprise Messaging Server ($EMS_PORT) not running" fi ;; # ######################################################################### # # Check if TIBCO Enterprise Messaging Server is running # # ######################################################################### 'check')
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
8.2 tibco_ems_cluster.sh
#!/bin/sh # Cluster control script for TIBCO EMS # Greg Mabrito - Oct 25, 2004 TIBCO_HOME="/opt/tibco" TIBCO_SCRIPTS="$TIBCO_HOME/scripts" TIBCO_EMS_PORT="7020" # process command line parameters, if any case "$1" in start) su - tibco -c "$TIBCO_SCRIPTS/tibco_ems.sh $TIBCO_EMS_PORT start" ;; stop) su - tibco -c "$TIBCO_SCRIPTS/tibco_ems.sh $TIBCO_EMS_PORT stop" ;; check) RET_VAL=`su - tibco -c "$TIBCO_SCRIPTS/tibco_ems.sh $TIBCO_EMS_PORT check" | grep "running with pid"` if [ -n "$RET_VAL" ] ; then exit 0 else exit 1 fi ;; *) echo "Usage: $0 {start|stop|check}" exit 1 ;; esac exit 0
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
------------------------------------------------------------------- Cluster Transport Paths -Endpoint -------sys99115:qfe1 sys99115:eri0 Endpoint -------sys99116:qfe1 sys99116:eri0 Status -----Path online Path online
------------------------------------------------------------------- Quorum Summary -Quorum votes possible: Quorum votes needed: Quorum votes present: -- Quorum Votes by Node -Node Name --------sys99115 sys99116 Present Possible Status ------- -------- -----1 1 Online 1 1 Online 3 2 3
-- Quorum Votes by Device -Device Name ----------/dev/did/rdsk/d8s2 Present Possible Status ------- -------- -----1 1 Online
Device votes:
------------------------------------------------------------------- Device Group Servers -Device Group Primary -----------------tibco_ems_data_merch sys99115 tibco_ems_data_suppch sys99116 Secondary --------sys99116 sys99115
------------------------------------------------------------------- Resource Groups and Resources -Group Name ---------Resources: ctibco_merch_rg Resources: ctibco_suppch_rg -- Resource Groups -Group Name ---------Group: ctibco_merch_rg Group: ctibco_merch_rg Group: ctibco_suppch_rg Group: ctibco_suppch_rg Node Name --------sys99115 sys99116 sys99116 sys99115 State ----Online Offline Online Offline Resources --------lh_cert_tibcomerch HA_ctibco_merch_store ctibco_merch_app lh_cert_tibcosuppch HA_ctibco_suppch_store ctibco_suppch_app
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
Resource: HA_ctibco_merch_store sys99115 Resource: HA_ctibco_merch_store sys99116 Resource: ctibco_merch_app Resource: ctibco_merch_app sys99115 sys99116
Resource: lh_cert_tibcosuppch sys99116 Resource: lh_cert_tibcosuppch sys99115 Resource: HA_ctibco_suppch_store sys99116 Resource: HA_ctibco_suppch_store sys99115 Resource: ctibco_suppch_app Resource: ctibco_suppch_app sys99116 sys99115
------------------------------------------------------------------- IPMP Groups -Node Name --------IPMP Group: sys99115 IPMP Group: sys99116 Group Status ---------ipmp827 Online ipmp827 Online Adapter ------qfe0 qfe0 Status -----Online Online
8.4 tibjmsFactoryQueueSender.java
import javax.jms.*; import javax.naming.*; import java.util.*; public class tibjmsFactoryQueueSender implements ExceptionListener { String userName = null; String password = null; String Vector queueName data = "queue.sample"; = new Vector (); providerContextFactory = "com.tibco.tibjms.naming.TibjmsInitialContextFactory"; defaultProviderURLs = "tibjmsnaming://localhost:7222, tibjmsnaming://localhost:7222";
static final String defaultTopicConnectionFactory = "FTTopicConnectionFactory"; static final String defaultQueueConnectionFactory = "FTQueueConnectionFactory"; String providerUrls = defaultProviderURLs; public tibjmsFactoryQueueSender ( String[] args ) { parseArgs ( args ); /* print parameters */ System.out.println ( "\n------------------------------------------------------------------------" ); System.out.println ( "tibjmsQueueSender SAMPLE" ); System.out.println ( "------------------------------------------------------------------------" ); System.out.println ( "Provider URL................. " + providerUrls ); System.out.println ( "User......................... " + ( userName != null ? userName:"(null)" ) ); System.out.println ( "Queue........................ " + queueName ); System.out.println ( "------------------------------------------------------------------------\n" ); if ( queueName == null ) { System.err.println ( "Error: must specify queue name" ); usage (); } if ( 0 == data.size () ) { System.err.println ( "Error: must specify at least one message text" );
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
System.err.println ( "Publishing into queue: '" + queueName + "'\n" ); try { /* * Init JNDI Context. */ Hashtable env = new Hashtable (); env.put ( Context.INITIAL_CONTEXT_FACTORY, providerContextFactory ); env.put ( Context.PROVIDER_URL, providerUrls ); if ( null != userName ) { env.put ( Context.SECURITY_PRINCIPAL, userName ); if ( null != password ) { env.put ( Context.SECURITY_CREDENTIALS, password ); }
InitialContext jndiContext = new InitialContext ( env ); QueueConnectionFactory factory = (QueueConnectionFactory)jndiContext.lookup ( defaultQueueConnectionFactory ); QueueConnection connection = factory.createQueueConnection ( userName, password ); connection.setExceptionListener ( this ); Tibjms.setExceptionOnFTSwitch ( true ); QueueSession session = connection.createQueueSession ( false,javax.jms.Session.AUTO_ACKNOWLEDGE ); /* * Use createQueue() to enable sending into dynamic queues. */ javax.jms.Queue queue = session.createQueue ( queueName ); QueueSender sender = session.createSender ( queue ); javax.jms.TextMessage message = session.createTextMessage (); String text = (String)data.elementAt ( 0 ); message.setText ( text ); /* publish messages */ for ( int i=0; i < 1000 ; i++ ) { sender.send ( message ); System.err.println ( "Sent message(" + i + "): " + text ); try { Thread.sleep ( 1000 ); } catch ( Exception e ) { } } connection.close (); } catch ( NamingException e ) { e.printStackTrace (); System.exit ( 0 ); } catch ( JMSException e ) { e.printStackTrace (); System.exit ( 0 ); }
public static void main ( String args[] ) { tibjmsFactoryQueueSender t = new tibjmsFactoryQueueSender ( args ); } void usage () { System.err.println System.err.println System.err.println System.err.println
( ( ( (
"\nUsage: java tibjmsQueueSender [options]" ); " <message-text1 ... message-textN>" ); "" ); " where options are:" );
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
void parseArgs ( String[] args ) { int i = 0; while ( i < args.length ) { if ( args[i].compareTo ( "-provider" ) == 0 ) { if ( (i+1) >= args.length ) usage (); providerUrls = args[i+1]; i += 2; } else if ( args[i].compareTo ( "-queue" ) == 0 ) { if ( (i+1) >= args.length ) usage (); queueName = args[i+1]; i += 2; } else if ( args[i].compareTo ( "-user" ) == 0 ) { if ( (i+1) >= args.length ) usage (); userName = args[i+1]; i += 2; } else if ( args[i].compareTo ( "-password" ) == 0 ) { if ( (i + 1) >= args.length ) usage (); password = args[i+1]; i += 2; } else if ( args[i].compareTo ( "-help" ) == 0 ) { usage (); } else if ( args[i].compareTo ( "-help-ssl" ) == 0 ) { tibjmsUtilities.sslUsage (); } else if ( args[i].startsWith ( "-ssl" ) ) { i += 2; } else { data.addElement ( args[i] ); i++; } }
public void onException ( JMSException exception ) { String strErrCode = exception.getErrorCode (); String strFTSwitch = "FT-SWITCH"; if ( true == strErrCode.startsWith ( strFTSwitch ) ) { String strNewServer = strErrCode.substring ( strFTSwitch.length () + 2 ); System.out.println ( "FT Connection switched to: " + strNewServer ); } else { exception.printStackTrace (); }
} }
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
8.5 tibjmsMsgProducer.java
import javax.jms.*; import javax.naming.*; import com.tibco.tibjms.Tibjms; public class tibjmsMsgProducer implements ExceptionListener { /*----------------------------------------------------------------------* Parameters *----------------------------------------------------------------------*/ String String String String Vector boolean serverUrl userName password name data useTopic = = = = = = null; null; null; "topic.sample"; new Vector(); true;
/*----------------------------------------------------------------------* Variables *----------------------------------------------------------------------*/ Connection connection = null; Session session = null; MessageProducer msgProducer = null; Destination destination = null; public tibjmsMsgProducer ( String[] args ) { parseArgs ( args ); try { tibjmsUtilities.initSSLParams ( serverUrl, args ); } catch ( JMSSecurityException e ) { System.err.println ( "JMSSecurityException: "+e.getMessage ()+", provider=" + e.getErrorCode () ); e.printStackTrace (); System.exit ( 0 ); } /* print parameters */ System.err.println ( "\n------------------------------------------------------------------------" ); System.err.println ( "tibjmsMsgProducer SAMPLE" ); System.err.println ( "------------------------------------------------------------------------" ); System.err.println ( "Server....................... " +((serverUrl!=null)?serverUrl:"localhost" ) ); System.err.println ( "User......................... " +((userName !=null)?userName : "(null)" ) ); System.err.println ( "Destination.................. " + name ); System.err.println ( "Message Text................. " ); for ( int i = 0 ; i < data.size () ; i++ ) { System.err.println ( data.elementAt ( i ) ); } System.err.println ( "------------------------------------------------------------------------\n" ); try { if ( data.size () == 0 ) { System.err.println ( "***Error: must specify at least one message text\n" ); usage (); } /* Increase FT Reconnection Settings */ String reconnect = new String ( "60, 5000" ); Tibjms.setReconnectAttempts ( reconnect ); System.out.println ( "After change for reconnections: " + Tibjms.getReconnectAttempts () ); System.err.println ( "Publishing to destination '" + name + "'\n" ); ConnectionFactory factory = new com.tibco.tibjms.TibjmsConnectionFactory ( serverUrl ); connection = factory.createConnection ( userName, password ); connection.setExceptionListener ( this ); Tibjms.setExceptionOnFTSwitch ( true ); /* create the session */ session = connection.createSession ( false, javax.jms.Session.AUTO_ACKNOWLEDGE ); /* create the destination */
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
/*----------------------------------------------------------------------* usage *----------------------------------------------------------------------*/ private void usage () { System.err.println ( "\nUsage: java tibjmsMsgProducer [options] [ssl options]" ); System.err.println ( " <message-text-1>" ); System.err.println ( " [<message-text-2>] ..." ); System.err.println ( "\n" ); System.err.println ( " where options are:" ); System.err.println ( "" ); System.err.println ( " -server <server URL> - EMS server URL, default is local server" ); System.err.println ( " -user <user name> - user name, default is null" ); System.err.println ( " -password <password> - password, default is null" ); System.err.println ( " -topic <topic-name> - topic name, default is \"topic.sample\"" ); System.err.println ( " -queue <queue-name> - queue name, no default" ); System.err.println ( " -help-ssl - help on ssl parameters" ); System.exit ( 0 ); } /*----------------------------------------------------------------------* parseArgs *----------------------------------------------------------------------*/ void parseArgs(String[] args) { int i=0; while(i < args.length) { if (args[i].compareTo("-server")==0) { if ((i+1) >= args.length) usage(); serverUrl = args[i+1]; i += 2; } else if (args[i].compareTo("-topic")==0) { if ((i+1) >= args.length) usage(); name = args[i+1]; i += 2; } else if (args[i].compareTo("-queue")==0) { if ((i+1) >= args.length) usage(); name = args[i+1];
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.
} }
/*----------------------------------------------------------------------* main *----------------------------------------------------------------------*/ public static void main ( String[] args ) { tibjmsMsgProducer t = new tibjmsMsgProducer ( args ); } public void onException ( JMSException exception ) { String strErrCode = exception.getErrorCode (); String strFTSwitch = "FT-SWITCH"; if ( true == strErrCode.startsWith ( strFTSwitch ) ) { String strNewServer = strErrCode.substring ( strFTSwitch.length () + 2 ); System.out.println ( "FT Connection switched to: " + strNewServer ); } else { exception.printStackTrace (); } } }
2005 TIBCO Software Inc. All Rights Reserved. TIBCO Confidential and Proprietary.