Performing a Takeover on the Secondary System

Objective

After completing this lesson, you will be able to perform a takeover on the secondary system

Perform Takeover

During a takeover, you switch your active system from the current primary system to the secondary system.

If your primary data center is not available, due to a disaster or for planned downtime for example, and a decision has been made to fail over to the secondary data center, you can perform a takeover on your secondary system.

In addition to the tools that may be used to monitor the overall system status when system replication is enabled, a script is provided with SAP HANA, which helps you decide when a takeover should be performed.

We recommend that you use third-party, external tools to check if hosts, the network, and the data center are still available.

In addition, a script called landscapeHostConfiguration.py is provided so that SAP HANA itself can communicate the status of the primary system. It can communicate the following statuses:

  • SAP HANA is OK.

  • SAP HANA will be OK after a host auto-failover, for example.

  • Not enough instances are started and a takeover would be useful.

A takeover is only recommended when the return code from the script is 1 (error).

Note

The script does not tell you if the secondary system is ready for a takeover.

If a takeover occurs, the secondary site finds the latest savepoint in the data disk area. This is the starting point for a usual database restart, but many large data packages (main indexes) are preloaded in-memory, as on the primary data center before takeover. This supports the restart considerably. Based on this initial savepoint on the secondary data center, the log replay can start and roll the database forward to the latest point in time.

The primary system can fail due to disk, server, power, or external causes. When this failure is detected, a controlled takeover can start.

The following decision guideline can help you decide if a takeover is advisable.

Takeover Decision Guideline

There are three main questions involved in deciding whether or not a takeover will improve the situation.

  1. Can a takeover help at all?
    • No: Do not perform a takeover.

    • Yes: Proceed to question 2.

  2. Can a takeover reduce the downtime duration?
    • No: Do not perform a takeover.

    • Yes: Proceed to question 3.

  3. Can it be guaranteed that no data loss will result from the takeover?

    No: Evaluate the risk of data loss in the case of a takeover against that of data loss in case of no takeover, and against the impact of a longer downtime to bring back the primary site instead.

    Yes: Perform a takeover.

Note

For more information on how to answer these questions, see SAP Note: 2063657.

You can use the getTakeoverRecommendation.py script to get takeover recommendations.

Takeover Recommendations are Given by the Script:

  • getTakeoverRecommendation.py

  • Evaluates the status returned by the Python scripts:

    • landscapeHostConfiguration.py
    • systemReplicationStatus.py
  • These three possible states are returned:

    • Takeover required

    • Not decidable

    • Possible

When the getTakeoverRecommendation script is called, it shows the takeover recommendation based on the current system state. However, when the primary system faces any error situation, the system replication status can no longer be determined. Therefore, the previous state should be saved and compared against the current state.

Example

Primary Site

This is a sample implementation of a python script that uses getTakeoverRecommendation to act as a minimalist cluster manager:

Code Snippet
1234567891011121314
import time import subprocess from getTakeoverRecommendation import TakeoverDecision def main(): wasInSync = False while True: recommendation = subprocess.call(["python","getTakeoverRecommendation.py","--sapcontrol=1"]) if not wasInSync and recommendation is TakeoverDecision.Required: print "Primary defect & no sync => NO TAKEOVER" if wasInSync and recommendation is TakeoverDecision.Required: print "Primary defect & sync => TAKEOVER" nowInSync = recommendation is TakeoverDecision.Possible wasInSync = nowInSync

The output depends on the previous state with the result of the current call of getTakeoverRecommendation. If no sync state is reached, a takeover is not advised. But once the systems are in sync, the next error of the primary system will suggest a takeover. Any subsequent negative return value will reset the sync state, as it is no longer ensured that the replicated data is current.

Tools for Performing a Takeover

The takeover can be triggered using the following tools:

  • The SAP HANA Cockpit

  • SAP HANA Studio

  • hdbnsutil

The following steps are performed:

  1. Trigger a takeover to the secondary system in the event of a disaster.

  2. Register the former primary system as a new secondary when it becomes available again.

To initiate the controlled takeover in the secondary system, choose the takeover button in the system replication app.

Command Line Tool hdbnsutil

  1. Perform a takeover on the secondary site:

    Code Snippet
    1
    hdbnsutil –sr_takeover

  2. When the former primary site is available again it can be registered as the new secondary site:

    Code Snippet
    12345
    hdbnsutil -sr_register --remoteHost=<new primary hostname> --remoteInstance=<instance number> --replicationMode=<sync/syncmem/async> --operationMode=<delta_datashipping|logreplay> --name=<siteName>

Note

External cluster management software can be used to perform the client reconnect after takeover. Some of SAP’s hardware partners offer an integration of SAP HANA high availability in their cluster management solutions.

Client Connection Recovery

To perform the takeover only on the SAP HANA system in most cases is not enough. Somehow, the client or application server needs to be able to continuously reach the SAP HANA system, no matter which site is currently the primary.

Methods for Client Connection Recovery

  • IP redirection

    A virtual IP address is assigned to the virtual host name. In the case of a takeover, the virtual IP unbinds from the network adapter of the primary system and binds to the network adapter of the secondary system.

  • DNS redirection

    In this scenario, the IP for the host name in the DNS is changed from the address of the primary system to the address of the secondary system.

Both methods have their advantages, but the method is mostly decided by IT policies and the existing configuration. If there are no existing constraints, IP redirection has the clear benefit of being faster to process in a script rather than synchronizing changes of DNS entries over a global network.

SAP HANA offers the so-called "HA/DR providers" that are capable of informing external entities about activities inside SAP HANA scale-out (such as host auto-failover) and SAP HANA system replication setups. In a Python script, actions can be executed before or after certain SAP HANA activities, such as startup, shutdown, failover, takeover, connection change, and so on. One example of these HA/DR providers, or "hooks", is moving virtual IP addresses after a takeover in SAP HANA system replication.

Additionally, external cluster management software can be used to perform the client reconnect after takeover.

Monitoring View Providing Information About Takeover History

The monitoring view M_SYSTEM_REPLICATION_TAKEOVER_HISTORYprovides information about take-overs in SAP HANA system replication (HSR) and when HSR was activated or reactivated.

During take-over, the content of the view is also moved to the system taking over, so that the complete take-over history is available.

Takeover History

Information provided by system view M_SYSTEM_REPLICATION_TAKEOVER_HISTORY
Execution end time for takeover of the transaction domain
Execution start time for takeover of the transaction domain
Master log position, that has been reached by takeover
Time that has been reached by takeover
Master nameserver host at takeover time
Operation mode at takeover time
Replication mode at takeover time
Replication status at takeover time
Highest master log position, that has been shipped before executing takeover
Time of the last shipped log buffer before executing takeover
Logical name provided by the site administrator at takeover time
Generated ID of the secondary site at takeover time
Source site master nameserver host at takeover time
Logical name for the source site provided by the site administrator at takeover time
Generated ID of the source site at takeover time
Source site SAP HANA version
End time of the takeover command
Start time of the takeover command

Indicates how the system went online:

ONLINE: online takeover, OFFLINE: offline takeover, TIMETRAVEL: after time travel

SAP HANA version for the site that is executing the takeover

Implementing Takeover Hooks

Takeover Hooks

  • Takeover hooks are provided by SAP HANA in the form of a Python script template.

  • Pre- and post-takeover actions are implemented in this script, which are then executed by the name server before or after the takeover.

  • Therefore, the SAP HANA name server provides a Python-based API that is called at important points of the host auto-failover and the system replication takeover process.

  • There are a number of pre-takeover, post-takeover, and general hooks available.

These so called "hooks" can be used for arbitrary operations that need to be executed. One of the most important uses of the failover hooks is moving around a virtual IP address (in conjunction with STONITH).

There are other purposes, such as starting tools and applications on certain hosts after failover, or even stopping DEV or QA SAP HANA instances on secondary sites before takeover. Multiple failover hooks can be installed and used in parallel with a defined execution order.

The failover hooks are included in SAP HANA. SAP HANA comes with its own Python interpreter, which is used for interpreting the user defined failover hooks. The failover hook API also has a version number.

You can adapt Python files delivered with SAP HANA to create your own HA/DR provider. This allows you to integrate, for example, SAP HANA failover mechanisms into your existing scripts.

To create your own HA/DR provider, use the HADRDummy.py script (located in the $DIR_SYSEXE/python_support/hdb_ha_dr directory) as a template for implementing SAP HANA failover mechanisms in your own scripts.

After implementation of the basic HA/DR provider, you can add the methods listed in the figure, Hook Methods, to your provider.

Hook Methods

NameTrigger
startup()Beginning of nameserver’s start up phase
shutdown()Just before the nameserver exists
failover()As soon as the nameserver made a decision about the new role
stonith()As soon as the nameserver made the decision about the new role
preTakeover()As soon as the hdbnsutil -sr_takeover command is issued
postTakeover()As soon as all services with a volume return from their assign-call (open SQL port)
srConnectionChanged()As soon as one of the replicating services loses or (re-) establishes the system replication connection
srServiceStateChanged()As soon as the nameserver made a decision about the new state
srReadAccessInitialized()As soon as a tenant database or the system database is ready to accept SQL read queries on a read enabled secondary system

As an example, the srServiceStateChanged() HA/DR Provider Hook reports changed service states. It notices that an SAP HANA service is currently stopping or crashing. This knowledge can be used to reduce the takeover (detection) time, especially in systems with huge index servers.

Note

The procedure for creating a HA/DR provider, and the available hook methods, are described in detail in the SAP HANA Administration Guide.

Takeover with Handshake

The takeover with handshake ensures that all of the sent redo log is written to disk on the secondary system.

During a planned takeover, it is important to ensure that no data gets lost (all primary updates must be available on the secondary system), and the former primary system is isolated to avoid a split-brain situation with multiple active primary systems.

The takeover with handshake is ideal for a safe planned takeover while the primary is still running. All new writing transactions on the primary system are suspended and the takeover is only executed when the redo log is available on the secondary system. When performing a takeover with handshake, it is not required to check the replication status or to stop the old primary before the takeover.

You can trigger a takeover with handshake using hdbnsutil -sr_takeover -–suspendPrimary on the secondary system.

If a primary service cannot be accessed, or a service replication is not active or in sync, the takeover will be aborted and reported as an error. In this case, there is no impact on the system and the replication remains as it was. The suspended primary service can be unblocked using the -sr_register hdnsutil command.

Invisible Takeover

During an invisible takeover or a restart, the session's state needs to be recovered and restored to the new primary system.

During a standard takeover, you switch your active system from the current primary system to the secondary system. After a standard takeover, the primary system loses all connections to the client. Moreover, the secondary system is not aware of the previous connections that existed between the client and the primary system. This is different in an invisible takeover.

You can perform an invisible takeover to achieve an automatic recovery of your sessions after takeover to your new primary system. For dedicated client applications, this takeover is invisible. In contrast to a standard takeover, an invisible takeover ensures that the client reconnects to the primary system and the sessions are restored to the secondary system.

An invisible takeover has two functions:

  • Keep the physical connections between the client and the primary and secondary systems.

  • Restore the sessions to the secondary system.

This seamless recovery is possible also when restarting the system (for example, after a system crash).

The session's state needs to be recovered and restored to the new primary system in an invisible takeover scenario, or to the new system in a restart scenario. The cross-layer between the session and the client library makes the seamless recovery possible. This cross-layer feature called transparent session recovery recovers the current session's state and the physical connection.

Note

As a first step, the focus is on read SQL, while write transactions (including database cursors) still need to be restarted after take-over (similar to classical databases).

During an invisible takeover or a restart, the session's state is recovered and restored to the new primary system.

Note

The transparent session recovery is supported by SQLDBC for SAP HANA 2.0.

Up to SAP HANA 2.0 SPS 03, the implementation needs Active/Active system replication.

Configuration

The enable_session_recovery parameter controls the session recovery. The parameter is part of the indexserver.ini configuration file: indexserver.ini/session/enable_session_recovery. The default value is true, recovering all session variables and restoring the client connections from the primary system to the secondary system. This parameter is configurable online, but the changes can be applied only to the connections established after making the changes.

Limitations

In SAP HANA 2.0 SPS04, almost all session variables from the current session context can be recovered, with the exception of the following limitations:

  • Sessions that have created or updated a global temporary table with any DDL or DML commands will not be recovered. However, sessions which have created a local temporary table will be recovered without the table recovery.

  • Only read transactions are supported. Ongoing write transactions will be rolled back with an error, and the session can be recovered when an application restarts the failed transaction with no explicit reconnect trial from the application.

  • Almost all session variables from the current session context are recovered.

  • When a response for a request is not successfully sent from the client to the server, the session is not recovered. However, sessions are still recovered when an SQL command is not sent from the client to the server.

Log in to track your progress & complete quizzes