Describing the SAP HANA High Availability Features

Objective

After completing this lesson, you will be able to describe the different SAP HANA high availability features

SAP HANA High Availability

Business Example

For your company’s SAP ERP and SAP Business Warehouse (SAP BW) systems, high availability and disaster tolerance are important requirements that need to be built into the landscape architecture.

Your SAP ERP and SAP BW systems are running on the SAP HANA database, which is why you are looking into the native high availability and disaster tolerance features of the SAP HANA database system. You want to learn how to incorporate these features into your company’s landscape architecture.

SAP HANA and High Availability

Events that affect uninterrupted system availability can be divided into schedulable (hardware and software maintenance) and unschedulable events (hardware, software and human error).

The SAP HANA database platform is designed with high availability and disaster tolerance in mind. SAP HANA supports a broad range of recovery scenarios, from simple software errors or hardware failures, to disasters that take out an entire site.

What is High Availability?

Availability is usually indicated as a percentage of the operational uptime of a system, measured over the course of a year. For example, if a system is designed to be available 99.99% of the time (sometimes called "four nines"), its downtime per year must be less than 0.01%, or 52 minutes and 56 seconds.

That means less than an hour of downtime per year. This can be a very challenging target. To meet such challenging targets, high availability and disaster tolerance should be an integral part of the architectural design, that is, implemented on every layer of the infrastructure.

Downtime is the consequence of outages, which may be planned downtime (such as that for system upgrades or hardware replacements) or caused by unplanned downtime (such as that for software or hardware failures). Unplanned downtime can be triggered by equipment malfunction, software, or network failures, or a major disaster such as a fire, earthquake, a regional power loss, or a construction accident which may decommission the entire data center.

High Availability is a set of techniques, engineering practices, and design principles for business continuity. This is achieved by eliminating single points of failure (fault tolerance), and providing the ability to rapidly resume operations after a system outage with minimal business loss (fault resilience).

Fault Recovery is the process of recovering and resuming normal operations after an outage due to a fault.

Disaster Recovery is the process of recovering operations after an outage due to a prolonged data center or site failure. Preparing for disasters may require backing up data across longer distances, and may thus be more complex and costly.

Recovery - Key Performance Indicators (KPIs)

Customers commonly use two key measures to specify the recovery parameters of a system following an outage, the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO). The RPO and RTO of a system are illustrated in the figure, RPO and RTO.

Two important benchmarks during a recovery are the amount of data that may be lost (the Recovery Point Objective) and the maximum time a recovery may take (the Recovery Time Objective).
  • The RPO is the maximum permissible amount of time during which operational data may be lost without the ability to recover. It is the time between the last backup (data or log) and the crash. Almost every customer tries to achieve an RPO of 0, because the loss of business data is unacceptable.

  • The RTO is the maximum permissible amount of time it takes to recover the system, so that normal operations can resume. Many companies aim for a near-zero RTO, because during the RTO period, normal business is interrupted. Interrupted business leads to loss in revenue, which should be avoided as much as possible.

Eliminating Single Points of Failure

The key to achieving fault tolerance is to eliminate single points of failure by introducing redundancy. SAP HANA hardware vendors deliver several levels of redundancy to avoid outage due to component failure.

Generally speaking, these techniques are transparent to SAP HANA’s operation. Nevertheless, they form a crucial line of defense against avoidable system outage, and therefore greatly contribute to business continuity.

Hardware Redundancy

SAP HANA hardware vendors design multiple layers of redundancy in their hardware components and subsystems. These include redundant and hot-swappable Power Supply Units (PSUs), fans, network interface cards, and enterprise-grade, error-correcting code memory.

These subsystems are designed in such a way that the redundant components can sustain operations of the system even when other components fail.

The storage system is particularly critical. Enterprise-grade storage systems combine multiple physical drives into logical units, with built-in standard Redundant Array of Independent Disks (RAID) techniques for redundancy and error recovery. These include mirroring (the writing of the same data to two different drives in parallel) and parity (the writing of extra bits to allow the detection and automatic correction of errors).

Network Redundancy

Redundant networks, network equipment, and network connectivity are required to avoid network failures affecting system availability. This is typically accomplished by deploying a completely redundant switch topology, using the Spanning Tree Protocol (STP) to avoid loops.

Routers can be configured with the Hot Standby Router Protocol (HSRP) for automatic failover. The Border Gateway Protocol (BGP) is commonly used to manage dual WAN connections.

Data Center Redundancy

Data centers that host SAP HANA solutions are equipped with Uninterrupted Power Supply (UPS) units and backup power generators, redundant cooling systems, and multi-sourced providers of network connectivity and electricity. This is done to achieve operational availability in the presence of individual failures, and the significant reduction of the probability of a business-impacting outage. Some enterprises operate fully duplicated data centers, providing a high level of disaster tolerance.

SAP HANA High Availability Support

As an in-memory database, SAP HANA must not only concern itself with maintaining the reliability of its data in the event of failures. It should also be concerned with resuming operations as quickly as possible with most of its data loaded back into memory.

The figure, SAP HANA RPO and RTO Support, shows the phases of SAP HANA High Availability support.

For SAP HANA, as with other databases, the Recovery Point Objective and the Recovery Time Objective are important. However, detection time and ramp-up time must also be included.
Prepare Phase

This first phase means being ready for disaster. During this time, the database is regularly backed up (data and log backups). The local or remote standby systems are operational and ready to take over.

This phase is often taken for granted, because everything is working within the defined parameters. Due to this relaxed attitude, some check procedures might be skipped. This is a disaster waiting to happen. Ensure that there are checks in place.

Detect Phase

Before a takeover can be initiated, a fault must be detected. This fault detection can be done automatically or manually. In both cases, false positives must be avoided, so a failure must be retested.

Try to keep the detect phase as short as possible to avoid the loss of revenue while systems are down.

Recover Phase

When a real failure has been detected, the takeover is triggered. Depending on the fault, different recovery processes can be triggered. The different recovery processes have different recovery runtimes. The runtime is also heavily dependent on the available hardware resources.

Ramp-up Phase

As soon as recovery is completed, the system is available in a ramp-up state. This is due to the fact that not all the data is loaded into memory yet and external interfaces might still be initializing.

To optimize this phase, you could investigate which data is needed most so that this data can be loaded first.

Failback Phase

When all of the other phases are complete, the fault needs to be repaired. This can be a hardware repair or software updates. Both take time and may require additional testing before being applied to the production system.

When these repairs are done, the system may need to failback to the original data center and hardware. This can be triggered immediately or at the next data center maintenance window. If and when this failback is triggered is up to the customer, because this depends on contracts and service level agreements with third-party vendors and may involve additional costs.

SAP HANA Recovery Features

Different RPO and RTO values can be associated with different kinds of faults. Business-critical systems are expected to operate with an RPO of zero data loss in the case of local faults, and often even in the case of a disaster.

The challenges of disaster recovery are different for locally recoverable faults compared to total disasters. To achieve zero RPO and low RTO in a total disaster, data must be replicated synchronously over longer distances, which impacts regular system performance and may require more expensive standby and failover solutions.

All of this leads to trade-off decisions around the attributes of fault recovery functionality, cost, and complexity. SAP offers complementary design options, including three levels of disaster recovery support and three levels of automatic fault recovery support. These are summarized in the figure, SAP HANA Recovery Features. More details on fault recovery and disaster recovery are provided in the next units of this course.

Fault Recovery Support

Local faults, such as hardware and software failures, can often be handled in the same data center and hardware. Possible solutions to repair the error include restarting a failing service on the same server, or switching to a new host in the same data center. Such solutions can be implemented at almost no extra cost, as they are often a default part of the software and hardware solution provided by hardware vendors.

SAP HANA Fault Recovery Features

Recovery featureCost involvedRPO (data loss)RTO (time)
Service Auto-RestartNo costs0Short
SAP HANA Auto-RestartNo costs0Long
Host Auto-FailoverMedium costs0Medium
Service Auto-Restart

In the event of a software failure of one of the configured SAP HANA services (Index Server, Name Server, and so on), the failing service is restarted by the SAP HANA service auto-restart watchdog function.

This watchdog function is provided by the SAP HANA daemon process, which automatically detects the failure and restarts the stopped service process. Upon restart, the service loads data into memory and resumes its function. While all data remains safe, the service recovery takes some time.

SAP HANA Auto-Restart

The SAP HANA database system can be configured in an auto-restart mode. This can be useful after a power failure. When the power returns and the Linux operating system has been started successfully, the SAP HANA database system automatically performs a start-up and recovery. The SAP HANA database system is available again for normal operations as soon as the start-up and recovery is finished.

Host Auto-Failover

This is a local fault recovery solution that can be used in addition to, or as an alternative measure to, system replication. One or more hosts are added to an SAP HANA database system. These additional hosts are configured to work in standby mode.

As long as they are in standby mode, the databases on these hosts do not contain any data and do not accept requests or queries. This means these additional standby hosts cannot be used for other purposes, such as quality or test systems.

Disaster Recovery Support

Disasters can be divided into natural and man-made. Natural disasters are earthquakes, volcanoes, hurricanes, floods and fires. Man-made disasters for example accidental deletion of data files or table data, logical user errors, power outages or loss of Internet connectivity in the primary data center.

Recovery from these man-made disasters is more complex and time-consuming. Switching to the still intact secondary data center is often the quickest solution to ensure that business continuity is restored as quickly as possible.

SAP HANA Disaster Recovery Features

Recovery featureCost involvedRPO (data loss)RTO (time)
BackupsLow costs0Long
Storage ReplicationHigh costs0Medium
System ReplicationHigh costs0Short
System Replication – Active/ActiveMedium costs0Short
System Replication – without data preloadLow costs0Medium
Backups

SAP HANA is an in-memory database, but all the data is persisted on disk as well. Data is persisted on disk by means of regular savepoints. These savepoints are performed by default every five minutes. In between these savepoints, all the changes are recorded in transaction redo logs.

To make sure that SAP HANA can recover from hardware failures, regular data and log backups must be performed. These data and log backups must be shipped to the secondary site to make sure that the system can recover from a total disaster.

Storage Replication

This is a method to continuously replicate all persisted data and log information to the secondary site. Several SAP HANA hardware partners offer a storage-level replication solution, which delivers a backup of the volumes or file-system to a remote, networked storage system.

System Replication

This is a native SAP HANA high availability solution that provides a continuously-replicated SAP HANA system on the secondary site. The data is already loaded into memory, so takeover times are short in comparison to backup and storage replication solutions.

System Replication Active/Active

This is a second native SAP HANA system replication solution that allows the data to be read from the secondary system. In this setup, the secondary system can be used to handle the reporting workload without disrupting the primary system.

System Replication without Data Preload

This is a third native SAP HANA scenario. In this solution, the secondary system does not preload data, and hence consumes very little memory. This allows the hosts of the secondary system to serve dual purposes. For example, for development, unit testing, or QA with separate storage. Before takeover, these activities must of course be turned off. The trade-off in this scenario is a longer RTO in the case of failover.

Log in to track your progress & complete quizzes