Business Example
For your company’s SAP ERP and SAP Business Warehouse (SAP BW) systems, high availability and disaster tolerance are important requirements that need to be built into the landscape architecture.
Your SAP ERP and SAP BW systems are running on the SAP HANA database, which is why you are looking into the native high availability and disaster tolerance features of the SAP HANA database system. You want to learn how to incorporate these features into your company’s landscape architecture.
SAP HANA and High Availability

The SAP HANA database platform is designed with high availability and disaster tolerance in mind. SAP HANA supports a broad range of recovery scenarios, from simple software errors or hardware failures, to disasters that take out an entire site.
What is High Availability?
Availability is usually indicated as a percentage of the operational uptime of a system, measured over the course of a year. For example, if a system is designed to be available 99.99% of the time (sometimes called "four nines"), its downtime per year must be less than 0.01%, or 52 minutes and 56 seconds.
That means less than an hour of downtime per year. This can be a very challenging target. To meet such challenging targets, high availability and disaster tolerance should be an integral part of the architectural design, that is, implemented on every layer of the infrastructure.
Downtime is the consequence of outages, which may be planned downtime (such as that for system upgrades or hardware replacements) or caused by unplanned downtime (such as that for software or hardware failures). Unplanned downtime can be triggered by equipment malfunction, software, or network failures, or a major disaster such as a fire, earthquake, a regional power loss, or a construction accident which may decommission the entire data center.
High Availability is a set of techniques, engineering practices, and design principles for business continuity. This is achieved by eliminating single points of failure (fault tolerance), and providing the ability to rapidly resume operations after a system outage with minimal business loss (fault resilience).
Fault Recovery is the process of recovering and resuming normal operations after an outage due to a fault.
Disaster Recovery is the process of recovering operations after an outage due to a prolonged data center or site failure. Preparing for disasters may require backing up data across longer distances, and may thus be more complex and costly.
Recovery - Key Performance Indicators (KPIs)
Customers commonly use two key measures to specify the recovery parameters of a system following an outage, the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO). The RPO and RTO of a system are illustrated in the figure, RPO and RTO.

The RPO is the maximum permissible amount of time during which operational data may be lost without the ability to recover. It is the time between the last backup (data or log) and the crash. Almost every customer tries to achieve an RPO of 0, because the loss of business data is unacceptable.
The RTO is the maximum permissible amount of time it takes to recover the system, so that normal operations can resume. Many companies aim for a near-zero RTO, because during the RTO period, normal business is interrupted. Interrupted business leads to loss in revenue, which should be avoided as much as possible.
Eliminating Single Points of Failure
The key to achieving fault tolerance is to eliminate single points of failure by introducing redundancy. SAP HANA hardware vendors deliver several levels of redundancy to avoid outage due to component failure.
Generally speaking, these techniques are transparent to SAP HANA’s operation. Nevertheless, they form a crucial line of defense against avoidable system outage, and therefore greatly contribute to business continuity.
- Hardware Redundancy
SAP HANA hardware vendors design multiple layers of redundancy in their hardware components and subsystems. These include redundant and hot-swappable Power Supply Units (PSUs), fans, network interface cards, and enterprise-grade, error-correcting code memory.
These subsystems are designed in such a way that the redundant components can sustain operations of the system even when other components fail.
The storage system is particularly critical. Enterprise-grade storage systems combine multiple physical drives into logical units, with built-in standard Redundant Array of Independent Disks (RAID) techniques for redundancy and error recovery. These include mirroring (the writing of the same data to two different drives in parallel) and parity (the writing of extra bits to allow the detection and automatic correction of errors).
- Network Redundancy
Redundant networks, network equipment, and network connectivity are required to avoid network failures affecting system availability. This is typically accomplished by deploying a completely redundant switch topology, using the Spanning Tree Protocol (STP) to avoid loops.
Routers can be configured with the Hot Standby Router Protocol (HSRP) for automatic failover. The Border Gateway Protocol (BGP) is commonly used to manage dual WAN connections.
- Data Center Redundancy
Data centers that host SAP HANA solutions are equipped with Uninterrupted Power Supply (UPS) units and backup power generators, redundant cooling systems, and multi-sourced providers of network connectivity and electricity. This is done to achieve operational availability in the presence of individual failures, and the significant reduction of the probability of a business-impacting outage. Some enterprises operate fully duplicated data centers, providing a high level of disaster tolerance.
SAP HANA High Availability Support
As an in-memory database, SAP HANA must not only concern itself with maintaining the reliability of its data in the event of failures. It should also be concerned with resuming operations as quickly as possible with most of its data loaded back into memory.
The figure, SAP HANA RPO and RTO Support, shows the phases of SAP HANA High Availability support.

- Prepare Phase
This first phase means being ready for disaster. During this time, the database is regularly backed up (data and log backups). The local or remote standby systems are operational and ready to take over.
This phase is often taken for granted, because everything is working within the defined parameters. Due to this relaxed attitude, some check procedures might be skipped. This is a disaster waiting to happen. Ensure that there are checks in place.
- Detect Phase
Before a takeover can be initiated, a fault must be detected. This fault detection can be done automatically or manually. In both cases, false positives must be avoided, so a failure must be retested.
Try to keep the detect phase as short as possible to avoid the loss of revenue while systems are down.
- Recover Phase
When a real failure has been detected, the takeover is triggered. Depending on the fault, different recovery processes can be triggered. The different recovery processes have different recovery runtimes. The runtime is also heavily dependent on the available hardware resources.
- Ramp-up Phase
As soon as recovery is completed, the system is available in a ramp-up state. This is due to the fact that not all the data is loaded into memory yet and external interfaces might still be initializing.
To optimize this phase, you could investigate which data is needed most so that this data can be loaded first.
- Failback Phase
When all of the other phases are complete, the fault needs to be repaired. This can be a hardware repair or software updates. Both take time and may require additional testing before being applied to the production system.
When these repairs are done, the system may need to failback to the original data center and hardware. This can be triggered immediately or at the next data center maintenance window. If and when this failback is triggered is up to the customer, because this depends on contracts and service level agreements with third-party vendors and may involve additional costs.
SAP HANA Recovery Features
Different RPO and RTO values can be associated with different kinds of faults. Business-critical systems are expected to operate with an RPO of zero data loss in the case of local faults, and often even in the case of a disaster.
The challenges of disaster recovery are different for locally recoverable faults compared to total disasters. To achieve zero RPO and low RTO in a total disaster, data must be replicated synchronously over longer distances, which impacts regular system performance and may require more expensive standby and failover solutions.
All of this leads to trade-off decisions around the attributes of fault recovery functionality, cost, and complexity. SAP offers complementary design options, including three levels of disaster recovery support and three levels of automatic fault recovery support. These are summarized in the figure, SAP HANA Recovery Features. More details on fault recovery and disaster recovery are provided in the next units of this course.
Fault Recovery Support
Local faults, such as hardware and software failures, can often be handled in the same data center and hardware. Possible solutions to repair the error include restarting a failing service on the same server, or switching to a new host in the same data center. Such solutions can be implemented at almost no extra cost, as they are often a default part of the software and hardware solution provided by hardware vendors.
SAP HANA Fault Recovery Features
Recovery feature | Cost involved | RPO (data loss) | RTO (time) |
---|---|---|---|
Service Auto-Restart | No costs | 0 | Short |
SAP HANA Auto-Restart | No costs | 0 | Long |
Host Auto-Failover | Medium costs | 0 | Medium |
- Service Auto-Restart
In the event of a software failure of one of the configured SAP HANA services (Index Server, Name Server, and so on), the failing service is restarted by the SAP HANA service auto-restart watchdog function.
This watchdog function is provided by the SAP HANA daemon process, which automatically detects the failure and restarts the stopped service process. Upon restart, the service loads data into memory and resumes its function. While all data remains safe, the service recovery takes some time.
- SAP HANA Auto-Restart
The SAP HANA database system can be configured in an auto-restart mode. This can be useful after a power failure. When the power returns and the Linux operating system has been started successfully, the SAP HANA database system automatically performs a start-up and recovery. The SAP HANA database system is available again for normal operations as soon as the start-up and recovery is finished.
- Host Auto-Failover
This is a local fault recovery solution that can be used in addition to, or as an alternative measure to, system replication. One or more hosts are added to an SAP HANA database system. These additional hosts are configured to work in standby mode.
As long as they are in standby mode, the databases on these hosts do not contain any data and do not accept requests or queries. This means these additional standby hosts cannot be used for other purposes, such as quality or test systems.
Disaster Recovery Support
Disasters can be divided into natural and man-made. Natural disasters are earthquakes, volcanoes, hurricanes, floods and fires. Man-made disasters for example accidental deletion of data files or table data, logical user errors, power outages or loss of Internet connectivity in the primary data center.
Recovery from these man-made disasters is more complex and time-consuming. Switching to the still intact secondary data center is often the quickest solution to ensure that business continuity is restored as quickly as possible.
SAP HANA Disaster Recovery Features
Recovery feature | Cost involved | RPO (data loss) | RTO (time) |
---|---|---|---|
Backups | Low costs | 0 | Long |
Storage Replication | High costs | 0 | Medium |
System Replication | High costs | 0 | Short |
System Replication – Active/Active | Medium costs | 0 | Short |
System Replication – without data preload | Low costs | 0 | Medium |
- Backups
SAP HANA is an in-memory database, but all the data is persisted on disk as well. Data is persisted on disk by means of regular savepoints. These savepoints are performed by default every five minutes. In between these savepoints, all the changes are recorded in transaction redo logs.
To make sure that SAP HANA can recover from hardware failures, regular data and log backups must be performed. These data and log backups must be shipped to the secondary site to make sure that the system can recover from a total disaster.
- Storage Replication
This is a method to continuously replicate all persisted data and log information to the secondary site. Several SAP HANA hardware partners offer a storage-level replication solution, which delivers a backup of the volumes or file-system to a remote, networked storage system.
- System Replication
This is a native SAP HANA high availability solution that provides a continuously-replicated SAP HANA system on the secondary site. The data is already loaded into memory, so takeover times are short in comparison to backup and storage replication solutions.
- System Replication Active/Active
This is a second native SAP HANA system replication solution that allows the data to be read from the secondary system. In this setup, the secondary system can be used to handle the reporting workload without disrupting the primary system.
- System Replication without Data Preload
This is a third native SAP HANA scenario. In this solution, the secondary system does not preload data, and hence consumes very little memory. This allows the hosts of the secondary system to serve dual purposes. For example, for development, unit testing, or QA with separate storage. Before takeover, these activities must of course be turned off. The trade-off in this scenario is a longer RTO in the case of failover.