Handling System Hang but Reachable Situations

Objective

After completing this lesson, you will be able to handle system hanging but reachable situations.

Troubleshoot Unresponsive Systems

There are various reasons for a system to hang, or seem to be hanging from an end-user perspective. The database is said to be hanging when it no longer responds to queries that are executed against it.

The source of the system standstill might be related to any of the components involved, for example, the storage, OS and hardware, network, SAP HANA database or the application layer. For troubleshooting it is essential to collect information about the context of the active threads in the SAP HANA database.

The following list of issues can cause a system hang state, or a state where the system seems to hang from the end-user perspective:

Question: How can a system get into a hanging state?

  • Log volume full caused by either a full disk, a quota setting or failed log backups
  • Savepoint lock conflict with long-running update
  • Wrong configuration of transparent huge page or OS page cache
  • The Translation Lookaside Buffer (TLB) shootdown
  • High context switches caused by many SqlExecutor or JobExecutor threads
  • Huge Multiversion Concurrency Control (MVCC) versions
  • High system CPU usage caused by non-HANA applications
  • Frequent Out of Memory (OOM) situations that lead to a performance drop

Note

What does "Translation Lookaside Buffer (TLB) shootdown" mean?

A Translation Lookaside Buffer (TLB) is a cache of the translations from virtual memory addresses to physical memory addresses. When a processor changes the virtual-to-physical mapping of an address, it needs to tell the other processors to invalidate that mapping in their caches.

As SQL statements cannot usually be executed for analysis, you should perform the following steps if it is still possible to log on to the OS of the master host (for example, as the <sid>adm user). Also see SAP Note 1999020: SAP HANA: Troubleshooting when database is no longer reachable for further specific steps and guidance on proactive or reactive actions you can take.

Question: What and where can you check?

  • Use SAP HANASitter

  • Check if the SAP HANA file systems still have free space

  • Collect call stack and runtime information

  • Analyze the current Operating System information

  • Analyze an Unresponsive System in SAP HANA Cockpit

Use SAP HANASitter

SAP HANASitter is a tool developed by SAP Support that allows you to monitor SAP HANA and to automatically create dump files in certain scenarios. For more information, see SAP Note 2399979 - How-To: Configuring automatic SAP HANA Data Collection with SAP HANASitter

How to start the HANASitter.py script: 1. Download the script attached to SAP Note 2399979. 2. Using PuTTY, log on to the Linux server as <sis>adm. 3. Execute the command: cdpy python hanasitter.py-ng 1.

SAP HANASitter Features:

  • Database online check

  • CPU, ping, and critical feature check

  • Recording mode for RTE dumps, stack calls, kernel profiler trace, and GStack

  • Scale-out monitor

  • Critical session killer

The SAP HANAsitter checks by default once an hour, if SAP HANA is online and primary. If so, it starts to track. Tracking includes regularly (by default, every minute) checking if SAP HANA is responsive. If it is not, it starts to record.

Recording can include writing call stacks of all active threads, recording run time dumps, index server gstacks, and/or kernel profiler traces. By default, nothing is recorded.

If SAP HANA is responsive, it checks many of the critical features of SAP HANA. As standard, the script checks if there are more than 30 active threads. If there are more than 30 active threads, the script starts to record.

When the script has finished recording, it exits. The script can be configured to restart using the command line.

When the script has finished all the tests successfully, it sleeps for one hour, before it starts all the checks again.

Setup Steps Overview

  1. Create an SAP HANA user (for example, HANASITTER, but you can use a different name) and assign the CATALOG READ privilege.

  2. Create a user key (for example, SYSTEMKEY, but you can use a different name) in the hdbuserstore.

  3. Download the hanasitter.py script attached to SAP Note 2399979.

  4. Store the script in, for example, the python_support directory.

  5. As <sid>adm, change to the python_support directory with the command cdpy.

  6. Execute the script with the command python hanasitter.py -ng 1.

Check if the SAP HANA File Systems Still Have Free Space

In a system hanging situation the execution of SQL statements is probably not possible anymore. If you still can log on to the operating system, you should try to perform the following steps on the OS of the master host.

Check an unresponsive system: 1. Check if the system has enough free inodes: df -i | grep hana. 2. Check for deadlock using hdbcons: hdbcons 'deadlockdetector wg -w -0 <filename>.dot'. 3. Check backup.log for errors using: cat backup.log | grep ERROR. 4. Check if transparent hugepage is disabled: cat /sys/kernel/mm/transparent_hugepage/enabled

In cases where logs cannot be written, all DML statements will fall into wait status. This can lead to a failure of opening new connections because the system internally executes DML statements during the process. Typically, a full log volume is the cause for this.

The root cause of the "log volume full" situation is either caused by disk being full or hitting the quota setting. To investigate more deeply, perform the following steps:

  1. Check for the Internal Disk-Full Event (Alert 30) in the indexserver trace.
  2. Check if the system is running out of disk space using the command df -h on the OS ssh shell.
  3. Check if the system is running out of inodes (NFS) using the command df -i.
  4. Check the quota setting in file system.
  5. Check SAP Note 1679938 - Log Volume is full.
  6. Check SAP Note 2083715 - Analyzing log volume full situations.

Once you have resolved the issue (for example, freed up disk space), you may need to manually mark the internal event as handled. You can do this on the Overview tab of the Administration editor in the SAP HANA studio, or by executing the following SQL statements:

Code Snippet
12
ALTER SYSTEM SET EVENT ACKNOWLEDGED '<host>:<port>' <id> ALTER SYSTEM SET EVENT HANDLED '<host>:<port>' <id>

Log Backup Failure

The "log volume full" situation can also be caused by failing log backups. A SAP HANA log segment is freed for reuse as soon as a the segment is written by the backup process to the backup location. If the log backups fail, the log segments are not reused and new log segments are created. Eventually this can fill up the data volume file system resulting in the "log volume full" situation. To investigate more deeply, perform the following steps:

  1. Check the backup.log (located at /usr/sap/<SID>/HDB<Instance#>/<Host>/trace) to see whether it includes ERROR in the log backup. Check M_BACKUP_CATALOG and M_LOG_SEGMENTS.
  2. If log backup uses backint, check backint.log (located at /usr/sap/<SID>/HDB<Instance#>/<Host>/trace) to see whether it includes ERROR information, and contact backint vendor support.

Savepoint Lock Conflict with Long-running Update

With certain revision and conditions, the conflict between the savepoint lock and the DML lock blocks subsequent statements when long-running update/insert statements exist. To investigate more deeply, perform the following steps:

  1. Use SAP Note 813020: How to generate a runtime dump on SAP HANA to collect a runtime dump. In the generated dump, look for the following combination of call stacks in many threads.
    Code Snippet
    1234
    … DataAccess::SavepointLock::lockShared(…) DataAccess::SavepointSPI::lockSavepoint(…) …

    And one of the following call stacks that is in the savepoint phase.

    Code Snippet
    1234
    … DataAccess::SavepointLock::lockExclusive() DataAccess::SavepointImpl::enterCriticalPhase(…) …
  2. If you are running SAP HANA 1.0 (rev 97 or older) check whether the symptoms match the description in SAP Note 2214279 - Blocking situation caused by waiting writer holding consistent change lock. If so, apply the parameter cch_reopening_enabled as described in the SAP Note.

Collect Call Stack and Runtime Information

For accurate root cause analysis, it is very helpful to have call-stack and runtime information available from the time of the hang. For this, you can use the approaches outlined in SAP Note 2313619. It is useful to capture one or several runtime dumps (SAP Note 1813020) so that an accurate root cause analysis can be done later.

If you consider the possibility of an SAP HANA internal deadlock, you can also run the deadlock detector functionality of hdbcons (SAP Note 2222218): hdbcons 'deadlockdetector wg -w -o <file_name>.dot'.

Note

The generated DOT file can be converted to a PDF or a GIF file using the following commands:

  • To generate a PDF file: dot -Tpdf -o <pdffile> /usr/sap/<SID>/HDB00/work/HA215_<SID>_DeadlockCheck.dot
  • To generate a GIF file: dot -Tgif -o <giffile> /usr/sap/<SID>/HDB00/work/HA215_<SID>_DeadlockCheck.dot

Analyze the Current Operating System Information

Use OS commands like top to identify the amount of CPU consumption, the main CPU consuming processes, and the main CPU utilization component. As SAP HANA is a multithreading system it's more useful to have an overview of the consumed CPU per thread. To display the CPU used per thread, use the top -H command. Its output shows the CPU consumption per thread.

Check /proc/interrupts to understand if certain interrupts happen very often. Ensure that the SAP HANA-related file systems contain sufficient free space (for example, using tools like df or mmdf (for GPFS)) and that there is no quota defined that could result in a file system becoming full. Check if the file systems are still reachable, for example, using df /usr/sap/<sid>. Inspect OS log files, such as /var/log/messages, for suspicious error messages that are issued at the time of the problem (but not at times of normal operation).

Check the OS Configuration

On Linux, you must ensure that Transparent Huge Pages (THPs) are disabled. If THPs are disabled properly, the cat /sys/kernel/mm/transparent_hugepage/enabled command returns always madvise [never].

See further recommended OS, firmware, and hardware settings described in SAP Note 2000003.

Use the OS Command "sar" for Historic Analysis

Use operating system commands like sar to identify further relevant operating system information (for example, memory, network, interrupts, and further CPU and load details). For more information, see SAP Note 1999670.

Execute Hardware Checks

On IBM hardware, you can use the hardware check tool provided in SAP Note 1661146 to collect more information about the system environment.

For SAP HANA 1.0, use the SAP HANA hardware configuration check tool (SAP Note 1943937) to determine bottlenecks on the infrastructure side.

For SAP HANA 2.0, use the SAP HANA Hardware and Cloud Measurement Tools (SAP Note 2493172) to determine bottlenecks on the infrastructure side.

See SAP Note 1999020: SAP HANA: Troubleshooting, when the database is no longer reachable for further specific steps and guidance on the proactive or reactive actions you can take.

Analyze an Unresponsive System in SAP HANA Cockpit

The Troubleshoot Unresponsive Systems function uses the SAP host agent. It can collect the most important diagnosis information from the SAP HANA database, even when the system is stopped or cannot be reached by SQL due to performance problems.

Screenshot showing how to Troubleshoot an Unresponsive System, as described in the preceding and following text.

To troubleshoot a system in a hang state, there is a function in SAP HANA cockpit called Troubleshoot unresponsive systems. When using this function, information is collected through the SAP host agent. The communication between the web browser and the SAP host agent is always done over HTTPS, which requires that the SAP host agent has a Secure Sockets Layer (SSL) certificate (PSE) in its security directory.

The information is collected into a file named emergency_info_<SID>.zip by the Python script emergencyInfo.py. This script connects to the index server, using the hdbcons interface. The script tries to collect information about the open connections, running transactions, and threads. It also shows blocked transactions. If the index server is unavailable, no information is shown.

The information is collected by a Python script that connects to the index server, using the hdbcons interface. The script tries to collect information about the open connections, running transactions, and threads. It also shows blocked transactions. If the index server is unavailable, no information is shown.

Screenshot of the Troubleshoot Unresponsive System function, highlighting four tabs: Connections, Transactions, Blocked Transactions, and Threads, as described in the following text.

The Troubleshoot Unresponsive System function organizes information about the system by tab. You can diagnose the following:

  • Connections
  • Transactions
  • Blocked transactions
  • Threads

Connections Tab

Screenshot of the Troubleshoot Unresponsive System page, highlighting the Connections tab, as described in the following text.

Analyzing the sessions connected to your SAP HANA database helps you identify which applications, or which users, are currently connected to your system, as well as what they are doing in terms of SQL execution.

On the CONNECTIONS tab, you can see information about the current connections to the SAP HANA server. This information includes connection start time, ID, user name, and status. If there are many connections open to the server, it can lead to congestion and may result in the server becoming unresponsive.

On the CONNECTIONS tab, you can use the Cancel Connection button to stop a single connection. To do this, select the connection that you want to cancel, and choose Cancel Connection.

You can stop all the transactions that are currently running by choosing Cancel All Transactions.

Transactions Tab

Screenshot of the Troubleshoot Unresponsive System page, highlighting the Transactions tab, as described in the following text.

On the TRANSACTIONS tab, you can see information about the current transactions in the SAP HANA system. This information includes connection and transaction ID, allocated memory, and user name. The information shown in the transactions table gives you a good insight into current activity on the system.

Via the connection ID and the primary connection tab, you can even link the transaction to the corresponding connection.

You can stop all the transactions that are currently running by choosing Cancel All Transactions.

Blocked Transactions Tab

Screenshot of the Troubleshoot Unresponsive System page, highlighting the Blocked Transactions tab, as described in the following text.

On the BLOCKED TRANSACTIONS tab, you can investigate if there are blocked transactions in your system.

Blocked transactions are transactions that cannot be processed further because they need to acquire transactional locks (record or table locks) that are currently held by another transaction. Transactions can also be blocked while waiting for other resources, such as the network or disk access (database or metadata locks).

The type of lock held by the blocking transaction (record, table, or metadata) is indicated in the Lock Type column.

The lock mode is indicated in the Transactional Lock Type column.

Exclusive: Row-level locks prevent concurrent write operations on the same record. They are acquired implicitly by update and delete operations, or explicitly with the SELECT FOR UPDATE statement.

Table-level: Locks prevent operations on the content of a table from interfering with changes to the table definition (such as drop table or alter table). DML operations on the table content require an intentional exclusive lock, while changes to the table definition (DDL operations) require an exclusive table lock. There is also a LOCK TABLE statement for explicitly locking a table. Intentional exclusive locks can be acquired if no other transaction holds an exclusive lock for the same object. Exclusive locks require that no other transaction holds a lock for the same object (neither intentional exclusive nor exclusive).

For more detailed analysis of blocked transactions, information about low-level locks is available in the columns Lock Wait Name, Lock Wait Component, and Thread ID of Low-Level Lock Owner. Low-level locks are locks acquired at the thread level. They manage code-level access to a range of resources (for example, internal data structures, network, or disk). Lock wait components group low-level locks by engine component or resource.

By choosing Cancel All Transactions, you can stop all the current running transactions.

Because the Delta Table Merge needs to lock tables to proceed, it is a common cause of blocked transactions. Another job displayed by this monitor is the savepoint write, which needs to pull a global database lock in its critical phase. A common issue is a flaw in the application coding that does not commit a write transaction. Such a transaction will block any other transaction that needs to access the same database object. To remedy the situation, close the blocking transaction.

In the UI table, the blocked transactions are displayed directly beneath the blocking transaction.

First, you must determine whether there is only one, or a few transactions, blocking many other transactions. To do this, open the Blocked Transaction tab and check the amount of blocking transactions. If there are only a few blocking transactions, there is probably an issue on the application side. To resolve the problem, use the following techniques:

  1. If only one transaction is blocked, contact the application user and the developer. Firstly, ask the user to close the application and secondly, to check if there is a general issue with the application code.

    If you are not able to contact the user, you can kill the transaction or kill the client process that opened the session. The transaction is rolled back. The session cancellation may take some time to succeed. If it takes longer than 30 seconds, consider this as a bug and contact development support.

    If the session cancellation takes too long or does not complete at all, you can kill the client process that opened the session. This also terminates the blocking transaction. As a prerequisite, you must have access to the client machine.

  2. If a large amount of transactions are blocked, you must find out whether a specific access pattern is causing the issue. If multiple transactions are trying to access the same database objects with write operations, they block each other. To check if this is happening, open the Blocked Transaction Monitor and analyze the Waiting Schema Name, Waiting Object Name, and Waiting Record ID columns. If you find a lot of transactions that are blocking many other transactions, you must investigate whether you can do the following:

    1. Change the client application(s) to avoid the access pattern.
    2. If a background job that issues many write transactions (for example, a data load job) is running, reschedule it to a period with a low user load.
    3. Partition tables that are accessed frequently to avoid clashes. See the SAP HANA Administration Guide for more details on partitioning.
  3. If you cannot identify specific transactions or specific database objects that lead to transactions being blocked, you can assume that there is a problem with the database itself or with its configuration, an issue with the delta merge (for example, mass write operation on a column table), or a long savepoint duration.

Threads Tab

Screenshot of the Troubleshoot Unresponsive System page, highlighting the Threads tab, as described in the following text.

In the THREADS tab, you can identify which statements or procedures are being executed and at what stage they are, who else is connected to the system, and if there are any internal processes running as well. The information shown includes thread type, ID, and thread status. You can also find information of system user and waiting time. The information shown in the table helps you to identify transactions with high average wait times. With the user name, thread ID, and wait time columns, you can identify which thread is causing problems.

In this tab, you can identify long-running threads and those threads that are blocked for an inexplicable long period of time.

In the case of an emergency, choose Cancel All Transactions.

Log in to track your progress & complete quizzes