My Oracle Blog: January 2016

Friday, January 29, 2016

Oracle Hang Management

Hang Manager (HM) has existed since 10.2.0.1. The main goal of Hang Manager is to reliably detect, and if hang resolution is enabled, resolve hangs in a timely manner. Over various releases, Hang Manager has been enhanced along with the wait event infrastructure on which it relies. However, it is only in 11.2.0.2 that Hang Manager actually resolves any hangs by terminating sessions and/or processes. This is the default operation in 11.2.0.2. Hang Manager will not terminate an instance unless the resolution scope, which is controlled by the initialization parameter _HANG_RESOLUTION_SCOPE, is set to INSTANCE. By default, this parameter is set to PROCESS.

Hang Manager is only active when RAC is enabled, that is, CLUSTER_DATABASE = TRUE. Also, the parameters _HANG_DETECTION_ENABLED and _HANG_RESOLUTION_SCOPE control hang detection and resolution respectively. Currently, Hang Manager does not operate on a non-RAC enabled instance.

DIA0 process is responsible for hang detection and deadlock resolution.

NOT TO BE CONFUSED WITH

DIAG process: diagnosability process, performs diagnostic dumps and executes oradebug commands

When DIA0 process kills off a process for hang management,

ORA 32701 is thrown, and this indicates that "Hang Management" is in use.

Hang can occur due to variety of reasons:

- application related

- resource related

- internal blockage in Oracle kernel

- other reason

'Resolvable Hangs' int the incident trace file indicate one or more hangs that were found and identify the final blocking session and instance on which they occurred. If the current hang resolution state is 'PROCESS',

any hangs requiring session or process termination are automatically resolved by DIA0 process

>11.2.0.2, Hang Management will try to resolve the hang by terminating the ultimate blocker

If 32701 occured, it is important to determine the cause of the hang.

In the example below, the DIA0 terminated session sid:3768 with serial # 19475 (ospid:73677)

for terminating blocker process (ospid: 73677 sid: 3768 ser#: 19475) of hang with

-----------------------------------------------------------------------------------------

ORA-32701: Possible hangs up to hang ID=16 detected

Incident details in: /ade/b/3518352270/oracle/log/diag/rdbms/mainrac/mainrac1/incident/incdir_81/mainrac1_dia0_20207_i81.trc

DIA0 requesting termination of session sid:44 with serial # 1 (ospid:5964) on instance 2

due to a GLOBAL, HIGH confidence hang with ID=16.

Hang Resolution Reason: Although hangs of this root type are typically

self-resolving, the previously ignored hang was resolved.

DIA0 could not resolve a GLOBAL, HIGH confidence hang with ID=16.

-----------------------------------------------------------------------------------------

Errors in file /u01/app/oracle/diag/rdbms/dm02cdb1/orcl2/trace/orcl2_dia0_87406.trc (incident=31458) (PDBNAME=CDB$ROOT):

ORA-32701: Possible hangs up to hang ID=86 detected

Incident details in: /u01/app/oracle/diag/rdbms/dm02cdb1/orcl2/incident/incdir_31458/orcl2_dia0_87406_i31458.trc

Thu Dec 31 21:59:43 2015

Sweep [inc][31458]: completed

Sweep [inc2][31458]: completed

Thu Dec 31 21:59:44 2015

DIA0 terminating blocker (ospid: 73677 sid: 3768 ser#: 19475) of hang with ID = 86

requested by master DIA0 process on instance 2

Hang Resolution Reason: Previously ignored hang - hang did not self-resolve

as was expected. Automatic hang resolution will be performed.

by terminating session sid:3768 with serial # 19475 (ospid:73677)

DIA0 successfully terminated session sid:3768 with serial # 19475 (ospid:73677) with status 0.

DIA0 successfully resolved a GLOBAL, HIGH confidence hang with ID=86.

-----------------------------------------------------------------------------------------

*** 2015-12-31 21:59:40.213

Resolvable Hangs in the System

Root Chain Total Hang

Hang Hang Inst Root #hung #hung Hang Hang Resolution

ID Type Status Num Sess Sess Sess Conf Span Action

----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------

86 HANG RSLNPEND 2 3768 3 3 HIGH GLOBAL Terminate Process

Hang Resolution Reason: Previously ignored hang - hang did not self-resolve

as was expected. Automatic hang resolution will be performed.

Inst Sess Ser Proc Wait Wait

Num ID Num OSPID Name Time(s) Event

----- ------ ----- --------- ----- ------- -----

PDBID PDBNm

----- ---------------

1 4528 13156 28552 FG 704 library cache lock

11 MODERNIZATION_DEV_12

1 2770 52684 22975 FG 50 library cache pin

11 MODERNIZATION_DEV_12

2 3768 19475 73677 FG 1183 not in wait

11 MODERNIZATION_DEV_12

Dumping process info of pid[65.73677] (sid:3768, ser#:19475)

requested by master DIA0 process on instance 2.

-----------------------------------------------------------------------------------------

*** 2015-12-31 21:59:40.213

[TOC00004]

Process diagnostic dump for oracle@dm02dbadm02.ttiinc.com, OS id=73677,

pid: 65, proc_ser: 187, sid: 3768, sess_ser: 19475

-------------------------------------------------------------------------------

os thread scheduling delay history: (sampling every 1.000000 secs)

0.000000 secs at [ 22:59:39 ]

NOTE: scheduling delay has not been sampled for 0.476107 secs

0.000000 secs from [ 22:59:35 - 22:59:40 ], 5 sec avg

0.000000 secs from [ 22:58:40 - 22:59:40 ], 1 min avg

0.000000 secs from [ 22:54:40 - 22:59:40 ], 5 min avg

*** 2015-12-31 21:59:41.679

loadavg : 2.31 2.47 2.23

System user time: 0.03 sys time: 0.00 context switch: 17233

Memory (Avail / Total) = 75526.99M / 257965.57M

Swap (Avail / Total) = 23089.19M / 24576.00M

F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD

0 R oracle 73677 1 99 80 0 - 19898980 ? 21:39 ? 00:19:45 oracleorcl2 (LOCAL=NO)

ORA 00445 Background Process 'xxx' did not start after 120 seconds

When an Oracle background process was unable to spawn a child process after a stipulated time of 2 minutes, therefore startup for this child process was aborted.

The Parent process that starts this child process does it in synchronous manner (the code would be a sequence probly with nothing to handle the case where child is not able to start), and hence cannot continue with any of the tasks, until child process spawning is completed. Sometimes this may lead to instance wide hang.

Where to find details:

- Alert log

- incident trace file

the incident trace file name would be mentioned in the alert log itself

shows the instance/database wide waits that the child process encountered when coming up

"PROCESS STATE" -> "Current Wait Stack"

- traditional trace file generated at the time of issue

shows details of the load on machine

* load average

* memory consumption

* output of PS (process state)

* output of GDB (to view function stack trace) basically stack trace

Need to review both the traditional and the incident trace files.

Stages of process startup

- Queued

- Forking

- Execution

- initialization

Forking and Execution phases are directly linked to the load on the system/resources.

Traditional trace file would contain information about what phase the process was in.

"waited for process"

.. "to be spawned" - Forking/Queued

.. "to initialize" - Execute/initialize

RCA:

Root cause falls in any of following 2 categories

- Contention among processes

- OS and network level issues OS resource issues/Network storage issues

Known issues/potential solutions:

Lack of OS resources and incorrect config

memory or swap space may be insufficient to spawn a new process

- check OS error lof to check for the time when the error occurred

AIX - errpt -a command

- run HCVE report

verifies that the OS resources are set as recommended by oracle ( RDA - Health Check / Validation Engine Guide (Doc ID 250262.1) )

- run OSWatcher

- check the define ulimit settings on Unix

On UNIX systems, the ulimit command controls the limits on system resource, such as process data size, process virtual memory, and process file size.

Solution:

- review the output of HCVE and allocate resource as recommended and feasible

- reset ulimit settings if not apt

2. ASLR linux feat is being used

it is designed to load shared memory resources in random addressed. In Oracle, multiple processes map a shared memory object at same address across processes.

When ASLR is turned on, Oracle cannot guarantee the address of these shared objects and hence this error.

mainly reported on RHEL5 and Oracle 11202.

To verify ASLR:

- /sbin/sysctl -a | grep randomize

kernel.randomize_va_space = 1

If parameter is set to anything other than 0, then ASLR is in use.

Solution:

-disable ASLR, modify the following parameters as below in /etc/sysctl.conf

kernel.randomize_va_space = 0

kernel.exec-shield = 0

Reboot is required for kernel.exec-shield to take effect

3. Incorrect database settings

- PGA_AGGREGATE_TARGET to TRUE/FALSE

this should be a numeric value

- PRE_PAGE_SGA is set to true

instructs Oracle to read entire SGA into memory at instance startup

OS page entries are then prebuilt for each SGA page. This can increase the time for instance startup but decreases the amt of time reqd for Oracle to reach full

performance capacity after startup

Also can increase process startup duration because every process that starts must access every page in SGA, can cause PMON process to start late and exceed timeout

- check o/p of SQL> select * from v$RESOURCE_LIMIT;

provides details of resources like sessions, processes, locks etc.

it has initialization limits for resources, maximum values reached after last db startup and current utilization of resource.

Solution:

- set PGA_AGGREGATE_TARGET to apt numeric value

- set PRE_PAGE_SGA to FALSE

- check if the limits were reached and accordingly increase the limits

Other causes/known issues

- too much activity on mc

- NFS latency issues

- disk latency issues (IO latency)

- network latency