Friday, January 29, 2016

Oracle Hang Management

Hang Manager (HM) has existed since 10.2.0.1. The main goal of Hang Manager is to reliably detect, and if hang resolution is enabled, resolve hangs in a timely manner. Over various releases, Hang Manager has been enhanced along with the wait event infrastructure on which it relies. However, it is only in 11.2.0.2 that Hang Manager actually resolves any hangs by terminating sessions and/or processes. This is the default operation in 11.2.0.2. Hang Manager will not terminate an instance unless the resolution scope, which is controlled by the initialization parameter _HANG_RESOLUTION_SCOPE, is set to INSTANCE. By default, this parameter is set to PROCESS.

Hang Manager is only active when RAC is enabled, that is, CLUSTER_DATABASE = TRUE. Also, the parameters _HANG_DETECTION_ENABLED and _HANG_RESOLUTION_SCOPE control hang detection and resolution respectively. Currently, Hang Manager does not operate on a non-RAC enabled instance.

DIA0 process is responsible for hang detection and deadlock resolution.

NOT TO BE CONFUSED WITH 
DIAG process: diagnosability process, performs diagnostic dumps and executes oradebug commands

When DIA0 process kills off a process for hang management,
ORA 32701 is thrown, and this indicates that "Hang Management" is in use.

Hang can occur due to variety of reasons:
 - application related
 - resource related
 - internal blockage in Oracle kernel
 - other reason

'Resolvable Hangs' int the incident trace file indicate one or more hangs that were found and identify the final blocking session and instance on which they occurred. If the current hang resolution state is 'PROCESS',
any hangs requiring session or process termination are automatically resolved by DIA0 process

 >11.2.0.2, Hang Management will try to resolve the hang by terminating the ultimate blocker

If 32701 occured, it is important to determine the cause of the hang.

In the example below, the DIA0 terminated session sid:3768 with serial # 19475 (ospid:73677)
 for terminating blocker process (ospid: 73677 sid: 3768 ser#: 19475) of hang with


-----------------------------------------------------------------------------------------
ORA-32701: Possible hangs up to hang ID=16 detected
Incident details in: /ade/b/3518352270/oracle/log/diag/rdbms/mainrac/mainrac1/incident/incdir_81/mainrac1_dia0_20207_i81.trc
DIA0 requesting termination of session sid:44 with serial # 1 (ospid:5964) on instance 2
  due to a GLOBAL, HIGH confidence hang with ID=16.
  Hang Resolution Reason: Although hangs of this root type are typically
  self-resolving, the previously ignored hang was resolved.
DIA0 could not resolve a GLOBAL, HIGH confidence hang with ID=16.
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
Errors in file /u01/app/oracle/diag/rdbms/dm02cdb1/orcl2/trace/orcl2_dia0_87406.trc  (incident=31458) (PDBNAME=CDB$ROOT):
ORA-32701: Possible hangs up to hang ID=86 detected
Incident details in: /u01/app/oracle/diag/rdbms/dm02cdb1/orcl2/incident/incdir_31458/orcl2_dia0_87406_i31458.trc
Thu Dec 31 21:59:43 2015
Sweep [inc][31458]: completed
Sweep [inc2][31458]: completed
Thu Dec 31 21:59:44 2015
DIA0 terminating blocker (ospid: 73677 sid: 3768 ser#: 19475) of hang with ID = 86
    requested by master DIA0 process on instance 2
    Hang Resolution Reason: Previously ignored hang - hang did not self-resolve
   as was expected.  Automatic hang resolution will be performed.
    by terminating session sid:3768 with serial # 19475 (ospid:73677)
DIA0 successfully terminated session sid:3768 with serial # 19475 (ospid:73677) with status 0.
DIA0 successfully resolved a GLOBAL, HIGH confidence hang with ID=86.

-----------------------------------------------------------------------------------------
*** 2015-12-31 21:59:40.213
Resolvable Hangs in the System
                     Root       Chain Total               Hang              
  Hang Hang          Inst Root  #hung #hung  Hang   Hang  Resolution        
    ID Type Status   Num  Sess   Sess  Sess  Conf   Span  Action            
 ----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
    86 HANG RSLNPEND    2  3768     3     3   HIGH GLOBAL Terminate Process 
 Hang Resolution Reason: Previously ignored hang - hang did not self-resolve
    as was expected.  Automatic hang resolution will be performed.

  Inst  Sess   Ser             Proc  Wait    Wait
   Num    ID    Num      OSPID  Name Time(s) Event
  ----- ------ ----- --------- ----- ------- -----
        PDBID PDBNm          
        ----- ---------------
      1   4528 13156     28552    FG     704 library cache lock
           11 MODERNIZATION_DEV_12
      1   2770 52684     22975    FG      50 library cache pin
           11 MODERNIZATION_DEV_12
      2   3768 19475     73677    FG    1183 not in wait
           11 MODERNIZATION_DEV_12

Dumping process info of pid[65.73677] (sid:3768, ser#:19475)
    requested by master DIA0 process on instance 2.

-----------------------------------------------------------------------------------------

*** 2015-12-31 21:59:40.213
[TOC00004]
Process diagnostic dump for oracle@dm02dbadm02.ttiinc.com, OS id=73677,
pid: 65, proc_ser: 187, sid: 3768, sess_ser: 19475
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 22:59:39 ]
    NOTE: scheduling delay has not been sampled for 0.476107 secs
  0.000000 secs from [ 22:59:35 - 22:59:40 ], 5 sec avg
  0.000000 secs from [ 22:58:40 - 22:59:40 ], 1 min avg
  0.000000 secs from [ 22:54:40 - 22:59:40 ], 5 min avg

*** 2015-12-31 21:59:41.679
loadavg : 2.31 2.47 2.23
System user time: 0.03 sys time: 0.00 context switch: 17233
Memory (Avail / Total) = 75526.99M / 257965.57M
Swap (Avail / Total) = 23089.19M /  24576.00M
F S UID         PID   PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 R oracle    73677      1 99  80   0 - 19898980 ?   21:39 ?        00:19:45 oracleorcl2 (LOCAL=NO)





ORA 00445 Background Process 'xxx' did not start after 120 seconds

ORA 00445 Background Process 'xxx' did not start after 120 seconds

When an Oracle background process was unable to spawn a child process after a stipulated time of 2 minutes, therefore startup for this child process was aborted.

The Parent process that starts this child process does it in synchronous manner (the code would be a sequence probly with nothing to handle the case where child is not able to start), and hence cannot continue with any of the tasks, until child process spawning is completed. Sometimes this may lead to instance wide hang.

Where to find details:
  - Alert log
  - incident trace file
     the incident trace file name would be mentioned in the alert log itself
       shows the instance/database wide waits that the child process encountered when coming up
       "PROCESS STATE" -> "Current Wait Stack"
  - traditional trace file generated at the time of issue
      shows details of the load on machine
              * load average
              * memory consumption
              * output of PS (process state)
              * output of GDB (to view function stack trace) basically stack trace

Need to review both the traditional and the incident trace files.

Stages of process startup
 - Queued
 - Forking
 - Execution
 - initialization

Forking and Execution phases are directly linked to the load on the system/resources.
Traditional trace file would contain information about what phase the process was in.
"waited for process"
    .. "to be spawned" - Forking/Queued
    .. "to initialize" - Execute/initialize

RCA:

Root cause falls in any of following 2 categories
 - Contention among processes
 - OS and network level issues OS resource issues/Network storage issues

Known issues/potential solutions:

  1. Lack of OS resources and incorrect config
             memory or swap space may be insufficient to spawn a new process
            - check OS error lof to check for the time when the error occurred
                       AIX - errpt -a command
            - run HCVE report
                 verifies that the OS resources are set as recommended by oracle ( RDA - Health Check / Validation Engine Guide (Doc ID 250262.1) )
            - run OSWatcher
            - check the define ulimit settings on Unix
                    On UNIX systems, the ulimit command controls the limits on system resource, such as process data size, process virtual memory, and process file size.
         
         Solution:
                 - review the output of HCVE and allocate resource as recommended and feasible
                 - reset ulimit settings if not apt

    2. ASLR linux feat is being used
           it is designed to load shared memory resources in random addressed. In Oracle, multiple processes map a shared memory object at same address across processes.
           When ASLR is turned on, Oracle cannot guarantee the address of these shared objects and hence this error.
         
         mainly reported on RHEL5 and Oracle 11202.
         To verify ASLR:
                 - /sbin/sysctl -a | grep randomize
                 kernel.randomize_va_space = 1
              
                If parameter is set to anything other than 0, then ASLR is in use.

             Solution:
                      -disable ASLR, modify the following parameters as below in /etc/sysctl.conf
                                kernel.randomize_va_space = 0
                                kernel.exec-shield = 0
                           
Reboot is required for kernel.exec-shield to take effect

    3.   Incorrect database settings
                - PGA_AGGREGATE_TARGET to TRUE/FALSE
                        this should be a numeric value
               - PRE_PAGE_SGA is set to true
                            instructs Oracle to read entire SGA into memory at instance startup
                            OS page entries are then prebuilt for each SGA page. This can increase the time for instance startup but decreases the amt of time reqd for Oracle to reach full
                            performance capacity after startup
                        Also can increase process startup duration because every process that starts must access every page in SGA, can cause PMON process to start late and exceed timeout

               - check o/p of SQL> select * from v$RESOURCE_LIMIT;
                   provides details of resources like sessions, processes, locks etc.
                   it has initialization limits for resources, maximum values reached after last db startup and current utilization of resource.

               Solution:
                      - set PGA_AGGREGATE_TARGET to apt numeric value
                      - set PRE_PAGE_SGA to FALSE
                      - check if the limits were reached and accordingly increase the limits

  1. Other causes/known issues
      
            - too much activity on mc
            - NFS latency issues
            - disk latency issues (IO latency)

            - network latency