Friday, January 29, 2016

ORA 00445 Background Process 'xxx' did not start after 120 seconds

ORA 00445 Background Process 'xxx' did not start after 120 seconds

When an Oracle background process was unable to spawn a child process after a stipulated time of 2 minutes, therefore startup for this child process was aborted.

The Parent process that starts this child process does it in synchronous manner (the code would be a sequence probly with nothing to handle the case where child is not able to start), and hence cannot continue with any of the tasks, until child process spawning is completed. Sometimes this may lead to instance wide hang.

Where to find details:
  - Alert log
  - incident trace file
     the incident trace file name would be mentioned in the alert log itself
       shows the instance/database wide waits that the child process encountered when coming up
       "PROCESS STATE" -> "Current Wait Stack"
  - traditional trace file generated at the time of issue
      shows details of the load on machine
              * load average
              * memory consumption
              * output of PS (process state)
              * output of GDB (to view function stack trace) basically stack trace

Need to review both the traditional and the incident trace files.

Stages of process startup
 - Queued
 - Forking
 - Execution
 - initialization

Forking and Execution phases are directly linked to the load on the system/resources.
Traditional trace file would contain information about what phase the process was in.
"waited for process"
    .. "to be spawned" - Forking/Queued
    .. "to initialize" - Execute/initialize

RCA:

Root cause falls in any of following 2 categories
 - Contention among processes
 - OS and network level issues OS resource issues/Network storage issues

Known issues/potential solutions:

  1. Lack of OS resources and incorrect config
             memory or swap space may be insufficient to spawn a new process
            - check OS error lof to check for the time when the error occurred
                       AIX - errpt -a command
            - run HCVE report
                 verifies that the OS resources are set as recommended by oracle ( RDA - Health Check / Validation Engine Guide (Doc ID 250262.1) )
            - run OSWatcher
            - check the define ulimit settings on Unix
                    On UNIX systems, the ulimit command controls the limits on system resource, such as process data size, process virtual memory, and process file size.
         
         Solution:
                 - review the output of HCVE and allocate resource as recommended and feasible
                 - reset ulimit settings if not apt

    2. ASLR linux feat is being used
           it is designed to load shared memory resources in random addressed. In Oracle, multiple processes map a shared memory object at same address across processes.
           When ASLR is turned on, Oracle cannot guarantee the address of these shared objects and hence this error.
         
         mainly reported on RHEL5 and Oracle 11202.
         To verify ASLR:
                 - /sbin/sysctl -a | grep randomize
                 kernel.randomize_va_space = 1
              
                If parameter is set to anything other than 0, then ASLR is in use.

             Solution:
                      -disable ASLR, modify the following parameters as below in /etc/sysctl.conf
                                kernel.randomize_va_space = 0
                                kernel.exec-shield = 0
                           
Reboot is required for kernel.exec-shield to take effect

    3.   Incorrect database settings
                - PGA_AGGREGATE_TARGET to TRUE/FALSE
                        this should be a numeric value
               - PRE_PAGE_SGA is set to true
                            instructs Oracle to read entire SGA into memory at instance startup
                            OS page entries are then prebuilt for each SGA page. This can increase the time for instance startup but decreases the amt of time reqd for Oracle to reach full
                            performance capacity after startup
                        Also can increase process startup duration because every process that starts must access every page in SGA, can cause PMON process to start late and exceed timeout

               - check o/p of SQL> select * from v$RESOURCE_LIMIT;
                   provides details of resources like sessions, processes, locks etc.
                   it has initialization limits for resources, maximum values reached after last db startup and current utilization of resource.

               Solution:
                      - set PGA_AGGREGATE_TARGET to apt numeric value
                      - set PRE_PAGE_SGA to FALSE
                      - check if the limits were reached and accordingly increase the limits

  1. Other causes/known issues
      
            - too much activity on mc
            - NFS latency issues
            - disk latency issues (IO latency)

            - network latency

No comments:

Post a Comment