jump to navigation

appagent.bin + unable to fork new process 19 October 2010

Posted by David Alejo Marcos in Oracle 11.2, RAC.
Tags: ,
comments closed

Yesterday (18/10/2010)  we had a problem on our QA cluster.

After refreshing the database from production, several of our DBAs noticed they could not connect to the first node from time to time.

At this point, I did have a look.

The problem:

We had intermittent disconnections from node a (we have a 3-node cluster). I was told the error being raised was “unable to fork new process”.

I did have a look on Metalink, but I could not find any entry related to appagent.bin.

The solution:

We did have the same problem some time ago and we fixing by stopping the server. This type of fix is not our prefer option as we have 3 databases running on the same clusterware.

To identify the process with more threads we used the following command (Thanks to Coskan (Josh) Gundogar):

[oracle@ssssa trace]$ ps -eLf | grep -i appagent | wc -l
30793

we compared this number with node 2:

[oracle@ssssb trace]$  ps -eLf | grep appagent | wc -l
24

and it proved something was not right with the process.

Something interesting about this process on the first node was the date:

[oracle@ssssa trace]$ ps -eLf | grep appagent
oracle   14768     1  9159  0 30792 Oct15 ?       00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   14768     1  9223  0 30792 Oct15 ?       00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   14768     1  9317  0 30792 Oct15 ?       00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin

so all of them (30793) were 3 days old.

I found some errors on the alert.log file regarding this process:

[oracle@ssssssa]$ cd /u01/crs/oracle/product/11.2.0/grid/log/sssssssa
[oracle@sssssssa]$ tail -100f alertssssssssa.log

[client(17908)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/crs/oracle/product/11.2.0/grid/log/ssssssa/client/css.log.
2010-10-18 14:16:25.223
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/ssssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:16:37.512
[client(10936)]CRS-2313:Fatal error: failed to initialize GIPC communication layer. GIPC error: gipcretInitialization (2).
2010-10-18 14:16:37.512
[client(10936)]CRS-2316:Fatal error: cannot initialize GPnP, CLSGPNP_ERR (Generic GPnP error).
2010-10-18 14:16:37.539
[client(10936)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/crs/oracle/product/11.2.0/grid/log/ssssssa/client/css.log.
2010-10-18 14:16:38.920
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/ssssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:16:53.100
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/ssssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:17:25.227
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/sssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:17:38.921
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/sssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"

and I was told one of the filesystem filled on Friday October 15th.

As we had an agreement with all 3 businesses to have a downtime I decided to kill the process:

[oracle@ssssssa trace]$ kill -9 14768

and it fixed the problem:

[oracle@ssssa trace]$ ps -eLf | grep appagent | wc -l
14

[oracle@ssssssa trace]$ ps -eLf | grep appagent
oracle   18655     1 18655  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18657  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18658  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18659  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18660  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18661  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18662  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18663  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18664  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18665  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18670  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18673  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18694  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   19230 22249 19230  0    1 15:18 pts/4    00:00:00 grep appagent

Conclusions:

I am not sure if the problem with filesystems getting full is related to this issue or “The OCR location in an ASM disk group is inaccessible” logged on the alert.log.

For the above, we have an SR with Oracle.

As always, comments are welcome.