jump to navigation

appagent.bin + unable to fork new process 19 October 2010

Posted by David Alejo Marcos in Oracle 11.2, RAC.
Tags: ,
trackback

Yesterday (18/10/2010)  we had a problem on our QA cluster.

After refreshing the database from production, several of our DBAs noticed they could not connect to the first node from time to time.

At this point, I did have a look.

The problem:

We had intermittent disconnections from node a (we have a 3-node cluster). I was told the error being raised was “unable to fork new process”.

I did have a look on Metalink, but I could not find any entry related to appagent.bin.

The solution:

We did have the same problem some time ago and we fixing by stopping the server. This type of fix is not our prefer option as we have 3 databases running on the same clusterware.

To identify the process with more threads we used the following command (Thanks to Coskan (Josh) Gundogar):

[oracle@ssssa trace]$ ps -eLf | grep -i appagent | wc -l
30793

we compared this number with node 2:

[oracle@ssssb trace]$  ps -eLf | grep appagent | wc -l
24

and it proved something was not right with the process.

Something interesting about this process on the first node was the date:

[oracle@ssssa trace]$ ps -eLf | grep appagent
oracle   14768     1  9159  0 30792 Oct15 ?       00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   14768     1  9223  0 30792 Oct15 ?       00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   14768     1  9317  0 30792 Oct15 ?       00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin

so all of them (30793) were 3 days old.

I found some errors on the alert.log file regarding this process:

[oracle@ssssssa]$ cd /u01/crs/oracle/product/11.2.0/grid/log/sssssssa
[oracle@sssssssa]$ tail -100f alertssssssssa.log

[client(17908)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/crs/oracle/product/11.2.0/grid/log/ssssssa/client/css.log.
2010-10-18 14:16:25.223
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/ssssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:16:37.512
[client(10936)]CRS-2313:Fatal error: failed to initialize GIPC communication layer. GIPC error: gipcretInitialization (2).
2010-10-18 14:16:37.512
[client(10936)]CRS-2316:Fatal error: cannot initialize GPnP, CLSGPNP_ERR (Generic GPnP error).
2010-10-18 14:16:37.539
[client(10936)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/crs/oracle/product/11.2.0/grid/log/ssssssa/client/css.log.
2010-10-18 14:16:38.920
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/ssssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:16:53.100
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/ssssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:17:25.227
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/sssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:17:38.921
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/sssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"

and I was told one of the filesystem filled on Friday October 15th.

As we had an agreement with all 3 businesses to have a downtime I decided to kill the process:

[oracle@ssssssa trace]$ kill -9 14768

and it fixed the problem:

[oracle@ssssa trace]$ ps -eLf | grep appagent | wc -l
14

[oracle@ssssssa trace]$ ps -eLf | grep appagent
oracle   18655     1 18655  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18657  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18658  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18659  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18660  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18661  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18662  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18663  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18664  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18665  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18670  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18673  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18694  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   19230 22249 19230  0    1 15:18 pts/4    00:00:00 grep appagent

Conclusions:

I am not sure if the problem with filesystems getting full is related to this issue or “The OCR location in an ASM disk group is inaccessible” logged on the alert.log.

For the above, we have an SR with Oracle.

As always, comments are welcome.

Comments»

1. Daniesh Shaikh - 29 September 2011

Hi David,

We are facing the same issue in one of our environment where 4 databases are there. I have raised an SR with oracle.I have not got any response from oracle yet. one of the node is down since 2 days.
I would be very thankful if you please share what is the actual resolution for this.

Thanks and Regards,
Daniesh

2. Daniesh Shaikh - 30 September 2011

Hi David,

We are facing same issue. I would be very thankful if you would please share the solution provided by Oracle.

Thanks and Regards,
Daniesh

David Alejo Marcos - 3 October 2011

Hi,

I do not recall Oracle providing us any solution to this problem. We fixed it by killing the process (appagent.bin) after agreeing downtime with business (in case Database or cluster crashed).

This was the workaround we used, as described on the post. If you decide to follow up, please, arrange downtime and proceed with caution, I would also recommend to check you have an up-to-date backup.

Thank you.

David Alejo-Marcos.

shaikh daniesh - 3 October 2011

Thanks, David


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: