LGWR is blocking sessions (or running very slow). 5 February 2010Posted by David Alejo Marcos in Enterprise Manager, Tuning.
I am aware this is a very generic subject or a very specific problem to every system or database, so I am going to write about the problem we had with one of our databases at the end of last week.
Last Friday, at lunchtime, we had a huge spike of orange looking wait event related to Commit (see fig_1) on Enterprise Manager. After digging a bit on some v$ views and Enterprise Manager we found that the LGWR was blocking several user sessions.
We raised a Severity 1 SR with oracle as the symptoms where the same as the ones described on Bug 5230765 – LGWR spins while waiting for ‘log file parallel write’ using ASM [ID 5230765.8. This bug was technically solved on 10.2.0.4 but it is always good to check things.
After uploading AWRs for the time the database was affected and the previous day when it was fine, some screen prints from Enterprise Manager, ASH for a short period of time and different log files (traces, alert log, etc); the best they could come up with was to resize the redo logs.
I must say the way they approached the problem was not right in my opinion.
Those redolog files never changed for the last 1+ year and we did not have the problem before. The problem went away at midnight and manifested itself on Monday and Tuesday, always around midday.
The database is running on a single instance, Oracle + ASM 10.2.0.4, and Linux x86-64.
One of the trends I noticed (or lack of trend) was on the dml being executed. Some times it was an insert, sometimes and update on different objects and different schemas, so a hot block was unlikely to be the problem.
Another big clue was the number of waits in Buffer Gets and Disk Reads where well above average for both per execution and per session.
The Storage manager and the SAs checked the system several times while it was performing really bad and both said that it did not look different from previous days or weeks, so not much on their side.
Business also confirmed that the workload was not different from the previous day. We also have to bear in mind that the first time we had the problem was on the 31st; but the second and third time was on the 3rd and 4th, so we could discard any month end batch performing worst than before.
We decided to add two more redo log groups and increase the size by 2.5 times as this is an online operation with little impact on the business if the time frame is the right one, making a little different as the I/O and CPU was still very high.
The next change was to increase the SGA to solve the wait events on the buffer gets and disk reads. The reason I proposed to increase the SGA was that I thought we did not have enough space on the buffer cache to hold all the data the business was selecting plus inserts/updates.
As soon as the buffer cache was getting full, the LGWR was calling the DBWR to flush changes out of the memory to free space, over killing the server.
This change was not easy to get approved as requires downtime and the system is critical, but after missing couple of SLAs it was a go-go situation.
Once the change was done (increase the SGA_TARGET and SGA_MAX_SIZE), we monitored the system closely. All processes and batches finished on time and the load was much less (see fig_2 below).
Note.- the grey spike is due to a network problem between the primary site and the DR.
You have to know your business and your databases. Oracle support came with an out of the book solution of increasing the redo logs but, as mentioned before, we did not change the size, neither the number for over 1 year.
We can get sometimes to focus on a single point (LGWR). Sometimes is good to make a step backwards and check the system. Different segments, different schemas, different dml and a huge amount buffer gets and disk read were the clue, not just LGWR blocking sessions.
We all have the rule of not switching redo logs more than x amount of times per hour, but sometimes we need to know why we have reached that number and not just increase the size as this may mask an underlying problem.
As always, comments are welcome.