jump to navigation

LGWR is blocking sessions (or running very slow). 5 February 2010

Posted by David Alejo Marcos in Enterprise Manager, Tuning.

I am aware this is a very generic subject or a very specific problem to every system or database, so I am going to write about the problem we had with one of our databases at the end of last week.

The problem:

Last Friday, at lunchtime, we had a huge spike of orange looking wait event related to Commit (see fig_1) on Enterprise Manager. After digging a bit on some v$ views and Enterprise Manager we found that the LGWR was blocking several user sessions.

We raised a Severity 1 SR with oracle as the symptoms where the same as the ones described on Bug 5230765 – LGWR spins while waiting for ‘log file parallel write’ using ASM [ID 5230765.8. This bug was technically solved on but it is always good to check things.

After uploading AWRs for the time the database was affected and the previous day when it was fine, some screen prints from Enterprise Manager, ASH for a short period of time and different log files (traces, alert log, etc); the best they could come up with was to resize the redo logs.

I must say the way they approached the problem was not right in my opinion.

Those redolog files never changed for the last 1+ year and we did not have the problem before. The problem went away at midnight and manifested itself on Monday and Tuesday, always around midday.

The solution:

The database is running on a single instance, Oracle + ASM,  and Linux x86-64.

One of the trends I noticed (or lack of trend) was on the dml being executed. Some times it was an insert, sometimes and update on different objects and different schemas, so a hot block was unlikely to be the problem.

Another big clue was the number of waits in Buffer Gets and Disk Reads where well above average for both per execution and per session.

The Storage manager and the SAs checked the system several times while it was performing really bad and both said that it did not look different from previous days or weeks, so not much on their side.

Business also confirmed that the workload was not different from the previous day. We also have to bear in mind that the first time we had the problem was on the 31st; but the second and third time was on the 3rd and 4th, so we could discard any month end batch performing worst than before.

We decided to add two more redo log groups and increase the size by 2.5 times as this is an online operation with little impact on the business if the time frame is the right one, making a little different as the I/O and CPU was still very high.

The next change was to increase the SGA to solve the wait events on the buffer gets and disk reads. The reason I proposed to increase the SGA was that I thought we did not have enough space on the buffer cache to hold all the data the business was selecting plus inserts/updates.

As soon as the buffer cache was getting full, the LGWR was calling the DBWR to flush changes out of the memory to free space, over killing the server.

This change was not easy to get approved as requires downtime and the system is critical, but after missing couple of SLAs it was a go-go situation.

Once the change was done (increase the SGA_TARGET and SGA_MAX_SIZE), we monitored the system closely. All processes and batches finished on time and the load was much less (see fig_2 below).

Note.- the grey spike is due to a network problem between the primary site and the DR.


You have to know your business and your databases. Oracle support came with an out of the book solution of increasing the redo logs but, as mentioned before, we did not change the size, neither the number for over 1 year.

We can get sometimes to focus on a single point (LGWR). Sometimes is good to make a step backwards and check the system. Different segments, different schemas, different dml and a huge amount buffer gets and disk read were the clue, not just LGWR blocking sessions.

We all have the rule of not switching redo logs more than x amount of times per hour, but sometimes we need to know why we have reached that number and not just increase the size as this may mask an underlying problem.

As always, comments are welcome.


1. youssef - 12 January 2011

So increasing the SGA will solve the wait events ?

David Alejo Marcos - 18 January 2011


yes, increasing SGA_TARGET and SGA_MAX_SIZE did the trick.

Thank you.

2. Jitesh - 1 February 2011

We are facing same issue…and currently we have
SGA_MAX_SIZE = 676M and SGA_TARGET = 0. What should be appropriate size for SGA_MAX_SIZE and SGA_TARGET ?
Do we need to modify at other init.ora parameters if we are increasing SGA_MAX_SIZE and SGA_TARGET ?

David Alejo Marcos - 5 February 2011

Good morning,

quite difficult to give you numbers without knowing your system.

By having SGA_TARGET = 0 you are disabling the SGA automatic tuning for buffer_cache, shared_pool, large_pool, java_pool and streams_pool. As our problem was related to the buffer_cache not being big enough and our workload is quite linear during the day (not massive spikes) I have SGA_TARGET = SGA_MAX_SIZE. By doing it, I will leave to Oracle to resize memory as needed.

If you decide to set SGA_TARGET, you may also wish to monitor how often Oracle is resizing your memory structures by selecting from v$memory_resize_ops.

Thank you,

David Alejo Marcos.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: