Tuning Rman Backups 25 November 2009Posted by David Alejo Marcos in RMAN.
Tags: backup, RMAN
First of all, I would like to say thanks to Peter Boyes for his invaluable assistance with resolving the RMAN performance problems. Peter’s expertise is in EVA configuration and tuning.
This was an interesting problem we faced 5 weeks ago. It has been solved for Full Backups but I am still working on incremental backups…
2.- 1.6 TB 3-node RAC Oracle Enterprise Edition (10.2.0.4.1) database
3.- ASM and ASMLib 10.2.0.4.1 (8 disk for data, 2 disk for FRA and 1 disk for Logs).
4.- SAN EVA 8100
We moved from 2-node RAC 10.2.0.3 running on Linux X86 and SAN EVA 8100 to the configuration mentioned above. Full backups took 9 hours on the old hardware.
Backups running on the new environment took 11 hours to finish, a big surprise to all of us as the new hardware was much, much faster.
To find out what the problem was we needed to reproduce the problem first on QA. As we did not have enough space for a full backup it was decided to modify the RMAN backup scripts to back up a fairly large tablespace (101GB).
I generated 10 different test and we monitored the EVA using EVAperf and TLViz.
We have to bear in mind that the goal of this exercise is not just reduce the time the backup runs but also to reduce the impact on the EVA to a minimum as it is shared with other products.
All backup scenarios run with 2 channels (C1 and C2) type disk. Those backup scenarios are as follows:
Note.- I/O limitation is a RMAN feature to reduce the I/O rate per channel. To limit the I/O per channel, you only need to specify “rate” and the speed. For example, to limit to 20MB:
allocate channel C1 type disk rate 20M;
Normal backup, no I/O limitation.
Start Time: 09:02
Finish Time: 09:27
Runtime: 15.25 min C2, 15:40 min C1
normal backup, both channels limited bandwidth to 40MB/s.
start time 09:40
finish time 10:02
runtime: 20:15 min C2, 22:30 min C1
normal backup, both channels limited bandwidth to 20MB/s.
start time 10:05
finish time 10:50
runtime: 40:16 min C2, 14:51 min C1
normal backup, both channels limited bandwidth to 30MB/s.
start time 10:54
finish time 11:24
runtime: 26:56 min C2, 30:01 min C1
compress backup, no limitation.
start time 11:32
finish time 11:56
runtime: 20:55 min C2, 23:50 min C1
for this scenario, and all scenarios using “compress”, CPU % idle went from an average of 86% to an average of 72%. These numbers are consistent for all compressed backups. While backups were running, we executed a very heavy procedure and %Idle went down to 56%, but it did not affect performance.
compress backup, both channels limited bandwidth to 30MB/s.
start time 11:58
finish time 12:28
runtime: 26:56 min C2, 30:02 min C1
normal backup, no limitation, window of 20 minutes with minimize load.
start time 12:31
finish time 12:49
runtime: 16:36 min C2, 18:11 min C1
compress backup, no limitation, window of 20 minutes with minimize load.
start time 12:57
finish time 13:17
ORA-19591: backup aborted because job time exceeded duration time
compress backup, both channels limited bandwidth to 30MB/s, window of 20 minutes with minimize load.
start time 13:24
finish time 13:44
ORA-19591: backup aborted because job time exceeded duration time
compress backup, 4 channels no limitation, window of 20 minutes with minimize load.
start time 14:55
finish time 15:14
runtime: 07:15 min C3, 12:21 min C2, 14:25 min C4, 19:01 min C1
I have added several graphs for CPU, WriteMB and Disk write Latency on the EVA. I am afraid I did not allow much gap between test as we had a limited window to perform our tests. For this reason it can be a bit difficult to appreciate when a backup started and the previous finished.
As we can see on the graph, the scenario 3 (normal backup, I/O limited to 20MB/s) produced the best results on CPU utilization.
Normal backup with I/O limited to 30MB/s had a bigger impact that 20MB/s, but if finished in half the time.
Compressed back with no limitation on I/O was impressive,but it would have had an impact if we tried to backup our 1.6TB database.
Some results showing lower CPU were discarded almost immediately as we were using a window of 20 minutes (slightly longer that the fastest test) with minimizing load, but those backups never finished.
As you can see in this graph, the lowest impact was done by compressing backups. As soon as we tried to use uncompressed backups those graphs the throughput went up to 130 MBps, while compressed backups had a throughput of 55-60 MBps.
At this point we had to references, limiting I/O to 20-30MBps and compressed looked a good compromises between CPU usage, throughput and time spent for the backup.
Lets have a look the latency for those test scenarios.
Latency can be described as the time between a write request is received from a host and the time the request completion is returned”. This value is normally measured in ms.
For this reason, we should aim for low latency.
From the graph, we can see that scenarios 5 and 6 (compressed backups with both, limitation in I/O and without limitation) returned low latency (specially scenario 6 which had a limitation of 30MB/s per channel.
The last spike on the graph corresponds to scenario 10, where we run 4 channels with no limitation on I/O, compressed backup and a window of 20 minutes with minimize load. This spike is important because it showed that for our environments, it was better to limit the I/O per channels rather than leaving Oracle to tune it by using a window.
From those 3 graphs we reached the conclusion that the problem was on the controllers. We moved from an old hardware to a brand new, higher spec hardware and it was “too fast” for the EVA controllers to write at the same ratio that rman was sending information so rman and, to some extend, other products using the Production EVA was flooding the EVA controllers.
Our solution for full backups on a 1.6TB went from 2 channels no limitation in I/O to 2 channels limited to 30MB/s and compression (our CPUs were pretty much idle for the whole backup).
This changes proved successful as we moved from a 12 hours backup to a 4 hours 40 minutes full backup with and output of 380 GB, resulting on a compress ratio of 4.38.
We decided to de-tune our backups to avoid any impact whatsoever on the EVA. This was done by reducing the I/O from 30MB/s to 20MB/s.
Our full backup now runs in just over 6 hours, but still much better that 12 hours….
Note.- V$RMAN_BACKUP_JOB_DETAILS is a very useful view to monitor backups, compression rates, and runtime among others.
As always, comments are welcome.