jump to navigation

ORA-16072: a minimum of one standby database destination is required 27 August 2011

Posted by David Alejo Marcos in ASM, Exadata, Oracle 11.2, RAC, RMAN.
Tags: , , , , , , ,
2 comments

This is a quick post regarding the error on the subject. This is the second time it happens to me, so I thought I will write a bit about it.

The problem:

I am refreshing one of my UAT environments (happens to be a Full Rack Exadata) using Oracle RMAN duplicate command. Then the following happens (on both occasions).

1.- Duplicate command fails (lack of space for restoring archivelogs, or any other error). This is can be fixed quite easy.

2.- following error while trying to open the database after restore and recover has finished:

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 13710
Session ID: 1250 Serial number: 5
SQL> exit

On the alert.log file we can read the following:

Wed Aug 24 13:32:48 2011
alter database open
Wed Aug 24 13:32:49 2011
LGWR: STARTING ARCH PROCESSES
Wed Aug 24 13:32:49 2011
ARC0 started with pid=49, OS id=13950
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
<strong>LGWR: Primary database is in MAXIMUM AVAILABILITY mode</strong>
<strong>LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR</strong>
<strong>LGWR: Minimum of 1 LGWR standby database required</strong>
Errors in file /apps/oracle/server/diag/rdbms/xxxx04/xxxx041/trace/xxxx041_lgwr_13465.trc:
<strong>ORA-16072: a minimum of one standby database destination is required</strong>
Errors in file /apps/oracle/server/diag/rdbms/xxxx04/xxxx041/trace/xxxx041_lgwr_13465.trc:
ORA-16072: a minimum of one standby database destination is required
LGWR (ospid: 13465): terminating the instance due to error 16072
Wed Aug 24 13:32:50 2011
ARC1 started with pid=48, OS id=13952
Wed Aug 24 13:32:50 2011
System state dump is made for local instance
System State dumped to trace file /apps/oracle/server/diag/rdbms/xxxx04/xxxx041/trace/xxxx041_diag_13137.trc
Trace dumping is performing id=[cdmp_20110824133250]
Instance terminated by LGWR, pid = 13465

The Solution:

Quite simple:

1.- Start up database in mount mode:

SQL> startup mount

ORACLE instance started.

Total System Global Area 1.7103E+10 bytes
Fixed Size                  2230472 bytes
Variable Size            4731176760 bytes
Database Buffers         1.2180E+10 bytes
Redo Buffers              189497344 bytes
Database mounted.

SQL> select open_mode, DATABASE_ROLE, guard_status, SWITCHOVER_STATUS from v$database;

OPEN_MODE            DATABASE_ROLE    GUARD_S SWITCHOVER_STATUS
-------------------- ---------------- ------- --------------------
MOUNTED              PRIMARY          NONE    NOT ALLOWED

2.- Execute the following command:

SQL> alter database set standby database to maximize performance;

Database altered.

SQL> select open_mode, DATABASE_ROLE, guard_status, SWITCHOVER_STATUS from v$database;

OPEN_MODE            DATABASE_ROLE    GUARD_S SWITCHOVER_STATUS
-------------------- ---------------- ------- --------------------
MOUNTED              PRIMARY          NONE    NOT ALLOWED

3.- Stop database:

SQL shutdown immediate
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.

4.- Start up database mount mode:

SQL> startup mount
ORACLE instance started.

Total System Global Area 1.7103E+10 bytes
Fixed Size                  2230472 bytes
Variable Size            4731176760 bytes
Database Buffers         1.2180E+10 bytes
Redo Buffers              189497344 bytes
Database mounted.

5.- Open database:

SQL> alter database open;

Database altered.

SQL> select open_mode from v$database;

OPEN_MODE
--------------------
READ WRITE

SQL> select instance_name from v$instance;

INSTANCE_NAME
----------------
xxxx041

SQL>

As always, comments are welcome.

David Alejo-Marcos.

David Marcos Consulting Ltd.

Archive area +RECO has -7440384 free KB remaining (Usable_file_MB is negative) 17 July 2011

Posted by David Alejo Marcos in ASM, Exadata, Oracle 11.2, RMAN.
Tags: , , ,
add a comment

I must say, this has been a busy weekend.

We have been promoting a release to production and a guaranteed restore point was created on Friday as rollback strategy. On Sunday I was called as we started to receive alerts.

The problem:

Our monitoring system started to send emails and SNMP Traps with the following alerts:

OEM alert for Automatic Storage Management +ASM4_ssssss4: Disk group RECO has used 100% of safely usable free space. (Current Disk Group Used % of Safely Usable value: 100)
OEM alert for Automatic Storage Management +ASM2_ssssss2: Disk group RECO has used 100% of safely usable free space. (Current Disk Group Used % of Safely Usable value: 100)
OEM alert for Automatic Storage Management +ASM3_ssssss3: Disk group RECO has used 100% of safely usable free space. (Current Disk Group Used % of Safely Usable value: 100)
OEM alert for Automatic Storage Management +ASM7_ssssss7: Disk group RECO has used 100% of safely usable free space. (Current Disk Group Used % of Safely Usable value: 100)
OEM alert for Automatic Storage Management +ASM8_ssssss8: Disk group RECO has used 100% of safely usable free space. (Current Disk Group Used % of Safely Usable value: 100)
OEM alert for Automatic Storage Management +ASM6_ssssss6: Disk group RECO has used 100% of safely usable free space. (Current Disk Group Used % of Safely Usable value: 100)
OEM alert for Automatic Storage Management +ASM1_ssssss1: Disk group RECO has used 100% of safely usable free space. (Current Disk Group Used % of Safely Usable value: 100)
OEM alert for Automatic Storage Management +ASM5_ssssss5: Disk group RECO has used 100% of safely usable free space. (Current Disk Group Used % of Safely Usable value: 100)
OEM alert for Database Instance : Archive area +RECO has -7440384 free KB remaining. #Current Free Archive Area #KB# value: -7440384#
OEM alert for Database Instance : Archive area +RECO has -21725184 free KB remaining. (Current Free Archive Area (KB) value: -21725184)

Not very nice in any situation, but not when you are in the middle of a critical, high visible release.

I did have a look and this is what I found.

The solution:

The first thing I did was to check the Flash Recovery Area, as it is configured to write to our +RECO diskgroup:

NAME                    USED_MB   LIMIT_MB   PCT_USED
-------------------- ---------- ---------- ----------
+RECO                   1630569    2048000      79.62
Elapsed: 00:00:00.12
FILE_TYPE            PERCENT_SPACE_USED PERCENT_SPACE_RECLAIMABLE NUMBER_OF_FILES
-------------------- ------------------ ------------------------- ---------------
CONTROL FILE                          0                         0               1
REDO LOG                              0                         0               0
ARCHIVED LOG                          0                         0               0
BACKUP PIECE                      75.66                     75.66              86
IMAGE COPY                            0                         0               0
FLASHBACK LOG                      3.96                         0             707
FOREIGN ARCHIVED LOG                  0                         0               0
7 rows selected.
Elapsed: 00:00:01.62

Numbers did look ok, some backup files could be reclaimed (Oracle should do it automatically). Lets have a look the ASM:

oracle@ssss (+ASM1)$ asmcmd lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB   Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  N         512   4096  4194304  55050240  26568696          5004567        10782064              0             N  DATA/
MOUNTED  NORMAL  N         512   4096  4194304  35900928   3192844          3263720          -35438              0             N  RECO/
MOUNTED  NORMAL  N         512   4096  4194304   4175360   4061640           379578         1841031              0             N  SYSTEMDG/
Elapsed: 00:00:01.62

Bingo, this is where our problem is. USABLE_FILE_MB (+RECO diskgroup) indicates the amount of free space that can be utilized, including the mirroring space, and being able to restore redundancy after a disk failure. A negative number on this column, could be critical in case of disk failure for the system as we might not have enough space perform a restore of all files to the surviving of disk.

Our backups goes to ASM and we copy them to tape afterwards. Our retention policy on disk is between 2 or 3 days, depending of the systems.

When I did check the contents of the backupset on ASM I found some old backups:

Type  Redund  Striped  Time             Sys  Name
Y    2011_07_17/
Y    2011_07_16/
Y    2011_07_15/
Y    2011_07_14/
Y    2011_07_13/
Y    2011_07_12/
Elapsed: 00:00:01.62

To delete those old backups I executed the following script from RMAN:

RMAN> delete backup tag EOD_DLY_110712 device type disk;
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1430 instance=<instance> device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=2138 instance=<instance> device type=DISK
allocated channel: ORA_DISK_3
channel ORA_DISK_3: SID=8 instance=<instance> device type=DISK
allocated channel: ORA_DISK_4
channel ORA_DISK_4: SID=150 instance=<instance> device type=DISK
List of Backup Pieces
BP Key  BS Key  Pc# Cp# Status      Device Type Piece Name
------- ------- --- --- ----------- ----------- ----------
8292    4020    1   1   AVAILABLE   DISK        +RECO//backupset/2011_07_12/sssss_eod_dly_110712_0.nnnn.nnnnn
8293    4021    1   1   AVAILABLE   DISK        +RECO//backupset/2011_07_12/sssss_eod_dly_110712_0.nnnn.nnnnn
Do you really want to delete the above objects (enter YES or NO)? yes
deleted backup piece
backup piece handle=+RECO//backupset/2011_07_12/sssss_eod_dly_110712_0.nnnn.nnnnnn RECID=8292 STAMP=nnnn
deleted backup piece
backup piece handle=+RECO//backupset/2011_07_12/sssss_eod_dly_110712_0.nnnn.nnnnnn RECID=8293 STAMP=nnnn
Deleted 2 objects
Elapsed: 00:00:01.62

After deleting a two more old backups, the number looked much better:

State    Type    Rebal  Sector  Block       AU  Total_MB   Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  N         512   4096  4194304  55050240  26568696          5004567        10782064              0             N  DATA/
MOUNTED  NORMAL  N         512   4096  4194304  35900928   3548260          3263720          142270              0             N  RECO/
MOUNTED  NORMAL  N         512   4096  4194304   4175360   4061640           379578         1841031              0             N  SYSTEMDG/

Note.- There is another temporary fix. I could changed db_recovery_file_dest to point to +DATA instead of +RECO, but as we have a guaranteed restore point, I thought releasing space from old backups was easier.

As always, comments are welcome.

David Alejo-Marcos.

David Marcos Consulting Ltd.

Unable to delete archive log and FRA is filling 7 September 2010

Posted by David Alejo Marcos in ASM, Oracle 11.2, RMAN.
Tags: , ,
4 comments

Couple of weeks ago we had a problem with one of our busiest databases.

The FRA was filling quite rapidly and we just could not free enough space. What made this problem interesting was the fact that Oracle was telling us more than 40% of the space was marked as reclaimable.

The problem:

We have automatic alerts when we are running out of space. When I had a look to the situation this is what I saw:

SQL> select * from v$flash_recovery_area_usage;
FILE_TYPE PERCENT_SPACE_USED PERCENT_SPACE_RECLAIMABLE NUMBER_OF_FILES
------------ ------------------ ------------------------- ---------------
CONTROLFILE .01 0 1
ONLINELOG .89 0 9
ARCHIVELOG 51.29 49.67 1181
BACKUPPIECE 0 0 0
IMAGECOPY 0 0 0
FLASHBACKLOG 40.43 34.83 1658

My first thought was to remove old files using RMAN, after checking all standby databases were in sync I execute the following command (we have space enough to store 7 days worth of archivelogs, but I wanted to play safe as we had many changes during the weekend and one change in particular, generated 60+ GB of archivelogs):

rman
RMAN> connect target /
RMAN> delete noprompt archivelog until time 'SYSDATE - 3';
List of Archived Log Copies
Key Thrd Seq S Low Time Name
------- ---- ------- - ------------------- ----
228321 1 71005 A 07.08.2010 14:27:00 +FRA/xxx/archivelog/2010_08_07/thread_1_seq_71005.670.726416889
228323 1 71006 A 07.08.2010 14:28:09 +FRA/xxx/archivelog/2010_08_07/thread_1_seq_71006.2112.726416957
228330 1 71007 A 07.08.2010 14:29:16 +FRA/xxx/archivelog/2010_08_07/thread_1_seq_71007.1210.726417025
.....
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of delete command on ORA_DISK_1 channel at 08/09/2010 18:19:58
ORA-15028: ASM file '+FRA/xxx/archivelog/2010_08_07/thread_1_seq_71005.670.726416889' not dropped; currently being accessed

bingo, what should have been a simple operation now has just become more complicated…

The solution:

Technically we have plenty of space on the FRA, but Oracle is complaining that a file is being used. This file is old.

The only way I could fix the problem is by killing the process holding a lock on the file. This process has to be one of the arc processes.

In this case the solution was quite simple. The steps I followed were (bearing in mind that arc? processes are being restarted automatically by oracle):
1.- find the process id for arc:
ps -ef | grep -i ora_arc*
oracle 5607 1 1 19:02 ? 00:00:00 ora_arc9_prod1

2.- kill the running process:
kill -9 5607
3.- check the process is started again before killing more:
ps -ef | grep -i ora_arc9_prod1
4.- perform 2 and 3 for all arc? running for your instance.

Once we have killed all arc? processes and we are sure they have been re-started by Oracle I did the following:
1.- Perform couple of redo log switches and make sure are being propagated to all standby databases:

SQL> alter system switch logfile;

2.- Connect to RMAN and perform a backup of the archivelog “being in used”:
rman
RMAN> connect target /
RMAN> backup archivelog from sequence 71005 until sequence 71005 thread 1;

The output is:

Starting backup at 09.08.2010 18:50:02
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=620 instance=prod1 devtype=DISK
channel ORA_DISK_1: starting archive log backupset
channel ORA_DISK_1: specifying archive log(s) in backup set
input archive log thread=1 sequence=71005 recid=228321 stamp=726416895
channel ORA_DISK_1: starting piece 1 at 09.08.2010 18:50:04
channel ORA_DISK_1: finished piece 1 at 09.08.2010 18:50:11
piece handle=+FRA/xxx/backupset/2010_08_09/annnf0_tag20100809t185002_0.1097.726605405 tag=TAG20100809T185002 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:08
Finished backup at 09.08.2010 18:50:11
Starting Control File and SPFILE Autobackup at 09.08.2010 18:50:11
piece handle=/u04/oradata/prod/rman_disk/auto/PROD/c-4255406167-20100809-00_PROD comment=NONE
Finished Control File and SPFILE Autobackup at 09.08.2010 18:50:15

Once we have performed the backup, we can delete old archivelogs to release some space:

RMAN> delete noprompt archivelog until time 'SYSDATE - 3';
....
deleted archive log
archive log filename=+FRA/xxx/archivelog/2010_08_07/thread_1_seq_71005.670.726416889 recid=228321 stamp=726416895
deleted archive log
archive log filename=+FRA/xxx/archivelog/2010_08_07/thread_1_seq_71006.2112.726416957 recid=228323 stamp=726416962
....

3.- Monitor the space on the FRA:
SQL> select * from v$flash_recovery_area_usage;
FILE_TYPE PERCENT_SPACE_USED PERCENT_SPACE_RECLAIMABLE NUMBER_OF_FILES
-------------------- ------------------ ------------------------- ---------------
CONTROLFILE .01 .00 1
ONLINELOG .89 .00 9
ARCHIVELOG 49.09 46.83 1168
BACKUPPIECE .00 .00 0
IMAGECOPY .00 .00 0
FLASHBACKLOG 40.22 33.91 1648

couple of times:

SQL> select * from v$flash_recovery_area_usage;
FILE_TYPE PERCENT_SPACE_USED PERCENT_SPACE_RECLAIMABLE NUMBER_OF_FILES
-------------------- ------------------ ------------------------- ---------------
CONTROLFILE .01 .00 1
ONLINELOG .89 .00 9
ARCHIVELOG 3.10 .04 225
BACKUPPIECE .10 .00 1
IMAGECOPY .00 .00 0
FLASHBACKLOG 39.91 32.95 1633

As always, comments are welcome.

Securing passwords on scripts using Oracle Wallet 21 August 2010

Posted by David Alejo Marcos in Oracle Wallet, RMAN.
Tags: ,
2 comments

One thing I like about being an Oracle DBA is the never-ending possibilities of improving your systems.

I have been working on our backup strategy for some time, the main goal being to have a standard script to be used in all our system.

While working on it I decided to try to secure our backup scripts (and scripts in general) as the password is hard-coded (never a good idea, specially because the user to perform the backup needs to have sysdba privileges).

I decided to use Oracle Wallet. Below are the steps and the test I performed.

The problem:

Having the password hardcoded on any script is not a good idea, specially for RMAN backup scripts as the account being used needs sysdba privileges.

The solution:

The steps I followed to secure our scripts is Oracle Wallet. The steps I performed are:

1.- Create the wallet:

mkstore -wrl -create

I will recommend to store the wallet on a different directory rather than storing it on Oracle home. For example $ORACLE_BASE/wallet:

mkstore -wrl $ORACLE_BASE/wallet -create

The command above will create an Oracle wallet with auto login feature enabled. from now on, only the operating system user who created the wallet will be able to manage it.

You will be prompted to provide a password. This password will have to be provided any time a maintenance operation is performed on the wallet.

2.- Add database connection (including connetion_string, username and password):

mkstore -wrl -createCredential <db_connection_string> <username> <password>

db_connection_string is an entry on your tnsnames.ora or any service name to connect to the desired database.

An example:

mkstore -wrl $ORACLE_BASE/wallet -createCredential prod sys mypassword

mkstore -wrl $ORACLE_BASE/wallet -createCredential catalog rman myrmanpassword

3.- add the following code to your sqlnet.ora:

SQLNET.WALLET_OVERRIDE = TRUE
WALLET_LOCATION =
(SOURCE=
(METHOD = FILE)
(METHOD_DATA = (DIRECTORY=/u01/app/oracle/wallet)
)
)

where:

WALLET_LOCATION: points to the directory where the wallet resides.
SQLNET.WALLET_OVERRIDE: will force all connections as /@db_connection_string to use the information being stored on the wallet to authenticate to databases.

4.- Test the connection:
[oracle@xxxxx]rman
RMAN> connect TARGET /@<db_connection_string>;
connected to target database: xxxx (DBID=yyyyyyyyyyy)
RMAN> connect catalog /@<rman_connection_string>;
connected to recovery catalog database

At this point, we can replace all connections:
<username>/<password>@<db_connection_string>;

by

/@<db_connection_string>;

What options does Wallet offer?
mkstore offers you the following:
1.- list the content being stored on the wallet:
mkstore -wrl -listCredential

2.- Add credentials:
mkstore -wrl -createCredential <db_connection_string> <username> <password>

3.- Modify credentials:
mkstore -wrl -modigyCredential <db_connection_string> <username> <password>

4.- Delete credentials:
mkstore -wrl -deleteCredential <db_connection_string>

Note: if you need to store more than one combination of /
for the same database you may have to create different entries on your tnsnames.ora.

Note.- If you are using the Wallet, you will probably face the following error sooner or later:

startup failed: ORA-00000: normal, successful completion

this can be due to startup force nomount or to duplicate database using RMAN.

The reason is an Oracle Bug (Bug 7258404 currently under development). The workaround is to remove SQLNET.WALLET_OVERRIDE=TRUE from your sqlnet.ora.

We are using one sqlnet.ora when using Wallet, and the default one without wallet configuration, on our script we just configure TNS_ADMIN variable.

As always, comments are welcome.

Move datafile to a different ASM diskgroup using RMAN 23 July 2010

Posted by David Alejo Marcos in ASM, Oracle 11.2, RMAN.
Tags: , ,
3 comments

Yesterday we received an email from one of the developers with the following error:

ORA-00376: file 68 cannot be read at this time
ORA-01110: data file 68: ‘+DATA/xxx/datafile/yyy.366.724161769’

This error came to me for investigation and a fix as soon as possible as the database is critical for testing.

Earlier during the day, we did a SnapClone from a local standby running on the same server, but this should not have any impact on this other database.

As we are using SAN storage, I asked the SAN Manager and the SA to see if there was a chance something might have gone wrong.

They say no.

The problem:

ORA-00376: file 68 cannot be read at this time
ORA-01110: data file 68: ‘+DATA/xxx/datafile/yyy.366.724161769’

The status of the datafile on dba_data_files was AVAILABLE, but the online_status from v$recover_file was OFFLINE.

The solution:

The first thing I did was to check the alert.log file for any anomaly. I found the following entry:

Sat Jul 10 19:55:58 2010
ORA-1683: unable to extend index yyyy.FC_CDS_CNTR_CRV_ISCLCLTD_IDX partition FC_CDS_CNTR_CRV_FEB10 by 128 in tablespace yyy
......
ORA-1653: unable to extend table yyy.FC_JOB_STATUS by 128 in tablespace yyy
alter tablespace yyy add datafile '+DATA' size 10G AUTOEXTEND ON NEXT 1G MAXSIZE unlimited
Completed: alter tablespace yyy add datafile '+DATA' size 10G AUTOEXTEND ON NEXT 1G MAXSIZE unlimited
alter tablespace yyy add datafile size 10G AUTOEXTEND ON NEXT 1G MAXSIZE unlimited
Completed: alter tablespace yyy add datafile size 10G AUTOEXTEND ON NEXT 1G MAXSIZE unlimited
undo_tablespace = UNDOTBS1
Successfully onlined Undo Tablespace 1.
undo_tablespace = UNDOTBS1
Successfully onlined Undo Tablespace 1.
[oracle@ukwhqaorc004 bdump]$ vi alert_xxx.log
Mon Jul 12 11:55:37 2010
SUCCESS: diskgroup DATA was mounted
SUCCESS: diskgroup DATA was dismounted
SUCCESS: diskgroup DATA was mounted
SUCCESS: diskgroup DATA was dismounted
SUCCESS: diskgroup DATA was mounted
SUCCESS: diskgroup DATA was dismounted
SUCCESS: diskgroup DATA was mounted
SUCCESS: diskgroup DATA was dismounted
SUCCESS: diskgroup DATA was mounted
SUCCESS: diskgroup DATA was dismounted
Mon Jul 12 12:02:47 2010
alter tablespace yyy add datafile '+DATA' size 10G AUTOEXTEND ON NEXT 1G MAXSIZE unlimited
Mon Jul 12 12:02:47 2010
SUCCESS: diskgroup DATA was mounted
Mon Jul 12 12:03:49 2010
SUCCESS: diskgroup DATA was dismounted
Mon Jul 12 12:03:49 2010
SUCCESS: diskgroup DATA was mounted
SUCCESS: diskgroup DATA was dismounted
SUCCESS: diskgroup DATA was mounted
Mon Jul 12 12:03:49 2010
Completed: alter tablespace yyy add datafile '+DATA' size 10G AUTOEXTEND ON NEXT 1G MAXSIZE unlimited
Mon Jul 12 12:11:11 2010

and earlier on the day:
Thu Jul 22 09:05:08 2010
Errors in file /u01/app/oracle/admin/whqaint/bdump/xxx_ckpt_522.trc:
ORA-01110: data file 68: '+DATA/xxx/datafile/yyy.366.724161769'
ORA-01115: IO error reading block from file 68 (block # 1)
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Jul 22 09:05:23 2010
Errors in file /u01/app/oracle/admin/xxx/bdump/xxx_ckpt_522.trc:
ORA-01110: data file 68: '+DATA/xxx/datafile/yyy.366.724161769'
ORA-01115: IO error reading block from file 68 (block # 1)
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Jul 22 09:05:38 2010
Errors in file /u01/app/oracle/admin/xxx/bdump/xxx_ckpt_522.trc:
ORA-01171: datafile 68 going offline due to error advancing checkpoint
ORA-01110: data file 68: '+DATA/xxx/datafile/xxx.366.724161769'
ORA-01115: IO error reading block from file 68 (block # 1)
ORA-15078: ASM diskgroup was forcibly dismounted

The key line is ORA-15078: ASM diskgroup was forcibly dismounted

As we are using ASM, the first step on our SnapClone procedure is to dismount several diskgroups, but technically, this operation should not impact the other database as both databases should use different diskgroups.

When the development DBA created the datafile to extend the tablespace, she did it on the wrong diskgroup (DATA) instead of (DATB). It is an easy mistake as the name is similar.

Now that we know the reason, it is fairly simple to fix it. The steps I followed are:
1.- Check the assumption is correct (datafile was created on a different diskgroup):
SQL> select file_name, status
from dba_data_files
where tablespace_name = 'xxx';
FILE_NAME STATUS
------------------------------------------------ ---------
+DATB/xxx/datafile/yyy.339.718314941 AVAILABLE
+DATB/xxx/datafile/yyy.358.718303181 AVAILABLE
+DATB/xxx/datafile/yyy.338.718314943 AVAILABLE
+DATB/xxx/datafile/yyy.309.718325181 AVAILABLE
+DATA/xxx/datafile/yyy.366.724161769 AVAILABLE
+DATB/xxx/datafile/yyy.281.724238729 AVAILABLE

2.- Change the status of the datafile to offline:
SQL> alter database datafile '+DATA/xxx/datafile/yyy.366.724161769' offline;

3.- Copy the datafile using RMAN:
[oracle@ssssssssssssss bdump]$ rman target /
Recovery Manager: Release 10.2.0.4.0 - Production on Thu Jul 22 15:52:52 2010
Copyright (c) 1982, 2007, Oracle. All rights reserved.
connected to target database: xxx (DBID=4255406167)
RMAN> copy datafile '+DATA/xxx/datafile/yyy.366.724161769' to '+DATB';

during the output of the command you can see the source and the destination of the datafile:
Starting backup at 22-JUL-10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=403 devtype=DISK
channel ORA_DISK_1: starting datafile copy
input datafile fno=00068 name=+DATA/xxx/datafile/yyy.366.724161769
output filename=+DATB/xxx/datafile/yyy.280.725039609
tag=TAG20100722T155329 recid=1 stamp=725039763
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:02:35
Finished backup at 22-JUL-10

4.- Update the Oracle data dictionary with the new information:
SQL> alter database rename
file '+DATA/xxx/datafile/yyy.366.724161769'
to '+DATB/xxx/datafile/yyy.280.725039609';

Once the datafile has been renamed in the data dictionary, Oracle removes the old copy of the file in ASM

5.- Rename the ASM datafile copy using RMAN.
[oracle@ssssssssssssss bdump]$ rman target /
Recovery Manager: Release 10.2.0.4.0 - Production on Thu Jul 22 15:58:31 2010
Copyright (c) 1982, 2007, Oracle. All rights reserved.
connected to target database: xxx (DBID=4255406167)
RMAN> switch datafile '+DATB/xxxx/datafile/yyy.280.725039609' to copy;

The output should be similar to:

using target database control file instead of recovery catalog
datafile 68 switched to datafile copy "+DATB/xxx/datafile/yyy.280.725039609"

6.- Perform media recovery of the datafile:
RMAN> recover datafile '+DATB/xxx/datafile/yyy.280.725039609';

7.- Make the datafile available by changing the status to online:
SQL> alter database datafile '+DATB/xxx/datafile/yyy.280.725039609' online;

8.- Check the status and location of the datafile:
SQL> select file_name, status from dba_data_files where tablespace_name = 'yyy';
FILE_NAME STATUS
------------------------------------------------ ---------
+DATB/xxx/datafile/yyy.339.718314941 AVAILABLE
+DATB/xxx/datafile/yyy.358.718303181 AVAILABLE
+DATB/xxx/datafile/yyy.338.718314943 AVAILABLE
+DATB/xxx/datafile/yyy.309.718325181 AVAILABLE
+DATB/xxx/datafile/yyy.280.725039609 AVAILABLE
+DATB/xxx/datafile/yyy.281.724238729 AVAILABLE

and job done.

As always, comments are welcome.

RMAN backup fails (ORA-27211: Failed to load Media Management Library) 16 July 2010

Posted by David Alejo Marcos in Oracle 11.2, RMAN.
Tags: ,
1 comment so far

The current backup setup for some of our systems is to perform a backup using NetBackup.

We have experiencing some problems on one of our databases. The interesting thing is the backup from the standby database was working fine, but the same backup from production was failing.

The problem:

RMAN> crosscheck backup;
released channel: ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of crosscheck command at 07/15/2010 11:41:52
ORA-19554: error allocating device, device type: SBT_TAPE, device name:
ORA-27211: Failed to load Media Management Library
Additional information: 2

The solution:

I spoke with our SAs and I was told we were running the right version of NetBackup. After some investigation, I decided to check the libraries myself, and this is what I found:

oracle@standby dbhome_1]$ ls -lrt ./lib/libobk.so
lrwxrwxrwx 1 oracle oinstall 36 Jul 14 11:09 ./lib/libobk.so -> /usr/openv/netbackup/bin/libobk.so64
[oracle@primary dbhome_1]$ ls -lrt ./lib/libobk.so
lrwxrwxrwx 1 oracle oinstall 34 Jul 14 14:14 ./lib/libobk.so -> /usr/openv/netbackup/bin/libobk.so

On our primary database, the link on the ORACLE_HOME was for 32 bits, while on the standby, it was for 64. As our platform is 64 bits, I asked the SA to relink the library.

After this was done, I proceeded to run another test. The result was a different error:

channel c01: starting piece 1 at 15-07-2010 10:04:37
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on c01 channel at 07/15/2010 12:05:42
RMAN-10038: database session for channel c01 terminated unexpectedly

I did check the alert.log on the primary database and this is what I found:

Thu Jul 15 10:31:59 2010
SERVER COMPONENT id=UTLRP_BGN: timestamp=2010-07-15 10:31:59
SERVER COMPONENT id=UTLRP_END: timestamp=2010-07-15 10:32:01
Thu Jul 15 10:32:32 2010
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xF] [PC:0x3C2B078820, strcpy()+16] [flags: 0x0, count: 1]
Errors in file /opt/oracle/diag/rdbms/ssssss/yyyyyy/trace/yyyyy_ora_25177.trc  (incident=39130):
ORA-07445: exception encountered: core dump [strcpy()+16] [SIGSEGV] [ADDR:0xF] [PC:0x3C2B078820] [Address not mapped to object] []
Incident details in: /opt/oracle/diag/rdbms/ssssss/yyyyyy/incident/incdir_39130/yyyy_ora_25177_i39130.trc
Thu Jul 15 10:32:32 2010
Trace dumping is performing id=[cdmp_20100715103232]
Thu Jul 15 10:32:35 2010
Sweep [inc][39130]: completed
Sweep [inc2][39130]: completed

The following is and extract from /opt/oracle/diag/rdbms/sssssss/yyyyyy/trace/yyyyy_ora_25177.trc:

*** 2010-07-15 10:32:32.012
*** SESSION ID:(138.18967) 2010-07-15 10:32:32.012
*** CLIENT ID:() 2010-07-15 10:32:32.012
*** SERVICE NAME:(SYS$USERS) 2010-07-15 10:32:32.012
*** MODULE NAME:(rman@mandela.marketxs.com (TNS V1-V3)) 2010-07-15 10:32:32.012
*** ACTION NAME:(0000006 STARTED62) 2010-07-15 10:32:32.012
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xF] [PC:0x3C2B078820, strcpy()+16] [flags: 0x0, count: 1]
Incident 39130 created, dump file: /opt/oracle/diag/rdbms/sssssss/yyyyy/incident/incdir_39130/yyy_ora_25177_i39130.trc
ORA-07445: exception encountered: core dump [strcpy()+16] [SIGSEGV] [ADDR:0xF] [PC:0x3C2B078820] [Address not mapped to object] []
ssexhd: crashing the process...
Shadow_Core_Dump = PARTIAL

I could not find any information regarding ORA-07445 and strcpy()+16, and the contents of /opt/oracle/diag/rdbms/ssssss/yyyyyy/incident/incdir_39130/yyyy_ora_25177_i39130.trc did not help much:

Dump continued from file: /opt/oracle/diag/rdbms/horvitz/UXS/trace/UXS_ora_25177.trc
ORA-07445: exception encountered: core dump [strcpy()+16] [SIGSEGV] [ADDR:0xF] [PC:0x3C2B078820][Address not mapped to object] []
========= Dump for incident 39130 (ORA 7445 [strcpy()+16]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xF] [PC:0x3C2B078820, strcpy()+16] [flags: 0x0, count: 1]
Registers:
%rax: 0x3720393931393233 %rbx: 0x00007fff096551e0 %rcx: 0x0000000000000001
%rdx: 0x3720393931393233 %rdi: 0x3720393931393233 %rsi: 0x000000000000000f
%rsp: 0x00007fff09655138 %rbp: 0x00007fff09655160  %r8: 0x0000000000000004
%r9: 0x0000003c2b118760 %r10: 0x0000003c2b351a30 %r11: 0x0000000000000000
%r12: 0x000000001820db10 %r13: 0x0000000000000048 %r14: 0x000000000000000f
%r15: 0xffffffffffffffff %rip: 0x0000003c2b078820 %efl: 0x0000000000010213
> (0x3c2b078820) mov (%rsi),%al
(0x3c2b078822) test %al,%al
(0x3c2b078824) mov %al,(%rdx)
(0x3c2b078826) jz 0x3c2b0788e8
(0x3c2b07882c) inc %rsi

So I decided to run the backup with log and trace:

[oracle@xxxxx trace]$ cd /tmp
[oracle@xxxxxxx tmp]$ORACLE_HOME/bin/rman debug trace rman.trc log rman.log

The log file did not provide much information, but the trace file contained some interesting bits:

DBGMISC:      EXITED krmice [13:18:42.647] elapsed time [00:00:00:00.127]
Calling krmmpem from krmmexe
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of crosscheck command on ORA_SBT_TAPE_1 channel at 07/15/2010 13:18:42
RMAN-10032: unhandled exception during execution of job step 1:
ORA-03113: end-of-file on communication channel
ORA-06512: at line 223
RMAN-10031: RPC Error: ORA-03113  occurred during call to DBMS_BACKUP_RESTORE.VALIDATEBACKUPPIECE
DBGMISC:      ENTERED krmkursr [13:18:42.647]

I knew the package was compiled (I checked it), so my guess was a problem at NetBackup level.

The SA confirmed we were running the latest version (6.5.5), so I decide to have a look to those libraries myself and I spotted the problem:

-r-xr-xr-x 1 oracle oinstall    89873 May  1  2009 libobk.so64

The library had not been updated when the SAs deployed the patch, somehow the library was old, as the patch was deployed early June.

As soon as the SA copied libobk.so64 from the standby server all started to work as expected:

Starting backup at 15-07-2010 18:26:01
channel c01: starting full datafile backup set
channel c01: specifying datafile(s) in backup set
including current SPFILE in backup set
channel c01: starting piece 1 at 15-07-2010 16:26:02
channel c01: finished piece 1 at 15-07-2010 16:26:37
piece handle=83lis1oq_1_1 tag=TAG20100715T162601 comment=API Version 2.0,MMS Version 5.0.0.0
channel c01: backup set complete, elapsed time: 00:00:35
Finished backup at 15-07-2010 18:26:37
released channel: c01
RMAN>

So the problem was fixed.

Note.- Maybe I should have checked the NetBackup libraries just after finding out the link was wrong; this would have save me 3 hours of troubleshooting.

On the other hand, it was a lesson learned and quite interesting to follow the RMAN trace file.

As always, comments are welcome.

Creating a 3-node standby database. Part II 15 March 2010

Posted by David Alejo Marcos in ASM, RAC, RMAN, Standby.
Tags: , ,
add a comment

On Part I we installed the software and restored the database on a single node or a 3-node RAC Clusterware.

I will explain how I did configure the database on that  CRS and kept it in sync with the current production database. We have to bear in mind that the objective of this exercise is to move from the current production (single node) to an existing 3-node RAC using ASM.

Now we have a database that is an exact copy of our production database. We need to configure it as standby so we can keep it up to date until the migration.

1.- The first step is to create a copy of the current controlfile as standby:

$ rman target /
RMAN> backup current controlfile for standby format '/tmp/stdbyctl.bkp';

Once we have the controlfile for standby we copy it to the destination box using scp.

2.- We stop the database with shutdown immediate. It is important to check that LOG_FILE_NAME_CONVERT is defined using the correct directories to avoid problems in the following steps.

3.- Start the standby database as nomount.

4.- Restore the controlfile for standby:

$ rman nocatalog target /
RMAN> restore standby controlfile from '/tmp/stdbyctl.bkp';

5.- Mount standby database.

RMAN> alter database mount;

6.- Catalog datafiles of the standby database:

the following command will list all files and ask if you want them to be catalog. It is important to review the list and say YES if the list is correct. In our ASM configuration, the path for this database was ‘+EVB_DATA/BVMPRODSH/DATAFILE/’

RMAN> catalog start with '+EVB_DATA/BVMPRODSH/DATAFILE/';

7.- Once the command has finished, we have to “commit” those changes on the controlfile:

RMAN> switch database to copy;
RMAN> EXIT;

8.- If you were using Flashback, it is highly recommended to re-enable it again to avoid problems:

$ sqlplus / as sysdba
SQL> alter database flashback off;

Database altered.

SQL> alter database flashback on;

9.- Now we have to perform some housekeeping… Starting with the redo logs:

9.1.- Find out how many groups we have on the database and clear them:

SQL> select group# from v$log;

GROUP#
———-
1
2
3

SQL> alter database clear logfile group 1;

Database altered.

SQL> alter database clear logfile group 2;

Database altered.

SQL> alter database clear logfile group 3;

Database altered.

9.2.- Now we have to do the same for the standby redo logs:

SQL> select group# from v$standby_log;

GROUP#
———-
4
5
6

SQL> alter database clear logfile group 4;

Database altered.

SQL> alter database clear logfile group 5;

Database altered.

SQL> alter database clear logfile group 6;

Database altered.

OR

9.3.- Create standby redo logs if there are no present:

SQL> select group# from v$standby_log;

no row selected

SQL> alter database add standby logfile group 4 size 250m;

Database altered.

SQL> alter database add standby logfile group 5 size 250m;

Database altered.

SQL> alter database add standby logfile group 6 size 250m;

Database altered.

9.4.- Create spfile in ASM.

This step is not a must, but it is a good idea to use spfiles due to the flexibility that provides:

SQL> create spfile='+EVB_DATA' from pfile;

SQL> shutdown immediate

Edit init.ora file and paste the location of the spfile. This can be found using asmcmd under the diskgroup +EVB_DATA/BVMPRODSH/parameterfile

$ vi $ORACLE_HOME/dbms/init.ora
spfile='+EVB_DATA/BVMPRODSH/parameterfile/spfile.309.712071991'

SQL> startup mount;

10.- Configure the standby parameters, tnsnames.ora,…

10.1.- create entries on the TNSNames file to point to the current production on the standby server. Check you can connect as sys and you can tnsping to the standby server.

10.2.- create entries on the TNSNames file to point to the standby server on the current production server. Check you can connect as sys and you can tnsping to the production server.

10.3.- modify log_archive_config to have the primary database and ALL standby databases on all databases:

SQL> alter system set log_archive_config='DG_CONFIG=(<primary>,<standby1>,<standby2>...)' scope=both sid = '*';

10.4.- configure log_archive_dest_x to point to the new standby database on the primary database. Check log_archive_dest_state_x is enabled:

SQL> alter system set log_archive_dest_3='SERVICE= LGWR ASYNC VALID_FOR=(ALL_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=' scope=both sid = '*';

10.5.- configure log_archive_dest_x to point to the production database on the new standby database. Check log_archive_dest_state_x is enabled:

SQL> alter system set log_archive_dest_3='SERVICE=<db_unique_name> LGWR ASYNC VALID_FOR=(ALL_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=<db_unique_name>' scope=both sid = '*';

10.6.- Configure FAL_CLIENT and FAL_SERVER for gap resolution for primary and standby databases:

SQL> alter system set fal_client=<db_unique_name> scope=both sid='*';
SQL> alter system set fal_server=<db_unique_name for primary and standby databases> scope=both sid='*';

11.- We are good to start the recovery process on this standby database:

SQL> alter database recover managed standby database disconnect from session;

Database altered.

SQL> exit;

12.- Check the primary and standby databases are working properly by switching couple of logfiles and checking are being transferred and applied on all standby databases:

On primary database:

SQL> alter system switch logfile;

Now we have to register the database on the CRS…

1.- Add database
srvctl add database -d bvmprodsh -o -n evbprod

2.- Add instances
srvctl add instance -d bvmprodsh -i bvmprod1 -n
srvctl add instance -d bvmprodsh -i bvmprod2 -n
srvctl add instance -d bvmprodsh -i bvmprod3 -n

3.- Modify instance to add ASM
srvctl modify instance -d bvmprodsh -i bvmprod1 -s +ASM1
srvctl modify instance -d bvmprodsh -i bvmprod2 -s +ASM2
srvctl modify instance -d bvmprodsh -i bvmprod3 -s +ASM3

4.- Create and start services:
$ srvctl add service -d bvmprodsh -s evbprod -r <primary_node1>,<primary_node2>,... -a <available_node1><available_node2>,...
$ srvctl start service -d bvmprodsh -s evbprod
$ srvctl status service -d bvmprodsh

On part III, I will explain how we tested the configuration by doing a switchover from the current production single instance database to the 3-node RAC standby database.

As always, comments are welcome.

PRKP-1001 : Error starting instance …. followed by CRS-0215 31 December 2009

Posted by David Alejo Marcos in ASM, RAC, RMAN.
Tags: , ,
3 comments

Well, I am sure this is going to be the last post for 2009 as somehow I am running out of time to fix more things and write about them here.

The problem:

I was asked to create a POC (Proof Of Concept) RAC database on one of our QA clusters as we may decide to share the cluster for two different businesses.

The specification was quite simple, 3 node RAC running on Linux, ASM (using ASMLib) and Oracle 10.2.0.4.1.

The creation of the ASM disks was quite simple, as well as the creation of the database from a full RMAN backup using the “duplicate target” command.

Once all components were registered on the cluster, I went for the start of the database:

srvctl start database -d …

I had the following error:

PRKP-1001 : Error starting instance iiiiiii on node nnnnnna
CRS-0215: Could not start resource ‘ora.dddddd.iiiiii1.inst’.

After having a look on the logfiles and in google, the answers did not look really good. Some people suggested to stop the crs (impossible for us as it was shared with QA and the POC database and could not justify the downtime). Other people suggested a patch to fix a bug (Bug 4337645), which I thought it was unlikely as it was running before without problems.

The Solution:

Well, it was quite easy now that I know the answer. The steps I followed were:

1.- Try to start the database using sqlplus:

sqlplus ‘/as sysdba’

SQL*Plus: Release 10.2.0.4.0 – Production on Tue Dec 15 10:52:39 2009

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

Connected to an idle instance.

SQL> startup nomount
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file ‘+DATA_POC/dddddd/spfileddddd.ora’
ORA-17503: ksfdopn:2 Failed to open file +DATA_POC/dddddd/spfileddddd.ora
ORA-15077: could not locate ASM instance serving a required diskgroup
SQL> exit

2.- check the disk on ASMLib (as ROOT):

/etc/init.d/oracleasm listdisks

/etc/init.d/oracleasm querydisk

3.- I knew the ASM instance was up and running as the QA database was working. So the problem had to be on the new diskgroups. The following sql statement confirmed my suspicions, the diskgroups were not mounted:

SQL> select name, state from v$asm_diskgroup;

SQL> select name, state from v$asm_diskgroup;

NAME                           STATE
—————————— —————–
CTRLLOG                        MOUNTED
DATA                             MOUNTED
FRA                                MOUNTED
DATA_POC                      DISMOUNTED
FRA_POC                         DISMOUNTED
LOGCTL_POC                   DISMOUNTED

3.- Mounting the diskgroups (on all 3 ASM instances) was fairly simple:

SQL> alter diskgroup data_poc mount;

Diskgroup altered.

SQL>  select name, state from v$asm_diskgroup;

NAME                           STATE
—————————— —————–
CTRLLOG                        MOUNTED
DATA                             MOUNTED
FRA                                MOUNTED
DATA_POC                      MOUNTED
FRA_POC                         DISMOUNTED
LOGCTL_POC                   DISMOUNTED

and the same for fra_poc and logctl_poc.

once all 3 diskgroups were mounted on all 3 nodes, I try to start the database again using srvctl.

Couple of things:

1.- ASMLib have to be execute on all servers

2.- ASM Diskgroups have to be mounted and all ASM instances.

3.- When moving from a single database to a RAC database, remember to set the cluster_database and any other RAC related parameters.

Happy new year.

Tuning Rman Backups 25 November 2009

Posted by David Alejo Marcos in RMAN.
Tags: ,
add a comment

First of all, I would like to say thanks to Peter Boyes for his invaluable assistance with resolving the RMAN performance problems. Peter’s expertise is in EVA configuration and tuning.

This was an interesting problem we faced 5 weeks ago. It has been solved for Full Backups but I am still working on incremental backups…

The System:

1.-Linux X86-64

2.- 1.6 TB 3-node RAC Oracle Enterprise Edition (10.2.0.4.1) database

3.- ASM and ASMLib 10.2.0.4.1 (8 disk for data, 2 disk for FRA and 1 disk for Logs).

4.- SAN EVA 8100

The problem:

We moved from 2-node RAC 10.2.0.3 running on Linux X86 and SAN EVA 8100 to the configuration mentioned above. Full backups took 9 hours on the old hardware.

Backups running on the new environment took 11 hours to finish, a big surprise to all of us as the new hardware was much, much faster.

The approach:

To find out what the problem was we needed to reproduce the problem first on QA. As we did not have enough space for a full backup it was decided to  modify the RMAN backup scripts to back up a fairly large tablespace (101GB).

I generated 10 different test and we monitored the EVA using EVAperf and TLViz.

We have to bear in mind that the goal of this exercise is not just reduce the time the backup runs but also to reduce the impact on the EVA to a minimum as it is shared with other products.

All backup scenarios run with 2 channels (C1 and C2) type disk. Those backup scenarios are as follows:

Note.- I/O limitation is a RMAN feature to reduce the I/O rate per channel.  To limit the I/O per channel, you only need to specify “rate” and the speed. For example, to limit to 20MB:

allocate channel C1 type disk rate 20M;

Scenario 1:

Normal backup, no I/O limitation.

Start Time: 09:02

Finish Time: 09:27

Runtime: 15.25 min C2, 15:40 min C1

Scenario 2:

normal backup, both channels limited bandwidth to 40MB/s.

start time 09:40

finish time 10:02

runtime: 20:15 min C2, 22:30 min C1

Scenario 3:

normal backup, both channels limited bandwidth to 20MB/s.

start time 10:05

finish time 10:50

runtime: 40:16 min C2, 14:51 min C1

Scenario 4:

normal backup, both channels limited bandwidth to 30MB/s.

start time 10:54

finish time 11:24

runtime: 26:56 min C2, 30:01 min C1

Scenario 5:

compress backup, no limitation.

start time 11:32

finish time 11:56

runtime: 20:55 min C2, 23:50 min C1

for this scenario, and all scenarios using “compress”, CPU % idle went from an average of 86% to an average of 72%. These numbers are consistent for all compressed backups. While backups were running, we executed a very heavy procedure and %Idle went down to 56%, but it did not affect performance.

Scenario 6:

compress backup,  both channels limited bandwidth to 30MB/s.

start time 11:58

finish time 12:28

runtime: 26:56 min C2, 30:02 min C1

Scenario 7:

normal backup, no limitation, window of 20 minutes with minimize load.

start time 12:31

finish time 12:49

runtime: 16:36 min C2, 18:11 min C1

Scenario 8:

compress backup, no limitation, window of 20 minutes with minimize load.

start time 12:57

finish time 13:17

runtime: ———

ORA-19591: backup aborted because job time exceeded duration time

Scenario 9:

compress backup, both channels limited bandwidth to 30MB/s, window of 20 minutes with minimize load.

start time 13:24

finish time 13:44

runtime: ———

ORA-19591: backup aborted because job time exceeded duration time

Scenario 10:

compress backup, 4 channels no limitation, window of 20 minutes with minimize load.

start time 14:55

finish time 15:14

runtime: 07:15 min C3, 12:21 min C2, 14:25 min C4, 19:01 min C1

I have added several graphs for  CPU, WriteMB and Disk write Latency on the EVA. I am afraid I did not allow much gap between test as we had a limited window to perform our tests. For this reason it can be a bit difficult to appreciate when a backup started and the previous finished.

CPU:

As we can see on the graph, the scenario 3 (normal backup, I/O limited to 20MB/s) produced the best results on CPU utilization.

Normal backup with I/O limited to 30MB/s had a bigger impact that 20MB/s, but if finished in half the time.

Compressed back with no limitation on I/O was impressive,but it would have had an impact if we tried to backup our 1.6TB database.

Some results showing lower CPU were discarded almost immediately as we were using a window of 20 minutes (slightly longer that the fastest test) with minimizing load, but those backups never finished.

WriteMB:

As you can see in this graph, the lowest impact was done by compressing backups. As soon as we tried to use uncompressed backups those graphs the throughput went up to 130 MBps, while compressed backups had a throughput of 55-60 MBps.

At this point we had to references, limiting I/O to 20-30MBps and compressed looked a good compromises between CPU usage, throughput and time spent for the backup.

Lets have a look the latency for those test scenarios.

Disk Write Latency:

Latency can be described as the time between a write request is received from a host and the time the request completion is returned”. This value is normally measured in ms.

For this reason, we should aim for low latency.

From the graph, we can see that scenarios 5 and 6 (compressed backups with both, limitation in I/O and without limitation) returned low latency (specially scenario 6 which had a limitation of 30MB/s per channel.

The last spike on the graph corresponds to scenario 10, where we run 4 channels with no limitation on I/O, compressed backup and a window of 20 minutes with minimize load. This spike is important because it showed that for our environments, it was better to limit the I/O per channels rather than leaving Oracle to tune it by using a window.

From those 3 graphs we reached the conclusion that the problem was on the controllers. We moved from an old hardware to a brand new, higher spec hardware and it was “too fast” for the EVA controllers to write at the same ratio that rman was sending information so rman and, to some extend, other products using the Production EVA was flooding the EVA controllers.

Our solution for full backups on a 1.6TB went from 2 channels no limitation in I/O to 2 channels limited to 30MB/s and compression (our CPUs were pretty much idle for the whole backup).

This changes proved successful as we moved from a 12 hours backup to a 4 hours 40 minutes full backup with and output of 380 GB, resulting on a compress ratio of 4.38.

We decided to de-tune our backups to avoid any impact whatsoever on the EVA. This was done by reducing the I/O from 30MB/s to 20MB/s.

Our full backup now runs in just over 6 hours, but still much better that 12 hours….

Note.- V$RMAN_BACKUP_JOB_DETAILS is a very useful view to monitor backups, compression rates, and runtime among others.

As always, comments are welcome.