LGWR is blocking sessions (or running very slow). 5 February 2010
Posted by davidalejomarcos in Enterprise Manager, Tuning.Tags: Tuning
add a comment
I am aware this is a very generic subject or a very specific problem to every system or database, so I am going to write about the problem we had with one of our databases at the end of last week.
The problem:
Last Friday, at lunchtime, we had a huge spike of orange looking wait event related to Commit (see fig_1) on Enterprise Manager. After digging a bit on some v$ views and Enterprise Manager we found that the LGWR was blocking several user sessions.
We raised a Severity 1 SR with oracle as the symptoms where the same as the ones described on Bug 5230765 – LGWR spins while waiting for ‘log file parallel write’ using ASM [ID 5230765.8. This bug was technically solved on 10.2.0.4 but it is always good to check things.
After uploading AWRs for the time the database was affected and the previous day when it was fine, some screen prints from Enterprise Manager, ASH for a short period of time and different log files (traces, alert log, etc); the best they could come up with was to resize the redo logs.
I must say the way they approached the problem was not right in my opinion.
Those redolog files never changed for the last 1+ year and we did not have the problem before. The problem went away at midnight and manifested itself on Monday and Tuesday, always around midday.
The solution:
The database is running on a single instance, Oracle + ASM 10.2.0.4, and Linux x86-64.
One of the trends I noticed (or lack of trend) was on the dml being executed. Some times it was an insert, sometimes and update on different objects and different schemas, so a hot block was unlikely to be the problem.
Another big clue was the number of waits in Buffer Gets and Disk Reads where well above average for both per execution and per session.
The Storage manager and the SAs checked the system several times while it was performing really bad and both said that it did not look different from previous days or weeks, so not much on their side.
Business also confirmed that the workload was not different from the previous day. We also have to bear in mind that the first time we had the problem was on the 31st; but the second and third time was on the 3rd and 4th, so we could discard any month end batch performing worst than before.
We decided to add two more redo log groups and increase the size by 2.5 times as this is an online operation with little impact on the business if the time frame is the right one, making a little different as the I/O and CPU was still very high.
The next change was to increase the SGA to solve the wait events on the buffer gets and disk reads. The reason I proposed to increase the SGA was that I thought we did not have enough space on the buffer cache to hold all the data the business was selecting plus inserts/updates.
As soon as the buffer cache was getting full, the LGWR was calling the DBWR to flush changes out of the memory to free space, over killing the server.
This change was not easy to get approved as requires downtime and the system is critical, but after missing couple of SLAs it was a go-go situation.
Once the change was done (increase the SGA_TARGET and SGA_MAX_SIZE), we monitored the system closely. All processes and batches finished on time and the load was much less (see fig_2 below).
Note.- the gray spike is due to a network problem between the primary site and the DR.
Conclusion:
You have to know your business and your databases. Oracle support came with an out of the book solution of increasing the redo logs but, as mentioned before, we did not change the size, neither the number for over 1 year.
We can get sometimes to focus on a single point (LGWR). Sometimes is good to make a step backwards and check the system. Different segments, different schemas, different dml and a huge amount buffer gets and disk read were the clue, not just LGWR blocking sessions.
We all have the rule of not switching redo logs more than x amount of times per hour, but sometimes we need to know why we have reached that number and not just increase the size as this may mask an underlying problem.
As always, comments are welcome.
Modify RAC Services. 28 January 2010
Posted by davidalejomarcos in RAC.Tags: RAC, service
add a comment
Today I had to redefine the services on our 3 nodes cluster.
The change has been triggered by a new database running on the cluster and a more logical distribution of the business as we noticed an increase on the load of data using the interconnect.
Batches were running on two different nodes, depending of the type of product. Some tables are specific to the product, but we still have many common tables and all the information retrieved from common tables were being transferred from node to node.
After this change, all batches but one will run on the same preferred node.
The problem:
Redistribute existing services as per the following design:
Existing configuration:
Totem: PREF: qa1 qa2 qa3 AVAIL: TAF: NONE
CDS: PREF: qa2 AVAIL: qa3 TAF: NONE
Bonds: PREF: qa1 AVAIL: qa3 TAF: NONE
New configuration:
Totem: PREF: qa3 AVAIL: qa1 qa2 TAF: NONE
CDS: PREF: qa1 AVAIL: qa2 TAF: NONE
Bonds: PREF: qa1 AVAIL: qa2 TAF: NONE
The solution:
The way we were going to implement it was by dropping the existing service and re-creating it. This is not a bad idea, but implies a reboot of our application server and the possibility of errors on the application if anything or anyone tries to connect to the database using that service at the time we were re-creating it (we are a 24*7 shop).
After having a look on the Oracle manuals (Oracle RAC Administrator guide, appendix B), I found a better way to do it.
srvctl modify service gives us the possibility of redefine the service without need of re-creating it or even stopping it.
Now we can upgrade available nodes to preferred, downgrade them or simply creating a new definition while the service is still running (Note.- Stop and start the service is recommended as there are some restrictions, i.e. newly added nodes will not be used until the service is restarted).
So, the syntax is as follows:
srvctl modify service -d db_unique_name -s service_name -n -i pref_inst_list [ -a avail_inst_list] [-f]
where:
-n = Only uses the instances named for this service (unnamed instances already assigned to the service are removed).
-i = List of preferred instances.
-a = List of available instances.
-f = Disconnect all sessions during stop or relocate service operations.
In our case, the command I had to execute was:
srvctl modify service -s Totem -d qa -n -i qa3 -a qa1,qa2
to check the change:
[oracle ~]$ srvctl config service -s Totem -d qa -a
and the other services:
[oracle ~]$srvctl modify service -s CDS -d qa -n -i qa1 -a qa2
[oracle ~]$srvctl modify service -s bonds -d qa -n -i qa1 -a qa2
This will save us some typing and the potencial of making a mistake on the naming.
As always, comments are welcome.
PRKP-1001 : Error starting instance …. followed by CRS-0215 31 December 2009
Posted by davidalejomarcos in ASM, RAC, RMAN.Tags: ASM, RAC, RMAN
add a comment
Well, I am sure this is going to be the last post for 2009 as somehow I am running out of time to fix more things and write about them here.
The problem:
I was asked to create a POC (Proof Of Concept) RAC database on one of our QA clusters as we may decide to share the cluster for two different businesses.
The specification was quite simple, 3 node RAC running on Linux, ASM (using ASMLib) and Oracle 10.2.0.4.1.
The creation of the ASM disks was quite simple, as well as the creation of the database from a full RMAN backup using the “duplicate target” command.
Once all components were registered on the cluster, I went for the start of the database:
srvctl start database -d …
I had the following error:
PRKP-1001 : Error starting instance iiiiiii on node nnnnnna
CRS-0215: Could not start resource ‘ora.dddddd.iiiiii1.inst’.
After having a look on the logfiles and in google, the answers did not look really good. Some people suggested to stop the crs (impossible for us as it was shared with QA and the POC database and could not justify the downtime). Other people suggested a patch to fix a bug (Bug 4337645), which I thought it was unlikely as it was running before without problems.
The Solution:
Well, it was quite easy now that I know the answer. The steps I followed were:
1.- Try to start the database using sqlplus:
sqlplus ‘/as sysdba’
SQL*Plus: Release 10.2.0.4.0 – Production on Tue Dec 15 10:52:39 2009
Copyright (c) 1982, 2007, Oracle. All Rights Reserved.
Connected to an idle instance.
SQL> startup nomount
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file ‘+DATA_POC/dddddd/spfileddddd.ora’
ORA-17503: ksfdopn:2 Failed to open file +DATA_POC/dddddd/spfileddddd.ora
ORA-15077: could not locate ASM instance serving a required diskgroup
SQL> exit
2.- check the disk on ASMLib (as ROOT):
/etc/init.d/oracleasm listdisks
/etc/init.d/oracleasm querydisk
3.- I knew the ASM instance was up and running as the QA database was working. So the problem had to be on the new diskgroups. The following sql statement confirmed my suspicions, the diskgroups were not mounted:
SQL> select name, state from v$asm_diskgroup;
SQL> select name, state from v$asm_diskgroup;
NAME STATE
—————————— —————–
CTRLLOG MOUNTED
DATA MOUNTED
FRA MOUNTED
DATA_POC DISMOUNTED
FRA_POC DISMOUNTED
LOGCTL_POC DISMOUNTED
3.- Mounting the diskgroups (on all 3 ASM instances) was fairly simple:
SQL> alter diskgroup data_poc mount;
Diskgroup altered.
SQL> select name, state from v$asm_diskgroup;
NAME STATE
—————————— —————–
CTRLLOG MOUNTED
DATA MOUNTED
FRA MOUNTED
DATA_POC MOUNTED
FRA_POC DISMOUNTED
LOGCTL_POC DISMOUNTED
and the same for fra_poc and logctl_poc.
once all 3 diskgroups were mounted on all 3 nodes, I try to start the database again using srvctl.
Couple of things:
1.- ASMLib have to be execute on all servers
2.- ASM Diskgroups have to be mounted and all ASM instances.
3.- When moving from a single database to a RAC database, remember to set the cluster_database and any other RAC related parameters.
Happy new year.
ORA-06502: character string buffer too small while re/creating dbconsole 18 December 2009
Posted by davidalejomarcos in Enterprise Manager, Grid Control, SQL - PL/SQL.Tags: Enterprise Manager, Grid Control, SQL - PL/SQL
add a comment
wow,
we are getting close to the end of this year. This may be the last post for 2009, we will have to think about next year…
The problem:
OK, the last thing we faced here was the following error while re-creating the dbconsole using the standard emca -config dbcontrol db -repos recreate command:
CONFIG: ORA-06502: PL/SQL: numeric or value error: character string buffer too small
ORA-06512: at line 259
oracle.sysman.assistants.util.sqlEngine.SQLFatalErrorException: ORA-06502: PL/SQL: numeric or value error: character string buffer too small
ORA-06512: at line 259
followed for some nasty Java alerts and:
SEVERE: Error creating the repository
….
oracle.sysman.emcp.exception.EMConfigException: Error creating the repository
The solution:
Well, the first thing to check is that you are on the right ORACLE_HOME, this is not going to solve the problem, but it will help us to modify the script on the right ORACLE_HOME the first time.
The problem is located on a variable used to store the name of the host.
Normally this variable is defined for 32 characters. This is normally right for most systems, but we had a problem with one of ours as the domain of the servers plus the name of the database had a length of 33 characters.
How can we check the length of our host_name?, quite simple:
SYS@XXXXXXX> col host_name for a40
SYS@databee> select host_name, length(host_name) length from v$instance;
HOST_NAME LENGTH
————————————————————————– ———-
hhhhhhhhhhhhhhhhh.dddddd.dddddddd 33
Elapsed: 00:00:00.09
I had to change the names for security reasons.
Ok, the problem is down to a single character, the length is 33 characters and the variable has been defined as:
l_host_name VARCHAR2(32);
so it can only store 32.
the fix is quite simple.
Steps:
1.- Check you are on the right ORACLE_HOME
echo $ORACLE_HOME
2.- cd $ORACLE_HOME/sysman/admin/emdrep/sql/core/latest/self_monitor
3.- cp self_monitor_post_creation.sql self_monitor_post_creation.sql_bk
4.- Edit self_monitor_post_creation.sql and replace
l_host_name VARCHAR2(32);
for a length that can store your host_name
l_host_name VARCHAR2(35);
NOTE.- there are two places where this variable is defined on the script, remember to modify the definition on both places.
Once the script has been changed, you can rerun the command to re/create the dbconsole:
emca -config dbcontrol db -repos recreate
Perfect.
As always, comments are welcome.
Merry Xmas.
Importance to choose a good name for a variable 5 December 2009
Posted by davidalejomarcos in SQL - PL/SQL.Tags: archivelog, pl/sql, redolog, SQL, tune, variable
add a comment
After a very busy week, here we are again.
Last week was a busy one for different reasons. I was in Birmingham on Tuesday for the UKOUG. I will post my impressions regarding the presentations I went to.
The other reason is daily work.
On Monday we notices a huge amount of redo log being generated between 04:00 and 07:00. When I mean huge it really means huge. We moved from having 3 archive logs (500MB each) to 17 archivelogs.
We saw this increase not as one off (it could have been our batches trying to process an unusual volume of data) but it wasn’t.
So, lets see how we did troubleshoot the problem:
Symptoms:
Archivelog increase from 3 files to 17 in one hour.
Problem:
Standby database in US was lagging behind, this generated tickets and phone calls to the on call DBA at 04:15 am.
Analysis:
We started the analysis using a simple query to show how many log switches the database has performed per hour.
It was clear we had a dramatic increased in switches.We also noticed that this increased started on Monday at 04:00. Our batches are very light on Monday morning as we do not receive any data during the weekend. Data arrives to our systems from Monday to Friday.
Using this knowledge we could almost discard batches as the source of the problem.
At 06:00 we noticed that a specific Oracle job (AUTO_SPACE_ADVISOR_JOB) was taking longer than necessary and performing more inserts than last week.
The decision was made of disable the job for one night, to see it that helped. It was a shot in the dark, but worth trying.
Next day we noticed a decrease in the number of archivelogs, but only by 2 (still 1GB). After enabling the job, we decided to attach the problem from a different direction.
Oracle has some very cool tables so see what happened in the past. Those views start with dba_hist_*. I created the following query to find out which user was generating more redo on the database:
SELECT COUNT(*),
a.sql_id,
c.username
FROM dba_hist_active_sess_history a,
dba_hist_sqltext b,
dba_users c
WHERE a.sample_time BETWEEN
to_date ('26/11/2009 04','dd/mm/yyyy hh24')
AND to_date ('26/11/2009 05','dd/mm/yyyy hh24')
AND a.sql_id = b.sql_id
AND (UPPER(b.sql_text) like '%INSERT%'
OR UPPER(b.sql_text) LIKE '%UPDATE%'
OR UPPER(b.sql_text) LIKE '%DELETE%'
)
AND a.user_id = c.user_id
GROUP BY a.sql_id, c.username
ORDER BY 1 DESC;
This query showed me who executed more inserts/updates/deletes during the window defined.
I run the query for November 26th and December 3rd and different windows (04:00 – 05:00 and 06:00-07:00).
The query proved very useful and identified the user and the sql_id that run the most (for security reasons, I have amended the username):
COUNT(*) SQL_ID USERNAME -------- ------------- ---------- 58 0h9cf9v0d1bt0 batch1 9 a8gj2w0at9xjp batch2 6 c8wust7dkbqzy user1 4 4sr1q0ryvy6zn batch3 4 9aywg0c0wzmxv batch3
for December 3rd, this was the output:
COUNT(*) SQL_ID USERNAME ------- ------------- ----------- 168 9bmbf5t7580d8 User11 50 0h9cf9v0d1bt0 batch1 4 cg8dxqb0s8z7j user2
Wow, the number of executions for that single command was unexpected. Lets see the code:
SELECT *
FROM dba_hist_sqltext
WHERE sql_id = '9bmbf5t7580d8';
DBID SQL_ID SQL_TEXT
------- ------------- ----------------------------------
XXXXXX 9bmbf5t7580d8 UPDATE TABLE SET FIELD3= FIELD3,
FIELD4 = FIELD4, FIELD5= FIELD5
WHERE DATE = DATE AND
CURVE = CURVE AND
INSTRUMENT = INSTRUMENT
A quick word with the developers confirmed that the previous weekend a new feature was deployed. It was a new code that performed inserts if the data was not there or updates if the data was already on the database (I did not want to explain what MERGE command did at that time as this was a production problem).
I managed to have a copy of the procedure. The developers mentioned that the inserts were running very fast, but if the data was already on the table, the same procedure was very slow, so they run it in 2 threads from the application server.
I created a simple test on our QA environment and was able to reproduce the problem. The code is below:
procedure mypackage(
date table.date%TYPE,
curve table.CURVE%TYPE,
instrument table.INSTRUMENT%TYPE,
field3 table.field3%TYPE,
field4 table.field4%TYPE,
field5 table.field5%TYPE) IS
BEGIN
......
BEGIN
insert into table (date, CURVE, INSTRUMENT,
field3, field4, field5)
VALUES (date, CURVE, INSTRUMENT,
field3, field4, field5);
EXCEPTION
WHEN DUP_VAL_ON_INDEX THEN
UPDATE table SET
field3 = field3,
field4 = field4,
field5 = field5
WHERE date = date and curve = curve
and instrument = instrument;
END;
END mypackage;
Ok, even with the modified version of the code, you should be able to spot the problem.
Basically, the developer has decide to use the same name for the variable and the column to insert/update. The variable curve was storing the value to insert in the column curve of the table.
For the Insert, this is fine. Oracle will check the value of the variable, but for the update, the rules are different. Oracle was scanning and updating any single row of the table (8.9 million) as the condition:
WHERE date = date AND curve = curve AND instrument = instrument
was similar to do:
WHERE 1=1.
The second thread was updating every single row of the table (8.9 million) after every single insert done by the 1st thread…. No wonder we were generating so many archivelog files.
Solution:
The first step to fix the problem has been to disable one of the threads. We are lucky that the application is new, so it mostly performs inserts.
Developers are working now in renaming all variables and parameter to use a different name convention, something like p_<parameter>, l_<variable>, etc.
As always, comments are welcome.
Tuning Rman Backups 25 November 2009
Posted by davidalejomarcos in RMAN.Tags: backup, RMAN
add a comment
First of all, I would like to say thanks to Peter Boyes for his invaluable assistance with resolving the RMAN performance problems. Peter’s expertise is in EVA configuration and tuning.
This was an interesting problem we faced 5 weeks ago. It has been solved for Full Backups but I am still working on incremental backups…
The System:
1.-Linux X86-64
2.- 1.6 TB 3-node RAC Oracle Enterprise Edition (10.2.0.4.1) database
3.- ASM and ASMLib 10.2.0.4.1 (8 disk for data, 2 disk for FRA and 1 disk for Logs).
4.- SAN EVA 8100
The problem:
We moved from 2-node RAC 10.2.0.3 running on Linux X86 and SAN EVA 8100 to the configuration mentioned above. Full backups took 9 hours on the old hardware.
Backups running on the new environment took 11 hours to finish, a big surprise to all of us as the new hardware was much, much faster.
The approach:
To find out what the problem was we needed to reproduce the problem first on QA. As we did not have enough space for a full backup it was decided to modify the RMAN backup scripts to back up a fairly large tablespace (101GB).
I generated 10 different test and we monitored the EVA using EVAperf and TLViz.
We have to bear in mind that the goal of this exercise is not just reduce the time the backup runs but also to reduce the impact on the EVA to a minimum as it is shared with other products.
All backup scenarios run with 2 channels (C1 and C2) type disk. Those backup scenarios are as follows:
Note.- I/O limitation is a RMAN feature to reduce the I/O rate per channel. To limit the I/O per channel, you only need to specify “rate” and the speed. For example, to limit to 20MB:
allocate channel C1 type disk rate 20M;
Scenario 1:
Normal backup, no I/O limitation.
Start Time: 09:02
Finish Time: 09:27
Runtime: 15.25 min C2, 15:40 min C1
Scenario 2:
normal backup, both channels limited bandwidth to 40MB/s.
start time 09:40
finish time 10:02
runtime: 20:15 min C2, 22:30 min C1
Scenario 3:
normal backup, both channels limited bandwidth to 20MB/s.
start time 10:05
finish time 10:50
runtime: 40:16 min C2, 14:51 min C1
Scenario 4:
normal backup, both channels limited bandwidth to 30MB/s.
start time 10:54
finish time 11:24
runtime: 26:56 min C2, 30:01 min C1
Scenario 5:
compress backup, no limitation.
start time 11:32
finish time 11:56
runtime: 20:55 min C2, 23:50 min C1
for this scenario, and all scenarios using “compress”, CPU % idle went from an average of 86% to an average of 72%. These numbers are consistent for all compressed backups. While backups were running, we executed a very heavy procedure and %Idle went down to 56%, but it did not affect performance.
Scenario 6:
compress backup, both channels limited bandwidth to 30MB/s.
start time 11:58
finish time 12:28
runtime: 26:56 min C2, 30:02 min C1
Scenario 7:
normal backup, no limitation, window of 20 minutes with minimize load.
start time 12:31
finish time 12:49
runtime: 16:36 min C2, 18:11 min C1
Scenario 8:
compress backup, no limitation, window of 20 minutes with minimize load.
start time 12:57
finish time 13:17
runtime: ———
ORA-19591: backup aborted because job time exceeded duration time
Scenario 9:
compress backup, both channels limited bandwidth to 30MB/s, window of 20 minutes with minimize load.
start time 13:24
finish time 13:44
runtime: ———
ORA-19591: backup aborted because job time exceeded duration time
Scenario 10:
compress backup, 4 channels no limitation, window of 20 minutes with minimize load.
start time 14:55
finish time 15:14
runtime: 07:15 min C3, 12:21 min C2, 14:25 min C4, 19:01 min C1
I have added several graphs for CPU, WriteMB and Disk write Latency on the EVA. I am afraid I did not allow much gap between test as we had a limited window to perform our tests. For this reason it can be a bit difficult to appreciate when a backup started and the previous finished.
As we can see on the graph, the scenario 3 (normal backup, I/O limited to 20MB/s) produced the best results on CPU utilization.
Normal backup with I/O limited to 30MB/s had a bigger impact that 20MB/s, but if finished in half the time.
Compressed back with no limitation on I/O was impressive,but it would have had an impact if we tried to backup our 1.6TB database.
Some results showing lower CPU were discarded almost immediately as we were using a window of 20 minutes (slightly longer that the fastest test) with minimizing load, but those backups never finished.
As you can see in this graph, the lowest impact was done by compressing backups. As soon as we tried to use uncompressed backups those graphs the throughput went up to 130 MBps, while compressed backups had a throughput of 55-60 MBps.
At this point we had to references, limiting I/O to 20-30MBps and compressed looked a good compromises between CPU usage, throughput and time spent for the backup.
Lets have a look the latency for those test scenarios.
Latency can be described as the time between a write request is received from a host and the time the request completion is returned”. This value is normally measured in ms.
For this reason, we should aim for low latency.
From the graph, we can see that scenarios 5 and 6 (compressed backups with both, limitation in I/O and without limitation) returned low latency (specially scenario 6 which had a limitation of 30MB/s per channel.
The last spike on the graph corresponds to scenario 10, where we run 4 channels with no limitation on I/O, compressed backup and a window of 20 minutes with minimize load. This spike is important because it showed that for our environments, it was better to limit the I/O per channels rather than leaving Oracle to tune it by using a window.
From those 3 graphs we reached the conclusion that the problem was on the controllers. We moved from an old hardware to a brand new, higher spec hardware and it was “too fast” for the EVA controllers to write at the same ratio that rman was sending information so rman and, to some extend, other products using the Production EVA was flooding the EVA controllers.
Our solution for full backups on a 1.6TB went from 2 channels no limitation in I/O to 2 channels limited to 30MB/s and compression (our CPUs were pretty much idle for the whole backup).
This changes proved successful as we moved from a 12 hours backup to a 4 hours 40 minutes full backup with and output of 380 GB, resulting on a compress ratio of 4.38.
We decided to de-tune our backups to avoid any impact whatsoever on the EVA. This was done by reducing the I/O from 30MB/s to 20MB/s.
Our full backup now runs in just over 6 hours, but still much better that 12 hours….
Note.- V$RMAN_BACKUP_JOB_DETAILS is a very useful view to monitor backups, compression rates, and runtime among others.
As always, comments are welcome.
Listener status is down (or pending) in Enterprise Manager 24 November 2009
Posted by davidalejomarcos in Enterprise Manager.Tags: Agent, Grid Control
add a comment
This solution works for both Agents uploading data to a Grid Control Repository or to DBConsole.
We have a 3-nodes RAC and we are using dbconsole to monitor our production systems. We will be deploying Grid Control any time soon (providing Oracle fixes couple of bugs we discovered some time ago).
Until then, dbconsole is good enough.
Our configuration is as follows:
1.- Linux x86-64
2.- Oracle ASM (using ASMLib) 10.2.0.4
3.- Oracle Enterprise Edition 10.2.0.4.1
As you may know, Oracle recommends to run the listener from the ASM home instead of the database home and to have a different set of binaries for ASM in case you need to upgrade ASM (or you have more than one database running on the same server using different Oracle versions).
We run dbconsole from the Database Oracle home. This is as expected, but this leads us towards our problem, as the default location for SQL*Net configuration on RAC is the Oracle Home for the database.
For some time we have to deal with not knowing if the listener was down (as displayed on dbconsole) or pending (which normally indicates the agent is not uploading information to the repository.
Steps to deal with the problem are:
1.- Stop dbconsole/agent on all nodes.
2.- localte the target.xml file (normally under $ORACLE_HOME/sysman/emd/targets.xml if using dbconsole or $ORACLE_HOME/agent/sysman/emd/targets.xml if using Grid Control).
3.- make a backup of the file (always good practice)
4.- edit the file:
you need to search for an entry with the following format:
<Target TYPE=”oracle_listener” NAME=”LISTENER_xxxxxx”>
then you need to change the path on two properties:
<Property NAME=”ListenerOraDir” VALUE=”xxxxxxxxxx”/>
<Property NAME=”OracleHome” VALUE=”yyyyyyyyyyy”/>
where xxxxx should be your $ORACLE_HOME/network/admin
and yyyyyyy should be your $ORACLE_HOME
ie: /u01/app/oracle/admin/product/10.2.0/asm_1
5.- start dbconsole/agent on all nodes
6.- check the status of those listeners are up.
I read some documents on Oracle Support (old Metalink) and none work for us.
One of them was interesting by the fact of suggesting to add an environment variable (TNS_ADMIN) on the server. This is not recommended (at least on our systems) as we run ASM and DB and, hopefully one day, Grid Control Agent, and we like to keep things separated.
As always, comments are welcome.
ORA-15063:ASM discovered an insufficient number of disks for diskgroup …. 16 November 2009
Posted by davidalejomarcos in ASM.Tags: ASM
add a comment
Last Friday we started a task to move our local standby database from FC to FATA disks. The goal is to reduce costs as this standby database is only used to refresh QA/DEV environments and not as a proper DR.
We have three databases on this server:
1.- prod1 – local standby database
2.- whqaint – QA Integration database
3.- whtest – test database (50GB size)
The first two (prod1 and whqaint) are a copy of our production database (1.6TB) running on EVA. Whtest was created as a proof of concept (POC) for some testing.
All of them run on 10.2.0.4 and ASM, and all of them have different diskgroups on ASM (data, datb, dat_c).
We started by shutting down all three database, the asm and asmlib.
After the snap was done and the server restarted (to refresh the SCSi bus) I started asmlib.
All looked ok as all disk where listed but, when I queried one of the disk, is when all started to go wrong:
[root@xxxxxxxx init.d]# ./oracleasm querydisk DATA1
Disk “DATA1″ is labeled for disk “DATA4″
Well, something was wrong, but did not know what. The first call was to stop asmlib, disable asmlib, enable asmlib, start asmlib and perform a scandisk.
After these steps were done, the label matched with the Disk. Also a quick check on /dev/oracleasm/disks showed all disks, so far so good.
Once I started the asm, is when I knew it will be a long day:
ORA-15032: not all alterations performed
ORA-15063: ASM discovered an insufficient number of disks for diskgroup “FRB”
ORA-15063: ASM discovered an insufficient number of disks for diskgroup “FRA”
ORA-15063: ASM discovered an insufficient number of disks for diskgroup “FRA_C”
ORA-15063: ASM discovered an insufficient number of disks for diskgroup “DATB”
ORA-15063: ASM discovered an insufficient number of disks for diskgroup “DATA”
ORA-15063: ASM discovered an insufficient number of disks for diskgroup “DATA_C”
ORA-15063: ASM discovered an insufficient number of disks for diskgroup “CTRLLOG”
ORA-15063: ASM discovered an insufficient number of disks for diskgroup “CTRLLGG”
ORA-15063: ASM discovered an insufficient number of disks for diskgroup “CNTRL_C”
After speaking with the Storage Managed and the SA, all of them confirmed the LUNs were presented and all should be good.
I had a list of the devices before and after the snap, so I could see that it was not the same, the WWIDs were also different, meaning oracleasm should have scanned and recognized the new devices, but it was not the case.
Once thing we have to remember is that, for some reason, oracleasm LVM volumnes.
Oracleasm scans /proc/partitions table to detect block devices with ASM headers.
LVM managed devices are presented to the kernel as /dev/mapper/*, while /proc/partitions are /dev/dm-*
This was the first problem. The good thing is that by scanning those disk again, we managed to fix one of the problems:
[root@xxxxxxxx init.d]# ./oracleasm querydisk data1
Disk “DATA1″ is a valid ASM disk
At this point the only way forward I could think of was reinstalling asmlib, something I did not feel comfortable.
I had a look to our ASM spfile:
*.asm_diskgroups=’DATA’,'FRA’,'CTRLLOG’,'CTRLLGG’,'DATB’,'FRB’,'DATA_C’,'CNTRL_C’,'FRA_C’
*.asm_diskstring=’/dev/oracleasm/disks/’
As we were having problems identifying disks on asm_diskgroups, I decided to comment all three entries and start asm. ASM started without problems (as expected).
I then followed a try-and-error approach. Basically, leave asm_diskstring and +ASM.asm_diskgroups commented, create a copy of asm_diskgroups with only one diskgroup (DATA):
#*.asm_diskgroups=’DATA’,'FRA’,'CTRLLOG’,'CTRLLGG’,'DATB’,'FRB’,'DATA_C’,'CNTRL_C’,'FRA_C’
*.asm_diskgroups=’DATA’
#*.asm_diskstring=’/dev/oracleasm/disks/’
And try to start ASM again. And it worked!
To cut a long story short… I kept adding diskgroups until I found the ones failing (DATA_C, CNTRL_C and FRA_C)
Those diskgroups are the ones we created with local disks and, by default, they are mapped as /dev/dm-* (remember when I mentioned oracleasm only scans /proc/partitions and does not scan /dev/dm-*…).
Removing those diskgroups from the init+ASM.ora file allowed us to start both databases (local scandby and qaint).
Devices can be mounted in different ways, Linux (not sure about other platforms) adds those entries to /dev/dm-* by default…
Something to remember ….
Split Strings in Oracle 4 November 2009
Posted by davidalejomarcos in SQL - PL/SQL.Tags: SQL
add a comment
I received an email from one of our developers asking the following question:
Do you know how to transform a horrid legacy column consisting of semicolon-separated strings into a view containing one row per (originally semicolon-separated) value? For example:
Existing table:
Row1: Value1; Value2;
Row2: Value3; Value4; Value5;
Desired view based on the above:
Row1: Value1
Row2: Value2
Row3: Value3
Row4: Value4
Row5: Value5
I thought it was quite an interesting question, so I created a simple test:
davidmarcos@devdb> create table mytest (pos number, description varchar2(1000));
Table created.
Elapsed: 00:00:00.40
davidmarcos@devdb> insert into mytest values (1,’value1;value2;value3′);
1 row created.
Elapsed: 00:00:00.01
davidmarcos@devdb> insert into mytest values (2,’value4;value5′);
1 row created.
Elapsed: 00:00:00.03
davidmarcos@devdb> insert into mytest values (3,’value6′);
1 row created.
Elapsed: 00:00:00.04
davidmarcos@devdb> insert into mytest values (4,’value7;value8;value9;value10;value11′);
1 row created.
Elapsed: 00:00:00.28
davidmarcos@devdb> commit;
Commit complete.
Elapsed: 00:00:00.03
davidmarcos@devdb> select * from mytest;
POS DESCRIPTION
——- ————————————
1 value1;value2;value3
2 value4;value5
3 value6
4 value7;value8;value9;value10;value11
Elapsed: 00:00:00.06
The first thing I noticed is that the number of values are variable and the column can be null. The easiest way I found to find out how many different values we had per row was counting the number of “;” and adding 1 (last value does not have semicolon):
davidmarcos@devdb>select description,
2 pos,
3 (length(description)-length(replace(description,’;',”)))+1 cnt
4 from mytest
5 where description is not null;
DESCRIPTION POS CNT
—————————————— ——- ———-
value1;value2;value3 1 3
value4;value5 2 2
value6 3 1
value7;value8;value9;value10;value11 4 5
Elapsed: 00:00:00.03
This query will become the core of my select.
The next step was to split the string. I decide to use regular expressions (regexp_substr to be more specific).
To split the string I used the following:
regexp_substr(description,’[^;]+’,1,level)
Where:
description is the field.
^; indicates the string beginning with “;”
1 indicates where the search should start (in this case, the first string after “;”)
Level tells oracle which occurrence to search for.
The query looks like:
with data
as
(
select description,
pos,
(length(description)-length(replace(description,’;',”)))+1 cnt
from mytest
where description is not null
)
select distinct ltrim(regexp_substr(description,’[^;]+’,1,level)) result, pos
from data
connect by level <= cnt
order by pos, result
Note:
- ltrim was used to remove any blanks at the beginning of the string.
davidmarcos@devdb > with data
2 as
3 (
4 select description,
5 pos,
6 (length(description)-length(replace(description,’;',”)))+1 cnt
7 from mytest
8 where description is not null
9 )
10 select distinct ltrim(regexp_substr(description,’[^;]+’,1,level)) result, pos
11 from data
12 connect by level <= cnt
13 order by pos, result
14 /
RESULT POS
————————————— ———-
value1 1
value2 1
value3 1
value4 2
value5 2
value6 3
value10 4
value11 4
value7 4
value8 4
value9 4
11 rows selected.
Elapsed: 00:00:00.37
So far, so good. The only missing detail is to order the columns in a proper manner. I decided to order by level instead of description and hide it from the user (he is not really interested on the level, but on having the information ordered in the same way it was inserted on the column):
davidmarcos@devdb> with data
2 as
3 (
4 select description,
5 pos,
6 (length(description)-length(replace(description,’;',”)))+1 cnt
7 from mytest
8 where description is not null
9 ) select result, pos
10 from (
11 select distinct ltrim(regexp_substr(description,’[^;]+’,1,level)) result, pos, level
12 from data
13 connect by level <= cnt
14 order by pos, level
15 )
16 /
RESULT POS
—————————————- ———-
value1 1
value2 1
value3 1
value4 2
value5 2
value6 3
value7 4
value8 4
value9 4
value10 4
value11 4
11 rows selected.
Elapsed: 00:00:00.17
Today is Friday 30 October 2009
Posted by davidalejomarcos in Uncategorized.add a comment
And I decided to start a blog.
The reason is quite simple. A good friend and colleague of mine has suggested to share several, in our opinion, interesting test we have been doing with our systems recently.
In short, and in no specific order:
1.- Backups take longer than expected.
2.- Migrate 1.5 TB10.2.0.3 Oracle RAC 2 nodes on Linux RedHat 32b to 10.2.0.4 RAC 3 nodes on Linux RedHat 64b including physical standby database in the US.
3.- interesting query to transform:
from
row1: value1, value2, value3
row2:value4, value5
to
value1
value2
value3
value4
value5




