Importance to choose a good name for a variable 5 December 2009Posted by David Alejo Marcos in SQL - PL/SQL.
Tags: archivelog, pl/sql, redolog, SQL, tune, variable
After a very busy week, here we are again.
Last week was a busy one for different reasons. I was in Birmingham on Tuesday for the UKOUG. I will post my impressions regarding the presentations I went to.
The other reason is daily work.
On Monday we notices a huge amount of redo log being generated between 04:00 and 07:00. When I mean huge it really means huge. We moved from having 3 archive logs (500MB each) to 17 archivelogs.
We saw this increase not as one off (it could have been our batches trying to process an unusual volume of data) but it wasn’t.
So, lets see how we did troubleshoot the problem:
Archivelog increase from 3 files to 17 in one hour.
Standby database in US was lagging behind, this generated tickets and phone calls to the on call DBA at 04:15 am.
We started the analysis using a simple query to show how many log switches the database has performed per hour.
It was clear we had a dramatic increased in switches.We also noticed that this increased started on Monday at 04:00. Our batches are very light on Monday morning as we do not receive any data during the weekend. Data arrives to our systems from Monday to Friday.
Using this knowledge we could almost discard batches as the source of the problem.
At 06:00 we noticed that a specific Oracle job (AUTO_SPACE_ADVISOR_JOB) was taking longer than necessary and performing more inserts than last week.
The decision was made of disable the job for one night, to see it that helped. It was a shot in the dark, but worth trying.
Next day we noticed a decrease in the number of archivelogs, but only by 2 (still 1GB). After enabling the job, we decided to attach the problem from a different direction.
Oracle has some very cool tables so see what happened in the past. Those views start with dba_hist_*. I created the following query to find out which user was generating more redo on the database:
SELECT COUNT(*), a.sql_id, c.username FROM dba_hist_active_sess_history a, dba_hist_sqltext b, dba_users c WHERE a.sample_time BETWEEN to_date ('26/11/2009 04','dd/mm/yyyy hh24') AND to_date ('26/11/2009 05','dd/mm/yyyy hh24') AND a.sql_id = b.sql_id AND (UPPER(b.sql_text) like '%INSERT%' OR UPPER(b.sql_text) LIKE '%UPDATE%' OR UPPER(b.sql_text) LIKE '%DELETE%' ) AND a.user_id = c.user_id GROUP BY a.sql_id, c.username ORDER BY 1 DESC;
This query showed me who executed more inserts/updates/deletes during the window defined.
I run the query for November 26th and December 3rd and different windows (04:00 – 05:00 and 06:00-07:00).
The query proved very useful and identified the user and the sql_id that run the most (for security reasons, I have amended the username):
COUNT(*) SQL_ID USERNAME -------- ------------- ---------- 58 0h9cf9v0d1bt0 batch1 9 a8gj2w0at9xjp batch2 6 c8wust7dkbqzy user1 4 4sr1q0ryvy6zn batch3 4 9aywg0c0wzmxv batch3
for December 3rd, this was the output:
COUNT(*) SQL_ID USERNAME ------- ------------- ----------- 168 9bmbf5t7580d8 User11 50 0h9cf9v0d1bt0 batch1 4 cg8dxqb0s8z7j user2
Wow, the number of executions for that single command was unexpected. Lets see the code:
SELECT * FROM dba_hist_sqltext WHERE sql_id = '9bmbf5t7580d8'; DBID SQL_ID SQL_TEXT ------- ------------- ---------------------------------- XXXXXX 9bmbf5t7580d8 UPDATE TABLE SET FIELD3= FIELD3, FIELD4 = FIELD4, FIELD5= FIELD5 WHERE DATE = DATE AND CURVE = CURVE AND INSTRUMENT = INSTRUMENT
A quick word with the developers confirmed that the previous weekend a new feature was deployed. It was a new code that performed inserts if the data was not there or updates if the data was already on the database (I did not want to explain what MERGE command did at that time as this was a production problem).
I managed to have a copy of the procedure. The developers mentioned that the inserts were running very fast, but if the data was already on the table, the same procedure was very slow, so they run it in 2 threads from the application server.
I created a simple test on our QA environment and was able to reproduce the problem. The code is below:
procedure mypackage( date table.date%TYPE, curve table.CURVE%TYPE, instrument table.INSTRUMENT%TYPE, field3 table.field3%TYPE, field4 table.field4%TYPE, field5 table.field5%TYPE) IS BEGIN ...... BEGIN insert into table (date, CURVE, INSTRUMENT, field3, field4, field5) VALUES (date, CURVE, INSTRUMENT, field3, field4, field5); EXCEPTION WHEN DUP_VAL_ON_INDEX THEN UPDATE table SET field3 = field3, field4 = field4, field5 = field5 WHERE date = date and curve = curve and instrument = instrument; END; END mypackage;
Ok, even with the modified version of the code, you should be able to spot the problem.
Basically, the developer has decide to use the same name for the variable and the column to insert/update. The variable curve was storing the value to insert in the column curve of the table.
For the Insert, this is fine. Oracle will check the value of the variable, but for the update, the rules are different. Oracle was scanning and updating any single row of the table (8.9 million) as the condition:
WHERE date = date AND curve = curve AND instrument = instrument
was similar to do:
The second thread was updating every single row of the table (8.9 million) after every single insert done by the 1st thread…. No wonder we were generating so many archivelog files.
The first step to fix the problem has been to disable one of the threads. We are lucky that the application is new, so it mostly performs inserts.
Developers are working now in renaming all variables and parameter to use a different name convention, something like p_<parameter>, l_<variable>, etc.
As always, comments are welcome.