Friday, May 21, 2010

MAX_UTILIZATION_LIMIT - A Very useful option in 11gR2

Resource allocation and management in large SMP systems has always been a challenge especially in large data warehouse systems. At the client sites that I have been to, the challenge will be in the form of critical ETL jobs that does massive data processing while trickle-down feeds that gets incrementally loaded in the staging tables for subsequent processing.

While Oracle manages CPU resources efficiently using database resource manager APIs, there are cases when this does not satisfy the real world requirements.

For instance, the CPU resource allocation pretty much goes by allocation based on how much resource is currently available at the time a process started. Once the process starts it is impossible to take away resources from the job for redistribution.

Imagine a scenario where a slew of jobs from a low priority group kicks off when the system is totally idle. Based on the way resource management is handled in Oracle, all the available CPU resources will be divided among the processes. Now while these processes are consuming CPU, if a job from a high priority group kicks off, the resources are not immediately available and the high priority job has to work with what is available. Potentially, if the low priority jobs are many and as a whole consumes a significant amount of CPU, then this directly will impact the high priority job.

Oracle Database 11gR2 provides a new parameter "MAX_UTILIZATION_LIMIT" for the resource manager APIs that let's us set an "absolute" maximum CPU that a group and it's associated processes can use. This is really significant. With this now we can guarantee some low priority jobs will not eat up CPU cycles while more important jobs need to take a hit.

Below is a sample.

Some portion of the code was re-used from Jonathan Lewis's kill_cpu.

The test was conducted in a completely idle system running 11gR2/OEL5/ASM 11R2.

In the first case, the plan with 2 groups with "max_utilization" resource allocation of 50/50 was created and a simple create table statement was executed. The tkprof output,

create table tab_stingy_cpu_new_50_50 as
  select a.name,a.type,a.line,a.text,b.*
from dba_source a, dba_objects b
 where rownum <=1000000

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.12       0.18          0          0          0           0
Execute      1      7.04      27.22      10554      12559      27033     1000000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      7.16      27.41      10554      12559      27033     1000000


The next sample shows, the elapsed time change when the max_utilization_limit is set to a higly imbalanced (1,99) fro the same respective groups.

exec dbms_resource_manager.create_pending_area;
exec dbms_resource_manager.update_plan_directive('STINGY_CPU','STINGY_CPU_CON_GROUP',NEW_MAX_UTILIZATION_LIMIT=>1);
exec dbms_resource_manager.update_plan_directive('STINGY_CPU','OTHER_GROUPS',NEW_MAX_UTILIZATION_LIMIT=>99);
exec dbms_resource_manager.validate_pending_area;
exec dbms_resource_manager.submit_pending_area


SQL> select a.sid,a.resource_consumer_group,
b.MAX_UTILIZATION_LIMIT from v$session a, DBA_RSRC_PLAN_DIRECTIVES  b where plan='STINGY_CPU'
and a.resource_consumer_group=b.group_or_subplan
and a.sid=(select sys_context('USERENV','SID') from dual)  2    3    4 
  5  /

       SID RESOURCE_CONSUMER_GROUP          MAX_UTILIZATION_LIMIT
---------- -------------------------------- ---------------------
       146 STINGY_CPU_CON_GROUP                                 1

create table tab_stingy_cpu_new_1_99 as
  select a.name,a.type,a.line,a.text,b.*
from dba_source a, dba_objects b
 where rownum <=1000000

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.12       6.17          0          0          0           0
Execute      1      7.03     305.25      10554      12559      27033     1000000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      7.16     311.43      10554      12559      27033     1000000



Amazingly, to my surprise the CPU time is exactly the same (7.16 seconds). I confess, I did not expect this :) However, if you notice the elapsed time is 311 seconds. This is because, the max_utilization_limit enforces that the process does not take too much CPU though the machine is complete idle.

This in my opinion is an amazing feature from Data Warehousing systems perspective where I can be confident with my ETL load and Materialized view refresh timings.

After all, consistency is more important than speed when it comes to explaining performance to end users.

All comments welcome.

No comments: