Exotic SQL: Hints that can change results

In general hints are suggestions for the Oracle cost based optimizer (CBO) that can influence performance but will never change the result. The CBO will comply with the hint at all costs – if it is possible.

True? Well not always!

There is a very small set of hints which can change the result of a select or DML statement.

The CBO is an extremly complicated piece of software, so it can happen that there are bugs in it, which in turn lead to different results. This is not what I am considering here.

So here is my collection of hints that are able to change the output of a statement.

ignore_row_on_dupkey_index

This hint allows to avoid UK errors during an insert. So only those rows are inserted that do not violate the unique index. The other rows are silently ignored.

See this example by Richard Foote about the hint:
https://richardfoote.wordpress.com/2010/12/20/oracle11g-ignore_row_on_dupkey_index-hint-micro-cuts/

Similar hints are
(CHANGE_DUPKEY_ERROR_INDEX and RETRY_ON_ROW_CHANGE)

They are all called semantic hints. Which has the side effect that they are not disabled by setting _optimizer_ignore_hints=TRUE

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sqlrf/Comments.html#GUID-4664D3D8-6312-4C15-8E8F-4872DD7A44F8

For more examples see this older post from Rob van Wijk
http://rwijk.blogspot.de/2009/10/three-new-hints.html

driving_site

I consider the driving_site hint to be one of the “good” hints. There are many cases where we need it. It is good to set it, because it enables behaviour that otherwise the optimizer can not choose.

However there is one strange special issue that I found.
I described this behaviour in an older blog post: https://wordpress.com/post/svenweller.wordpress.com/108

If we have a distributed query then SYSTIMESTAMP and also DBTIMEZONE is affected by the driving site hint. It will return the timezone from the local (no hint) or the remote (with hint) database. SYSDATE however was not affected.

This effect happend on an older 10g database, I wasn’t able to test if the behaviour still exists in 12c.

A similar effect can be reached using the materialize hint.

opt_param(‘OPTIMIZER_SECURE_VIEW_MERGING’,’true|false’)

The opt_param hint will allow to set instance parameters only for the time while the select is running. Some of those parameters can potentially influence the outcome of queries.

One of them is OPTIMIZER_SECURE_VIEW_MERGING
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/refrn/OPTIMIZER_SECURE_VIEW_MERGING.html

An example from Tom Kyte:

the goal of this parameter [OPTIMIZER_SECURE_VIEW_MERGING] is to prevent a function owned by some user B from seeing data of some other user A that is should not see. For example, …

and after some setup…

b%ORA11GR2> set autotrace on explain
b%ORA11GR2> alter system set optimizer_secure_view_merging = true;

System altered.

b%ORA11GR2> select * from a.v v1 where f(y) = ‘ok to see’;

X Y
———- ——————————
1 ok to see

I see: ok to see

Execution Plan
———————————————————-
Plan hash value: 1931062764

—————————————————————————
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————
| 0 | SELECT STATEMENT | | 10000 | 292K| 27206 (1)| 00:05:27 |
|* 1 | VIEW | V | 10000 | 292K| 27206 (1)| 00:05:27 |
|* 2 | TABLE ACCESS FULL| T | 10000 | 292K| 27206 (1)| 00:05:27 |
—————————————————————————

Predicate Information (identified by operation id):
—————————————————

1 – filter(“F”(“Y”)=’ok to see’)
2 – filter(“Y_F”(“X”)=1)

there it is clear that the Y_F(X)=1 is evaluated and then f(y) is

b%ORA11GR2> alter system set optimizer_secure_view_merging = false;

System altered.

b%ORA11GR2> select * from a.v v1 where f(y) = ‘ok to see’;

X Y
———- ——————————
1 ok to see

I see: ok to see
I see: NOT ok to see

whoops, my function saw data that it should not

Execution Plan
———————————————————-
Plan hash value: 1601196873

————————————————————————–
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
————————————————————————–
| 0 | SELECT STATEMENT | | 100 | 3000 | 27207 (1)| 00:05:27 |
|* 1 | TABLE ACCESS FULL| T | 100 | 3000 | 27207 (1)| 00:05:27 |
————————————————————————–

Predicate Information (identified by operation id):
—————————————————

1 – filter(“F”(“Y”)=’ok to see’ AND “Y_F”(“X”)=1)

and now we can see the converse is true

b%ORA11GR2> alter system set optimizer_secure_view_merging = true;

System altered.

b%ORA11GR2> set autotrace off

There are also some side effects with VPD. See https://antognini.ch/2011/09/optimizer_secure_view_merging-and-vpd/

Also in some cases this parameter can result in errors.
Here is a report of such a case: https://support.esri.com/en/technical-article/000010620

Another example for a parameter that can influence result sets would be RESULT_CACHE_REMOTE_EXPIRATION
https://docs.oracle.com/cd/B28359_01/server.111/b28320/initparams208.htm#REFRN10294

The default is 0. Changeing it to a positiv integer, will allow us to get “stale” data from the cache instead of fetching a fresh value from the remote source.

fresh_mv

This 12.2 hint can make a real time materialised view to refresh itself. So if the data is stale then the hint will use the MV and somehow add the missing data to it. Without the hint we would still see old data, but with it we will see new data.

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sqlrf/Comments.html#GUID-5EF4198B-50B3-40D8-B12A-3D3115C69D9B

The FRESH_MV hint applies when querying a real-time materialized view. This hint instructs the optimizer to use on-query computation to fetch up-to-date data from the materialized view, even if the materialized view is stale.

A similar connected hint would be
no_rewrite/rewrite_or_error
All of those hints have the ability to change the behaviour, especially the usage of stale materialized views. Which then can give a different result.

Disclaimer

This list has no pretence to be complete. Also some of the mentioned behaviours might change in different database versions. Especially if it is not documented and not intended behaviour.

There are more such hints. I have recorded some more, but couldn’t consistently reproduce the issue. It might have been related to some bug or some other obscure circumstance, that I can’t remember.

Here is the current list for those “watch out” candidates:

  • first_rows
  • (no_)query_transformation
  • (no_)result_cache

Result cache in combination with deterministic functions, that are not truely deterministic can easily produce wrong/different results. However the main culprit then is the wrong usage of the deterministic pragma.

And some more strange things

The SELECT clause can also influence the number of rows. And not only via distinct.

    with tbl as (select 1 val from dual union all  
                select 2 val from dual union all  
                select 3 val from dual )  
        SELECT  CASE  0  
                    WHEN  0  
                        THEN  'Aardvark'  
                    WHEN  SUM (val)  
                        THEN  'Baracuda'  
                END  AS c  
        FROM    tbl;  

Aardvark
Aardvark
Aardvark
3 rows selected.

    with tbl as (select 1 val from dual union all  
                select 2 val from dual union all  
                select 3 val from dual )  
        SELECT  CASE  6  
                    WHEN  0  
                        THEN  'Aardvark'  
                    WHEN  SUM (val)  
                        THEN  'Baracuda'  
                END  AS c  
        FROM    tbl;  

Baracuda
1 row selected.

Discovered by Frank Kulash and discussed and tested in this OTN thread
https://community.oracle.com/thread/4097998?start=15&tstart=0

Advertisements

Exotic SQL: Bubble Sort with Model Clause

Introduction

This is a “just for fun” experiment. But I demonstrate some model clause effects on the way.

Task: Sort a pipe separated list using SQL.

For example this string ’29|1|3004|3|2|24′ has 6 elements separated by |. The elements should be sorted in numerical order so that the resulting string looks like this ‘1|2|3|24|29|3004’.

Standard way

The “normal” way would be to tokenize the input string, put the tokens into rows, sort the rows and aggregate the sorted result to reassemble the string.

Here is one possible solution.


with inputdata(src)
     as (select '29|1|3004|3|2|24' from dual)
    ,num_rows (part, remains)
     as (select to_number(substr(i.src,1,coalesce(nullif(instr(i.src,'|'),0)-1,length(i.src)))) as part
               ,substr(src,instr(i.src,'|')+1) as remains
         from inputdata i
         UNION ALL
         select to_number(substr(r.remains,1,coalesce(nullif(instr(r.remains,'|'),0)-1,length(r.remains)))) as part
               ,substr(remains, nullif(instr(r.remains,'|'),0)+1) as remains
         from num_rows r
         where r.remains is not null
         )
    , combine (resultdata)
as (select listagg(to_char(part,'FM9999999999'),'|') within group (order by part) from num_rows)
select * from combine;

The num_rows subquery in this example is a recursive with clause that splits the input string into rows, one row for each token. The logic to split the string into tokens is based around substr . Essentially it just cuts of an element from the beginning of the string and keeps the remainder until there are no elements left.
The combine subquery then uses listagg to sort and convert the rows into a string again.

Standard SQL technologies that each decent developer should know about.

There are multiple ways how to tokenize,  how to create the rows and how to aggregate the rows again. But the general principle is still the same.

Here is a second example solution. This time using connect by to create rows and some regular expressions to count and split.

with inputdata(src) as (select '29|1|3004|3|2|24' from dual )
select listagg(token,'|') within group (order by to_number(token)) as res
from (
  select src, regexp_substr(src,'[^|]+', 1, level) token
  from inputdata
  connect by level <= regexp_count(src,'[|*]') + 1
  )
group by src;

Exotic way

I was wondering if we can use a totally different approach.

Instead of letting the database do the hard work and implement a sort mechanism, we can do it ourselves. Here comes bubbles sort. It is one of the most inefficient sort mechanisms you can think of (unless the list is already sorted), but it is fairly easy to implement.

Since bubble sort is an iterative approach with a simple set of rules, the model clause immediately jumped to my mind.

So here is a solution based around a slightly optimised bubble sort mechanism (gnome sort). We will take it apart afterwards.
The solution here creates several rows, but this is just to see and check each step of the iteration. It is possible to do this using one row (dim=0) only, although it is much harder to develop and to understand.

with inputdata(src) as (select '29|1|3004|3|2|24' from dual)
select *
from inputdata
model dimension by (0 as dim)
measures (  src
          , 1 ele_pos1
          , 2 ele_pos2
          , cast (regexp_substr(src,'[^\|]+',1,1) as varchar2(100)) token1
          , cast (regexp_substr(src,'[^\|]+',1,2) as varchar2(100)) token2
          , regexp_count(src,'\|')+1 max_element
          , 2 hwm
          , cast (src as varchar2(500)) as res
          )
rules iterate (500)
      until (-- we can stop if the high water mark is at the last position and no switch needed
             hwm[iteration_number] >= max_element[0]
             and (ele_pos1[iteration_number] = 1 or to_number(token1[iteration_number]) < to_number(token2[iteration_number])))
( -- switch tokens if left token > right token
  res[iteration_number+1]
    = case when to_number(token1[iteration_number]) > to_number(token2[iteration_number]) then
      -- do the switch
      regexp_replace(
             regexp_replace(res[iteration_number],'[^\|]+',token2[iteration_number],1,ele_pos1[iteration_number])
                            ,'[^\|]+',token1[iteration_number],1,ele_pos2[iteration_number])
      else res[iteration_number] -- no switch needed
      end ,
  -- calculate next position for token1
  ele_pos1[iteration_number+1]
  = case
      when ele_pos1[iteration_number] = 1 then
        -- we reached first position, so go back to hwm
        hwm[iteration_number]
      when to_number(token1[iteration_number]) > to_number(token2[iteration_number]) then
        -- after a switch, move one position the the left and check there
        ele_pos1[iteration_number] - 1
      when ele_pos1[iteration_number] + 1 <= max_element[0]-1 then
        -- no more switch, so go to hwm and look for next element
         hwm[iteration_number]
       else
         -- just in case, don't do anything
         ele_pos1[iteration_number]
       end,
   -- calculate next position for token2
   ele_pos2[iteration_number+1]
    = case
       when ele_pos2[iteration_number] = 2 then
         --ele_pos2[iteration_number] + 1
         hwm[iteration_number] + 1
       when to_number(token1[iteration_number]) > to_number(token2[iteration_number]) then
        ele_pos2[iteration_number] - 1
      when ele_pos2[iteration_number] + 1 <= max_element[0] then
        hwm[iteration_number] + 1
      else
        ele_pos2[iteration_number]
      end,
  -- calculate high water mark
  hwm[iteration_number+1]
  = greatest(hwm[iteration_number],ele_pos2[iteration_number+1]),
  -- get token1 for new calculated position1
  token1[iteration_number+1]
  = regexp_substr(res[iteration_number+1],'[^\|]+',1,ele_pos1[iteration_number+1]),
  -- get token2 for new calculated position2
  token2[iteration_number+1]
  = regexp_substr(res[iteration_number+1],'[^\|]+',1,ele_pos2[iteration_number+1])
 );

Here is the result, but I show only the important column. Feel free to run the statement yourself (works on all supported database versions) and see the other helper columns.

Result (only Res column)
29|1|3004|3|2|24
1|29|3004|3|2|24
1|29|3004|3|2|24
1|29|3|3004|2|24
1|3|29|3004|2|24
1|3|29|3004|2|24
1|3|29|2|3004|24
1|3|2|29|3004|24
1|2|3|29|3004|24
1|2|3|29|3004|24
1|2|3|29|24|3004
1|2|3|24|29|3004
1|2|3|24|29|3004

Fairly easy SQL isn’t it?

65 rows instead of 8. So if your aim is to write obfuscated SQL then this certainly is the way to go.

If you want to understand it, skip to the “algorithm explained” chapter.

Rant

Now what is wrong with the model clause!? A simple algorithm like bubble sort looks this complex?

In the recent months I have written a number of model clause solutions for different problems. However none of them is easy. Apart from very specific cases all model solutions are terrible to look at and terrible complex to build. In comparison it is much easier to do the same in excel.

I attribute this to several things.

  1. The way to reference a cell is verbose. And when we look at the code then the brackets break the eye scanning mechanism of a trained developer eye. If more than one dimension is involved it is even worse.
    Here are some examples:
    measure1[cv(),3] or  resultdata[iteration_number+1]
    It is hard to see this as one cell reference.
    In Excel this would simply be: B$3 or C2.
  2. The way to reference a cell from the previous row is complex. It requires a working calculation for the cell address. We can do this by applying the row_number() analytic function or using iteration_number (for iterative models). But this requires some extra logic.
  3. There is no simple way to set several different measures at once. A rule can only be applied for one measure. This means that sometimes we need to repeat the same logic for a different measure.
    In the bubble sort example this happened for the calculation for the position of token 1 and for position of token 2. In this case it can be simplified a lot because the position of token 2 is always exactly +1 from position of token 1. In other cases it is not that simple. Additional measures can serve as a kind of variable to capture rules that then need to applied to several other columns.

Excel Comparison

The model clause resembles Excel very much.

  • Dimensions are rows.
  • Measures are columns.
  • Rules are formulas

But why does the model clause feels so more complex that some simple excel cells.

Excel has a similar way to address cells. It is called the R1C1-style reference. It uses number coordinates to find a cell. But most of times we use the A1-Style reference. This is a little more intuitive and much shorter.

But more importantly the process to build formulas (=rules) is different. While building formulas we click on a source cell and its coordinates are used automatically. Excel then calculates the difference from the current cell to the referenced cell.

The main difference to Excel is that is has a separate formula for each cell, and we use just a simple way to fill out the formulas to the other cells. We don’t need that copy mechanism with the model clause, because the rules automatically are applied for all relevant cells. But it is hard with the model clause to give the direct reference for a different row.

Also the different styles and the different ways to use absolute and relative cell references in Excel (e.g. B$2) are more convenient than doing the same in the model clause.

Pro-Tipp: A good way to develop model rules is to build the formulas in excel first, then copy the logic to SQL.

The model clause also has several advantages over excel. Among others are

  • Excel only has 2 dimensions and no decent concept of partitions.
  • Excel columns can not be named (although single cells and cell ranges can).
  • Model allows to return only updated rows.
  • Model has a easy way to work with invalid cell references (IGNORE NAV)

Algorithm explained

Before we look at the statement, let’s look at the bubble sort logic first.

bubble sort logic

Start: 29|1|3004|3|2|24

I start with the first two tokens. 29 and 1. Token1 is always the left and token2 is always the right token.

Rule 1: If token1 is bigger than token2 we need to switch them.
Rule 2: If we are at position 1, we can go one step to the right.
Rule 3: If we don’t need to switch, then we can also go one step to the right.
Rule 4: If we did switch then go one step to the left.

Now lets see how the rules can be applied to the string.

Step 1: 1|29|3004|3|2|24 – Rule 1
Step 2: 1|29|3004|3|2|24 – Rule 2
Step 3: 1|29|3004|3|2|24 – Rule 3
Step 4: 1|29|3|3004|2|24 – Rule 1
Step 5: 1|3|29|3004|2|24 – Rules 4+1
Step 6: 1|3|29|3004|2|24 – Rule 3 (here I used an additional optimisation, by storing a high water mark and jumping as far right as the HWM allows)
Step 7: 1|3|29|2|3004|24 – Rule 1

Stop: We can finish the iteration if the HWM is to the far right and if no switch is needed anymore.

model SQL explained in detail

So how is this logic implemented in the model clause?

A token is found using a regular expression with the position of the token.

regexp_substr(src,'[^\|]+',1,4)

This finds the fourth token.

We start with a way to number our rows. The initial row is defined as 0.

dimension by (0 as dim)

0 is only used because later we address our rows using an iteration_number. If we would start with 1, then we can potentially overwrite our data. Often 0 is a good starting row.

We then define several columns. And the initial value for those columns.

measures (  src
   , 1 ele_pos1
   , 2 ele_pos2
   , cast (regexp_substr(src,'[^\|]+',1,1) as varchar2(100)) token1
   , cast (regexp_substr(src,'[^\|]+',1,2) as varchar2(100)) token2
   , regexp_count(src,'\|')+1 max_element
   , 2 hwm
   , cast (src as varchar2(500)) as res
   )

The datatype and size of the columns is automatically deducted from the initial value. This is why cast is sometimes needed, if the column size should be bigger than what the initial value indicates.

Now we define that an iterative model is to be used.

rules iterate (500) until (stop criteria)

Iterative models are good for implementing procedural logic. Or for tasks when it is unknown beforehand what the area for the calculation should be.

It gives us a variable called iteration_number which can be used as a cell address.

Now lets look at a few rules.

-- switch tokens if left token > right token
res[iteration_number+1] 
    = case when to_number(token1[iteration_number]) > to_number(token2[iteration_number]) then
      -- do the switch
      regexp_replace(
             regexp_replace(res[iteration_number],'[^\|]+',token2[iteration_number],1,ele_pos1[iteration_number])
             ,'[^\|]+',token1[iteration_number],1,ele_pos2[iteration_number])
      else res[iteration_number] -- no switch needed
      end ,

The left hand side of the rule “res[iteration_number+1]” creates a new row for the res column. The right hand side of the rule references data from the previous row “token1[iteration_number]” to do the switch logic.

If data from the same row is needed, it is possible to use “token1[cv()]”. cv() stands for cell value of the current dimension. It is possible to calculate with those values. So instead of token1[iteration_number] we could also write “token1[cv()-1]” to fetch the value from the previous row. For iterative models I find it more convenient to stay consistent and use iteration_number instead.

This rule implements rule 1 (switch tokens) from the bubble sort logic. However it is not a 1:1 matching of rules. We can see this when looking at the next rule.

Instead the logic needs to be implemented for each column. To calculate the next position a similar case construct is needed.

-- calculate next position (for token1)
ele_pos1[iteration_number+1] 
       = case
         when ele_pos1[iteration_number] = 1 then
           -- we reached first position, so go back to hwm
           hwm[iteration_number]
         when to_number(token1[iteration_number]) > to_number(token2[iteration_number]) then
           -- after a switch, move one position the the left and check there
           ele_pos1[iteration_number] - 1
         when ele_pos1[iteration_number] + 1 <= max_element[0]-1 then           -- no more switch, so go to hwm and look for next element            hwm[iteration_number]  
         else 
           -- just in case, don't do anything
           ele_pos1[iteration_number]
         end

Bubble sort rules 2-4 are implemented by a case construct.

As we can see in this example, the rules from the model clause do not match our business rules. Sometimes they do, but more often they do not.

conclusion

The Model clause can open up new ways to do solve SQL puzzles. Bubble sort is a good candidate for an iterative model.

For real world cases the Model clause is hardly maintainable. If we find a solid way to match business rules to model clause rules, then we have a good way to react to future changes of those business rules.

Sometimes model clause can provide a performance advantage, because of they way how the data is handled.

Model clause is a tool in our toolbox, although an extremely sophisticated and complex one. We need to train using this tool on a regular basis.

btw: Some of the constructs from the model clause can be rediscovered in other statements. For example the new analytic views in 12.2 also have a “dimension by” and a “measures” part. But the rules are missing (one can argue that attribute dimensions and hierarchies resemble the rules).