Exotic SQL: Bubble Sort with Model Clause

Introduction

This is a “just for fun” experiment. But I demonstrate some model clause effects on the way.

Task: Sort a pipe separated list using SQL.

For example this string ’29|1|3004|3|2|24′ has 6 elements separated by |. The elements should be sorted in numerical order so that the resulting string looks like this ‘1|2|3|24|29|3004’.

Standard way

The “normal” way would be to tokenize the input string, put the tokens into rows, sort the rows and aggregate the sorted result to reassemble the string.

Here is one possible solution.


with inputdata(src)
     as (select '29|1|3004|3|2|24' from dual)
    ,num_rows (part, remains)
     as (select to_number(substr(i.src,1,coalesce(nullif(instr(i.src,'|'),0)-1,length(i.src)))) as part
               ,substr(src,instr(i.src,'|')+1) as remains
         from inputdata i
         UNION ALL
         select to_number(substr(r.remains,1,coalesce(nullif(instr(r.remains,'|'),0)-1,length(r.remains)))) as part
               ,substr(remains, nullif(instr(r.remains,'|'),0)+1) as remains
         from num_rows r
         where r.remains is not null
         )
    , combine (resultdata)
as (select listagg(to_char(part,'FM9999999999'),'|') within group (order by part) from num_rows)
select * from combine;

The num_rows subquery in this example is a recursive with clause that splits the input string into rows, one row for each token. The logic to split the string into tokens is based around substr . Essentially it just cuts of an element from the beginning of the string and keeps the remainder until there are no elements left.
The combine subquery then uses listagg to sort and convert the rows into a string again.

Standard SQL technologies that each decent developer should know about.

There are multiple ways how to tokenize,  how to create the rows and how to aggregate the rows again. But the general principle is still the same.

Here is a second example solution. This time using connect by to create rows and some regular expressions to count and split.

with inputdata(src) as (select '29|1|3004|3|2|24' from dual )
select listagg(token,'|') within group (order by to_number(token)) as res
from (
  select src, regexp_substr(src,'[^|]+', 1, level) token
  from inputdata
  connect by level <= regexp_count(src,'[|*]') + 1
  )
group by src;

Exotic way

I was wondering if we can use a totally different approach.

Instead of letting the database do the hard work and implement a sort mechanism, we can do it ourselves. Here comes bubbles sort. It is one of the most inefficient sort mechanisms you can think of (unless the list is already sorted), but it is fairly easy to implement.

Since bubble sort is an iterative approach with a simple set of rules, the model clause immediately jumped to my mind.

So here is a solution based around a slightly optimised bubble sort mechanism (gnome sort). We will take it apart afterwards.
The solution here creates several rows, but this is just to see and check each step of the iteration. It is possible to do this using one row (dim=0) only, although it is much harder to develop and to understand.

with inputdata(src) as (select '29|1|3004|3|2|24' from dual)
select *
from inputdata
model dimension by (0 as dim)
measures (  src
          , 1 ele_pos1
          , 2 ele_pos2
          , cast (regexp_substr(src,'[^\|]+',1,1) as varchar2(100)) token1
          , cast (regexp_substr(src,'[^\|]+',1,2) as varchar2(100)) token2
          , regexp_count(src,'\|')+1 max_element
          , 2 hwm
          , cast (src as varchar2(500)) as res
          )
rules iterate (500)
      until (-- we can stop if the high water mark is at the last position and no switch needed
             hwm[iteration_number] >= max_element[0]
             and (ele_pos1[iteration_number] = 1 or to_number(token1[iteration_number])  right token
  res[iteration_number+1]
    = case when to_number(token1[iteration_number]) > to_number(token2[iteration_number]) then
      -- do the switch
      regexp_replace(
             regexp_replace(res[iteration_number],'[^\|]+',token2[iteration_number],1,ele_pos1[iteration_number])
                            ,'[^\|]+',token1[iteration_number],1,ele_pos2[iteration_number])
      else res[iteration_number] -- no switch needed
      end ,
  -- calculate next position for token1
  ele_pos1[iteration_number+1]
  = case
      when ele_pos1[iteration_number] = 1 then
        -- we reached first position, so go back to hwm
        hwm[iteration_number]
      when to_number(token1[iteration_number]) > to_number(token2[iteration_number]) then
        -- after a switch, move one position the the left and check there
        ele_pos1[iteration_number] - 1
      when ele_pos1[iteration_number] + 1  to_number(token2[iteration_number]) then
        ele_pos2[iteration_number] - 1
      when ele_pos2[iteration_number] + 1 <= max_element[0] then
        hwm[iteration_number] + 1
      else
        ele_pos2[iteration_number]
      end,
  -- calculate high water mark
  hwm[iteration_number+1]
  = greatest(hwm[iteration_number],ele_pos2[iteration_number+1]),
  -- get token1 for new calculated position1
  token1[iteration_number+1]
  = regexp_substr(res[iteration_number+1],'[^\|]+',1,ele_pos1[iteration_number+1]),
  -- get token2 for new calculated position2
  token2[iteration_number+1]
  = regexp_substr(res[iteration_number+1],'[^\|]+',1,ele_pos2[iteration_number+1])
 );

Here is the result, but I show only the important column. Feel free to run the statement yourself (works on all supported database versions) and see the other helper columns.

Result (only Res column)
29|1|3004|3|2|24
1|29|3004|3|2|24
1|29|3004|3|2|24
1|29|3|3004|2|24
1|3|29|3004|2|24
1|3|29|3004|2|24
1|3|29|2|3004|24
1|3|2|29|3004|24
1|2|3|29|3004|24
1|2|3|29|3004|24
1|2|3|29|24|3004
1|2|3|24|29|3004
1|2|3|24|29|3004

Fairly easy SQL isn’t it?

65 rows instead of 8. So if your aim is to write obfuscated SQL then this certainly is the way to go.

If you want to understand it, skip to the “algorithm explained” chapter.

Rant

Now what is wrong with the model clause!? A simple algorithm like bubble sort looks this complex?

In the recent months I have written a number of model clause solutions for different problems. However none of them is easy. Apart from very specific cases all model solutions are terrible to look at and terrible complex to build. In comparison it is much easier to do the same in excel.

I attribute this to several things.

  1. The way to reference a cell is verbose. And when we look at the code then the brackets break the eye scanning mechanism of a trained developer eye. If more than one dimension is involved it is even worse.
    Here are some examples:
    measure1[cv(),3] or  resultdata[iteration_number+1]
    It is hard to see this as one cell reference.
    In Excel this would simply be: B$3 or C2.
  2. The way to reference a cell from the previous row is complex. It requires a working calculation for the cell address. We can do this by applying the row_number() analytic function or using iteration_number (for iterative models). But this requires some extra logic.
  3. There is no simple way to set several different measures at once. A rule can only be applied for one measure. This means that sometimes we need to repeat the same logic for a different measure.
    In the bubble sort example this happened for the calculation for the position of token 1 and for position of token 2. In this case it can be simplified a lot because the position of token 2 is always exactly +1 from position of token 1. In other cases it is not that simple. Additional measures can serve as a kind of variables to capture rules that then need to applied to several other columns.

Excel Comparison

The model clause resembles Excel very much.

  • Dimensions are rows.
  • Measures are columns.
  • Rules are formulas

But why does the model clause feels so more complex that some simple excel cells.

Excel has a similar way to address cells. It is called the R1C1-style reference. It uses number coordinates to find a cell. But most of times we use the A1-Style reference. This is a little more intuitive and much shorter.

But more importantly the process to build formulas (=rules) is different. While building formulas we click on a source cell and its coordinates are used automatically. Excel then calculates the difference from the current cell to the referenced cell.

The main difference to Excel is that is has a separate formula for each cell, and we use just a simple way to fill out the formulas to the other cells. We don’t need that copy mechanism with the model clause, because the rules automatically are applied for all relevant cells. But it is hard with the model clause to give the direct reference for a different row.

Also the different styles and the different ways to use absolute and relative cell references in Excel (e.g. B$2) are more convenient than doing the same in the model clause.

Pro-Tipp: A good way to develop model rules is to build the formulas in excel first, then copy the logic to SQL.

The model clause also has several advantages over excel. Among others are

  • Excel only has 2 dimensions and no decent concept of partitions.
  • Excel columns can not be named (although single cells and cell ranges can).
  • Model allows to return only updated rows.
  • Model has a easy way to work with invalid cell references (IGNORE NAV)

Algorithm explained

Before we look at the statement, let’s look at the bubble sort logic first.

bubble sort logic

Start: 29|1|3004|3|2|24

I start with the first two tokens. 29 and 1. Token1 is always the left and token2 is always the right token.

Rule 1: If token1 is bigger than token2 we need to switch them.
Rule 2: If we are at position 1, we can go one step to the right.
Rule 3: If we don’t need to switch, then we can also go one step to the right.
Rule 4: If we did switch then go one step to the left.

Now lets see how the rules can be applied to the string.

Step 1: 1|29|3004|3|2|24 – Rule 1
Step 2: 1|29|3004|3|2|24 – Rule 2
Step 3: 1|29|3004|3|2|24 – Rule 3
Step 4: 1|29|3|3004|2|24 – Rule 1
Step 5: 1|3|29|3004|2|24 – Rules 4+1
Step 6: 1|3|29|3004|2|24 – Rule 3 (here I used an additional optimisation, by storing a high water mark and jumping as far right as the HWM allows)
Step 7: 1|3|29|2|3004|24 – Rule 1

Stop: We can finish the iteration if the HWM is to the far right and if no switch is needed anymore.

model SQL explained in detail

So how is this logic implemented in the model clause?

A token is found using a regular expression with the position of the token.

regexp_substr(src,'[^\|]+',1,4)

This finds the fourth token.

We start with a way to number our rows. The initial row is defined as 0.

dimension by (0 as dim)

0 is only used because later we address our rows using an iteration_number. If we would start with 1, then we can potentially overwrite our data. Often 0 is a good starting row.

We then define several columns. And the initial value for those columns.

measures (  src
   , 1 ele_pos1
   , 2 ele_pos2
   , cast (regexp_substr(src,'[^\|]+',1,1) as varchar2(100)) token1
   , cast (regexp_substr(src,'[^\|]+',1,2) as varchar2(100)) token2
   , regexp_count(src,'\|')+1 max_element
   , 2 hwm
   , cast (src as varchar2(500)) as res
   )

The datatype and size of the columns is automatically deducted from the initial value. This is why cast is sometimes needed, if the column size should be bigger than what the initial value indicates.

Now we define that an iterative model is to be used.

rules iterate (500) until (stop criteria)

Iterative models are good for implementing procedural logic. Or for tasks when it is unknown beforehand what the area for the calculation should be.

It gives us a variable called iteration_number which can be used as a cell address.

Now lets look at a few rules.

-- switch tokens if left token > right token
res[iteration_number+1] 
    = case when to_number(token1[iteration_number]) > to_number(token2[iteration_number]) then
      -- do the switch
      regexp_replace(
             regexp_replace(res[iteration_number],'[^\|]+',token2[iteration_number],1,ele_pos1[iteration_number])
             ,'[^\|]+',token1[iteration_number],1,ele_pos2[iteration_number])
      else res[iteration_number] -- no switch needed
      end ,

The left hand side of the rule “res[iteration_number+1]” creates a new row for the res column. The right hand side of the rule references data from the previous row “token1[iteration_number]” to do the switch logic.

If data from the same row is needed, it is possible to use “token1[cv()]”. cv() stands for cell value of the current dimension. It is possible to calculate with those values. So instead of token1[iteration_number] we could also write “token1[cv()-1]” to fetch the value from the previous row. For iterative models I find it more convenient to stay consistent and use iteration_number instead.

This rule implements rule 1 (switch tokens) from the bubble sort logic. However it is not a 1:1 matching of rules. We can see this when looking at the next rule.

Instead the logic needs to be implemented for each column. To calculate the next position a similar case construct is needed.

-- calculate next position (for token1)
ele_pos1[iteration_number+1] 
       = case
         when ele_pos1[iteration_number] = 1 then
           -- we reached first position, so go back to hwm
           hwm[iteration_number]
         when to_number(token1[iteration_number]) > to_number(token2[iteration_number]) then
           -- after a switch, move one position the the left and check there
           ele_pos1[iteration_number] - 1
         when ele_pos1[iteration_number] + 1 <= max_element[0]-1 then           -- no more switch, so go to hwm and look for next element            hwm[iteration_number]  
         else 
           -- just in case, don't do anything
           ele_pos1[iteration_number]
         end

Bubble sort rules 2-4 are implemented by a case construct.

As we can see in this example, the rules from the model clause do not match our business rules. Sometimes they do, but more often they do not.

conclusion

The Model clause can open up new ways to do solve SQL puzzles. Bubble sort is a good candidate for an iterative model.

For real world cases the Model clause is hardly maintainable. If we find a solid way to match business rules to model clause rules, then we have a good way to react to future changes of those business rules.

Sometimes model clause can provide a performance advantage, because of they way how the data is handled.

Model clause is a tool in our toolbox, although an extremely sophisticated and complex one. We need to train using this tool on a regular basis.

btw: Some of the constructs from the model clause can be rediscovered in other statements. For example the new analytic views in 12.2 also have a “dimension by” and a “measures” part. But the rules are missing (one can argue that attribute dimensions and hierarchies resemble the rules).