Markus Winand's Blog

Analytic Top-N Queries

In Maintainability, Performance on 2010-07-30 at 10:55

One of the more advanced tricks I like to exploit are analytic Top-N queries. Although I am using them for quite a while, I recently discovered a “limitation” that I was not aware of. Actually—to be honest—it’s not a limitation; it is a missing optimization in a rarely used feature that can easily worked around. I must admit that I ask for quite a lot in that case.

The article starts with a general introduction into Top-N queries, applies that technique to analytic queries and explains the case where I miss an optimization. But is is really worth all that efforts? The article concludes with my answer to that question.

Please find the CREATE and INSERT statements at the end of the article.

Top-N Queries

Top-N queries are queries for the first N rows according to a specific sort order—e.g., the first three rows like that:

select * from (
  select start_key, group_key, junk
    from demo
   where start_key = 'St'
   order by group_key
)
 where rownum <= 3;

That’s well known and very straight. However, the interesting part is performance—as usual. A naïve implementation executes the inner SQL first—that is, fetch and sort all the matching records—before limiting the result set to the first three rows. In absence of a useful index, that is really happening:

START_KEY  GROUP_KEY JUNK
---------- --------- ----------
St                 1 junk
St                 3 junk
St                10 junk

3 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 142682949

--------------------------------------------------------------
| Id | Operation               | Name | Rows  | Bytes | Cost |
--------------------------------------------------------------
|  0 | SELECT STATEMENT        |      |     3 |  1032 | 8240 |
|* 1 |  COUNT STOPKEY          |      |       |       |      |
|  2 |   VIEW                  |      |   370 |   124K| 8240 |
|* 3 |    SORT ORDER BY STOPKEY|      |   370 | 76960 | 8240 |
|* 4 |     TABLE ACCESS FULL   | DEMO |   370 | 76960 | 8239 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   3 - filter(ROWNUM<=3)
   4 - filter("START_KEY"='St')


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      30370  consistent gets
      30365  physical reads
          0  redo size
        998  bytes sent via SQL*Net to client
        419  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          3  rows processed

A full table scan is performed—more on that in a few seconds—to retrieve all the rows that match the where clause; about 370 according to the optimizers estimate. The next sorts the entire result set. Finally the limit is applied—the COUNT STOPKEY step—and the number of rows is reduced to three.

The performance problem of this query is obviously the full table scan. Let’s create an index to make it go away:

create index demo_idx on demo (start_key);
exec dbms_stats.gather_index_stats(null, 'DEMO_IDX');

That’s much better:

START_KEY  GROUP_KEY JUNK
---------- --------- ----------
St                 1 junk
St                 3 junk
St                10 junk

3 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1129354520

------------------------------------------------------------------------
| Id | Operation                      | Name     | Rows | Bytes | Cost |
------------------------------------------------------------------------
|  0 | SELECT STATEMENT               |          |    3 |  1032 |  372 |
|* 1 |  COUNT STOPKEY                 |          |      |       |      |
|  2 |   VIEW                         |          |  370 |   124K|  372 |
|* 3 |    SORT ORDER BY STOPKEY       |          |  370 | 76960 |  372 |
|  4 |     TABLE ACCESS BY INDEX ROWID| DEMO     |  370 | 76960 |  371 |
|* 5 |      INDEX RANGE SCAN          | DEMO_IDX |  370 |       |    3 |
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   3 - filter(ROWNUM<=3)
   5 - access("START_KEY"='St')


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        360  consistent gets
        201  physical reads
          0  redo size
        998  bytes sent via SQL*Net to client
        419  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          3  rows processed

You can see that the full table scan was replaced by an index lookup and the corresponding table access. The other steps remain unchanged.

However, this is still a bad execution plan because all matching records are fetched and sorted just to throw most of them away. The following index allows a much better execution plan:

drop index demo_idx;
create index demo_idx on demo (start_key, group_key);
exec dbms_stats.gather_index_stats(null, 'DEMO_IDX');

The new execution plan looks like this:

        ID START_KEY  GROUP_KEY
---------- ---------- ---------
    936196 St                 1
    232303 St                 3
    759212 St                10

3 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1891928015

-----------------------------------------------------------------------
| Id | Operation                     | Name     | Rows | Bytes | Cost |
-----------------------------------------------------------------------
|  0 | SELECT STATEMENT              |          |    3 |   465 |    7 |
|* 1 |  COUNT STOPKEY                |          |      |       |      |
|  2 |   VIEW                        |          |    3 |   465 |    7 |
|  3 |    TABLE ACCESS BY INDEX ROWID| DEMO     |    3 |    36 |    7 |
|* 4 |     INDEX RANGE SCAN          | DEMO_IDX |  370 |       |    3 |
-----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   4 - access("START_KEY"='St')


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
        609  bytes sent via SQL*Net to client
        419  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          3  rows processed

Well, that is efficient. The sort operation has vanished at all because the index definition supports the ORDER BY clause. But even more powerful, the STOPKEY takes effect down to the index range scan. You can see the reduced number of table accesses in the plan. Although not visible in the execution plan, the index range scan is also aborted after fetching the first three records.

Well, that optimization is in the Oracle database for quite a while—at least since 8i I guess. After that preparation, I can demonstrate what 10R2 has to offer on top of that.

Analytic Top-N Queries

It is actually the very same story with a small extension: I don’t want to retrieve the first N rows, but all the rows where the group_key value is at it’s minimum for the respective start_key. A very straight solution is that:

select id, start_key, group_key
  from demo
 where start_key = 'St'
   and group_key = (select min(group_key)
                      from demo
                     where start_key = 'St'
                   );

That statement is perfectly legal—even performance wise:

        ID START_KEY  GROUP_KEY
---------- ---------- ---------
    936196 St                 1

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1142136980

------------------------------------------------------------------------
| Id | Operation                      | Name     | Rows | Bytes | Cost |
------------------------------------------------------------------------
|  0 | SELECT STATEMENT               |          |    1 |    12 |    8 |
|  1 |  TABLE ACCESS BY INDEX ROWID   | DEMO     |    1 |    12 |    5 |
|* 2 |   INDEX RANGE SCAN             | DEMO_IDX |    1 |       |    3 |
|  3 |    SORT AGGREGATE              |          |    1 |     7 |      |
|  4 |     FIRST ROW                  |          |    1 |     7 |    3 |
|* 5 |      INDEX RANGE SCAN (MIN/MAX)| DEMO_IDX |    1 |     7 |    3 |
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("START_KEY"='St' 
               AND "GROUP_KEY"= (SELECT MIN("GROUP_KEY") FROM
              "DEMO" "DEMO" WHERE "START_KEY"='St'))
   5 - access("START_KEY"='St')


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          8  consistent gets
          0  physical reads
          0  redo size
        550  bytes sent via SQL*Net to client
        419  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed
Analytic functions

Analytic functions can perform calculations on the basis of multiple rows. However, not to be confused with aggregate functions, analytical functions work without GROUP BY. A very typical use for analytical functions is a running balance; that is, the sum of all the rows preceding the current row.

The function used in the example (dense_rank) returns the rank of the current row according to the supplied OVER(ORDER BY) clause—that is, in turn, not to be confused with a regular ORDER BY.

orafaq.com has a nice intro to Oracle analytic functions.

The fist step is to fetch the smallest group_key. Because of the min/max optimization in combination with a well supporting index, the database doesn’t need to sort the data—it just picks the first record from the index which must be the smallest anyway. The second step is to perform a regular index lookup for the start_key and the group_key that was just retrieved from the sub-query.

Another possible implementation for that is to use an analytic function:

select * from (
  select id, start_key, group_key, 
         dense_rank() OVER (order by group_key) rnk
    from demo
   where start_key = 'St'
)
 where rnk <= 1;

Do you recognize the pattern? It is very similar to the traditional Top-N query that was described at the beginning of this article. Instead of limiting on the rownum pseudocolumn we use an analytic function. The execution plan reveals the performance characteristic of that statement:

        ID START_KEY  GROUP_KEY        RNK
---------- ---------- --------- ----------
    936196 St                 1          1

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3221234897

-----------------------------------------------------------------------
| Id | Operation                     | Name     | Rows | Bytes | Cost |
-----------------------------------------------------------------------
|  0 | SELECT STATEMENT              |          |  370 | 62160 |  374 |
|* 1 |  VIEW                         |          |  370 | 62160 |  374 |
|* 2 |   WINDOW NOSORT STOPKEY       |          |  370 |  4440 |  374 |
|  3 |    TABLE ACCESS BY INDEX ROWID| DEMO     |  370 |  4440 |  373 |
|* 4 |     INDEX RANGE SCAN          | DEMO_IDX |  370 |       |    3 |
-----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RNK"<=1)
   2 - filter(DENSE_RANK() OVER ( ORDER BY "GROUP_KEY")<=1)
   4 - access("START_KEY"='St')


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          6  consistent gets
          0  physical reads
          0  redo size
        610  bytes sent via SQL*Net to client
        419  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

What was the COUNT STOPKEY operation for the traditional Top-N query has become the WINDOW NOSORT STOPKEY operation for the analytical function. However, the expected number of rows is not known to the optimizer—any number of rows could have the lowest group_key value. Still the index range scan is aborted once the required rows have been fetched. On the one hand, the consistent gets are even better as with the sub-query statement. On the other hand, the cost value is higher. Whenever you use analytic functions, go for a benchmark to know the actual performance.

Let’s have some thoughts about this optimization. The database knows that the index order corresponds to the OVER (ORDER BY) clause and avoids the the sort operation. But even more impressive is that that it can abort the range scan when the first value that doesn’t match the rnk <= 1 expression is fetched. That is only possible because the dense_rank() function can not decrease if the rows are fetched in order of the OVER(ORDER BY) clause. That’s impressive, isn’t it?

Mass Top-N Queries

The next step towards the issue that made me writing this article is to make a mass Top-N query. With the previous statement as basis, it is actually quite simple; just remove the inner where clause to get the result for all start_key values and add a partition clause to make sure the rank is built individually for each start_key:

select * from (
  select start_key, group_key, junk,
         dense_rank() OVER (partition by start_key
                                order by group_key) rnk
    from demo
)
 where rnk <= 1;

Declaring the partition is required to make sure those start_keys that don’t have a group_key of one will still show up, with their lowest group_key value.

With that query, we have reached the end of the optimizers smartness—as of release 11r2. On the first sight, the plan is not surprising:

3260 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1766530486

--------------------------------------------------------------
| Id | Operation                | Name | Rows | Bytes | Cost |
--------------------------------------------------------------
|  0 | SELECT STATEMENT         |      | 1000K|   340M| 8239 |
|* 1 |  VIEW                    |      | 1000K|   340M| 8239 |
|* 2 |   WINDOW SORT PUSHED RANK|      | 1000K|   198M| 8239 |
|  3 |    TABLE ACCESS FULL     | DEMO | 1000K|   198M| 8239 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RNK"<=1)
   2 - filter(DENSE_RANK() OVER ( PARTITION BY "START_KEY" ORDER BY
              "GROUP_KEY")<=1)


Statistics
----------------------------------------------------------
         22  recursive calls
         20  db block gets
      30370  consistent gets
      33163  physical reads
          0  redo size
      59215  bytes sent via SQL*Net to client
       2806  bytes received via SQL*Net from client
        219  SQL*Net roundtrips to/from client
          0  sorts (memory)
          1  sorts (disk)
       3260  rows processed

It’s a full table scan. However, a “mass” query performs a full table scan on good purpose—that did not call my attention. What did call my attention is the following:

select * from (
select * from (
   select start_key, group_key, junk,
          dense_rank() OVER (partition by start_key 
                                 order by group_key) rnk
     from demo
) where rnk <= 1
) where start_key = 'St';

It is actually the individual Top-N query again. This time it is built on the basis of the mass Top-N query—that was set up as view. That way, a single database view can be used for any mass query as well as for individual Top-N queries—that’s a maintainability benefit. If the advanced magic to abort the index range scan is still working it would be extremely efficient as well. The execution plan proves the opposite:

START_KEY  GROUP_KEY JUNK              RNK
---------- --------- ---------- ----------
St                 1 junk                1

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1309566133

-----------------------------------------------------------------------
| Id | Operation                     | Name     | Rows | Bytes | Cost |
-----------------------------------------------------------------------
|  0 | SELECT STATEMENT              |          |  370 |   128K|  373 |
|* 1 |  VIEW                         |          |  370 |   128K|  373 |
|* 2 |   WINDOW NOSORT               |          |  370 | 76960 |  373 |
|  3 |    TABLE ACCESS BY INDEX ROWID| DEMO     |  370 | 76960 |  373 |
|* 4 |     INDEX RANGE SCAN          | DEMO_IDX |  370 |       |    3 |
-----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RNK"<=1)
   2 - filter(DENSE_RANK() OVER ( PARTITION BY "START_KEY" ORDER BY
              "GROUP_KEY")<=1)
   4 - access("START_KEY"='St')


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        363  consistent gets
          0  physical reads
          0  redo size
        808  bytes sent via SQL*Net to client
        419  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Although no sort is required, the STOPKEY has disappeared from the WINDOW NOSORT operation. That means that the full index range scan will be performed; for all 359 rows where start_key='St'. On top of that, the number of consistent gets is rather high. A closer look into the execution plan reveals that the entire row is fetched from the table before the filter on the analytic expression is applied. The junk column that is fetched from the table is not required for the evaluation of this predicate; it would be possible to fetch that column only for those rows that pass the filter.

The “premature table access” is the reason why the full table scan is more efficient for the mass query than a index full scan. Have a look into the (hinted) full index scan execution plan for the mass query:

3260 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1402975529

------------------------------------------------------------------------
| Id | Operation                     | Name     | Rows | Bytes |  Cost |
------------------------------------------------------------------------
|  0 | SELECT STATEMENT              |          | 1000K|   340M| 1002K |
|* 1 |  VIEW                         |          | 1000K|   340M| 1002K |
|* 2 |   WINDOW NOSORT               |          | 1000K|   198M| 1002K |
|  3 |    TABLE ACCESS BY INDEX ROWID| DEMO     | 1000K|   198M| 1002K |
|  4 |     INDEX FULL SCAN           | DEMO_IDX | 1000K|       | 2504  |
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RNK"<=1)
   2 - filter(DENSE_RANK() OVER ( PARTITION BY "START_KEY" ORDER BY
              "GROUP_KEY")<=1)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
    1002692  consistent gets
     817172  physical reads
          0  redo size
      59215  bytes sent via SQL*Net to client
       2806  bytes received via SQL*Net from client
        219  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       3260  rows processed

The expensive step in this execution plan is the table access. If the table access would be moved up to take place after the window filter, the cost for this step would be 3260 (one for each fetched row). The total cost for the plan would probably stay below 7000; that is, lower then the cost for the full table scan plan.

Conclusion

Just to re-emphasize the motivation behind the view that can serve both needs; it is about multidimensional optimization—and that has nothing to do with OLAP!

Typically, performance optimization takes only one dimension into account; that is, performance. So far so good, but what about long term maintenance? Very often, performance optimization reduces the maintainability of the software. That’s not a coincidence, it’s because maintainability is the only degree of freedom during optimization. Unfortunately, reduced maintainability is very hard to notice. If it is noticed at all, it is probably years later.

I have been in both worlds for some years—operations and development—and try to optimize for all dimensions whenever possible because all of them are important for the business.

Create and Insert Statements

To try it yourself:

create table demo (
       id          number        not null,
       start_key   varchar2(255) not null,
       group_key   number        not null,
       junk        char(200),
       primary key (id)
);

insert into demo (
       select level, 
              dbms_random.string('A', 2) start_key, 
              trunc(dbms_random.value(0,1000)),
              'junk'
         from dual
      connect by level <= 1000000
);

commit;

exec DBMS_STATS.GATHER_TABLE_STATS(null, 'DEMO');

My tests were conducted on 11R2.


Do not use offset for pagination. Learn why.

SQL Performance Tuning Book for developers