Finding the Best Match With a Top-N Query

database, mysql, oracle, postgresql, sql

Finding the Best Match With a Top-N Query

In Performance on 2010-09-29 at 11:16

There was an interesting index related performance problem on Stack Overflow recently. The problem was to check an input string against a table that holds about 2000 prefix patterns (e.g., LIKE 'xyz%'). A fast select is needed that returns one row if any pattern matches the input string, or no row otherwise.

I believe my solution is worth a few extra words to explain it in more detail. Even though it’s a perfect fit for Use The Index, Luke it’s a little early to put it as an exercise there. It is, however, a very good complement to my previous article Analytic Top-N queries—so I put it here.

Although the problem was raised for a MySQL database, my solution applies to all databases that can properly optimize Top-N queries.

The original SQL statement in the question was like that:

select 1
  from T1
 where 'fxg87698x84' like concat (C1, '%')

T1.C1 is the column that holds the prefix patterns—one per row. Although a prefixed LIKE filter can use an index range scan, the problem is that it is the wrong way around: it’s not searching for a string that matches the pattern, it’s searching for a pattern that matches the string.

The query, as written, must check all the patterns against the string. E.g., by a full table scan or (fast) full index scan. However, it’s always a full scan. Can that be improved?

Let’s start step-by-step. The simplest case is that the exact input string is a pattern in the table. A SQL statement to check for the exact pattern is very simple:

select C1
  from T1
 where C1 = 'fxg87698x84'

The next case that the exact pattern doesn’t exist in the table, but a prefix pattern, that matches the input string, exists. That pattern must be shorter than the input string—otherwise it cannot match. Because we aim to solve the problem with an index, let’s imagine the patterns, as they would be stored in an index:

axt3
fxg
      <- place where 'fxg87698x84' would be
tru56

If the exact pattern doesn’t exist, the preceding index entry is the best possible match (precondition: no overlapping patterns exist). That’s because shorter strings are considered “smaller” when sorted. So, let’s extend the select to find the preceding record if the exact pattern is not in the table:

select C1
  from T1
 where C1 <= 'fxg87698x84'
 order by c1 desc
 limit 1

The less than or equals condition will match the exact pattern, if it exists, and all that precede it. The reverse ORDER BY clause makes sure that the index is traversed upwards. In conjunction with the where clause, it means that the tree traversal is done to find the input string, and the leaf node scan continues upwards from there. The LIMIT 1 clause is the MySQL way to make a Top-N query so that the leaf node scan aborts after the first record. Voilà, this statement will return the best candidate pattern (or none at all) by performing a very small index range scan.

The final case we need to take care of is that no pattern matches the input string. There are two sub-variants that can happen: (a) a potentially matching pattern would be the very first entry in the index. In that case the Top-N query will not return any row and we are done; (b) the Top-N query returns a pattern that is not a prefix for the input string. That can be handled by wrapping the Top-N query to filter the result through the original LIKE expression:

select 1
  from (
        select C1
          from T1
         where C1 <= 'fxg87698x84'
         order by C1 desc
         limit 1
       ) tmp
 where 'fxg87698x84' like concat (C1, '%')

Done.

Simple? With a good understanding of index fundamentals, it is simple! That’s why I am writing a Web-Book about indexing basics: Use The Index, Luke!. Funny enough, the basics are the same for all databases—we all put our pants on one leg at a time.

Closing Note

The precondition for all that is that there are no overlapping patterns in the table. E.g., the statement doesn’t work with the following patterns:

axt3
fxg
fxg1
      <- place where 'fxg87698x84' would be
tru56

In that case, the closest entry doesn’t match although there is a matching entry. However, the FXG entry matches everything that FXG1 can possibly match—the two patterns are overlapping.

Second Closing Note

The original problem posted on Stack Overflow mentioned that this lookup must be performed 1 million times—within half an hour. The author did not mention if that target was reached, nor if the process is single-threaded.

However, considering the overall problem, the most computing resource efficient solution would probably be to sort both sets—the patterns and the input strings—and implement a manual merge. But that’s probably much more effort to implement. The index solution is very efficient on human resources. Whatever is the best solution for the business is up to the company to decide.

▶ No Responses

Markus Winand's Blog

Software Quality is Quality of Life

Finding the Best Match With a Top-N Query

Closing Note

Second Closing Note

The Book I’ve Written

About Me

Top Posts

Categories