Cheating at the Countdown show with Postgres

19. Nov. 2017 ob 19:00 pod kategorijami: postgres

This friday, a member of the local programmers group on FB put up a question of how the others would approach the problem of finding all valid words from a word list, that can be composed from a list of characters, where letters can repeat. So if you have letters "ABOTOG", you can assemble words TAG, TAB, BOOT, BOAT, etc. This is essentially the same as seen in the Countdown TV game show.

The original poster had already resolved that problem for an online game project, but was curious how others would approach the problem. It sparked a bit of interest in the office and after some debate I set off to see how can this be done in Postgres.

Setup

I found a word list of 370.000 english words to test how the queries perform (words_alpha.txt from a github repo). This is how I created the table and imported the data:

 create table wordlist (word text);
 \copy wordlist from 'words_alpha.txt';
 select count(*) from wordlist ;
  count
---------
  370099
 (1 row)

Original author's SQL query

The author's original query was a MySQL version of this query:

SELECT word FROM wordlist WHERE
char_length(word)-char_length(replace(upper(word),'A','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'B','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'C','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'D','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'E','')) <= 2
AND char_length(word)-char_length(replace(upper(word),'F','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'G','')) <= 1
AND char_length(word)-char_length(replace(upper(word),'H','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'I','')) <= 1
AND char_length(word)-char_length(replace(upper(word),'J','')) <= 1
AND char_length(word)-char_length(replace(upper(word),'K','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'L','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'M','')) <= 1
AND char_length(word)-char_length(replace(upper(word),'N','')) <= 1
AND char_length(word)-char_length(replace(upper(word),'O','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'P','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'R','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'S','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'T','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'U','')) <= 1
AND char_length(word)-char_length(replace(upper(word),'V','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'Z','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'X','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'Y','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'Q','')) <= 0
AND char_length(word)-char_length(replace(upper(word),'W','')) <= 0;

This compares the letter count in the words and is relatively easy to understand, but is somewhat resource intensive. This was the query plan Postgres took:

                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.wordlist  (cost=0.00..149953.60 rows=1 width=10) (actual time=113.228..755.601 rows=120 loops=1)
   Output: word
   Filter: (((char_length(wordlist.word) - char_length(replace(upper(wordlist.word), 'A'::text, ''::text))) <= 0) AND ((char_length(wordlist.word) - char_length(replace(upper(wordlist.word)
   Rows Removed by Filter: 369979
   Buffers: shared hit=1914
 Planning time: 0.273 ms
 Execution time: 755.696 ms
(7 rows)

Regex queries

The easiest win is using regex in a query. With it you can limit which letters may be used in a word, and it works surprisingly well, essentially doing a sequential scan, reading full table and seeing if words match. This is the query for the letters "EJIGUNEM":

explain (analyze, verbose, buffers) select word from wordlist where word ~* '^[EJIGUNEM]+$';
                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.wordlist  (cost=0.00..6540.24 rows=37 width=10) (actual time=40.506..142.450 rows=249 loops=1)
   Output: word
   Filter: (wordlist.word ~* '^[EJIGUNEM]+$'::text)
   Rows Removed by Filter: 369850
   Buffers: shared hit=1914
 Planning time: 0.220 ms
 Execution time: 142.491 ms
(7 rows)

A trained eye will spot that a regex does not actually enforce letter repetition limit and therefore returns 249 rows instead of correct 120. Since I was doing this before the original author posted his solution, I created a python function which checks if a word can be composed from given chars, but you could equally well use his query to filter the words:

create or replace function can_word(word text, chars text) returns boolean
language plpython3u immutable strict as $f$
valid_chars = [c for c in chars.lower()]
for c in word:
    try:
        valid_chars.remove(c)
    except:
        return False
return True
$f$;

With this extra check the query then becomes:

 explain (analyze, verbose, buffers)
 with words_maybe as (select word from wordlist where word ~* '^[EJIGUNEM]+$')
 select word from words_maybe where can_word(word, 'EJIGUNEM');
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
  CTE Scan on words_maybe  (cost=6540.24..6550.23 rows=12 width=32) (actual time=42.183..150.407 rows=120 loops=1)
    Output: words_maybe.word
    Filter: can_word(words_maybe.word, 'EJIGUNEM'::text)
    Rows Removed by Filter: 129
    Buffers: shared hit=1914
    CTE words_maybe
      ->  Seq Scan on public.wordlist  (cost=0.00..6540.24 rows=37 width=10) (actual time=42.141..148.823 rows=249 loops=1)
            Output: wordlist.word
            Filter: (wordlist.word ~* '^[EJIGUNEM]+$'::text)
            Rows Removed by Filter: 369850
            Buffers: shared hit=1914
  Planning time: 0.190 ms
  Execution time: 150.443 ms
 (13 rows)

This query returns the correct 120 rows and it works in a decent enough time for most developers to stop here.

Cube extension

If we can represent letters as a vector, we could try the cube extension. Let's create a function that evolves the original author's query and creates a full vector from the letter count (we're assuming the alphabet is ASCII letters only here):

create or replace function word2vec(word text) returns cube
language plpython3u immutable strict as $f$
cu = [0] * 26

for c in word.upper():
    intc = ord(c)
    if 65 <= intc <= 90:
        cu[(intc - 65)] += 1
    else:
        return None
return str(tuple(cu))
$f$;

We can now create an index over the letter vectors, but we also include word in the index, in order to enable postgres to return word while only accessing index and not touching the heap:

create index wordlist_word_vec on wordlist using gist (word2vec(word), word);

And query is now magically fast:

 explain (analyze, verbose, buffers) select word from wordlist where word2vec(word) <@ cube_union(word2vec('EJIGUNEM'), '0');
                                                                                                        QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on public.wordlist  (cost=95.28..1118.29 rows=370 width=10) (actual time=7.518..7.650 rows=120 loops=1)
    Output: word
    Recheck Cond: (word2vec(wordlist.word) <@ '(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),(0, 0, 0, 0, 2, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0)'::cube)
    Heap Blocks: exact=67
    Buffers: shared hit=1737
    ->  Bitmap Index Scan on wordlist_word_vec  (cost=0.00..95.18 rows=370 width=0) (actual time=7.500..7.500 rows=120 loops=1)
          Index Cond: (word2vec(wordlist.word) <@ '(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),(0, 0, 0, 0, 2, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0)'::cube)
          Buffers: shared hit=1670
  Planning time: 0.370 ms
  Execution time: 7.726 ms
 (10 rows)

Queries are now fast, using index and returning 120 matching rows in sub 10ms. While the original table is only 15MB, the index is much bigger at 176MB, so it's a common memory vs. CPU time tradeoff.