Select rows where array of letters is contained in given array

The question:

I want to search for the letters in the “spelling” (text[]) column:
a n n o y t

I need it to find the words: any, annoy, no, an, toy
But not find: derivatives of annoy (annoying, annoyance), only find no once.

I also need the query to NOT find annoy if even a single letter is missing (such as anoyt).

I am using PostgreSQL 13.5

ronshome=# SELECT reference, word, spelling FROM word_mash_dictionary
           WHERE word LIKE 'annoy';
 reference | word  |  spelling
-----------+-------+-------------
       420 | annoy | {a,n,n,o,y}
(1 row)

This is the table structure:

                                                      Table "public.word_mash_dictionary"
   Column    |  Type  | Collation | Nullable |                         Default                         | Storage  | Stats target | Description
-------------+--------+-----------+----------+---------------------------------------------------------+----------+--------------+-------------
 reference   | bigint |           | not null | nextval('word_mash_dictionary_reference_seq'::regclass) | plain    |              |
 word        | text   |           |          |                                                         | extended |              |
 spelling    | text[] |           |          |                                                         | extended |              |
 ignore      | bigint |           |          |                                                         | plain    |              |
 list_100    | bigint |           |          |                                                         | plain    |              |
 list_300    | bigint |           |          |                                                         | plain    |              |
 list_500    | bigint |           |          |                                                         | plain    |              |
 list_800    | bigint |           |          |                                                         | plain    |              |
 list_1000   | bigint |           |          |                                                         | plain    |              |
 list_2000   | bigint |           |          |                                                         | plain    |              |
 list_3000   | bigint |           |          |                                                         | plain    |              |
 list_5000   | bigint |           |          |                                                         | plain    |              |
 list_7000   | bigint |           |          |                                                         | plain    |              |
 list_10000  | bigint |           |          |                                                         | plain    |              |
 word_length | bigint |           |          |                                                         | plain    |              |

The Solutions:

Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.

Method 1

The “is contained” operator <@ for arrays mostly does it:

SELECT reference, word, spelling
FROM   word_mash_dictionary
WHERE  spelling <@ '{a,n,n,o,y,t}'::text[];

This can be supported with a GIN index on the array, which makes it fast for big tables. Like:

CREATE INDEX ON word_mash_dictionary USING gin (spelling);

However, one element in the search array covers any number of matches in spelling. So '{a,n,o,y}' would find '{a,n,n,o,y}' etc. False positives for words with repeated letters.

A set-operation with EXCEPT ALL would be exact (consider each copy of the same element separately). Wrapped into a custom function:

CREATE OR REPLACE FUNCTION f_arr_is_contained(arr1 text[], arr2 text[])
  RETURNS bool
  LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
DECLARE
BEGIN
   PERFORM unnest(arr1) EXCEPT ALL SELECT unnest(arr2);

   RETURN NOT FOUND;
END
$func$;

If every single letter is covered in the second term, no row is returned, and FOUND is false. So return NOT FOUND.

I chose LANGUAGE plpgsql because the function cannot be “inlined” anyway, so plpgsql might be faster. You can test the equivalent alternative with LANGUAGE plpgsql:

CREATE OR REPLACE FUNCTION f_arr_is_contained_sql(arr1 text[], arr2 text[])
  RETURNS bool
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT NOT EXISTS (SELECT unnest (arr1) EXCEPT ALL SELECT unnest (arr2))';

The function cannot use any indexes, though, which would lead to expensive sequential scans over the whole table.

Combine both to be fast and exact:

SELECT reference, word, spelling
FROM   word_mash_dictionary
WHERE  spelling <@ '{a,n,o,y,t}'::text[]
AND    f_arr_is_contained(spelling, '{a,n,o,y,t}'::text[]);

db<>fiddle here

The first predicate finds all matches (and possibly some false positives) quickly with index support; the second predicate weeds out the (few!) false positives.

Aside, word and spelling should probably be declared NOT NULL.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Comment