The question:
I want to search for the letters in the “spelling” (text[]
) column:
a n n o y t
I need it to find the words: any, annoy, no, an, toy
But not find: derivatives of annoy (annoying, annoyance), only find no once.
I also need the query to NOT find annoy if even a single letter is missing (such as anoyt).
I am using PostgreSQL 13.5
ronshome=# SELECT reference, word, spelling FROM word_mash_dictionary
WHERE word LIKE 'annoy';
reference | word | spelling
-----------+-------+-------------
420 | annoy | {a,n,n,o,y}
(1 row)
This is the table structure:
Table "public.word_mash_dictionary"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------+--------+-----------+----------+---------------------------------------------------------+----------+--------------+-------------
reference | bigint | | not null | nextval('word_mash_dictionary_reference_seq'::regclass) | plain | |
word | text | | | | extended | |
spelling | text[] | | | | extended | |
ignore | bigint | | | | plain | |
list_100 | bigint | | | | plain | |
list_300 | bigint | | | | plain | |
list_500 | bigint | | | | plain | |
list_800 | bigint | | | | plain | |
list_1000 | bigint | | | | plain | |
list_2000 | bigint | | | | plain | |
list_3000 | bigint | | | | plain | |
list_5000 | bigint | | | | plain | |
list_7000 | bigint | | | | plain | |
list_10000 | bigint | | | | plain | |
word_length | bigint | | | | plain | |
The Solutions:
Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.
Method 1
The “is contained” operator <@
for arrays mostly does it:
SELECT reference, word, spelling
FROM word_mash_dictionary
WHERE spelling <@ '{a,n,n,o,y,t}'::text[];
This can be supported with a GIN index on the array, which makes it fast for big tables. Like:
CREATE INDEX ON word_mash_dictionary USING gin (spelling);
However, one element in the search array covers any number of matches in spelling
. So '{a,n,o,y}'
would find '{a,n,n,o,y}'
etc. False positives for words with repeated letters.
A set-operation with EXCEPT ALL
would be exact (consider each copy of the same element separately). Wrapped into a custom function:
CREATE OR REPLACE FUNCTION f_arr_is_contained(arr1 text[], arr2 text[])
RETURNS bool
LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
DECLARE
BEGIN
PERFORM unnest(arr1) EXCEPT ALL SELECT unnest(arr2);
RETURN NOT FOUND;
END
$func$;
If every single letter is covered in the second term, no row is returned, and FOUND
is false. So return NOT FOUND
.
I chose LANGUAGE plpgsql
because the function cannot be “inlined” anyway, so plpgsql might be faster. You can test the equivalent alternative with LANGUAGE plpgsql
:
CREATE OR REPLACE FUNCTION f_arr_is_contained_sql(arr1 text[], arr2 text[])
RETURNS bool
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT NOT EXISTS (SELECT unnest (arr1) EXCEPT ALL SELECT unnest (arr2))';
The function cannot use any indexes, though, which would lead to expensive sequential scans over the whole table.
Combine both to be fast and exact:
SELECT reference, word, spelling
FROM word_mash_dictionary
WHERE spelling <@ '{a,n,o,y,t}'::text[]
AND f_arr_is_contained(spelling, '{a,n,o,y,t}'::text[]);
db<>fiddle here
The first predicate finds all matches (and possibly some false positives) quickly with index support; the second predicate weeds out the (few!) false positives.
Aside, word
and spelling
should probably be declared NOT NULL
.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0