I have some data like this:
And I want to compare the string values in both columns on the folowing condition: they’re similar (assign some value, like 1) if they share at least 2 chars. Otherwise, they’re not similar.
So in the example, PPS and PSP would be similar.
How can this substring comparison be achieved?
I know one approach would be to extract substrings and manually compare them, but it feels hacky and I don’t know the maximum number of chars that can occur.
Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.
they’re similar … if they share at least 2 chars.
Unfortunately, there is no built-in “intersect” operator or function for strings or arrays. You can roll your own function to count overlapping characters:
CREATE FUNCTION f_count_overlapping_char(text, text) RETURNS int LANGUAGE sql PARALLEL SAFE IMMUTABLE STRICT AS $func$ SELECT count(*)::int FROM ( SELECT unnest(string_to_array($1, NULL)) INTERSECT ALL SELECT unnest(string_to_array($2, NULL)) ) sub; $func$;
INTERSECT ALL includes duplicate matching characters. To fold duplicates, use just
Then your query can be:
SELECT *, f_count_overlapping_char(t1.metaphone, t2.metaphone) AS overlap FROM tbl t1 JOIN tbl t2 ON t1.id < t2.id AND f_count_overlapping_char(t1.metaphone, t2.metaphone) >= 2;
But it’s expensive and does not scale well with more rows in the table – O(N²). Depending on your actual objective there are various superior alternatives – like trigram similarity provided by the additional module pg_trgm. See: