I regularly import data from httparchive.org. The data is a MySQL CSV export and I use pgloader, which handles the quirks of this export (
I also need to do some additional processing for normalisation:
- split the url in protocol (http|https) and host parts
- convert the string date “Mon DD YYYY” to date object
Currently, I have some triggers to do this when importing the data but I’m looking for ways to improve this, particularly to see whether some steps can be run in parallel.
I have the following CTE for extracting protocol and port:
with split as (select regexp_match(url, '(https|http)://(.+)/' )as parts from urls )
Running locally this seems faster than
This works well with a select but seems very slow as an update.
with split as (select regexp_match(url, '(https|http)://(.+)/' )as parts from urls ) update urls set protocol = parts, host = parts from split
An alternative, especially when working with a text source would be split the URL before it goes to Postgres.
The uncompressed CSV is 3526430884 bytes and takes around 20 minutes to import with no processing. But more than twice this with the processing. FWIW I have also tried using a Foreign Data Wrapper. But, even after solving various problems with the CSV (nulls, encoding), this leads to a memory error.
With some help, I’ve managed to run benchmarks and improve my triggers.
CREATE OR REPLACE FUNCTION public.extract_protocol() RETURNS trigger LANGUAGE plpgsql AS $function$ DECLARE elements text ; BEGIN elements := regexp_match(NEW.url, '(https|http)://(.+)/'); NEW.protocol = elements; NEW.host = elements; RETURN NEW; END; $function$
This now runs faster than doing a subsquent update but neither are limiting factors. The bottleneck is now the overhead of the indices when inserting the cleaned up data into the main table. I think my only options there are to weigh up the cost of indices for the insert over disabling and then adding them.
Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.
Your UPDATE syntax generates a cross join of the table
urls with the
split result. Which is essentially a cross join of the table with itself.
You need to have some kind of join condition between the target table and the source. The obvious choice would be the primary key column of the table.
with split as ( select pk_column, regexp_match(url, '(https|http)://(.+)/' ) as parts from urls ) update urls set protocol = s.parts, host = s.parts from split s where urls.pk_column = s.pk_column --<< here
I think your attempt to avoid evaluating the regex expression twice by using the CTE makes things slower, rather than faster. I would expect the overhead of joining the table with itself is far bigger than the overhead of evaluating the regex twice.
So I think, you should also try:
update urls set protocol = (regexp_match(url, '(https|http)://(.+)/' )) host = (regexp_match(url, '(https|http)://(.+)/' ))
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0