The question:
I am working with a PostgreSQL table with a large TEXT field, which is theoretically updated on a regular basis. I have thought about storing the data directly in the filesystem, but with TOAST, the data is already being stored off-page and compressed in the database so I figured I would keep things simple and just use the database storage.
To give some context, I am indexing RSS feeds. I will have a script run every 24h that pulls the RSS feeds and potentially updates the table. This can lead to a lot of dead tuples, and thus lots of space being used on disk. Of course, autovacuum will take care of this eventually but it has the potential to be a lot of data (many GB) and I want to make sure I know what will happen when I am doing lots of updates on this very large table.
One solution I have is to only update the TEXT field (storing the RSS data) if there are certain substantial changes to the feed, e.g. a new post on the website. This means that I could avoid doing the UPDATE unless I really have to. However, I still want to update the table (to keep track of when I most recently did an HTTP request). This will create a dead tuple with the old version of the row’s data.
What will happen to the TEXT data, if it isn’t actually changed? Will the UPDATE also duplicate the TEXT data, when it creates a dead tuple? Or will the TEXT data be left untouched, because it wasn’t changed and it is stored off-page?
The Solutions:
Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.
Method 1
That’s the one major shortcut an UPDATE
takes in Postgres’ MVCC model (as compared to DELETE
+ INSERT
): Fields that are stored out-of-line (TOASTed) and not changed in the UPDATE
are kept as is. Meaning, the old (soon to be dead) row version and the new row version in the main relation point to the same TOASTed values, no additional bloat there.
During an
UPDATE
operation, values of unchanged fields are normally
preserved as-is; so anUPDATE
of a row with out-of-line values
incurs no TOAST costs if none of the out-of-line values change.
As a_horse_with_no_name pointed out:
“Not changed in the UPDATE
“, or “unchanged” as the manual puts it, means “not targeted in a SET
clause of the UPDATE
“. Postgres does not verify whether a new column value actually differs from the previous row version.
If possible, skip rows that do not change to begin with by adding a WHERE
clause. That still leaves cases updating multiple columns at once where some stay unchanged. If that applies to a your big column, it might pay to update it separately, and only where it actually changes. See:
If, OTOH, you update big, TOASTed fields on a regular basis, consider the LZ4 compression algorithm (new in Postgres 14) for those columns. Disk footprint is a bit bigger, but performance is much better. See:
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0