I am working with a PostgreSQL table with a large TEXT field, which is theoretically updated on a regular basis. I have thought about storing the data directly in the filesystem, but with TOAST, the data is already being stored off-page and compressed in the database so I figured I would keep things simple and just use the database storage.
To give some context, I am indexing RSS feeds. I will have a script run every 24h that pulls the RSS feeds and potentially updates the table. This can lead to a lot of dead tuples, and thus lots of space being used on disk. Of course, autovacuum will take care of this eventually but it has the potential to be a lot of data (many GB) and I want to make sure I know what will happen when I am doing lots of updates on this very large table.
One solution I have is to only update the TEXT field (storing the RSS data) if there are certain substantial changes to the feed, e.g. a new post on the website. This means that I could avoid doing the UPDATE unless I really have to. However, I still want to update the table (to keep track of when I most recently did an HTTP request). This will create a dead tuple with the old version of the row’s data.
What will happen to the TEXT data, if it isn’t actually changed? Will the UPDATE also duplicate the TEXT data, when it creates a dead tuple? Or will the TEXT data be left untouched, because it wasn’t changed and it is stored off-page?
Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.
That’s the one major shortcut an
UPDATE takes in Postgres’ MVCC model (as compared to
INSERT): Fields that are stored out-of-line (TOASTed) and not changed in the
UPDATE are kept as is. Meaning, the old (soon to be dead) row version and the new row version in the main relation point to the same TOASTed values, no additional bloat there.
UPDATEoperation, values of unchanged fields are normally
preserved as-is; so an
UPDATEof a row with out-of-line values
incurs no TOAST costs if none of the out-of-line values change.
As a_horse_with_no_name pointed out:
“Not changed in the
UPDATE“, or “unchanged” as the manual puts it, means “not targeted in a
SET clause of the
UPDATE“. Postgres does not verify whether a new column value actually differs from the previous row version.
If possible, skip rows that do not change to begin with by adding a
WHERE clause. That still leaves cases updating multiple columns at once where some stay unchanged. If that applies to a your big column, it might pay to update it separately, and only where it actually changes. See:
If, OTOH, you update big, TOASTed fields on a regular basis, consider the LZ4 compression algorithm (new in Postgres 14) for those columns. Disk footprint is a bit bigger, but performance is much better. See: