The question:
Suppose I have a table with images that are supposed to go through several steps:
CREATE TABLE images (filename text, extracted bool, cropped bool, resized bool);
INSERT INTO images (filename, extracted, cropped, resized)
VALUES
('foo', false, false, false),
('bar', true, false, false),
('baz', true, true, false),
('qux', true, true, true);
At some point I have a query to find all images that are cropped but still need to be resized:
SELECT count(*) FROM images WHERE cropped AND NOT resized;
Now I believe the best way to make that query fast is a partial index:
CREATE INDEX ON images (cropped, resized) WHERE (cropped AND NOT resized);
I’d make it partial because cropped AND NOT resized
is a relatively rare state while there might be millions of images that are already fully processed and also millions that are not cropped yet.
My question is now, do I need statistics in addition to the index?
One of these?
CREATE STATISTICS stat1 (dependencies) ON cropped, resized FROM images;
CREATE STATISTICS stat2 (ndistinct) ON cropped, resized FROM images;
CREATE STATISTICS stat3 (mcv) ON cropped, resized FROM images;
ANALYZE images;
I found chapter How the Planner Uses Statistics which I had previously missed (or rather conflated with Statistics Used by the Planner), but it only talks about how statistics are turned into row estimates. What is unclear to me is how indexes are chosen, given that there are apparently no statistics about indexes.
The Solutions:
Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.
Method 1
You are substantially over-thinking this. Your query is very simple, and there are only a few ways it could be executed. Whether it will return 7000 rows or 2000 rows, it doesn’t matter because either way the index will appear to be better than the meager alternatives.
If you really do want to run a wider variety of queries which have more opportunities to make the wrong planner choice, it might be important to include the extended statistics of the mcv variety.
Your two examples are utterly mismatched. The table of counts in your question would lead to vastly different row estimates than are shown in your answer. It would give around 5,000,000 with no extended MCV stats, and around 1 with the extended stats. Certainly not 6872 vs 1782.
Method 2
My experiments showed that currently I don’t seem to need those statistics for the index to be used, but it was still unclear why that it. After a dive into the source I think I can answer my own question.
Essentially the decision to use an index is made in btcostestimate()
and in turn genericcostestimate()
.
It helps to remember what kinds of statistics are available for each table:
- Number of tuples
- For each column: Number of distinct values (sometimes called “cardinality”)
- For each column: The most common values
- For each column: A histogram of the remaining (less common) values
- If configured:
dependencies
stats (“How many values in Column A have only a single value appear in column B.”) - If configured:
ndistinct
stats (Number of unique value combinations in columns A and B.) - If configured:
mcv
stats (Most common value combinations in columns A and B.)
For each index Postgres determines which conditions can be checked using the index (the “Index Conds” or “indexQuals”). Based on those, genericcostestimate()
(using clauselist_selectivity()
) calculates a selectivity for the index, taking into account those extended statistics. This was actually reflected in my experiments in that I got better row estimates with extended statistics of the mcv
kind:
The difference in actual time is down to caching.
The predicate of the partial index would also be taken into account but only if it introduces additional restrictions, so it’s not relevant here.
So what I think how the index got chosen is this: First the index predicate was checked to see if the index is even usable. Then a selectivity was calculated for the particular conditions, which without the extended statistics was indeed be a bit wrong. But then further down, when the actual cost is calculated, it’s so low because the index is so small that even with the wrong row estimate the cost is very low.
So the answer is yes, theoretically extended statistics are still needed for good row estimates but also no, the index is still chosen without extended statistics because it is so small.
Method 3
your query does not sufficiently filter the table. so the postgres query optimizer doesn’t use this index and chooses the sequential table scan (or table full scan).
You can change the table design and query design. Or, you should aggreed sequential scan.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0