The question:
I have a table with this structure:
ticker VARCHAR NOT NULL,
interval VARCHAR NOT NULL,
ts TIMESTAMP WITHOUT TIME ZONE NOT NULL,
price FLOAT8 NOT NULL,
UNIQUE (ticker, interval, ts)
There are 40 tickers (to eventually be extended to around 130) and 8 intervals.
New rows (40 * 8) are added every 10 seconds as a bulk copy, which represents 115k rows/hour. They are written once and never modified.
Read operations are always done for fairly large time ranges (multiple of days) and request one ticker and 3 intervals for it, using this:
SELECT * FROM exchange.{tableName}
WHERE ticker = '{ticker}' AND "interval" IN ({intervalsText})
AND ts >= '{fromTime.Format}' AND ts < '{toTime.Format}'
ORDER BY ts
My question here is if it would be beneficial to group all the intervals in a single row per ticker. Like this:
ticker VARCHAR NOT NULL,
ts TIMESTAMP WITHOUT TIME ZONE NOT NULL,
price_interval0 FLOAT8 NOT NULL,
price_interval1 FLOAT8 NOT NULL,
...
price_interval7 FLOAT8 NOT NULL,
UNIQUE (ticker, ts)
This means 8x less rows in the table, a smaller index, but each query may need to load the whole row to just return 3 values and discard 5.
I don’t know how Postgres organizes the data internally and if a whole row would be retrieved at once (which is my assumption) and then parts of it get extracted, etc…
Any advice would be greatly appreciated.
The Solutions:
Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.
Method 1
Over time, that’s going to be a lot of rows!
Basics
Yes, storing 8 float8
in a single row will beat 8 rows with 1 float8
each by a long shot, in storage and performance.
But you can do more …
Table design
To optimize storage and performance:
CREATE TABLE ticker (
ticker_id smallint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
, ticker text NOT NULL UNIQUE
);
CREATE TABLE tbl (
the_date date NOT NULL -- columns in this order!
, timeslot smallint NOT NULL
, ticker_id smallint NOT NULL REFERENCES ticker
, price_interval0 int NOT NULL
, price_interval1 int NOT NULL
...
, price_interval7 int NOT NULL
CONSTRAINT tbl_pkey PRIMARY KEY (ticker_id, the_date, timeslot); -- columns in this order!
);
db<>fiddle here – including all
Explanation and auxiliaries
One entry every 10 seconds comes up to 6*60*24 = 8640 distinct timeslots per day. A smallint
with its range of -2^15 to 2^15 can easily hold that.
Of course, we don’t store the full ticker name every time. A smallint FK column covers 40 – 130 distinct tickers easily, and references a ticker
table. Typically better for storage and performance:
The day as date
(4 bytes), a timeslot smallint
(2 bytes) and a smallint
for the ticker ID, arranged in this sequence occupy 8 bytes with no alignment padding!
Unfortunately, we cannot optimize the PK index perfectly at the same time and incur 8 bytes of alignment padding. The only stain on the storage optimization.
For convenience, you can add a VIEW
to get pretty data:
CREATE VIEW tbl_pretty AS
SELECT ti.ticker, the_date + interval '10 sec' * timeslot AS ts, price_interval0, price_interval1
-- , price_interval2, ...
FROM tbl
JOIN ticker ti USING (ticker_id);
As you can see, this expression produces your original timestamp:
the_date + interval '10 sec' * timeslot
The reverse conversion will be used in the query below:
trunc(EXTRACT(EPOCH FROM time '12:34:56'))::int / 10)
Monetary values like a “price” shouldn’t be stored as floating point number. That’s a loaded foot gun. Use numeric
. Or, since we are optimizing for storage & performance, an integer
representing Cents typically works best. And that’s only 4 bytes instead of 8 bytes for float8
. (numeric
depends on actual length, typically bigger). See:
Storage
This will occupy:
- (24(tuple header) + 4(item identifier) + 4 + 2 + 2 + 4*8 + 4) = 72 bytes per table row – no padding at all
(Your original idea for the compound row would occupy (24 + 4 + (min. 8) + 8 + 8*8) = 108 bytes or more per row.) - (8(index header) + 2 + 2(padding) + 4 + 2 + 6(padding)) = 24 bytes per PK index entry
Plus minimal overhead per 8kb data page, and no overhead for dead tuples (never updated).
Details:
- Making sense of Postgres row sizes
- Calculating and saving space in PostgreSQL
- Configuring PostgreSQL for read performance
The PK index would be smaller (16 instead of 24 bytes per tuple) if we could make it on (the_date, timeslot, ticker_id)
. But we need it on (ticker_id, the_date, timeslot)
to support your query optimally. Equality before range. See:
Query
Your query becomes:
SELECT price_interval3, price_interval7 -- just the intervals you need
FROM tbl
WHERE ticker_id = (SELECT ticker_id FROM ticker WHERE ticker = 'ticker_3')
AND (the_date, timeslot) >= (date '2022-04-20', trunc(EXTRACT(EPOCH FROM time '00:00:00'))::int / 10)
AND (the_date, timeslot) < (date '2022-04-20', trunc(EXTRACT(EPOCH FROM time '00:01:00'))::int / 10)
ORDER BY the_date, timeslot;
Or short:
SELECT *
FROM tbl
WHERE ticker_id = 3
AND (the_date, timeslot) >= ('2022-04-20', 0)
AND (the_date, timeslot) < ('2022-04-20', 6)
ORDER BY the_date, timeslot;
Note the use of ROW value comparison! See:
- SQL syntax term for ‘WHERE (col1, col2) < (val1, val2)’
- What’s a good way of doing a > or < comparison that considers NULLs to be sorted first or last?
Performance
This is supported perfectly by the PK index on (ticker_id, the_date, timeslot)
. No other indexes required. You get a plan like:
Index Scan using tbl_pkey on tbl (cost=0.27..8.29 rows=1 width=16)
Index Cond: ((ticker_id = 3) AND (ROW(the_date, timeslot) >= ROW('2022-04-20'::date, 0)) AND (ROW(the_date, timeslot) < ROW('2022-04-20'::date, 6)))
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0