More columns vs. more rows to optimize storage and performance?

The question:

I have a table with this structure:

ticker         VARCHAR                     NOT NULL,
interval       VARCHAR                     NOT NULL,
ts             TIMESTAMP WITHOUT TIME ZONE NOT NULL,
price          FLOAT8                      NOT NULL,
UNIQUE (ticker, interval, ts)

There are 40 tickers (to eventually be extended to around 130) and 8 intervals.
New rows (40 * 8) are added every 10 seconds as a bulk copy, which represents 115k rows/hour. They are written once and never modified.

Read operations are always done for fairly large time ranges (multiple of days) and request one ticker and 3 intervals for it, using this:

SELECT * FROM exchange.{tableName}
WHERE ticker = '{ticker}' AND "interval" IN ({intervalsText})
AND ts >= '{fromTime.Format}' AND ts < '{toTime.Format}'
ORDER BY ts

My question here is if it would be beneficial to group all the intervals in a single row per ticker. Like this:

ticker           VARCHAR                     NOT NULL,
ts               TIMESTAMP WITHOUT TIME ZONE NOT NULL,
price_interval0  FLOAT8                      NOT NULL,
price_interval1  FLOAT8                      NOT NULL,
...
price_interval7  FLOAT8                      NOT NULL,
UNIQUE (ticker, ts)

This means 8x less rows in the table, a smaller index, but each query may need to load the whole row to just return 3 values and discard 5.

I don’t know how Postgres organizes the data internally and if a whole row would be retrieved at once (which is my assumption) and then parts of it get extracted, etc…

Any advice would be greatly appreciated.

The Solutions:

Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.

Method 1

Over time, that’s going to be a lot of rows!

Basics

Yes, storing 8 float8 in a single row will beat 8 rows with 1 float8 each by a long shot, in storage and performance.
But you can do more …

Table design

To optimize storage and performance:

CREATE TABLE ticker (
  ticker_id smallint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
, ticker text NOT NULL UNIQUE   
);

CREATE TABLE tbl (
  the_date        date     NOT NULL  -- columns in this order!
, timeslot        smallint NOT NULL
, ticker_id       smallint NOT NULL REFERENCES ticker
, price_interval0 int      NOT NULL
, price_interval1 int      NOT NULL
 ...
, price_interval7 int      NOT NULL
CONSTRAINT tbl_pkey PRIMARY KEY (ticker_id, the_date, timeslot);  -- columns in this order!
);

db<>fiddle here – including all

Explanation and auxiliaries

One entry every 10 seconds comes up to 6*60*24 = 8640 distinct timeslots per day. A smallint with its range of -2^15 to 2^15 can easily hold that.

Of course, we don’t store the full ticker name every time. A smallint FK column covers 40 – 130 distinct tickers easily, and references a ticker table. Typically better for storage and performance:

The day as date (4 bytes), a timeslot smallint (2 bytes) and a smallint for the ticker ID, arranged in this sequence occupy 8 bytes with no alignment padding!
Unfortunately, we cannot optimize the PK index perfectly at the same time and incur 8 bytes of alignment padding. The only stain on the storage optimization.

For convenience, you can add a VIEW to get pretty data:

CREATE VIEW tbl_pretty AS
SELECT ti.ticker, the_date + interval '10 sec' * timeslot AS ts, price_interval0, price_interval1
--   , price_interval2, ...
FROM   tbl
JOIN   ticker ti USING (ticker_id);

As you can see, this expression produces your original timestamp:

the_date + interval '10 sec' * timeslot

The reverse conversion will be used in the query below:

trunc(EXTRACT(EPOCH FROM time '12:34:56'))::int / 10)

Monetary values like a “price” shouldn’t be stored as floating point number. That’s a loaded foot gun. Use numeric. Or, since we are optimizing for storage & performance, an integer representing Cents typically works best. And that’s only 4 bytes instead of 8 bytes for float8. (numeric depends on actual length, typically bigger). See:

Storage

This will occupy:

  • (24(tuple header) + 4(item identifier) + 4 + 2 + 2 + 4*8 + 4) = 72 bytes per table row – no padding at all
    (Your original idea for the compound row would occupy (24 + 4 + (min. 8) + 8 + 8*8) = 108 bytes or more per row.)
  • (8(index header) + 2 + 2(padding) + 4 + 2 + 6(padding)) = 24 bytes per PK index entry

Plus minimal overhead per 8kb data page, and no overhead for dead tuples (never updated).

Details:

The PK index would be smaller (16 instead of 24 bytes per tuple) if we could make it on (the_date, timeslot, ticker_id). But we need it on (ticker_id, the_date, timeslot) to support your query optimally. Equality before range. See:

Query

Your query becomes:

SELECT price_interval3, price_interval7  -- just the intervals you need
FROM   tbl
WHERE  ticker_id = (SELECT ticker_id FROM ticker WHERE ticker = 'ticker_3')
AND    (the_date, timeslot) >= (date '2022-04-20', trunc(EXTRACT(EPOCH FROM time '00:00:00'))::int / 10)
AND    (the_date, timeslot) <  (date '2022-04-20', trunc(EXTRACT(EPOCH FROM time '00:01:00'))::int / 10)
ORDER  BY the_date, timeslot;

Or short:

SELECT *
FROM   tbl
WHERE  ticker_id = 3
AND    (the_date, timeslot) >= ('2022-04-20', 0)
AND    (the_date, timeslot) <  ('2022-04-20', 6)
ORDER  BY the_date, timeslot;

Note the use of ROW value comparison! See:

Performance

This is supported perfectly by the PK index on (ticker_id, the_date, timeslot). No other indexes required. You get a plan like:

Index Scan using tbl_pkey on tbl  (cost=0.27..8.29 rows=1 width=16)
  Index Cond: ((ticker_id = 3) AND (ROW(the_date, timeslot) >= ROW('2022-04-20'::date, 0)) AND (ROW(the_date, timeslot) < ROW('2022-04-20'::date, 6)))


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Comment