Fetching large number of rows from large MySQL table

The question:

I’m working with an IOT application where we have a bunch of devices sending readings every minute or so, to an AWS Aurora MySQL 5.6 (InnoDB) instance. That instance is a db.t2.medium (2 CPU, 4GB RAM) size. SELECT queries where we fetch by device id and sensor type have begun taking longer and longer, and I’m guessing it’s because we’re outgrowing our instance size.

The table we query has about 60 million rows, and because of the way our feature works to display the data on a graph, we fetch all historical data rather than do pagination. I also suspect this might be part of the problem. An example query looks like SELECT * FROM readings WHERE device_id = 1234 and sensor_type = 'pressure' and time >= 1644092837 and time <= 1646684837 and returns about 500K rows, taking around 5-8 seconds.

The readings table has four columns – device_id, sensor_type, time (Unix timestamp, stored as an int), and value. A composite index is on device_id, sensor_type, and time.

My main question is – how have people handled returning a large number of rows from an already large table? This table is only going to grow due to the frequency of the data the sensors send. I’ve considered having a readings table per device but I’m not that comfortable with having potentially thousands of tables, especially if we have to add or edit a column.

I’m also wondering how people have handled scaling up a database in an IOT use case because I’m concerned our AWS bill is going to get very expensive if we just keep increasing RAM / increasing the instance size.

(from Comment)

CREATE TABLE readings (
    device_id int(11) unsigned NOT NULL AUTO_INCREMENT, 
    sensor_type char(5) CHARACTER SET ascii NOT NULL DEFAULT '', 
    time int(11) unsigned NOT NULL, 
    value float NOT NULL, 
    PRIMARY KEY (device_id,sensor_type,time)
) ENGINE=InnoDB AUTO_INCREMENT=48025983 DEFAULT CHARSET=latin1

The Solutions:

Below are the methods you can try. The first solution is probably the best. Try others if the first one doesn’t work. Senior developers aren’t just copying/pasting – they read the methods carefully & apply them wisely to each case.

Method 1

Please provide SHOW CREATE TABLE. I suspect you have PRIMARY KEY(id), which is not the best option for a sensor table. Instead, it should be one of these:

  • If you can guarantee that there will never two readings from one sensor at the same time (or you simply throw away one of the readings if it happens, eg, via INSERT IGNORE) have the following and get rid of id:

    PRIMARY KEY (sensor_id, time)
    
  • Else it gets a bit messy; I can elaborate if you need me to.

That makes it so that all the rows needed for your particular table are “clustered” together. That way, the time for the query will be proportional to the number of rows returned, not the number of rows in the table.

I suggested sensor_id, you seem to have a 2-column way to specify a unique sensor? (device_id, sensor_type). Well, that is nearly as good. However, it takes more space in your huge “Fact” table. So you might consider having a table Sensors with columns device_id and sensor_type (etc).

The datatypes are important. If you are blindly using INT (4 bytes) for ids, you should switch to SMALLINT UNSIGNED NOT NULL (2 bytes) for some id that won’t exceed 64K. See also TINYINT and MEDIUMINT.

Shrink the datatypes now before the table gets any bigger. It will require downtime that will only grow as the table gets bigger.

Consider PARTITION BY RANGE(TO_DAYS(time)) if you intend to remove “old” data. That will, make the “delete” (via DROP PARTITION) immensely faster, but is not likely to speed up anything else. More discussion: Partition

Do you throw 500K rows at the graphing package? Choke! I doubt if the resolution of the resulting chart will look any different if you had only 1K points. To that end, build and maintain a Summary table. It should contain sensor_id hour (or other resolution), count/min/max/sum/… of the metric. Whether you use SUM() or AVG() depends on how you define “average”. I am thinking of what is an average over a day if there are missing data samples?

The fetch from the summary table will be much faster, without hurting the resolution of the graph.

Note that data can be rolled up via min = min of mins, sum = sum of sums, count = sum of counts, etc. (Even stddev can be rolled up if you keep sum and sum-of-squares.)

More: Summary Tables

I assume you have a mixture of metrics (degrees, inches, etc)? FLOAT may be optimal; it takes 4 bytes. With more regularity (or scaling), some DECIMAL might be adequate with 2-3 bytes.

Do not split into lots of tables.

Not-yet-ready-for-prime-time: Sensor data in MySQL

(Response to Comment)

Huh? You cannot put the mouse on one point out of 500M! If, on the other hand, it takes a few steps to zoom in, the do as follows. Start with the Summary table; when that plays out, fetch from the Fact table. That is, at any stage, don’t provide all 500M points. Instead, let the user get a week’s average when clicking on the original graph. Then, as you zoom in, get more and more refined.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Comment