Postgres Index Cost-Benefit Analysis Using the Cumulative Statistics System

Index Overhead

Postgres MVCC mechanism creates a new tuple version when a row is updated. This means that tuple updates, barring hot updates, have an amplifying effect on write load in addition to inserts and deletes, because all indexes that contain a pointer to the old tuple version must also get updated.

Updates impose another insidious cost on query performance: index-only scans lose efficiency. If a table’s visibility map is stale due to heavy update volume, then index-only scans must fetch heap pages to confirm tuple visibility, nullifying the benefits of the index-only scan.

Then, there is WAL logging. To support physical replication, index changes have to be individually tracked in the WAL file along with the heap change if WAL logging is enabled. This is another cost overhead that requires careful consideration of index use.

Consequently, a single tuple operation has the potential to turn into multiple write operations depending on just table indexing choices, everything else being equal. The proper implementation of indexes is, therefore, a critical concern to prevent unnecessary write amplification and ensure database performance.

Cost-Benefit Analysis

Given this overhead, measuring and analysis of index utilization is crucial. This can be achieved by performing a cost-benefit analysis. In this context, this would involve:

Leveraging the built-in Cumulative Statistics System for collecting operational data on tables and indexes
Costing index utilization across various categories
Reporting these costs in a normalized format that allows for monitoring, trending, and comparison

The Cumulative Statistics System

Postgres provides data about service activity through its Cumulative Statistics System. This system is composed of multiple views that expose internal statistics for analysis. For our analysis, there are four views that we are interested in:

pg_stat_user_tables
pg_statio_user_tables
pg_stat_user_indexes
pg_statio_user_indexes

Before we employ these statistics, however, we have to understand what they mean. This is important as the information that is conveyed depends on the context.

Postgres aggregates tuple stats internally using two counters: tuples_returned and tuples_fetched. These two counters aggregate different types of stats based on context. For tables, tuples_returned is the number of heap tuples returned by sequential scans, while tuples_fetched is the number of heap tuples returned by bitmap index scans. For indexes, tuples_returned is the number index tuples returned, while tuples_fetched is the number of heap tuples returned by simple index scans.

Here is a tabular representation of the above explanation:

+---------+---------------------------+----------------------------------+
| Context | tuples_returned           | tuples_fetched                   |
+---------+---------------------------+----------------------------------+
| table   | Heap tuples returned by   | Heap tuples returned by          |
|         | sequential scans          | bitmap index scans               |
+---------+---------------------------+----------------------------------+
| index   | Index tuples returned     | Heap tuples returned by          |
|         |                           | simple index scans               |
+---------+---------------------------+----------------------------------+

Note the distinction between index tuples and heap tuples. We can now explain the actual statistics that are based on these counters.

In the pg_stat_user_tables view, seq_tup_read column displays the number of heap tuples accessed by sequential scans:

seq_tup_read = table.tuples_returned

On the other hand, idx_tup_fetch column in the same view aggregates all heap tuple accesses through index scans of the table as such:

idx_tup_fetch = table.tuples_fetched + sum(index.tuples_fetched)

In the pg_stat_user_indexes view, idx_tup_read column displays the number of index tuples accessed:

idx_tup_read = index.tuples_returned

And idx_tup_fetched aggregates heap tuples accesses by simple index scans:

idx_tup_fetched = index.tuples_fetched

For the sake of completeness, I should also explain a few additional columns of interest in the pg_statio_user_tables and pg_statio_user_indexes views. Similar to tuple statistics, there are two internal Postgres counters that track block statistics: blocks_fetched and blocks_hit. These two counters are then leveraged in the aforementioned views as part of block stats which encode information about Postgres’ cache (i.e. shared_buffers) performance.

In the pg_statio_user_tables view, heap_blks_read and heap_blks_hit columns display heap block stats, while idx_blks_read and idx_blks_hit columns expose the total aggregate index block stats for all indexes on a table:

heap_blks_read = table.blocks_fetched - table.blocks_hit
heap_blks_hit = table.blocks_hit

idx_blks_read = sum(index.blocks_fetched - index.blocks_hit)

idx_blks_hit = sum(index.blocks_hit)

The pg_statio_user_indexes view exposes the same index block stats per index:

idx_blks_read = index.blocks_fetched - index.blocks_hit

idx_blks_hit = index.blocks_hit

In both views, idx_blks_read exposes the number of requests that hit the disk through the system page cache.

Index Costing

The index costing approach described here is not intended to be an exact cost calculation mechanism. Rather, it is a ranking system based on distinct cost categories. Values estimated for each cost category are weighted averages reported as percentages. Each category can then be used to rank tables as individual cost contributors.

Cost categories depend on underlying factors which reflect the elementary assumptions for costing purposes. While factor values are assigned arbitrarily, this does not necessarily undermine the approach; their default values are intended to provide a starting baseline, and their utility relies on the proportional relationship between them as opposed to their absolute figures.

Cost Categories

Index Read Cost

Index read cost exposes the cost of reading index blocks from disk:

index blocks read * index block read cost factor

Because it is based on observed block reads, it is a good indicator for comparing index-heavy tables and spotting cases where index access is driving index block IO cost due to cache misses.

Index Write Cost

Index write cost is an estimate of the cost of maintaining index tuples as corresponding heap tuples get inserted, updated (excluding hot updates), and deleted since each modification typically requires updates to every index on the table.

index count * (tuple insert + update - hot update + delete) * (1 / weighted average index tuples per block) * index block write cost factor

This category approximates the write amplification based on index configuration. The resulting cost highlights the tax paid for indexes in write-heavy workloads, even when those indexes are not frequently used for reads.

Index Scan Cost

Index scan cost estimates the IO cost incurred when fetching heap blocks.

heap tuples fetched by index scans * (1 / weighted average heap tuples read per block) * heap block read cost factor * heap block miss ratio

This is the metric is intended to expose heap block IO cost when caching is poor and/or query selectivity is low.

Sequential Scan Cost

This category quantifies the cost of sequential scanning, by taking the observed number of tuples returned by sequential scans and converting that activity into estimated heap block reads under the table’s observed cache behavior.

heap tuples returned by sequential scans * (1 / weighted average heap tuples read per block) * heap block read cost factor * heap block miss ratio

This can be interpreted as a workload-driven measure of how much sequential access is contributing to heap block IO cost. This makes it useful for identifying tables where sequential scans are a major contributor to cost, especially when the cache hit ratio is low and sequential reads are frequently going to disk.

Index Gain

Index gain category estimates how much sequential scan heap read cost is avoided because of the availability of indexes. This is a hypothetical scenario where each index scan operation is counted as a sequential scan in the absence of indexes. The resulting total sequential scan cost is considered a gain achieved by the utilization of existing indexes.

sequential scan operations * table page count * heap block read cost factor * heap block miss ratio adjusted for table size

The miss ratio adjustment reflects the fact that large heap accesses are more likely to miss the cache under sequential scans given the same memory limits. This means that the cost of sequential scans would be higher in the hypothetical scenario. To extrapolate this additional cost for larger tables, a power function reduces the observed hit ratio (base) by using the table page count (exponent) as a proxy.

Implementation

See index_costing.sql file at https://github.com/buraks78/pg-sql-toolbox for details.

Here is an example output that is transposed for readability:

+------------------------+-----------------+
|relid                   |24754   |24757   |
+------------------------+-----------------+
|relname                 |sbtest1 |sbtest2 |
+------------------------+-----------------+
|schemaname              |public  |public  |
+------------------------+-----------------+
|sti_idx_tup_per_blk     |344.83  |350.88  |
+------------------------+-----------------+
|sti_idx_tup_read_per_blk|9.69    |10.39   |
+------------------------+-----------------+
|sti_idx_cache_hit       |0.999983|0.999982|
+------------------------+-----------------+
|s_heap_tup_per_blk      |22.7    |22.73   |
+------------------------+-----------------+
|s_heap_tup_read_per_blk |1.1     |1.1     |
+------------------------+-----------------+
|s_heap_cache_hit        |0.999971|0.999971|
+------------------------+-----------------+
|w_seq_scan              |60      |40      |
+------------------------+-----------------+
|w_seq_tup_read          |66.67   |33.33   |
+------------------------+-----------------+
|w_idx_scan              |50.09   |49.91   |
+------------------------+-----------------+
|w_idx_tup_fetch         |49.95   |50.05   |
+------------------------+-----------------+
|w_ins                   |50.19   |49.81   |
+------------------------+-----------------+
|w_upd                   |49.87   |50.13   |
+------------------------+-----------------+
|w_del                   |50.23   |49.77   |
+------------------------+-----------------+
|w_hot_upd               |25.2    |74.8    |
+------------------------+-----------------+
|w_live                  |50      |50      |
+------------------------+-----------------+
|w_dead                  |0       |0       |
+------------------------+-----------------+
|w_tup                   |50      |50      |
+------------------------+-----------------+
|w_reltuples             |50      |50      |
+------------------------+-----------------+
|w_relpages              |50.03   |49.97   |
+------------------------+-----------------+
|w_heap_blks_read        |50.03   |49.97   |
+------------------------+-----------------+
|w_heap_blks_hit         |50.02   |49.98   |
+------------------------+-----------------+
|w_idx_blks_read         |91.49   |8.51    |
+------------------------+-----------------+
|w_idx_blks_hit          |55.22   |44.78   |
+------------------------+-----------------+
|w_sti_idx_scan          |50.09   |49.91   |
+------------------------+-----------------+
|w_sti_idx_tup_read      |50.02   |49.98   |
+------------------------+-----------------+
|w_sti_idx_tup_fetch     |49.98   |50.02   |
+------------------------+-----------------+
|w_sti_idx_blks_read     |91.49   |8.51    |
+------------------------+-----------------+
|w_sti_idx_blks_hit      |55.22   |44.78   |
+------------------------+-----------------+
|w_sti_reltuples         |60      |40      |
+------------------------+-----------------+
|w_sti_relpages          |91.49   |8.51    |
+------------------------+-----------------+
|w_c_idx_read            |91.49   |8.51    |
+------------------------+-----------------+
|w_c_idx_write           |64.7    |35.3    |
+------------------------+-----------------+
|w_c_idx_scan            |49.95   |50.05   |
+------------------------+-----------------+
|w_c_seq_scan            |66.46   |33.54   |
+------------------------+-----------------+
|w_c_idx_gain            |50.13   |49.87   |
+------------------------+-----------------+

Relative weight fields are encoded with a w_ prefix. There are also a few additional summary fields for debugging purposes. See the github repo for further details.

Limitations

This costing approach depends on up-to-date statistics. Ensure that tables are analyzed properly since Postgres statistics can get stale under heavy write load.

The default values for the cost factors are just there to provide a baseline. Characteristics of each database, and that includes the behavior of the whole ecosystem around the database, will be unique. Cost factors should be adjusted accordingly. Specifically, shared buffer misses do not directly translate into disk accesses; system page cache behavior should be a critical consideration here. Similarly, there is currently no tablespace support. Tablespaces with different IO characteristics would depend on additional logic with their own sets of cost factors.

All indexes are treated equal. There is no difference between partial vs full indexes, or btree vs hash indexes.

There is an underlying assumption for index gain estimates: each hypothetical sequential scan operation is expected to be a full-table scan. This is not necessarily realistic. Therefore, this metric should be treated as a heuristic, an upper-bound cost indicator of where indexing is preventing expensive full-table reads.

WAL logging cost category does not exist. Considering how significant overhead can get to generate additional WAL records, I find the lack of relation specific WAL accounting interesting. Tracking record count and byte size stats for each relation, provided that this could be achieved without a significant performance hit in this critical path, would have been beneficial to extrapolate WAL logging cost with the help of the pg_stat_wal table. At this time, the write factor may be adjusted to reflect this cost heuristically in the index write cost category instead.

Logic Exception

Search This Blog