Rob Ulejczyk
I'm using `world` database as always. Get it here:
http://pgfoundry.org/projects/dbsamples/
And then:
psql -d postgres
create database world;
\q
psql -d world -f world.sql
psql world
\x auto
This is entirely not necessary since the queries and examples I'm going to cover will be very basic.
This is entirely not necessary since the queries and examples I'm going to cover will be very basic.
I'm running on postgres 9.3 with default settings.
select version();
version
-----------------
PostgreSQL 9.3.7
world=# explain select * from city;
QUERY PLAN
---------------------------------------------------------
Seq Scan on city (cost=0.00..72.79 rows=4079 width=31)
In format of range.
First number is startup cost
Second number is cost to fetch all rows
(Press down arrow)Default settings
# - Planner Cost Constants -
#seq_page_cost = 1.0 # measured on an arbitrary scale
#random_page_cost = 4.0 # same scale as above
#cpu_tuple_cost = 0.01 # same scale as above
#cpu_index_tuple_cost = 0.005 # same scale as above
#cpu_operator_cost = 0.0025 # same scale as above
With SSD disks random_page_cost
will be less expensive compared to sequential, adjust it if that's your setup.
Seq Scan on city (cost=0.00..72.79 rows=4079 width=31)
Where did the number came from?
(disk pages reads * seq_page_cost) + (number of rows * cpu_tuple_cost)
Number of pages and tuples can be found in pg_class
table:
select reltuples, relpages from pg_class where relname = 'city';
reltuples | relpages
-----------+----------
4079 | 32
Seq Scan on city (cost=0.00..72.79 rows=4079 width=31)
(disk pages reads * seq_page_cost) + (number of rows * cpu_tuple_cost)
(32 * 1) + (4079 * 0.01) = 32 + 40.79 = 72.79
This is as easy as it can be. With even slightly more complicated query that involves index the formula gets different.
world=# SET enable_seqscan = OFF;
SET
world=# explain select * from city where id > 1;
QUERY PLAN
----------------------------------------------------------------------------
Index Scan using city_pkey on city (cost=0.28..162.66 rows=4079 width=31)
Index Cond: (id > 1)
(2 rows)
world=# SET enable_seqscan = ON;
Index pages will be read sequentially the pages with real data are read in random order.
There is also a predicate that each row needs to be evaluated against.
number of pages in index (14) * seq_page_cost (1) +
tuples (4079) * cpu_index_tuple_cost (0.005) +
random_page_cost (4) * number of pages in table (32)
Information about pages and tuples in an index can be found in pg_class
just like for normal table:
world=# select relpages, reltuples from pg_class where relname = 'city_pkey';
relpages | reltuples
----------+-----------
14 | 4079
Index Scan using city_pkey on city (cost=0.28..162.66 rows=4079 width=31)
14 + 20.395 + 128 = 162.40
(pretty close uff)
Proper index use can save time compared to sequential scan.
world=# explain select * from city where id > 3000;
QUERY PLAN
---------------------------------------------------------------------------
Index Scan using city_pkey on city (cost=0.28..47.15 rows=1078 width=31)
Index Cond: (id > 3000)
Plans involving subplans may or may not require full scans upfront.
world=# explain select * from city c
inner join country cc on c.countrycode = cc.code;
QUERY PLAN
--------------------------------------------------------------------------
Hash Join (cost=10.38..171.25 rows=4079 width=144)
Hash Cond: (c.countrycode = cc.code)
-> Seq Scan on city c (cost=0.00..104.79 rows=4079 width=31)
-> Hash (cost=7.39..7.39 rows=239 width=113)
-> Seq Scan on country cc (cost=0.00..7.39 rows=239 width=113)
(5 rows)
The Hash
operation (line 7) requires full Seq Scan upfront (line 8), and startup cost shows it.
Hash Join
(line 4) can probe the country
hash as it reads city
.
world=# explain analyze select * from city c
inner join country cc on c.countrycode = cc.code limit 1000;
QUERY PLAN
--------------------------------------------------------------------------
Limit (cost=10.38..49.82 rows=1000 width=144)
(actual time=0.364..2.393 rows=1000 loops=1)
-> Hash Join (cost=10.38..171.25 rows=4079 width=144)
(actual time=0.362..2.116 rows=1000 loops=1)
Hash Cond: (c.countrycode = cc.code)
-> Seq Scan on city c (cost=0.00..104.79 rows=4079 width=31)
(actual time=0.010..0.250 rows=1000 loops=1)
-> Hash (cost=7.39..7.39 rows=239 width=113)
(actual time=0.328..0.328 rows=239 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 36kB
-> Seq Scan on country cc (cost=0.00..7.39 rows=239 width=113)
(actual time=0.009..0.123 rows=239 loops=1)
Total runtime: 2.617 ms
Hash Join
operation got terminated as soon as it fetched 1000 rows, even through the estimation was for 4079.
Not every kind of operation can be early terminated.
Postgres is aware when plans are expected to take only fraction of rows and will adjust it's choice.
world=# explain select * from city c inner join country cc on c.countrycode = cc.code limit 1;
QUERY PLAN
---------------------------------------------------------------------------------------------
Limit (cost=0.14..0.35 rows=1 width=144)
-> Nested Loop (cost=0.14..846.61 rows=4079 width=144)
-> Seq Scan on city c (cost=0.00..104.79 rows=4079 width=31)
-> Index Scan using country_pkey on country cc (cost=0.14..0.17 rows=1 width=113)
Index Cond: (code = c.countrycode)
Because we need only 1 row Postgres has choosen very expensive plan, but with no startup costs it turns to be the best one for single row.
world=# explain select * from city;
QUERY PLAN
---------------------------------------------------------
Seq Scan on city (cost=0.00..72.79 rows=4079 width=31)
Total number of rows can be found in pg_class
table.
world=# explain select * from city where id < 80;
QUERY PLAN
------------------------------------------------------------------------
Index Scan using city_pkey on city (cost=0.28..9.68 rows=80 width=31)
Index Cond: (id < 80)
How does Postgres know there is 80 rows with ID less than 80 without scanning the table?
Why is it wrong? We know that there is 79 records only.
To calculate this number it uses statistics from `pg_stats` table.
world=# select * from pg_stats where tablename='city' and attname='id';
schemaname | public
tablename | city
attname | id
inherited | f
null_frac | 0
avg_width | 4
n_distinct | -1
most_common_vals |
most_common_freqs |
histogram_bounds | {1,41,82,123,164,204,245,286,327,368,408,449,490,...}
correlation | 1
most_common_elems |
most_common_elem_freqs |
elem_count_histogram |
The important one in this case is `histogram_bounds`. Histogram divides the data into equal frequency buckets.
histogram_bounds | {1,41,82,123,164,204,245,286,327,368,408,449,490,...}
QP looks for the bucket that matches our condition.
This will be 2nd bucket (41-82) so we know that total number of rows matching the condition will be size of the first bucket + part of the second bucket.
Since all the buckets are equal we can calculate the size by dividing total number of tuples by number of buckets.
4079 / 100 = 40.79
Index Scan using city_pkey on city (cost=0.28..9.68 rows=80 width=31)
Now we need to guess number of rows matching our condition in 2nd bucket. Given linear distribution this can be calculated as:
(80 - 41) / (82 - 41) = 39 / 41 = 0.95121
So 95% of items in the bucket met our criteria, that makes:
40.79 * 0.95121 = 38.80
Then add the previous bucket to get the final number:
40.79 + 38.80 = 79.59 = 80
What about not unique values?
world=# explain select * from city where countrycode = 'USA';
QUERY PLAN
--------------------------------------------------------
Seq Scan on city (cost=0.00..82.99 rows=274 width=31)
Filter: (countrycode = 'USA'::bpchar)
First let's see the statistics for countrycode
column:
world=# select * from pg_stats where tablename='city' and attname='countrycode';
schemaname | public
tablename | city
attname | countrycode
inherited | f
null_frac | 0
avg_width | 4
n_distinct | 232
most_common_vals | {CHN,IND,USA,BRA,JPN,RUS,MEX,PHL,DEU,IDN,GBR,KOR,...}
most_common_freqs | {0.0889924,0.0835989,0.0671733,0.0612895,0.0607992,...}
histogram_bounds | {ABW,AGO,ARE,ARM,AUS,AUS,AUT,BEL,BEN,BGR,BGR,BLR,...}
correlation | 0.668009
most_common_elems |
most_common_elem_freqs |
elem_count_histogram |
most_common_vals | {CHN,IND,USA,BRA,JPN,RUS,MEX,PHL,DEU,IDN,GBR,KOR,...}
most_common_freqs | {0.0889924,0.0835989,0.0671733,0.0612895,0.0607992,...}
histogram_bounds | {ABW,AGO,ARE,ARM,AUS,AUS,AUT,BEL,BEN,BGR,BGR,BLR,...}
On top of historgram we have most_common_vals
and most_common_freqs
We are looking for USA and it's on the list of most common values so:
4079 * 0.0671733 = 273.99 = 274
Seq Scan on city (cost=0.00..82.99 rows=274 width=31)
Filter: (countrycode = 'USA'::bpchar)
What if the value is not on the most common list?
world=# explain select * from city where countrycode = 'BEL';
QUERY PLAN
----------------------------------------------------------------------------------
Index Scan using idx_city_countrycode on city (cost=0.28..9.65 rows=4 width=31)
Index Cond: (countrycode = 'BEL'::bpchar)
Index Scan using idx_city_countrycode on city (cost=0.28..9.65 rows=4 width=31)
Substract frequencies of most common values:
1 - 0.8141700300000001 = 0.1858299699999999
Calculate number of distinct values left:
232 - 36 = 196
Divide these numbers to get friction of all values:
0.1858299699999999 / 196 = 0.0009481120918367342
Multiply it by total number of tuples:
4079 * 0.0009481120918367342 = 3.86 = 4
For non unique columns with range predicates Postgres uses both most frequent values and histogram.
These calculations can be found in postgres documentation.
What about multiple conditions?
world=# explain select * from city where id < 80 and countrycode = 'BEL';
QUERY PLAN
----------------------------------------------------------------------------------
Index Scan using idx_city_countrycode on city (cost=0.28..9.66 rows=1 width=31)
Index Cond: (countrycode = 'BEL'::bpchar)
Filter: (id < 80)
Query Planner assumes that both conditions are independend so individual estimations can be multiplied:
0.0009481120918367342 * 0.0195 = 0.07 = 1 (it never returns zero)
Statistics are calculated by analyze
Basic info about vacuum and analyze can be found in pg_stat_all_tables
world=# select last_vacuum, last_autovacuum, last_analyze, last_autoanalyze
from pg_stat_all_tables where relname = 'city';
-[ RECORD 1 ]----+------------------------------
last_vacuum | 2015-06-07 17:25:35.239145+02
last_autovacuum | 2015-06-07 17:17:55.872564+02
last_analyze | 2015-06-07 17:29:42.195939+02
last_autoanalyze | 2015-06-07 17:17:55.997271+02
Statistics are based on random sample of data.
It can be controlled via `default_statistics_target` on system, table and column level.
Target determines the size of histogram and indirectly number of rows taken for sample.
Default is 100, number of rows is `statistics_target * 300`, that gives `30 000` rows.
/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
* bin size error with probability 0.99. So there's no real need to
* scale for n, which is a good thing because we don't necessarily
* know it at this point.
*--------------------
*/
Width tells us average size (in bytes) of single tuple in the table.
Seq Scan on city (cost=0.00..72.79 rows=4079 width=31)
It's taken from `pg_stats` and is calculated as sum of averages of each column.
select sum(avg_width) from pg_stats where tablename = 'city';
sum
-----
31
(Press down arrow)
Postgres stores data in pages (blocks) each with size of 8 KB (can be changed).
It fetches full pages from the disk, so it can technically fetch multiple tuples with single page read.
This is great if the tuples we are looking for happen to be on the same page, but if we query for 10 tuples that are in different pages it will take 10 expensive reads.
With sequential scan being 4 times cheaper than random page reads it's very common that reading everything sequentially will be faster than doing index scan.
Indexes are always sorted and are always read sequentially.
world=# explain select * from city where id > 3000;
QUERY PLAN
---------------------------------------------------------------------------
Index Scan using city_pkey on city (cost=0.28..47.15 rows=1078 width=31)
Index Cond: (id > 3000)
How can Postgres guess how many random reads will it take?
It uses statistics again.
The column corelaction
for each column describes statistical correlation between physical row ordering and logical ordering of the column values.
world=# select correlation from pg_stats where tablename='city' and attname='countrycode';
correlation
-------------
0.668009
This ranges from -1 to +1.
It uses < operator and will be null for columns that don't support this operator.
As the value gets closer to +1 or -1 it's more likely the related index will be used, since the tuples are close to each other.
When data is inserted it's generally appended to the end. This works if we mostly read new data.
However when the data is updated it's still actually appended.
This makes updates likely to fragment our data on the disk making it more expensive to read.
The table can be reorganised by using CLUSTER
command.
This needs to happen manually and locks the table for the duration - you don't want to do it on live table.
world=# select correlation from pg_stats where tablename='city'
and attname='countrycode';
correlation
-------------
0.668009
CLUSTER
requires an index.
world=# create index idx_city_countrycode on city (countrycode);
CREATE INDEX
world=# cluster verbose city using idx_city_countrycode;
INFO: clustering "public.city" using index scan on "idx_city_countrycode"
INFO: "city": found 0 removable, 4079 nonremovable row versions in 32 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU 0.00s/0.00u sec elapsed 0.04 sec.
CLUSTER
world=# select correlation from pg_stats where tablename='city'
and attname='countrycode';
correlation
-------------
0.668009
No change?
Statistics needs to be updated using `analyze`
world=# analyze city;
ANALYZE
world=# select correlation from pg_stats where tablename='city'
and attname='countrycode';
correlation
-------------
1
Now the data in the table is sorted properly but new/updated tuples will be appended to the end.
CLUSTER
needs to be run periodically manually to keep the data in right order.
It give's some idea about memory usage.
Reading from disk is really expensive.
To save expensive reads from the disk Postgres uses `shared buffer cache`.
It's part of memory dedicated to cache requested pages.
Shared buffer is almost always full.
When new page is to be stored another page needs to be removed.
Postgres keeps statistics how often the pages are being used and tries his best to evict the right pages (we don't have control over it).
world=# explain (buffers, analyze) select * from city;
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on city (cost=0.00..72.79 rows=4079 width=31)
(actual time=13.135..16.141 rows=4079 loops=1)
Buffers: shared read=32
Total runtime: 17.421 ms
world=# explain (buffers, analyze) select * from city;
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on city (cost=0.00..72.79 rows=4079 width=31)
(actual time=0.009..0.476 rows=4079 loops=1)
Buffers: shared hit=32
Total runtime: 0.778 ms
This was more than 20 times difference for sequential scan.
There is also file system cache controlled by the operating between `shared buffer` and physical disk access.
Postgres caching is duplicating what operating system is already doing, but it's still worth having because Postgres have better knowledge about it's data access patterns, yet at the same time it has no knowledge about underlying file system layout.
Updates produce dirty pages.
These can't be simply discarded as clean pages, they need to be written to disk first.
Dirty pages are written mostly by background process, it's called checkpoint
.
Checkpoints could produce huge IO spikes, but are designed to run over time to balance the writes so impact on performance is minimal.
By default a checkpoint is defined to be distributed over half time until next run is scheduled.
Operating System still cache writes to disk, so there may be small delays between postgres flush and physical write occuring.
Amount of writes during checkpoint is partially dependent on shared buffers size.
This can potentially limit scalability of memory cache under certain use cases
Postgres keeps track of IO stats.
world=# select * from pg_statio_user_tables where relname='city';
-[ RECORD 1 ]---+-------
relid | 360517
schemaname | public
relname | city
heap_blks_read | 32
heap_blks_hit | 98
idx_blks_read | 4
idx_blks_hit | 2
toast_blks_read | 0
toast_blks_hit | 0
tidx_blks_read | 0
tidx_blks_hit | 0
Proper caching and disk organization can have huge impact on speed.
But it's difficult to predict and almost impossible to replicate locally.
It can take a lot of time to discover data access patterns before applying any fixes.
There is much more to be found in the documentation.