The FreeBSD Diary

Providing practical examples since 1998

If you buy from Amazon USA, please support us by using this link.

PostgreSQL - analyzing a query to improve speed 1 March 2003

Need more help on this topic? Click here
This article has 2 comments
Show me similar articles

PostgreSQL is the leading open source database. Many people consider PostgreSQL to be a competitor to mySQL. People who say that don't know either product. PostgreSQL isn't competing with mySQL; it's competing with Oracle. Yes, it's being quiet about it. But the users aren't. They were Oracle users and they're coming to PostgreSQL.

Of course the skeptics will ask for proof of my claims. I don't have any. I have anecdotal evidence. And third hand information. Such as Bruce Momjian's recent overseas trip report.

All the PostgreSQL versus mySQL noise is moot. What people should be concentrating on is PostgreSQL versus Oracle

Put that in your DBMS and analyse it.

What's coming up

Anyone who has ever written a complex query has seen that despite the best of designs, they aren't always fast. Over the past day or two I've been struggling with the slowest part of FreshPorts. I think I've come up with a relatively simple solution. The trip to that solution was rather interesting. I will reproduce it below and show you how it went from a query which took many seconds to a query which takes less than 0.01 seconds.

The first part of the article will cover a bit of FreshPorts introduction, the underlying database tables and how they relate to each other. Then we'll get into the queries, follow through the analyzing of the query and finally the optimized result.

This article should be useful to anyone using PostgreSQL or designing databases. It should also prove useful to those who wonder how database work and what it takes to make them run quickly. It might also provide a venue for my own education. People reading this article might know a better way to do things. Information is a two-way street.

You might want to read PostgreSQL Database Performance Tuning first. It gives you a few idea about database tuning, but not much on tuning a query. PostgreSQL Hardware Performance Tuning might also interest you.

Least of all, it should show people how PostgreSQL can do this, but mySQL can't... (that's not the goal of the article, and I'm not going to attempt to prove it... please just accept that mySQL can't do this type of thing...)

Explain it to me...

I'll be making use of the PostgreSQL EXPLAIN command. That's found in the Reference Manual of the PostgreSQL online documentation.

If what appears below doesn't make sense, try reading the above references first.

EXPLAIN shows the show the execution plan of a statement. You get to see what tables will be used, what indexes are involved, and the number of rows involved. Execution plans make use of statistics generated by the VACUUM command. If you aren't getting the results you want, try running VACUUM ANALYZE over the table[s] in question.

EXPLAIN ANALYZE goes one step further and actually executes the command. This provides the actual run time for the various parts of the query. Be warned that EXPLAIN ANALYSE will actually delete, update, or insert any data involved in the query. Read the documentation for details.

Here are a few things to look for when reading the output of EXPLAIN:

Look for high costs
look for sequential scans
look for large numbers of rows
look for index scans

High costs are an indication of something which takes a lot of set up. Sequential scans indicate a scan of the entire table. Perhaps an index is required. Large numbers of rows might indicate a place where you can cut down the result set earlier in the query. Index scans are usually a good thing as they tend to speed up the query (in most instances).

Items to note

For small tables, an index costs more than it is worth in many queries. Indexes are only really useful for larger data sets. Don't be upset if the query runs quickly but doesn't use an index.
If you think the index should be used, but it isn't being touched, try a VACUUM ANALYZE on that table. Then test the query again.

FreshPorts - what is it?

FreshPorts (and its sister FreshSource capture and store the commits to a source repository. It can work with any source repository but at present it is restricted to the FreeBSD source tree. FreshPorts concentrates on the Ports subtree, while FreshSource captures and records the entire tree.

By its nature, FreshPorts is the more complex of the two sites. It was also the first. Users create watch lists of whatever ports they wish to keep up to date with. When a commit occurs against those ports, the user is sent an email notification. Users can also view all commits recorded against a given port. When a port can consist of several dozen files, this information is very useful when tracking updates.

FreshPorts - the tables and sample data

The most important FreshPorts table is the commit_log table.

create table commit_log ( id serial not null, commit_date timestamp not null, committer text not null, primary key (id) ); create index commit_log_commit_date on commit_log (commit_date);

The actual table contains more columns but the above is sufficient for demonstration purposes.

Each new commit against the FreeBSD source repository is loaded into this table. FreshPorts is only interested in commits against the ports subtree. However, FreshSource and FreshPorts use the same database. A second table is used to indicate commits against particular ports:

create table commit_log_ports ( commit_log_id integer not null, port_id integer not null, primary key (commit_log_id, port_id) ); alter table commit_log_ports add foreign key (commit_log_id) references commit_log (id) on update cascade on delete cascade; alter table commit_log_ports add foreign key (port_id) references ports (id) on update cascade on delete cascade;

I will only mention the Ports table so that you know it exists and is referred to by many other tables. The main problem at hand concerns the commit_log and commit_log_ports table.

You can download the sample tables and data (859KB) used in this article should you wish to play with the results. Please note that I have only supplied the columns necessary to perform the queries indicated below. As such, your results will vary. NOTE: the examples were run on version 7.3.

The tarball contains a .sql file which can be loaded into your database with the following command:

createdb test psql test < postgresql-analyze.sql

Feedback is appreciated.

So what's the problem?

FreshPorts, by default, displays the last 100 commits. If you didn't think about it, you'd think this was the answer:

EXPLAIN ANALYZE SELECT * FROM commit_log ORDER BY commit_date LIMIT 100; Limit (cost=0.00..48.12 rows=100 width=252) (actual time=0.06..2.40 rows=100 loops=1) -> Index Scan using commit_log_commit_date on commit_log (cost=0.00..32975.88 rows=68533 width=252) (actual time=0.05..1.59 rows=101 loops=1) Total runtime: 2.98 msec (3 rows)

As you can see, it's rather fast to get the last 100 commits. Just under 3ms. That's 0.003 seconds. That's damn fast in my book. Mind you, there's only 68533 rows in that table.

Note that the (3 rows) refers to the EXPLAIN ANALYSE output, not the number of rows in the result set. Compare that to the 100 which appears in the Limit line in the output.

Unfortunately, that's the last 100 commits. Not the last 100 commits against the ports subtree. For that, we have to use the commit_log_ports table.

EXPLAIN ANALYZE SELECT distinct commit_log.id, commit_log.commit_date FROM commit_log, commit_log_ports WHERE commit_log.id = commit_log_ports.commit_log_id ORDER BY commit_log.commit_date LIMIT 100; Limit (cost=13456.18..13463.68 rows=100 width=16) (actual time=4190.89..4193.10 rows=100 loops=1) -> Unique (cost=13456.18..13938.43 rows=6430 width=16) (actual time=4190.88..4192.31 rows=101 loops=1) -> Sort (cost=13456.18..13616.93 rows=64300 width=16) (actual time=4190.87..4191.39 rows=113 loops=1) Sort Key: commit_log.commit_date, commit_log.id -> Hash Join (cost=4103.81..7662.06 rows=64300 width=16) (actual time=1065.79..3150.16 rows=64300 loops=1) Hash Cond: ("outer".commit_log_id = "inner".id) -> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.11..434.34 rows=64300 loops=1) -> Hash (cost=3032.33..3032.33 rows=68533 width=12) (actual time=1065.08..1065.08 rows=0 loops=1) -> Seq Scan on commit_log (cost=0.00..3032.33 rows=68533 width=12) (actual time=0.10..645.86 rows=68544 loops=1) Total runtime: 4195.82 msec (10 rows)

OUCH! That's terrible. That's four seconds! Why did it take so long?

The query is returning all rows in the two tables, then selecting only the first one hundred rows. That's quite inefficient, but it's not what we want to do. We can make things faster.

Don't search everything; provide a starting point

In this example, we're going to provide a starting place for the query. We aren't going to do the JOIN on the whole table. We are going to limit it to the most recent commits. Let's assume that in a given week, there are at least 100 commits against the ports tree. We will do this by adding the following condition to the WHERE clause:

commit_log.commit_date > now() - interval '1 week'

The new query is:

EXPLAIN ANALYZE SELECT distinct commit_log.id, commit_log.commit_date FROM commit_log, commit_log_ports WHERE commit_log.id = commit_log_ports.commit_log_id AND commit_log.commit_date > now() - interval '1 week' ORDER BY commit_log.commit_date LIMIT 100; Limit (cost=7425.72..7433.22 rows=100 width=16) (actual time=1238.13..1238.62 rows=21 loops=1) -> Unique (cost=7425.72..7586.47 rows=2143 width=16) (actual time=1238.12..1238.44 rows=21 loops=1) -> Sort (cost=7425.72..7479.31 rows=21433 width=16) (actual time=1238.11..1238.22 rows=28 loops=1) Sort Key: commit_log.commit_date, commit_log.id -> Hash Join (cost=3603.44..5883.85 rows=21433 width=16) (actual time=1175.11..1237.71 rows=28 loops=1) Hash Cond: ("outer".commit_log_id = "inner".id) -> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.10..420.66 rows=64300 loops=1) -> Hash (cost=3546.33..3546.33 rows=22844 width=12) (actual time=419.03..419.03 rows=0 loops=1) -> Seq Scan on commit_log (cost=0.00..3546.33 rows=22844 width=12) (actual time=0.19..418.47 rows=57 loops=1) Filter: (commit_date > (now() - '7 days'::interval)) Total runtime: 1238.96 msec (11 rows)

Now the query takes only just over 1 second. That's not bad. But we can do better. Let's now assume we know how far back in time the 100th previous port commit was. It was on 18 February 2003, at 16:02:15.

EXPLAIN ANALYZE SELECT distinct commit_log.id, commit_log.commit_date FROM commit_log, commit_log_ports WHERE commit_log.id = commit_log_ports.commit_log_id AND commit_log.commit_date > '2003-02-18 16:02:15.133047-05' ORDER BY commit_log.commit_date LIMIT 100; Limit (cost=1562.07..1563.65 rows=21 width=16) (actual time=737.54..739.94 rows=100 loops=1) -> Unique (cost=1562.07..1563.71 rows=22 width=16) (actual time=737.52..739.15 rows=101 loops=1) -> Sort (cost=1562.07..1562.62 rows=219 width=16) (actual time=737.52..738.09 rows=145 loops=1) Sort Key: commit_log.commit_date, commit_log.id -> Hash Join (cost=181.33..1553.56 rows=219 width=16) (actual time=507.25..736.43 rows=146 loops=1) Hash Cond: ("outer".commit_log_id = "inner".id) -> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.10..419.82 rows=64300 loops=1) -> Hash (cost=180.74..180.74 rows=233 width=12) (actual time=5.68..5.68 rows=0 loops=1) -> Index Scan using commit_log_commit_date on commit_log (cost=0.00..180.74 rows=233 width=12) (actual time=0.34..4.51 rows=215 loops=1) Index Cond: (commit_date > '2003-02-18 16:02:15.133047-05'::timestamp with time zone) Total runtime: 740.66 msec (11 rows)

Why is this faster? The key is in the explain output. In the previous example, it was merely filtering. But in this example, we are giving it a hard-and-fast starting point, as demonstrated by the Index Cond: output. Essentially, we are providing a date, instead of a function call. This allows the query to be much faster and finding what is needed.

We can go faster

This isn't fast enough. What if I know what the last 100 port commits are and I have them in a table? A table like this:

create table clp ( commit_log_id integer not null, commit_date timestamp not null, primary key (commit_log_id) ); alter table clp add foreign key (commit_log_id) references commit_log (id) on update cascade on delete cascade;

Note that this table is not included in the example tarball.

The table is named clp after the commit_log_ports table.

We can populate the table with this query:

EXPLAIN ANALYZE INSERT INTO clp SELECT DISTINCT commit_log_id, commit_date FROM commit_log, commit_log_ports WHERE commit_log.id = commit_log_ports.commit_log_id ORDER BY commit_log.commit_date desc LIMIT 100; Subquery Scan "*SELECT*" (cost=13456.18..13463.68 rows=100 width=16) (actual time=4165.77..4169.47 rows=100 loops=1) -> Limit (cost=13456.18..13463.68 rows=100 width=16) (actual time=4165.75..4168.39 rows=100 loops=1) -> Unique (cost=13456.18..13938.43 rows=6430 width=16) (actual time=4165.74..4167.59 rows=101 loops=1) -> Sort (cost=13456.18..13616.93 rows=64300 width=16) (actual time=4165.73..4166.48 rows=145 loops=1) Sort Key: commit_log.commit_date, commit_log_ports.commit_log_id -> Hash Join (cost=4103.81..7662.06 rows=64300 width=16) (actual time=1066.31..3157.63 rows=64300 loops=1) Hash Cond: ("outer".commit_log_id = "inner".id) -> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.11..430.26 rows=64300 loops=1) -> Hash (cost=3032.33..3032.33 rows=68533 width=12) (actual time=1065.58..1065.58 rows=0 loops=1) -> Seq Scan on commit_log (cost=0.00..3032.33 rows=68533 width=12) (actual time=0.10..647.40 rows=68544 loops=1) Total runtime: 4176.07 msec (11 rows)

Of course, the cost to insert into this table is relatively high, because of the previously mentioned examples and reasons. The ultimate goal is to maintain this table as we go along. I'll explain that in a subsequent section. For now, let's just try out this table and see how things go.

We will assume that new entries are added to this table all the time. If we query using this table:

EXPLAIN ANALYZE SELECT DISTINCT clp.commit_log_id, clp.commit_date FROM commit_log, commit_log_ports, clp WHERE commit_log.id = commit_log_ports.commit_log_id AND clp.commit_log_id = commit_log.id ORDER BY clp.commit_date desc LIMIT 100; Limit (cost=5703.16..5710.66 rows=100 width=20) (actual time=731.50..733.90 rows=100 loops=1) -> Unique (cost=5703.16..5772.32 rows=922 width=20) (actual time=731.49..733.11 rows=100 loops=1) -> Sort (cost=5703.16..5726.22 rows=9221 width=20) (actual time=731.48..732.05 rows=144 loops=1) Sort Key: clp.commit_date, clp.commit_log_id -> Hash Join (cost=3714.68..5095.91 rows=9221 width=20) (actual time=503.23..730.43 rows=144 loops=1) Hash Cond: ("outer".commit_log_id = "inner".id) -> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.09..417.49 rows=64300 loops=1) -> Hash (cost=3712.18..3712.18 rows=1000 width=16) (actual time=7.03..7.03 rows=0 loops=1) -> Nested Loop (cost=0.00..3712.18 rows=1000 width=16) (actual time=0.42..6.41 rows=100 loops=1) -> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.10..0.82 rows=100 loops=1) -> Index Scan using commit_log_pkey on commit_log (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100) Index Cond: ("outer".commit_log_id = commit_log.id) Total runtime: 734.72 msec (13 rows)

Note also that to get the above time, you will have to do a VACUUM ANALYZE on the clp table so that the query engine will have data regarding that table.

Look at where all the time is used up on this query, it's in the LIMIT. That time represents almost the entire query. But we can use the same idea but structure the query in a different way so as to increase performance.

EXPLAIN ANALYZE SELECT DISTINCT clp.commit_log_id, clp.commit_date FROM clp, commit_log, commit_log_ports WHERE clp.commit_log_id = commit_log.id AND commit_log.id = commit_log_ports.commit_log_id ORDER BY clp.commit_date desc LIMIT 100; Limit (cost=6538.88..6545.86 rows=93 width=20) (actual time=934.51..936.90 rows=100 loops=1) -> Unique (cost=6538.88..6545.92 rows=94 width=20) (actual time=934.50..936.10 rows=100 loops=1) -> Sort (cost=6538.88..6541.23 rows=938 width=20) (actual time=934.49..935.06 rows=144 loops=1) Sort Key: clp.commit_date, clp.commit_log_id -> Merge Join (cost=0.00..6492.57 rows=938 width=20) (actual time=922.63..933.37 rows=144 loops=1) Merge Cond: ("outer".commit_log_id = "inner".commit_log_id) -> Nested Loop (cost=0.00..3744.18 rows=1000 width=16) (actual time=0.50..6.95 rows=100 loops=1) -> Index Scan using clp_pkey on clp (cost=0.00..52.00 rows=1000 width=12) (actual time=0.31..1.21 rows=100 loops=1) -> Index Scan using commit_log_pkey on commit_log (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100) Index Cond: ("outer".commit_log_id = commit_log.id) -> Index Scan using clp_cli on commit_log_ports (cost=0.00..2469.87 rows=64300 width=4) (actual time=0.31..599.70 rows=64300 loops=1) Total runtime: 937.68 msec (12 rows)

Well, that's not quite what i wanted to do. But it does demonstrate a useful point. If you rearrange the table names in the FROM clause, it can affect performance.

Now it's time to apply our old trick of providing a starting point. We will restrict the commits to those after a certain time:

EXPLAIN ANALYZE SELECT DISTINCT clp.commit_log_id, clp.commit_date FROM clp, commit_log, commit_log_ports WHERE commit_log.id = commit_log_ports.commit_log_id AND clp.commit_log_id = commit_log.id AND commit_log.commit_date > '2003-02-18 16:02:15.133047-05' ORDER BY clp.commit_date desc LIMIT 100; Limit (cost=299.65..299.67 rows=1 width=20) (actual time=13.27..15.66 rows=100 loops=1) -> Unique (cost=299.65..299.67 rows=1 width=20) (actual time=13.26..14.87 rows=100 loops=1) -> Sort (cost=299.65..299.66 rows=3 width=20) (actual time=13.25..13.81 rows=144 loops=1) Sort Key: clp.commit_date, clp.commit_log_id -> Nested Loop (cost=169.43..299.63 rows=3 width=20) (actual time=4.87..12.33 rows=144 loops=1) -> Hash Join (cost=169.43..194.47 rows=3 width=16) (actual time=4.64..6.69 rows=100 loops=1) Hash Cond: ("outer".commit_log_id = "inner".id) -> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.02..0.77 rows=100 loops=1) -> Hash (cost=168.97..168.97 rows=186 width=4) (actual time=4.52..4.52 rows=0 loops=1) -> Index Scan using commit_log_commit_date on commit_log (cost=0.00..168.97 rows=186 width=4) (actual time=0.45..3.37 rows=215 loops=1) Index Cond: (commit_date > '2003-02-18 16:02:15.133047-05'::timestamp with time zone) -> Index Scan using clp_cli on commit_log_ports (cost=0.00..38.58 rows=9 width=4) (actual time=0.02..0.03 rows=1 loops=100) Index Cond: ("outer".commit_log_id = commit_log_ports.commit_log_id) Total runtime: 16.44 msec (14 rows)

Now that's fast. Very fast. In fact, it's more than 57 times faster.

Another approach

We now have a very fast query which will return the last 100 commits to the ports tree. Let's now figure out how to get the list of ports associated with those commits.

EXPLAIN ANALYZE SELECT clp.commit_log_id, clp.commit_date, ports.name FROM clp, commit_log, commit_log_ports, ports WHERE commit_log.id = commit_log_ports.commit_log_id AND clp.commit_log_id = commit_log.id AND commit_log.commit_date > '2003-02-18 16:02:15.133047-05' AND commit_log_ports.port_id = ports.id ORDER BY clp.commit_date desc LIMIT 100; Limit (cost=314.99..314.99 rows=2 width=78) (actual time=389.48..390.66 rows=100 loops=1) -> Sort (cost=314.99..314.99 rows=3 width=78) (actual time=389.46..389.86 rows=101 loops=1) Sort Key: clp.commit_date -> Nested Loop (cost=169.43..314.97 rows=3 width=78) (actual time=40.67..387.90 rows=144 loops=1) -> Nested Loop (cost=169.43..299.63 rows=3 width=24) (actual time=16.52..24.91 rows=144 loops=1) -> Hash Join (cost=169.43..194.47 rows=3 width=16) (actual time=16.15..18.40 rows=100 loops=1) Hash Cond: ("outer".commit_log_id = "inner".id) -> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=10.55..11.38 rows=100 loops=1) -> Hash (cost=168.97..168.97 rows=186 width=4) (actual time=5.48..5.48 rows=0 loops=1) -> Index Scan using commit_log_commit_date on commit_log (cost=0.00..168.97 rows=186 width=4) (actual time=0.34..4.32 rows=215 loops=1) Index Cond: (commit_date > '2003-02-18 16:02:15.133047-05'::timestamp with time zone) -> Index Scan using clp_cli on commit_log_ports (cost=0.00..38.58 rows=9 width=8) (actual time=0.03..0.04 rows=1 loops=100) Index Cond: ("outer".commit_log_id = commit_log_ports.commit_log_id) -> Index Scan using ports_pkey on ports (cost=0.00..6.00 rows=1 width=54) (actual time=2.49..2.50 rows=1 loops=144) Index Cond: ("outer".port_id = ports.id) Total runtime: 391.55 msec (16 rows)

That query only returned one hundred rows.... Oh yes, we need a subselect here, we want to limit the result set to 100 commits, not 100 ports. This is the query we will use as a starting point:

EXPLAIN ANALYZE SELECT clp.commit_log_id FROM clp ORDER BY clp.commit_date DESC LIMIT 100; Limit (cost=69.83..70.08 rows=100 width=12) (actual time=1.30..2.48 rows=100 loops=1) -> Sort (cost=69.83..72.33 rows=1000 width=12) (actual time=1.29..1.69 rows=100 loops=1) Sort Key: commit_date -> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.02..0.68 rows=100 loops=1) Total runtime: 3.02 msec (5 rows)

With that, let's join it up with the commit_log table:

EXPLAIN ANALYZE SELECT commit_log.id FROM commit_log JOIN (SELECT clp.commit_log_id FROM clp ORDER BY clp.commit_date DESC LIMIT 100) AS TMP ON commit_log.id = TMP.commit_log_id; Nested Loop (cost=69.83..439.30 rows=100 width=8) (actual time=1.76..9.11 rows=100 loops=1) -> Subquery Scan tmp (cost=69.83..70.08 rows=100 width=12) (actual time=1.38..3.50 rows=100 loops=1) -> Limit (cost=69.83..70.08 rows=100 width=12) (actual time=1.37..2.59 rows=100 loops=1) -> Sort (cost=69.83..72.33 rows=1000 width=12) (actual time=1.36..1.79 rows=100 loops=1) Sort Key: commit_date -> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.02..0.68 rows=100 loops=1) -> Index Scan using commit_log_pkey on commit_log (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100) Index Cond: (commit_log.id = "outer".commit_log_id) Total runtime: 9.69 msec (9 rows)

As you can see, that's still pretty fast, and still only 100 records. Now we have the rows from commit_log which relate to the last 100 commits to the ports tree. Let's get the ports touched by those commits.

EXPLAIN ANALYZE SELECT commit_log.id FROM commit_log JOIN (SELECT clp.commit_log_id FROM clp ORDER BY clp.commit_date DESC LIMIT 100) AS TMP ON commit_log.id = TMP.commit_log_id, commit_log_ports WHERE commit_log_ports.commit_log_id = commit_log.id; Hash Join (cost=439.55..1810.22 rows=94 width=12) (actual time=504.90..731.12 rows=144 loops=1) Hash Cond: ("outer".commit_log_id = "inner".id) -> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.09..416.59 rows=64300 loops=1) -> Hash (cost=439.30..439.30 rows=100 width=8) (actual time=9.64..9.64 rows=0 loops=1) -> Nested Loop (cost=69.83..439.30 rows=100 width=8) (actual time=1.84..9.09 rows=100 loops=1) -> Subquery Scan tmp (cost=69.83..70.08 rows=100 width=12) (actual time=1.44..3.56 rows=100 loops=1) -> Limit (cost=69.83..70.08 rows=100 width=12) (actual time=1.43..2.63 rows=100 loops=1) -> Sort (cost=69.83..72.33 rows=1000 width=12) (actual time=1.42..1.82 rows=100 loops=1) Sort Key: commit_date -> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.09..0.73 rows=100 loops=1) -> Index Scan using commit_log_pkey on commit_log (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100) Index Cond: (commit_log.id = "outer".commit_log_id) Total runtime: 731.93 msec (13 rows)

Wow! That is 75 times slower than before! Surely we can do better! I think the clue is the rows=64300 entry. Let's try reducing that. In this particular case, I know we're only looking at values within the clp table. So let's create this function:

CREATE OR REPLACE FUNCTION Anchor_CLID() RETURNS integer AS ' SELECT min(commit_log_id) AS RESULT FROM clp; ' LANGUAGE SQL IMMUTABLE;

See also how to make the above faster.

Note that the above is for version 7.3 which first introduced the IMMUTABLE designation. This function obtains the minimum commit_log_id value from the clp table. Let's use that in our query:

EXPLAIN ANALYZE SELECT commit_log_id, port_id FROM (SELECT commit_log.id FROM commit_log JOIN (SELECT clp.commit_log_id FROM clp ORDER BY clp.commit_date DESC LIMIT 100) AS TMP ON commit_log.id = TMP.commit_log_id) AS TMP2 JOIN commit_log_ports ON commit_log_ports.commit_log_id = TMP2.id AND commit_log_ports.commit_log_id > Anchor_CLID(); Merge Join (cost=442.62..594.26 rows=1 width=16) (actual time=8.52..12.97 rows=141 loops=1) Merge Cond: ("outer".id = "inner".commit_log_id) -> Sort (cost=442.62..442.87 rows=100 width=8) (actual time=8.44..8.86 rows=100 loops=1) Sort Key: commit_log.id -> Nested Loop (cost=69.83..439.30 rows=100 width=8) (actual time=1.39..7.81 rows=100 loops=1) -> Subquery Scan tmp (cost=69.83..70.08 rows=100 width=12) (actual time=1.32..3.41 rows=100 loops=1) -> Limit (cost=69.83..70.08 rows=100 width=12) (actual time=1.31..2.51 rows=100 loops=1) -> Sort (cost=69.83..72.33 rows=1000 width=12) (actual time=1.30..1.71 rows=100 loops=1) Sort Key: commit_date -> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.01..0.71 rows=100 loops=1) -> Index Scan using commit_log_pkey on commit_log (cost=0.00..3.68 rows=1 width=4) (actual time=0.02..0.02 rows=1 loops=100) Index Cond: (commit_log.id = "outer".commit_log_id) -> Index Scan using clp_cli on commit_log_ports (cost=0.00..149.91 rows=39 width=8) (actual time=0.03..1.35 rows=141 loops=1) Index Cond: (commit_log_id > 69061) Total runtime: 13.81 msec (15 rows)

I'd say that's fast enough for my needs. The key to this function is the IMMUTABLE designation. This indicates that the function returns the same value for a given set of parameters. This allow the query planner to make certain optimizations that it could not otherwise do. With this designation, the query planner considers the function return value to be a constant, and therefore it can make use of the indexes available. Please refer to the CREATE FUNCTION documentation for more information.

In fact, my function isn't really IMMUTABLE. It's actually STABLE. If I change the function to stable, the time goes up to nearly 19 seconds. If I add a < condition to the range, the time drops to just under one second, but it never gets near to the speed of the IMMUTABLE function. Please read the CREATE FUNCTION documentation and pick the designation which best suits your particular application.

More about that clp table

For my specific needs, I'll be doing some more work on the clp table in order to maintain it as things go along. That may turn out to be the subject of a future article.

For now, the starting point is a couple of rules which will maintain the data:

CREATE OR REPLACE RULE commit_log_ports_insert AS ON INSERT TO commit_log_ports DO INSERT INTO clp SELECT new.commit_log_id, (SELECT commit_date FROM commit_log WHERE id = new.commit_log_id) WHERE NOT EXISTS (SELECT commit_log_id FROM clp WHERE commit_log_id = new.commit_log_id); CREATE OR REPLACE RULE commit_log_update AS ON UPDATE TO commit_log DO UPDATE clp SET commit_date = new.commit_date WHERE commit_log_id = new.id;

The first rule will keep populating the clp table as new ports tree commits are added. It can handle multiple ports per commit by only inserting that which does not already exist. The second rule will keep the commit_date accurate should any updates occur.

What about deletes? That's taken care of in the foreign key defintion of the table. Any deletes on the primary table will delete the corresponding row from this table.

Here's a query which will retain the latest 100 entries:

DELETE FROM clp WHERE commit_log_id NOT in ( SELECT commit_log_id FROM clp ORDER BY commit_date LIMIT 100); Seq Scan on clp (cost=0.00..570.12 rows=50 width=6) (actual time=9.81..399.59 rows=400 loops=1) Filter: (subplan) SubPlan -> Materialize (cost=5.62..5.62 rows=100 width=12) (actual time=0.02..0.38 rows=90 loops=500) -> Limit (cost=5.37..5.62 rows=100 width=12) (actual time=6.73..7.99 rows=100 loops=1) -> Sort (cost=5.37..5.62 rows=101 width=12) (actual time=6.72..7.14 rows=101 loops=1) Sort Key: commit_date -> Seq Scan on clp (cost=0.00..2.01 rows=101 width=12) (actual time=0.21..3.87 rows=500 loops=1) Total runtime: 410.57 msec (9 rows)

But as you can see, it takes quite some time. Luckily, with some help from the PostgreSQL SQL mailing list, I was able to find a better way:

EXPLAIN ANALYZE DELETE FROM clp WHERE commit_date < (SELECT commit_date FROM clp ORDER BY commit_date DESC OFFSET 99 LIMIT 1); Seq Scan on clp (cost=0.00..2.26 rows=34 width=6) (actual time=7.98..11.42 rows=400 loops=1) Filter: (commit_date < $0) InitPlan -> Limit (cost=5.62..5.62 rows=1 width=8) (actual time=7.29..7.31 rows=1 loops=1) -> Sort (cost=5.37..5.62 rows=101 width=8) (actual time=6.49..6.90 rows=101 loops=1) Sort Key: commit_date -> Seq Scan on clp (cost=0.00..2.01 rows=101 width=8) (actual time=0.21..3.73 rows=500 loops=1) Total runtime: 15.36 msec

That's nearly 27 times faster than my original. Thanks folks.

This could be run from a perl script started by a cron job. It's not a critical task, but it does keep things tidy. I'd also VACUUM FULL ANALYSE that table at least once a week if not daily.

Making that function even faster 3 March 2002

Robert Treat just reminded me of way to make that function run faster. It uses an old trick I had forgotten about. The original query was:

SELECT min(commit_log_id) AS RESULT FROM clp; Aggregate (cost=9.25..9.25 rows=1 width=4) (actual time=5.22..5.23 rows=1 loops=1) -> Seq Scan on clp (cost=0.00..8.00 rows=500 width=4) (actual time=0.02..3.02 rows=500 loops=1) Total runtime: 5.31 msec

Let's try this instead:

SELECT commit_log_id FROM clp ORDER BY commit_log DESC LIMIT 1; Limit (cost=0.00..0.03 rows=1 width=4) (actual time=0.05..0.07 rows=1 loops=1) -> Index Scan Backward using clp_pkey on clp (cost=0.00..12.50 rows=500 width=4) (actual time=0.04..0.05 rows=2 loops=1) Total runtime: 0.15 msec

That new query is 35.4 times faster. This clearly demonstrates the benefits of cost of an aggregate function versus the speed of an index. Mind you, it could also be argued that the aggregate functions could use the index.

Is that it?

That's it. You've seen how adjusting the query and taking a different approach can often improve the speed. It is not always clear how to adjust the query, but now that you have seen a few examples, that should get you started.

Need more help on this topic? Click here
This article has 2 comments
Show me similar articles