The FreeBSD Diary

The FreeBSD Diary (TM)

Providing practical examples since 1998

If you buy from Amazon USA, please support us by using this link.
[ HOME | TOPICS | INDEX | WEB RESOURCES | BOOKS | CONTRIBUTE | SEARCH | FEEDBACK | FAQ | FORUMS ]

Things look quiet here. But I've been doing a lot of blogging at dan.langille.org because I prefer WordPress now. Not all my posts there are FreeBSD related. I am in the midst of migrating The FreeBSD Diary over to WordPress (and you can read about that here). Once the migration is completed, I'll move the FreeBSD posts into the new FreeBSD Diary website.

PostgreSQL - analyzing a query to improve speed 1 March 2003
Share
Need more help on this topic? Click here
This article has 2 comments
Show me similar articles

PostgreSQL is the leading open source database. Many people consider PostgreSQL to be a competitor to mySQL. People who say that don't know either product. PostgreSQL isn't competing with mySQL; it's competing with Oracle. Yes, it's being quiet about it. But the users aren't. They were Oracle users and they're coming to PostgreSQL.

Of course the skeptics will ask for proof of my claims. I don't have any. I have anecdotal evidence. And third hand information. Such as Bruce Momjian's recent overseas trip report.

All the PostgreSQL versus mySQL noise is moot. What people should be concentrating on is PostgreSQL versus Oracle

Put that in your DBMS and analyse it.

What's coming up

Anyone who has ever written a complex query has seen that despite the best of designs, they aren't always fast. Over the past day or two I've been struggling with the slowest part of FreshPorts. I think I've come up with a relatively simple solution. The trip to that solution was rather interesting. I will reproduce it below and show you how it went from a query which took many seconds to a query which takes less than 0.01 seconds.

The first part of the article will cover a bit of FreshPorts introduction, the underlying database tables and how they relate to each other. Then we'll get into the queries, follow through the analyzing of the query and finally the optimized result.

This article should be useful to anyone using PostgreSQL or designing databases. It should also prove useful to those who wonder how database work and what it takes to make them run quickly. It might also provide a venue for my own education. People reading this article might know a better way to do things. Information is a two-way street.

You might want to read PostgreSQL Database Performance Tuning first. It gives you a few idea about database tuning, but not much on tuning a query. PostgreSQL Hardware Performance Tuning might also interest you.

Least of all, it should show people how PostgreSQL can do this, but mySQL can't... (that's not the goal of the article, and I'm not going to attempt to prove it... please just accept that mySQL can't do this type of thing...)

Explain it to me...

I'll be making use of the PostgreSQL EXPLAIN command. That's found in the Reference Manual of the PostgreSQL online documentation.

If what appears below doesn't make sense, try reading the above references first.

EXPLAIN shows the show the execution plan of a statement. You get to see what tables will be used, what indexes are involved, and the number of rows involved. Execution plans make use of statistics generated by the VACUUM command. If you aren't getting the results you want, try running VACUUM ANALYZE over the table[s] in question.

EXPLAIN ANALYZE goes one step further and actually executes the command. This provides the actual run time for the various parts of the query. Be warned that EXPLAIN ANALYSE will actually delete, update, or insert any data involved in the query. Read the documentation for details.

Here are a few things to look for when reading the output of EXPLAIN:

  • Look for high costs
  • look for sequential scans
  • look for large numbers of rows
  • look for index scans
High costs are an indication of something which takes a lot of set up. Sequential scans indicate a scan of the entire table. Perhaps an index is required. Large numbers of rows might indicate a place where you can cut down the result set earlier in the query. Index scans are usually a good thing as they tend to speed up the query (in most instances).

Items to note

  • For small tables, an index costs more than it is worth in many queries. Indexes are only really useful for larger data sets. Don't be upset if the query runs quickly but doesn't use an index.
  • If you think the index should be used, but it isn't being touched, try a VACUUM ANALYZE on that table. Then test the query again.
FreshPorts - what is it?

FreshPorts (and its sister FreshSource capture and store the commits to a source repository. It can work with any source repository but at present it is restricted to the FreeBSD source tree. FreshPorts concentrates on the Ports subtree, while FreshSource captures and records the entire tree.

By its nature, FreshPorts is the more complex of the two sites. It was also the first. Users create watch lists of whatever ports they wish to keep up to date with. When a commit occurs against those ports, the user is sent an email notification. Users can also view all commits recorded against a given port. When a port can consist of several dozen files, this information is very useful when tracking updates.

FreshPorts - the tables and sample data

The most important FreshPorts table is the commit_log table.

create table commit_log
(
    id               serial    not null,
    commit_date      timestamp not null,
    committer        text      not null,
    primary key (id)
);

create index commit_log_commit_date on commit_log (commit_date);
The actual table contains more columns but the above is sufficient for demonstration purposes.

Each new commit against the FreeBSD source repository is loaded into this table. FreshPorts is only interested in commits against the ports subtree. However, FreshSource and FreshPorts use the same database. A second table is used to indicate commits against particular ports:

create table commit_log_ports
(
    commit_log_id  integer     not null,
    port_id        integer     not null,
    primary key (commit_log_id, port_id)
);

alter table commit_log_ports
    add foreign key (commit_log_id)
       references commit_log (id) on update cascade on delete cascade;

alter table commit_log_ports
    add foreign key (port_id)
       references ports (id) on update cascade on delete cascade;

I will only mention the Ports table so that you know it exists and is referred to by many other tables. The main problem at hand concerns the commit_log and commit_log_ports table.

You can download the sample tables and data (859KB) used in this article should you wish to play with the results. Please note that I have only supplied the columns necessary to perform the queries indicated below. As such, your results will vary. NOTE: the examples were run on version 7.3.

The tarball contains a .sql file which can be loaded into your database with the following command:

createdb test
psql test < postgresql-analyze.sql
Feedback is appreciated.
So what's the problem?

FreshPorts, by default, displays the last 100 commits. If you didn't think about it, you'd think this was the answer:

EXPLAIN ANALYZE
  SELECT *
    FROM commit_log
ORDER BY commit_date
   LIMIT 100;

Limit  (cost=0.00..48.12 rows=100 width=252) (actual time=0.06..2.40 rows=100 loops=1)
   ->  Index Scan using commit_log_commit_date on commit_log  (cost=0.00..32975.88 rows=68533 width=252) (actual time=0.05..1.59 rows=101 loops=1)
Total runtime: 2.98 msec
(3 rows)

As you can see, it's rather fast to get the last 100 commits. Just under 3ms. That's 0.003 seconds. That's damn fast in my book. Mind you, there's only 68533 rows in that table.

Note that the (3 rows) refers to the EXPLAIN ANALYSE output, not the number of rows in the result set. Compare that to the 100 which appears in the Limit line in the output.

Unfortunately, that's the last 100 commits. Not the last 100 commits against the ports subtree. For that, we have to use the commit_log_ports table.

EXPLAIN ANALYZE
  SELECT distinct commit_log.id, commit_log.commit_date
    FROM commit_log, commit_log_ports
   WHERE commit_log.id = commit_log_ports.commit_log_id
ORDER BY commit_log.commit_date
   LIMIT 100;

Limit  (cost=13456.18..13463.68 rows=100 width=16) (actual time=4190.89..4193.10 rows=100 loops=1)
   ->  Unique  (cost=13456.18..13938.43 rows=6430 width=16) (actual time=4190.88..4192.31 rows=101 loops=1)
         ->  Sort  (cost=13456.18..13616.93 rows=64300 width=16) (actual time=4190.87..4191.39 rows=113 loops=1)
               Sort Key: commit_log.commit_date, commit_log.id
               ->  Hash Join  (cost=4103.81..7662.06 rows=64300 width=16) (actual time=1065.79..3150.16 rows=64300 loops=1)
                     Hash Cond: ("outer".commit_log_id = "inner".id)
                     ->  Seq Scan on commit_log_ports  (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.11..434.34 rows=64300 loops=1)
                     ->  Hash  (cost=3032.33..3032.33 rows=68533 width=12) (actual time=1065.08..1065.08 rows=0 loops=1)
                           ->  Seq Scan on commit_log  (cost=0.00..3032.33 rows=68533 width=12) (actual time=0.10..645.86 rows=68544 loops=1)
Total runtime: 4195.82 msec
(10 rows)

OUCH! That's terrible. That's four seconds! Why did it take so long?

The query is returning all rows in the two tables, then selecting only the first one hundred rows. That's quite inefficient, but it's not what we want to do. We can make things faster.

Don't search everything; provide a starting point

In this example, we're going to provide a starting place for the query. We aren't going to do the JOIN on the whole table. We are going to limit it to the most recent commits. Let's assume that in a given week, there are at least 100 commits against the ports tree. We will do this by adding the following condition to the WHERE clause:

commit_log.commit_date > now() - interval '1 week'
The new query is:
EXPLAIN ANALYZE
  SELECT distinct commit_log.id, commit_log.commit_date
    FROM commit_log, commit_log_ports
   WHERE commit_log.id = commit_log_ports.commit_log_id
     AND commit_log.commit_date > now() - interval '1 week'
ORDER BY commit_log.commit_date
   LIMIT 100;

Limit  (cost=7425.72..7433.22 rows=100 width=16) (actual time=1238.13..1238.62 rows=21 loops=1)
   ->  Unique  (cost=7425.72..7586.47 rows=2143 width=16) (actual time=1238.12..1238.44 rows=21 loops=1)
         ->  Sort  (cost=7425.72..7479.31 rows=21433 width=16) (actual time=1238.11..1238.22 rows=28 loops=1)
               Sort Key: commit_log.commit_date, commit_log.id
               ->  Hash Join  (cost=3603.44..5883.85 rows=21433 width=16) (actual time=1175.11..1237.71 rows=28 loops=1)
                     Hash Cond: ("outer".commit_log_id = "inner".id)
                     ->  Seq Scan on commit_log_ports  (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.10..420.66 rows=64300 loops=1)
                     ->  Hash  (cost=3546.33..3546.33 rows=22844 width=12) (actual time=419.03..419.03 rows=0 loops=1)
                           ->  Seq Scan on commit_log  (cost=0.00..3546.33 rows=22844 width=12) (actual time=0.19..418.47 rows=57 loops=1)
                                 Filter: (commit_date > (now() - '7 days'::interval))
Total runtime: 1238.96 msec
(11 rows)

Now the query takes only just over 1 second. That's not bad. But we can do better. Let's now assume we know how far back in time the 100th previous port commit was. It was on 18 February 2003, at 16:02:15.

EXPLAIN ANALYZE
  SELECT distinct commit_log.id, commit_log.commit_date
    FROM commit_log, commit_log_ports
   WHERE commit_log.id = commit_log_ports.commit_log_id
     AND commit_log.commit_date > '2003-02-18 16:02:15.133047-05'
ORDER BY commit_log.commit_date
   LIMIT 100;

Limit  (cost=1562.07..1563.65 rows=21 width=16) (actual time=737.54..739.94 rows=100 loops=1)
   ->  Unique  (cost=1562.07..1563.71 rows=22 width=16) (actual time=737.52..739.15 rows=101 loops=1)
         ->  Sort  (cost=1562.07..1562.62 rows=219 width=16) (actual time=737.52..738.09 rows=145 loops=1)
               Sort Key: commit_log.commit_date, commit_log.id
               ->  Hash Join  (cost=181.33..1553.56 rows=219 width=16) (actual time=507.25..736.43 rows=146 loops=1)
                     Hash Cond: ("outer".commit_log_id = "inner".id)
                     ->  Seq Scan on commit_log_ports  (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.10..419.82 rows=64300 loops=1)
                     ->  Hash  (cost=180.74..180.74 rows=233 width=12) (actual time=5.68..5.68 rows=0 loops=1)
                           ->  Index Scan using commit_log_commit_date on commit_log  (cost=0.00..180.74 rows=233 width=12) (actual time=0.34..4.51 rows=215 loops=1)
                                 Index Cond: (commit_date > '2003-02-18 16:02:15.133047-05'::timestamp with time zone)
Total runtime: 740.66 msec
(11 rows)
Why is this faster? The key is in the explain output. In the previous example, it was merely filtering. But in this example, we are giving it a hard-and-fast starting point, as demonstrated by the Index Cond: output. Essentially, we are providing a date, instead of a function call. This allows the query to be much faster and finding what is needed.
We can go faster
This isn't fast enough. What if I know what the last 100 port commits are and I have them in a table? A table like this:
create table clp
(
    commit_log_id  integer    not null,
    commit_date    timestamp  not null,
    primary key (commit_log_id)
);

alter table clp
    add foreign key (commit_log_id)
       references commit_log (id) on update cascade on delete cascade;

Note that this table is not included in the example tarball.

The table is named clp after the commit_log_ports table.

We can populate the table with this query:

EXPLAIN ANALYZE
INSERT INTO clp
SELECT DISTINCT commit_log_id, commit_date
  FROM commit_log, commit_log_ports
 WHERE commit_log.id = commit_log_ports.commit_log_id
 ORDER BY commit_log.commit_date desc
 LIMIT 100;

Subquery Scan "*SELECT*"  (cost=13456.18..13463.68 rows=100 width=16) (actual time=4165.77..4169.47 rows=100 loops=1)
   ->  Limit  (cost=13456.18..13463.68 rows=100 width=16) (actual time=4165.75..4168.39 rows=100 loops=1)
         ->  Unique  (cost=13456.18..13938.43 rows=6430 width=16) (actual time=4165.74..4167.59 rows=101 loops=1)
               ->  Sort  (cost=13456.18..13616.93 rows=64300 width=16) (actual time=4165.73..4166.48 rows=145 loops=1)
                     Sort Key: commit_log.commit_date, commit_log_ports.commit_log_id
                     ->  Hash Join  (cost=4103.81..7662.06 rows=64300 width=16) (actual time=1066.31..3157.63 rows=64300 loops=1)
                           Hash Cond: ("outer".commit_log_id = "inner".id)
                           ->  Seq Scan on commit_log_ports  (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.11..430.26 rows=64300 loops=1)
                           ->  Hash  (cost=3032.33..3032.33 rows=68533 width=12) (actual time=1065.58..1065.58 rows=0 loops=1)
                                 ->  Seq Scan on commit_log  (cost=0.00..3032.33 rows=68533 width=12) (actual time=0.10..647.40 rows=68544 loops=1)
Total runtime: 4176.07 msec
(11 rows)
Of course, the cost to insert into this table is relatively high, because of the previously mentioned examples and reasons. The ultimate goal is to maintain this table as we go along. I'll explain that in a subsequent section. For now, let's just try out this table and see how things go.

We will assume that new entries are added to this table all the time. If we query using this table:

EXPLAIN ANALYZE
SELECT DISTINCT clp.commit_log_id, clp.commit_date
   FROM commit_log, commit_log_ports, clp
  WHERE commit_log.id     = commit_log_ports.commit_log_id
    AND clp.commit_log_id = commit_log.id
  ORDER BY clp.commit_date desc
  LIMIT 100;

Limit  (cost=5703.16..5710.66 rows=100 width=20) (actual time=731.50..733.90 rows=100 loops=1)
   ->  Unique  (cost=5703.16..5772.32 rows=922 width=20) (actual time=731.49..733.11 rows=100 loops=1)
         ->  Sort  (cost=5703.16..5726.22 rows=9221 width=20) (actual time=731.48..732.05 rows=144 loops=1)
               Sort Key: clp.commit_date, clp.commit_log_id
               ->  Hash Join  (cost=3714.68..5095.91 rows=9221 width=20) (actual time=503.23..730.43 rows=144 loops=1)
                     Hash Cond: ("outer".commit_log_id = "inner".id)
                     ->  Seq Scan on commit_log_ports  (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.09..417.49 rows=64300 loops=1)
                     ->  Hash  (cost=3712.18..3712.18 rows=1000 width=16) (actual time=7.03..7.03 rows=0 loops=1)
                           ->  Nested Loop  (cost=0.00..3712.18 rows=1000 width=16) (actual time=0.42..6.41 rows=100 loops=1)
                                 ->  Seq Scan on clp  (cost=0.00..20.00 rows=1000 width=12) (actual time=0.10..0.82 rows=100 loops=1)
                                 ->  Index Scan using commit_log_pkey on commit_log  (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100)
                                       Index Cond: ("outer".commit_log_id = commit_log.id)
Total runtime: 734.72 msec
(13 rows)

Note also that to get the above time, you will have to do a VACUUM ANALYZE on the clp table so that the query engine will have data regarding that table.

Look at where all the time is used up on this query, it's in the LIMIT. That time represents almost the entire query. But we can use the same idea but structure the query in a different way so as to increase performance.

EXPLAIN ANALYZE
SELECT DISTINCT clp.commit_log_id, clp.commit_date
   FROM clp, commit_log, commit_log_ports
  WHERE clp.commit_log_id = commit_log.id
    AND commit_log.id     = commit_log_ports.commit_log_id
  ORDER BY clp.commit_date desc
  LIMIT 100;

Limit  (cost=6538.88..6545.86 rows=93 width=20) (actual time=934.51..936.90 rows=100 loops=1)
   ->  Unique  (cost=6538.88..6545.92 rows=94 width=20) (actual time=934.50..936.10 rows=100 loops=1)
         ->  Sort  (cost=6538.88..6541.23 rows=938 width=20) (actual time=934.49..935.06 rows=144 loops=1)
               Sort Key: clp.commit_date, clp.commit_log_id
               ->  Merge Join  (cost=0.00..6492.57 rows=938 width=20) (actual time=922.63..933.37 rows=144 loops=1)
                     Merge Cond: ("outer".commit_log_id = "inner".commit_log_id)
                     ->  Nested Loop  (cost=0.00..3744.18 rows=1000 width=16) (actual time=0.50..6.95 rows=100 loops=1)
                           ->  Index Scan using clp_pkey on clp  (cost=0.00..52.00 rows=1000 width=12) (actual time=0.31..1.21 rows=100 loops=1)
                           ->  Index Scan using commit_log_pkey on commit_log  (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100)
                                 Index Cond: ("outer".commit_log_id = commit_log.id)
                     ->  Index Scan using clp_cli on commit_log_ports  (cost=0.00..2469.87 rows=64300 width=4) (actual time=0.31..599.70 rows=64300 loops=1)
Total runtime: 937.68 msec
(12 rows)
Well, that's not quite what i wanted to do. But it does demonstrate a useful point. If you rearrange the table names in the FROM clause, it can affect performance.

Now it's time to apply our old trick of providing a starting point. We will restrict the commits to those after a certain time:

EXPLAIN ANALYZE
SELECT DISTINCT clp.commit_log_id, clp.commit_date
   FROM clp, commit_log, commit_log_ports
  WHERE commit_log.id          = commit_log_ports.commit_log_id
    AND clp.commit_log_id      = commit_log.id
    AND commit_log.commit_date > '2003-02-18 16:02:15.133047-05'
  ORDER BY clp.commit_date desc
  LIMIT 100;

Limit  (cost=299.65..299.67 rows=1 width=20) (actual time=13.27..15.66 rows=100 loops=1)
   ->  Unique  (cost=299.65..299.67 rows=1 width=20) (actual time=13.26..14.87 rows=100 loops=1)
         ->  Sort  (cost=299.65..299.66 rows=3 width=20) (actual time=13.25..13.81 rows=144 loops=1)
               Sort Key: clp.commit_date, clp.commit_log_id
               ->  Nested Loop  (cost=169.43..299.63 rows=3 width=20) (actual time=4.87..12.33 rows=144 loops=1)
                     ->  Hash Join  (cost=169.43..194.47 rows=3 width=16) (actual time=4.64..6.69 rows=100 loops=1)
                           Hash Cond: ("outer".commit_log_id = "inner".id)
                           ->  Seq Scan on clp  (cost=0.00..20.00 rows=1000 width=12) (actual time=0.02..0.77 rows=100 loops=1)
                           ->  Hash  (cost=168.97..168.97 rows=186 width=4) (actual time=4.52..4.52 rows=0 loops=1)
                                 ->  Index Scan using commit_log_commit_date on commit_log  (cost=0.00..168.97 rows=186 width=4) (actual time=0.45..3.37 rows=215 loops=1)
                                       Index Cond: (commit_date > '2003-02-18 16:02:15.133047-05'::timestamp with time zone)
                     ->  Index Scan using clp_cli on commit_log_ports  (cost=0.00..38.58 rows=9 width=4) (actual time=0.02..0.03 rows=1 loops=100)
                           Index Cond: ("outer".commit_log_id = commit_log_ports.commit_log_id)
Total runtime: 16.44 msec
(14 rows)
Now that's fast.  Very fast. In fact, it's more than 57 times faster.
Another approach

We now have a very fast query which will return the last 100 commits to the ports tree. Let's now figure out how to get the list of ports associated with those commits.

EXPLAIN ANALYZE
 SELECT clp.commit_log_id, clp.commit_date, ports.name
   FROM clp, commit_log, commit_log_ports, ports
  WHERE commit_log.id            = commit_log_ports.commit_log_id
    AND clp.commit_log_id        = commit_log.id
    AND commit_log.commit_date   > '2003-02-18 16:02:15.133047-05'
    AND commit_log_ports.port_id = ports.id
  ORDER BY clp.commit_date desc
  LIMIT 100;

Limit  (cost=314.99..314.99 rows=2 width=78) (actual time=389.48..390.66 rows=100 loops=1)
   ->  Sort  (cost=314.99..314.99 rows=3 width=78) (actual time=389.46..389.86 rows=101 loops=1)
         Sort Key: clp.commit_date
         ->  Nested Loop  (cost=169.43..314.97 rows=3 width=78) (actual time=40.67..387.90 rows=144 loops=1)
               ->  Nested Loop  (cost=169.43..299.63 rows=3 width=24) (actual time=16.52..24.91 rows=144 loops=1)
                     ->  Hash Join  (cost=169.43..194.47 rows=3 width=16) (actual time=16.15..18.40 rows=100 loops=1)
                           Hash Cond: ("outer".commit_log_id = "inner".id)
                           ->  Seq Scan on clp  (cost=0.00..20.00 rows=1000 width=12) (actual time=10.55..11.38 rows=100 loops=1)                           ->  Hash  (cost=168.97..168.97 rows=186 width=4) (actual time=5.48..5.48 rows=0 loops=1)
                                 ->  Index Scan using commit_log_commit_date on commit_log  (cost=0.00..168.97 rows=186 width=4) (actual time=0.34..4.32 rows=215 loops=1)
                                       Index Cond: (commit_date > '2003-02-18 16:02:15.133047-05'::timestamp with time zone)
                     ->  Index Scan using clp_cli on commit_log_ports  (cost=0.00..38.58 rows=9 width=8) (actual time=0.03..0.04 rows=1 loops=100)
                           Index Cond: ("outer".commit_log_id = commit_log_ports.commit_log_id)
               ->  Index Scan using ports_pkey on ports  (cost=0.00..6.00 rows=1 width=54) (actual time=2.49..2.50 rows=1 loops=144)
                     Index Cond: ("outer".port_id = ports.id)
Total runtime: 391.55 msec
(16 rows)
That query only returned one hundred rows.... Oh yes, we need a subselect here, we want to limit the result set to 100 commits, not 100 ports. This is the query we will use as a starting point:
 EXPLAIN ANALYZE
  SELECT clp.commit_log_id
    FROM clp
ORDER BY clp.commit_date DESC
  LIMIT 100;

Limit  (cost=69.83..70.08 rows=100 width=12) (actual time=1.30..2.48 rows=100 loops=1)
   ->  Sort  (cost=69.83..72.33 rows=1000 width=12) (actual time=1.29..1.69 rows=100 loops=1)
         Sort Key: commit_date
         ->  Seq Scan on clp  (cost=0.00..20.00 rows=1000 width=12) (actual time=0.02..0.68 rows=100 loops=1)
Total runtime: 3.02 msec
(5 rows)
With that, let's join it up with the commit_log table:
EXPLAIN ANALYZE
SELECT commit_log.id
  FROM commit_log JOIN
            (SELECT clp.commit_log_id
               FROM clp
           ORDER BY clp.commit_date DESC
              LIMIT 100) AS TMP
        ON commit_log.id = TMP.commit_log_id;

Nested Loop  (cost=69.83..439.30 rows=100 width=8) (actual time=1.76..9.11 rows=100 loops=1)
   ->  Subquery Scan tmp  (cost=69.83..70.08 rows=100 width=12) (actual time=1.38..3.50 rows=100 loops=1)
         ->  Limit  (cost=69.83..70.08 rows=100 width=12) (actual time=1.37..2.59 rows=100 loops=1)
               ->  Sort  (cost=69.83..72.33 rows=1000 width=12) (actual time=1.36..1.79 rows=100 loops=1)
                     Sort Key: commit_date
                     ->  Seq Scan on clp  (cost=0.00..20.00 rows=1000 width=12) (actual time=0.02..0.68 rows=100 loops=1)
   ->  Index Scan using commit_log_pkey on commit_log  (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100)
         Index Cond: (commit_log.id = "outer".commit_log_id)
Total runtime: 9.69 msec
(9 rows)
As you can see, that's still pretty fast, and still only 100 records. Now we have the rows from commit_log which relate to the last 100 commits to the ports tree. Let's get the ports touched by those commits.
EXPLAIN ANALYZE
SELECT commit_log.id
  FROM commit_log JOIN
            (SELECT clp.commit_log_id
               FROM clp
           ORDER BY clp.commit_date DESC
              LIMIT 100) AS TMP
        ON commit_log.id = TMP.commit_log_id, commit_log_ports
WHERE commit_log_ports.commit_log_id = commit_log.id;

Hash Join  (cost=439.55..1810.22 rows=94 width=12) (actual time=504.90..731.12 rows=144 loops=1)
   Hash Cond: ("outer".commit_log_id = "inner".id)
   ->  Seq Scan on commit_log_ports  (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.09..416.59 rows=64300 loops=1)
   ->  Hash  (cost=439.30..439.30 rows=100 width=8) (actual time=9.64..9.64 rows=0 loops=1)
         ->  Nested Loop  (cost=69.83..439.30 rows=100 width=8) (actual time=1.84..9.09 rows=100 loops=1)
               ->  Subquery Scan tmp  (cost=69.83..70.08 rows=100 width=12) (actual time=1.44..3.56 rows=100 loops=1)
                     ->  Limit  (cost=69.83..70.08 rows=100 width=12) (actual time=1.43..2.63 rows=100 loops=1)
                           ->  Sort  (cost=69.83..72.33 rows=1000 width=12) (actual time=1.42..1.82 rows=100 loops=1)
                                 Sort Key: commit_date
                                 ->  Seq Scan on clp  (cost=0.00..20.00 rows=1000 width=12) (actual time=0.09..0.73 rows=100 loops=1)
               ->  Index Scan using commit_log_pkey on commit_log  (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100)
                     Index Cond: (commit_log.id = "outer".commit_log_id)
Total runtime: 731.93 msec
(13 rows)
Wow! That is 75 times slower than before! Surely we can do better! I think the clue is the rows=64300 entry. Let's try reducing that. In this particular case, I know we're only looking at values within the clp table. So let's create this function:
CREATE OR REPLACE FUNCTION Anchor_CLID() RETURNS integer
AS '
   SELECT min(commit_log_id) AS RESULT
     FROM clp;

' LANGUAGE SQL IMMUTABLE;

See also how to make the above faster.

Note that the above is for version 7.3 which first introduced the IMMUTABLE designation. This function obtains the minimum commit_log_id value from the clp table. Let's use that in our query:

EXPLAIN ANALYZE
SELECT commit_log_id, port_id FROM
    (SELECT commit_log.id
      FROM commit_log JOIN
                (SELECT clp.commit_log_id
                   FROM clp
               ORDER BY clp.commit_date DESC
                  LIMIT 100) AS TMP
            ON commit_log.id = TMP.commit_log_id) AS TMP2 JOIN commit_log_ports
                         ON commit_log_ports.commit_log_id = TMP2.id
                         AND commit_log_ports.commit_log_id > Anchor_CLID();

Merge Join  (cost=442.62..594.26 rows=1 width=16) (actual time=8.52..12.97 rows=141 loops=1)
   Merge Cond: ("outer".id = "inner".commit_log_id)
   ->  Sort  (cost=442.62..442.87 rows=100 width=8) (actual time=8.44..8.86 rows=100 loops=1)
         Sort Key: commit_log.id
         ->  Nested Loop  (cost=69.83..439.30 rows=100 width=8) (actual time=1.39..7.81 rows=100 loops=1)
               ->  Subquery Scan tmp  (cost=69.83..70.08 rows=100 width=12) (actual time=1.32..3.41 rows=100 loops=1)
                     ->  Limit  (cost=69.83..70.08 rows=100 width=12) (actual time=1.31..2.51 rows=100 loops=1)
                           ->  Sort  (cost=69.83..72.33 rows=1000 width=12) (actual time=1.30..1.71 rows=100 loops=1)
                                 Sort Key: commit_date
                                 ->  Seq Scan on clp  (cost=0.00..20.00 rows=1000 width=12) (actual time=0.01..0.71 rows=100 loops=1)
               ->  Index Scan using commit_log_pkey on commit_log  (cost=0.00..3.68 rows=1 width=4) (actual time=0.02..0.02 rows=1 loops=100)
                     Index Cond: (commit_log.id = "outer".commit_log_id)
   ->  Index Scan using clp_cli on commit_log_ports  (cost=0.00..149.91 rows=39 width=8) (actual time=0.03..1.35 rows=141 loops=1)
         Index Cond: (commit_log_id > 69061)
Total runtime: 13.81 msec
(15 rows)
I'd say that's fast enough for my needs. The key to this function is the IMMUTABLE designation. This indicates that the function returns the same value for a given set of parameters. This allow the query planner to make certain optimizations that it could not otherwise do. With this designation, the query planner considers the function return value to be a constant, and therefore it can make use of the indexes available. Please refer to the CREATE FUNCTION documentation for more information.

In fact, my function isn't really IMMUTABLE. It's actually STABLE. If I change the function to stable, the time goes up to nearly 19 seconds. If I add a < condition to the range, the time drops to just under one second, but it never gets near to the speed of the IMMUTABLE function. Please read the CREATE FUNCTION documentation and pick the designation which best suits your particular application.

More about that clp table

For my specific needs, I'll be doing some more work on the clp table in order to maintain it as things go along. That may turn out to be the subject of a future article.

For now, the starting point is a couple of rules which will maintain the data:

CREATE OR REPLACE RULE commit_log_ports_insert AS
   ON INSERT TO commit_log_ports DO
   INSERT INTO clp
   SELECT new.commit_log_id,
              (SELECT commit_date
                  FROM commit_log
                WHERE id = new.commit_log_id)
   WHERE NOT EXISTS
      (SELECT commit_log_id
         FROM clp
        WHERE commit_log_id = new.commit_log_id);

CREATE OR REPLACE RULE commit_log_update AS
   ON UPDATE TO commit_log DO
   UPDATE clp
      SET commit_date   = new.commit_date
    WHERE commit_log_id = new.id;

The first rule will keep populating the clp table as new ports tree commits are added. It can handle multiple ports per commit by only inserting that which does not already exist. The second rule will keep the commit_date accurate should any updates occur.

What about deletes? That's taken care of in the foreign key defintion of the table. Any deletes on the primary table will delete the corresponding row from this table.

Here's a query which will retain the latest 100 entries:

DELETE FROM clp
      WHERE commit_log_id NOT in (
          SELECT commit_log_id
            FROM clp
        ORDER BY commit_date
           LIMIT 100);

Seq Scan on clp  (cost=0.00..570.12 rows=50 width=6) (actual time=9.81..399.59 rows=400 loops=1)
   Filter: (subplan)
   SubPlan
     ->  Materialize  (cost=5.62..5.62 rows=100 width=12) (actual time=0.02..0.38 rows=90 loops=500)
           ->  Limit  (cost=5.37..5.62 rows=100 width=12) (actual time=6.73..7.99 rows=100 loops=1)
                 ->  Sort  (cost=5.37..5.62 rows=101 width=12) (actual time=6.72..7.14 rows=101 loops=1)
                       Sort Key: commit_date
                       ->  Seq Scan on clp  (cost=0.00..2.01 rows=101 width=12) (actual time=0.21..3.87 rows=500 loops=1)
Total runtime: 410.57 msec (9 rows)

But as you can see, it takes quite some time. Luckily, with some help from the PostgreSQL SQL mailing list, I was able to find a better way:

EXPLAIN ANALYZE
DELETE FROM clp
WHERE commit_date < (SELECT commit_date
                       FROM clp
                   ORDER BY commit_date DESC
                     OFFSET 99
                      LIMIT 1);

Seq Scan on clp  (cost=0.00..2.26 rows=34 width=6) (actual time=7.98..11.42 rows=400 loops=1)
   Filter: (commit_date < $0)
   InitPlan
     ->  Limit  (cost=5.62..5.62 rows=1 width=8) (actual time=7.29..7.31 rows=1 loops=1)
           ->  Sort  (cost=5.37..5.62 rows=101 width=8) (actual time=6.49..6.90 rows=101 loops=1)
                 Sort Key: commit_date
                 ->  Seq Scan on clp  (cost=0.00..2.01 rows=101 width=8) (actual time=0.21..3.73 rows=500 loops=1)
Total runtime: 15.36 msec
That's nearly 27 times faster than my original. Thanks folks.

This could be run from a perl script started by a cron job. It's not a critical task, but it does keep things tidy. I'd also VACUUM FULL ANALYSE that table at least once a week if not daily.

Robert Treat just reminded me of way to make that function run faster. It uses an old trick I had forgotten about. The original query was:

   SELECT min(commit_log_id) AS RESULT
     FROM clp;

Aggregate  (cost=9.25..9.25 rows=1 width=4) (actual time=5.22..5.23 rows=1 loops=1)
   ->  Seq Scan on clp  (cost=0.00..8.00 rows=500 width=4) (actual time=0.02..3.02 rows=500 loops=1)
Total runtime: 5.31 msec

Let's try this instead:

    SELECT commit_log_id
      FROM clp
  ORDER BY commit_log DESC
     LIMIT 1;

Limit  (cost=0.00..0.03 rows=1 width=4) (actual time=0.05..0.07 rows=1 loops=1)
   ->  Index Scan Backward using clp_pkey on clp  (cost=0.00..12.50 rows=500 width=4) (actual time=0.04..0.05 rows=2 loops=1)
Total runtime: 0.15 msec

That new query is 35.4 times faster. This clearly demonstrates the benefits of cost of an aggregate function versus the speed of an index. Mind you, it could also be argued that the aggregate functions could use the index.

Is that it?

That's it. You've seen how adjusting the query and taking a different approach can often improve the speed. It is not always clear how to adjust the query, but now that you have seen a few examples, that should get you started.

Share
Need more help on this topic? Click here
This article has 2 comments
Show me similar articles