PostgreSQL tutorial: Efficient color similarity search

10. Jun. 2016 ob 10:08 pod kategorijami: colors postgres

I've previously blogged about colors and PostgreSQL here. In this post I will cover some insight into custom GiST index, which enables efficiently looking up similar colors.

Color distance measures

There are several revisions of Delta E color distance measure, each more accurate and more complicated. The first revision is called CIE 1976. After further research in (human) color perception it became clear the CIE 1976 has deficiencies and revised versions CIE 1994 and later CIE 2000 were released.

Numerically, CIE 1976 is just euclidean, straight line distance between points in L*a*b color space, as this color space was defined such to have this property. The CIE 2000 Delta E is far more complex and non uniform and lacks this property, so we can't use CIE 2000 with vanilla PostgreSQL 9.6.

So let's see how you can implement CIE 1976 color similarity search with PostgreSQL 9.6.

PostgreSQL cube extension

Cube contrib extension is a custom PostgreSQL data type, with which you can represent N-dimensional vectors. It also supports indexes, so you can efficiently look up values based on column of this data type.

Let's load the extension:

colorsdb=# create extension cube;
CREATE EXTENSION

Okay, quick overview how cube works! We can create a sample point and check it is a point:

 colorsdb=# select '(1,2,3)'::cube;
    cube
------------
  (1, 2, 3)
 (1 row)

 colorsdb=# select cube_is_point('(1,2,3)'::cube);
  cube_is_point
----------------
  t
 (1 row)

Or we can create a 3x3x3 cube and check its size:

 colorsdb=# select '(1,2,3),(4,5,6)'::cube;
         cube
----------------------
  (1, 2, 3),(4, 5, 6)
 (1 row)

 colorsdb=# select cube_is_point('(1,2,3),(4,5,6)'::cube);
  cube_is_point
----------------
  f
 (1 row)

 colorsdb=# select cube_size('(1,2,3),(4,5,6)'::cube);
  cube_size
------------
         27
 (1 row)

Cube data type is perfectly suited for color space representation. You can store the colors as 3-dimensional points in space, where point is a special case of cube - it has zero volume and has both upper right and lower left points equal.

You can get euclidean distance with cube_distance:

 colorsdb=# select cube_distance('(1,2,3)'::cube, '(1,2,6)'::cube);
  cube_distance
----------------
              3
 (1 row)

In upcoming PostgreSQL 9.6 version, this extension will also support distance operators. Here, <-> is an euclidean distance operator:

 colorsdb=# select '(1,2,3)'::cube <-> '(1,2,6)'::cube;
  ?column?
-----------
         3
 (1 row)

Let's create an test table with 500.000 records:

create table colors (color cube);

with random_colors as (
    select
        random() * 100 as l,
        random() * 255 - 128 as a,
        random() * 255 - 128 as b
    from generate_series(1, 500000) x)
insert into colors select cube(array[l,a,b]) from random_colors;

And we can now select 5 closest colors to (1,2,3):

colorsdb=# select '(1,2,3)'::cube <-> color, color from colors order by 1 asc limit 5;
     ?column?     |                            color
-------------------+-----------------------------------------------------------
 1.61559798496189 | (1.89253129065037, 1.71048436127603, 1.68480973271653)
 2.55645225973282 | (3.37235750630498, 1.64988295687363, 3.88588152406737)
 2.55950246687413 | (2.91422470472753, 1.19065871741623, 4.493908747565)
 2.85963596867927 | (1.00834178738296, 4.51141274580732, 4.3675724142231)
 3.11127226552722 | (1.72140300273895, 3.04948928579688, 0.161309270653874)
(5 rows)

What is the cost for this:

 colorsdb=# explain analyze select '(1,2,3)'::cube <-> color, color from colors order by 1 asc limit 10;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=20734.28..20734.30 rows=10 width=40) (actual time=151.844..151.846 rows=10 loops=1)
    ->  Sort  (cost=20734.28..21984.46 rows=500072 width=40) (actual time=151.843..151.845 rows=10 loops=1)
          Sort Key: (('(1, 2, 3)'::cube <-> color))
          Sort Method: top-N heapsort  Memory: 25kB
          ->  Seq Scan on colors  (cost=0.00..9927.90 rows=500072 width=40) (actual time=0.014..88.199 rows=500000 loops=1)
  Planning time: 0.054 ms
  Execution time: 151.864 ms
 (7 rows)

You can see PostgreSQL is doing a sequential scan and it takes 151 ms to get 5 nearest colors. Without index, that is. Since 9.1, PostgreSQL supports returning nearest items with use of GiST index. Let's try indexing colors (building index takes a while):

colorsdb=# create index colors_color_idx on colors using gist (color);
CREATE INDEX

And check the query now:

 colorsdb=# explain analyze select '(1,2,3)'::cube <-> color, color from colors order by 1 asc limit 10;
                                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.29..1.20 rows=10 width=40) (actual time=0.236..0.565 rows=10 loops=1)
    ->  Index Scan using colors_color_idx on colors  (cost=0.29..45996.29 rows=500000 width=40) (actual time=0.235..0.562 rows=10 loops=1)
          Order By: (color <-> '(1, 2, 3)'::cube)
  Planning time: 0.051 ms
  Execution time: 0.585 ms
 (5 rows)

Yay, 0.6 ms, which is about 250-times faster for a table with 0,5 million records.