Simple moving objects database with Postgres

08. Aug. 2016 ob 20:32 pod kategorijami: moving-objects-database temporal postgres tutorial

"Moving objects" or "spatio-temporal database" is a database, which stores data with temporal and spatial aspects. One example is keeping a database of how real estate changes through time, another a bit more interactive is, say, tracking location and movement of vehicles through time and keeping the history in database.

What we want to be able to answer is such questions as:

where have the tracked objects been over time,

check whether an tracked object has passed through a checkpoint,

check if two tracked objects have crossed paths,

if two tracked objects have crossed paths, where and at what time has this happened?

To achieve this, we will combine two PostgreSQL's features: builtin range data types and operators, and spatial data types and operators provided by PostGIS extension.

For the example here, we will use GPX track exports of some of my runs. First we need to import GPX data into a table we create:

CREATE TABLE gpx (
    id integer,
    filename text,
    lat numeric,
    lon numeric,
    ts timestamptz);

I created a small Python script converter.py, which reads GPX XMLs in current dir and outputs CSV with coordinates out stdout. This is then directly usable in psql's \COPY command:

\COPY gpx FROM PROGRAM 'python converter.py' WITH CSV;

COPY 17549
Time: 617,839 ms

Modeling spatial aspects

Postgres supports spatial data types via PostGIS extension. I will not go into full detail about PostGIS here, for now it suffices to know it works by adding a geometry column to our table and that it allows us to index this column, which enables fast lookups. In my example we will model the distance the object covered with line geometry. Besides linestrings (lines), there are also points and polygons.

Modeling temporal aspects

Postgres supports ranged types, and among them are also time ranges. daterange is a range between two dates and tstzrange is a range between two timestamptz values. Ranges in Postgres are canonically represented in included-excluded notation, meaning the start of the range is included and the end of the range is excluded but marks the boundary and all values less than that are included. Example:

 SELECT '[2016-01-01, 2016-06-30]'::daterange;
         daterange
---------------------------
  [2016-01-01,2016-07-01)
 (1 row)

With this, we created a date range for the first half of this year, that is between first of January inclusive and last of June inclusive. We can see Postgres normalizes this into 1st January inclusive and 1st July exclusive.

A table for tracking objects

For tracking objects, lets create a new table. Table needs to have a few fields: a serial for identifying each segment, an object identifier to know which object the time and location belongs to, and, of course, time period and corresponding location of the object:

CREATE TABLE tracks (
    id serial,
    object_id integer,
    timeslice tstzrange);

You will notice the location is missing! The geometry columns are added via a special function AddGeometryColumn. We need to provide a number of parameters. Table name and column name are obvious. Next we need to provide spatial reference ID, which uniquely identifies a number of parameters required for correctly interpreting spatial data - here we use ID 4326, which denotes the WGS84 world geodetic system. Geometry type we'll be using is LINESTRING and the last parameter is number of dimensions, which is two:

SELECT AddGeometryColumn('tracks', 'geom', 4326, 'LINESTRING', 2);

Now we want to convert GPX data into tracks. Tracks will have temporal intervals in field timeslice and lines in field geom.

With following query we can convert numerical values in table gpx to real point data type. We use ST_MakePoint to create a point and ST_SetSRID, to set spatial reference system on geometry so PostGIS knows how to interpret it:

--- example
SELECT id, ts, ST_SetSRID(ST_MakePoint(lon, lat), 4326) pt FROM gpx;

Having points, we can use "lead" window function to pair current timestamp and point with timestamp and point in the next row:

--- example
SELECT
    id,
    ts,
    lead(ts) OVER (PARTITION BY id ORDER BY ts ASC) next_ts,
    pt,
    lead(pt) OVER (PARTITION BY id ORDER BY ts ASC) next_pt
FROM points;

Having such two timestamps and points in the same row, all we have to do now is join them into ranged types. Here we must be careful to exclude the last row, since the next_pt and next_ts are both NULL and will not generate correct geometry and time range:

--- example
SELECT
    id AS object_id,
    ST_MakeLine(pt, next_pt) geom,
    tstzrange(ts, next_ts) timeslice
FROM twopoints
WHERE next_pt IS NOT NULL ORDER BY id, ts ASC;

Now we can create a pipeline and join above three queries into one using WITH statement and insert resulting data into tracks table:

WITH
    points AS (
        SELECT id, ts, ST_SetSRID(ST_MakePoint(lon, lat), 4326) pt FROM gpx),
    twopoints AS (
        SELECT
            id,
            ts,
            lead(ts) OVER (PARTITION BY id ORDER BY ts ASC) next_ts,
            pt,
            lead(pt) OVER (PARTITION BY id ORDER BY ts ASC) next_pt
        FROM points),
    segments_and_timeslices AS (
        SELECT
            id AS object_id,
            ST_MakeLine(pt, next_pt) geom,
            tstzrange(ts, next_ts) timeslice
        FROM twopoints
        WHERE next_pt IS NOT NULL ORDER BY id, ts ASC)
INSERT INTO tracks (object_id, timeslice, geom)
SELECT object_id, timeslice, geom FROM segments_and_timeslices;

INSERT 0 17543
Time: 413,548 ms

There. In table tracks, we now have all we need to get meaningful results to the questions above. To make it speedy, we need to add a few more indexes:

modb=# CREATE INDEX tracks_timeslice_idx ON tracks USING GIST (timeslice);
CREATE INDEX
Time: 250,054 ms
modb=# CREATE INDEX tracks_geom_idx ON tracks USING GIST (geom);
CREATE INDEX
Time: 96,496 ms

Answering the questions

Now we can check where an object has been on 7th May this year between 13h and 14h. We do this by using && overlap operator with tstzrange:

SELECT
    A.id,
    A.object_id,
    A.geom
FROM tracks A
WHERE
    A.timeslice && '[2016-05-07 13:00:00+02,2016-05-07 14:00:00+02)'::tstzrange
    AND A.object_id = 0;

   id   | object_id |                                            geom
---------+-----------+--------------------------------------------------------------------------------------------
 117576 |         0 | 0102000020E610000002000000D0B4C4CA681C2D40B2D6506A2F044740AE4676A5651C2D40E84B6F7F2E044740
 117577 |         0 | 0102000020E610000002000000AE4676A5651C2D40E84B6F7F2E0447408CD82780621C2D408F34B8AD2D044740
...

Check if any two objects have crossed paths. In geometry, there's also && overlap operator, but it checks if the bounding boxes of compared geometries overlap, not the actual geometries. This is why we have to use another function here, ST_Intersects, which is exact:

SELECT A.object_id, B.object_id, A.id, B.id
FROM tracks A, tracks B
WHERE
    ST_Intersects(A.geom, B.geom)
    AND A.timeslice && B.timeslice
    AND A.id != B.id
    AND A.object_id < B.object_id;

 object_id | object_id | id | id
------------+-----------+----+----
(0 rows)

Time: 1032,741 ms

Obviously, since all the tracks are from me, I can't cross paths with myself. But we can use postgres to time travel one of my runs forward in time exactly the right amount to make it cross another:

UPDATE tracks
SET
    timeslice = tstzrange(
        lower(timeslice) + interval '302 days 20:33:37',
        upper(timeslice) + interval '302 days 20:33:37',
        '[)')
WHERE object_id = 5;

UPDATE 4632
Time: 164,788 ms

And now, if we repeat above query:

SELECT A.object_id, B.object_id, A.id, B.id
FROM tracks A, tracks B
WHERE
    ST_Intersects(A.geom, B.geom)
    AND A.timeslice && B.timeslice
    AND A.id != B.id
    AND A.object_id < B.object_id;

 object_id | object_id |   id   |   id
------------+-----------+--------+--------
         4 |         5 | 129396 | 133243
(1 row)

Time: 1099,215 ms

In this query it is enough to go through half of returned rows, since we are comparing all the tracks with all of the other and including only one permutation is enough. That's why we limit the query with A.object_id < B.object_id.

Checking if an object has passed a checkpoint then means specifying a line denoting checkpoint and looking up intersections with it in tracks table - an exercise left for the reader.

Modeling moving objects with ranges and lines is just one way. Your database representation might vary a lot depending on questions the users will be asking about the data.

Hopefully this tutorial has made a case how ranged data types in postgres can make otherwise very intensive task - finding intersections in both spatial and temporal dimensions - relatively simple.