I have been using Postgis 2.0 for 3/4 of a year now and while I really enjoy using it, excessive query processing time has rendered it basically unusable for my use case.

I tend to do heavy geoprocessing on municipal datasets which often have hundreds of thousands of multipolygons. These multipolygons are sometimes shaped very irregularly and can vary from 4 points to 78,000 points per multipolygon.

For example, when I intersect a parcel dataset with 329,152 multipolygons with a jurisdiction dataset containing 525 multipolygons, I get the following stats for total time consumed:

ArcGIS 10.0 (on same host with windows 7 OS): 3 minutes Postgis:56 minutes (not including geometry pre-processing queries)

In other words, it requires 1500% more time to do this intersection in Postgis than in ArcGIS --and this is one of my more simple queries!

One of the reasons ArcGIS supposedly runs faster is due to better indexes. Some programmers recently figured out how these indexes work and I am wondering if anyone knows how to build these indexes in Postgis (or build tables that would mimic the indexes). Perhaps this would solve most of the speed issues in Postgis. I can only hope there must be some way, especially since ArcGIS can only use 4 GB of RAM while I could use up to 4 times that for my postgis server!

Of course there are many reasons postgis can run slowly, so I will provide a detailed version of my system specs:

Machine: Dell XPS 8300 Processor: i7-2600 CPU @ 3.40 GHz 3.40 GHz Memory: Total Memory 16.0 GB (10.0 GB on virtual machine) Platform: Ubuntu Server 12.04 Virtual Box VM Potgres Version: 9.1.4 Postgis Version: POSTGIS="2.0.1 r9979" GEOS="3.3.5-CAPI-1.7.5" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.9.1, released 2012/05/15" LIBXML="2.7.8" LIBJSON="UNKNOWN" TOPOLOGY RASTER

I also detail the entire installation process that I used to set up postgis including the creation of the VM itself.

I also increased shared memory from the default 24MB to 6 GB in the conf file and ran the following commands to allow postgres to run:

sudo sysctl -w kernel.shmmax=7516192768 (I know this setting is deleted every time you restart the OS) sudo /etc/init.d/postgresql restart

As far as I can tell this does absolutely nothing noticeable in terms of performance.

Here are links to the data I used for this test:

Here are the steps I took to process the data:

ArcGIS

Add datasets to ArcMap Set coordinate system to central texas feet (srid 2277) Use intersection tool from dropdown menu

Postgis

Import parcels using:

shp2pgsql -c -s 2277 -D -i -I -W UTF-8 "tcad_parcels_06142012.shp" "public"."tcad_parcels_06142012" |psql -d postgis_testing -U postgres -h local_ip -p 5432

Import jurisdictions using:

shp2pgsql -c -s 2277 -D -i -I -W UTF-8 "jurisdictions.shp" "public"."jurisdictions" |psql -d postgis_testing -U postgres -h local_ip -p 5432

Clean invalid geometry in parcels:

DROP TABLE IF EXISTS valid_parcels; CREATE TABLE valid_parcels( gid serial PRIMARY KEY, orig_gid integer, geom geometry(multipolygon,2277) ); CREATE INDEX ON valid_parcels USING gist (geom); INSERT INTO valid_parcels(orig_gid,geom) SELECT gid orig_gid, st_multi(st_makevalid(geom)) FROM tcad_parcels_06142012; CLUSTER valid_parcels USING valid_parcels_geom_idx;

Clean invalid geometry in jurisdictions:

DROP TABLE IF EXISTS valid_jurisdictions; CREATE TABLE valid_jurisdictions( gid serial PRIMARY KEY, orig_gid integer, geom geometry(multipolygon,2277) ); CREATE INDEX ON valid_jurisdictions USING gist (geom); INSERT INTO valid_jurisdictions(orig_gid,geom) SELECT gid orig_gid, st_multi(st_makevalid(geom)) FROM jurisdictions; CLUSTER valid_jurisdictions USING valid_jurisdictions_geom_idx;

Run cluster:

cluster;

Run vacuum analyze:

vacuum analyze;

Perform intersection on cleaned tables:

CREATE TABLE parcel_jurisdictions( gid serial primary key, parcel_gid integer, jurisdiction_gid integer, isect_geom geometry(multipolygon,2277) ); CREATE INDEX ON parcel_jurisdictions using gist (isect_geom); INSERT INTO parcel_jurisdictions(parcel_gid,jurisdiction_gid,isect_geom) SELECT a.orig_gid parcel_gid, b.orig_gid jurisdiction_gid, st_multi(st_intersection(a.geom,b.geom)) FROM valid_parcels a, valid_jurisdictions b WHERE st_intersects(a.geom,b.geom);

Explain Analyze intersection query:

Total runtime: 3446860.731 ms Index Cond: (geom && b.geom) -> Index Scan using valid_parcels_geom_idx on valid_parcels a (cost=0.00..11.66 rows=2 width=1592) (actual time=0.030..4.596 rows=1366 loops=525) -> Seq Scan on valid_jurisdictions b (cost=0.00..113.25 rows=525 width=22621) (actual time=0.009..0.755 rows=525 loops=1) Nested Loop (cost=0.00..61428.74 rows=217501 width=24213) (actual time=2.625..3445946.889 rows=329152 loops=1) Join Filter: _st_intersects(a.geom, b.geom)

From everything I have read, my intersection query is efficient and I have absolutely no idea what I am doing wrong for the query to take 56 minutes on clean geometry!