Switch from Shapefile

ESRI Shapefile is a file format for storing geospatial vector data. It has been around since the early 1990s and is still the most commonly used vector data exchange format.

While Shapefiles have enabled many successful activities over the years, they also have a number of limitations that complicate software development and reduce efficiency.

We, members of the geospatial IT industry, believe that it is time to stop using Shapefiles as the primary vector data exchange format and to replace them with a format that takes advantage of the huge advances that have been made since Shapefile was introduced.

Read more:

The good side

Shapefile does a lot of things right. Here are some reasons why Shapefile is so heavily used:

Shapefile is by far the most widely supported format in existing software packages.

While the format is proprietary, the specification is open.

For many use cases, it is good enough. Its index file (*.shx) contains the offset and length of each feature in the main file (*.shp) which enables good reading performance. It is relatively efficient in terms of file size. The resulting file, even un-zipped, is relatively small compared to some other (mostly text-based) formats.



No coordinate reference system definition By default there is no definition of the coordinate reference system used. You can do it using e.g. .prj , but first: this is not standard part of the specification and second, there are still some issues, see projection issues and multifile format more lower.

Multifile format The Shapefile format uses at least 3 files (*.shp, *.dbf, *.shx). Users cannot share just one file; you must send them all. Users typically zip all the files into one archive and unzip them on the other end of the distribution chain, but this is cumbersome and error-prone. In addition, other geospatial software packages routinely add their own extensions to try to overcome Shapefile limitations. Custom additions are not supported by other tools and limit interoperability. NOTE: 3rd December is considered the International Shapefile day, because thanks to modular, extensible architecture it can have 12+ sidecar files, 3 of which are mandatory.

10 Characters attribute names Attribute names are limited to 10 characters max. Longer names are usually automatically shortened. This leads to abbreviated and/or cryptic attribute names that are unintuitive to the recipient of the data.

255 attribute fields There can be only 255 attribute fields in the database file. For some applications this is limiting, especially in combination with the flat table structure.

Poor support for attribute data types Float, integer, date and character string data types are supported. Floating point numbers can be stored as text, but there is no support for big integers (thus the format is not usable, you have data with big integer identifiers, such as cadastral maps) and the text is limited to only 254 characters. There is no support for more advanced data fields such as blobs, images or arrays.

Unknown character set There is no way to specify the character set used in the database. Many applications are using the old Windows-* or ISO-* data encodings, while nowadays we are tending to use UTF-8 more. Still there is no way to specify this in file header. The support for Unicode characters is also very limited.

2GB Size limit The size of both .shp and .dbf component files cannot exceed 2 GB. GDAL Shapefile driver overcomes this limit, but The Shapefile format explicitly uses 32bit offsets and so cannot go over 8GB (it actually uses 32bit offsets to 16bit words), but the OGR shapefile implementation has a limitation of 4GB. For compatibility with other software implementations, it is not recommended to use a file size over 2GB for both .SHP and .DBF files. So 4GB is all you can have in single Shapefile. This sounds enough, but not for all cases.

Non-topological format Shapefile is simple-feature format. There is no way to store more complex geometry relationships.

No mixed geometry Each file can be only one of the supported geometry formats (Point, Line, Polygon and others). Mixed geometry features are not possible.

Flat data structure The data structure is limited to flat tables with no hierarchies, relations or tree structure.

Very limited 3D support Shapefile can't store material definitions nor textures (images with texture coordinates). 3D models are stored as a triangle or polygon soup, with no watertight models or parametric geometries being supported.

Projection Definition Inconsistencies By default, Shapefile contains no information about coordinate reference system at all. But some software packages do accept *.prj files, which may contain CRS description. It uses Esri WKT definitions, which are often incompatible with standard definitions in EPSG or other sources regarding aspects such as axis order or unit definitions. Furthermore, they often miss parameters required for reprojection ("Missing Bursa Wolf Parameters", anyone?)

Multi part features has to be defined per-feature Line and polygon geometry type, single or multipart, cannot be reliably determined at the layer level, it must be determined at the individual feature level. This leads to incositancy during automatic data processing, you can not relay on input geometry type and test each feature, whether it is single geometry or multiple geometries.

There is no NULL value There is no way to mark no data in a field of the attribute table. You cannot distingues zero and no data for numerical fields.

Know about another issue? Send us more! Do you know about more limits or do you want to extend existing ones? Please do so via pull-request or comment in the repository.

Alternatives

What are the alternatives to the Shapefile format? To be honest, no alternative format has overthrown the Shapefile hegemony yet. Some formats nearly took over (KML, GML, GeoJSON), but their usage was limited to relatively narrow use cases only.

Although there are more then 80 vector data formats in use out there, only a few can be considered as candidates for Shapefile replacement. Please note, that we do take only open (preferably community) formats into account.

OGC GeoPackage OGC GeoPackage is one of the most promising formats, designed for today's modern applications. GeoPackage is published as standard by the Open Geospatial Consortium. Features SQLite as backend

File based, single file

Vectors, rasters

Official extensions

Supported in many software packages Description GeoPackage is an open, standards-based, platform-independent, portable, self-describing, compact format for transferring geospatial information. The GeoPackage Encoding Standard describes a set of conventions for storing the following within an SQLite database: vector features

tile matrix sets of imagery and raster maps at various scales

attributes (non-spatial data)

extensions There are several published extensions for GeoPackage which make this format even more powerful. GeoPackage is now (2017) supported in most GIS software packages. One downside to GeoPackage is that the underlying SQLite database is a complex binary format that is not suitable for streaming. It either must be written to the local file system or accessed through an intermediary service. We recommend GeoPackage as a Shapefile replacement for scenarios where the recipient will want to query or edit the data locally.

FlatGeobuf FlatGeobuf is a new format, designed for performance and simplicity. Features Binary encoding based on FlatBuffers

File based, single file

Vectors

Can be efficently serialized and streamed (read/write) Description FlatGeobuf is an open, standards-based, platform-independent, portable, self-describing, performant and compact format for transferring geospatial information. FlatGeobuf is currently (2020) supported in GDAL 3.1 and QGIS development version. Reference TypeScript/JavaScript implementation is available and suitable for use in for example OpenLayers and Leaflet. We recommend FlatGeobuf as a Shapefile replacement for scenarios where performance is critical and system to system integrations. Because of the streaming capabilities it is also suitable as an alternative WFS output format and is available as an official extension to GeoServer.

GeoJSON "GeoJSON isn't a shapefile replacement."

-- Sean Gillies Features JSON format

File based

Can handle complex data

File size grows fast

IETF Standard Description GeoJSON is a community format based on the popular JSON data exchange format. GeoJSON is very simple, human-readable, text-based format. Although it is technically possible to use it with more coordinate reference systems, the specification states clearly, that WGS84 is the only system, which should be used. It can handle complex vector data features and build complex hierarchical data models. Since GeoJSON is a JSON encoding it is very easy to parse. It also supports streaming (features are dealt with as they come in without waiting for the whole file to load). The problem with GeoJSON is that not all geometries can be represented and advanced coordinate reference systems are not well supported. We recommend GeoJSON as a Shapefile replacement for data interchange particularly for web services. For datasets with geometries or coordinate reference systems not representable in GeoJSON, GML may be suitable.

OGC GML Another OGC Standard. Features XML Based

Only vectors

Hierarchies

Thanks to INSPIRE, at least partial support in many software packages Description GML was picked as the main distribution vector data format the European INSPIRE initiative. It's a very complex format, and its direct usage in GIS software is limited. Its main use is as a data exchange format that needs to be ingested into the user's system (e.g. into a database) to be fully usable. GML is currently often used for open data datasets, since it is technology-neutral and a supported OGC Standard. A major downside to GML is that it is an insanely complex standard. Few software packages support the entire standard and support for individual parts of the standard varies widely. We believe that GML is a candidate for Shapefile replacement for data interchange in situations where data is too complex to be represented by GeoJSON. However, for the vast majority of datasets GML is overkill.

SpatiaLite SpatiaLite is popular database, file based data storage. Features File based

SQL database

OGC Simple Features Description SpatiaLite is an open source library intended to extend the SQLite core to support fully fledged Spatial SQL capabilities. SQLite is intrinsically simple and lightweight: a single lightweight library implementing the full SQL engine

standard SQL implementation: almost complete SQL-92

a whole database simply corresponds to a single monolithic file (no size limits)

any DB-file can be safely exchanged across different platforms, because the internal architecture is universally portable Support for SpatiaLite is relatively limited and most software that supports SpatiaLite also supports GeoPackage as well. They build on top of the same underlying technology, SQLite. SpatialLite lacks the support for extensions or raster data present in GeoPackage. While these are not necessarily must-have features, they may be useful. Like GeoPackage, it is unsuitable for streaming. Since SpatiaLite offers no clear advantages over GeoPackage at this time, it should only be considered as a Shapefile replacement in niche scenarios.

CSV Some people tend to use comma separated files for storing geospatial data. Features Simple Description Among non-geospatial people, CSV is very popular, but for most geospatial applications it is not an ideal format. At least two reasons for not using CSV as Shapefile replacement: It isn't standardized (there are many dialects out there) and support for non-point geospatial data is complicated.

OGC KML OGC KML was a popular popular vector data format due to the popularity of Google Earth. Features file-based

XML

Combines geometry along with cartography

Supports just the WGS-84 coordinate system Description KML was originally devised as the exchange format for a software package called Keyhole. When Google purchased Keyhole and released it as Google Earth, KML gained in popularity. However, as the geospatial community hit the limits of both Google Earth and KML, KML's popularity has waned. Since it is XML based, it is not efficient for storing larger datasets. It combines cartography along with the data geometry in one file, which is problematic when the data has the potential to be used in multiple ways. Since it officially supports only the WGS-84 coordinate reference system, it is not suitable for a number of applications.

ESRI GeoDatabase At its most basic level, an ArcGIS geodatabase is a collection of geographic datasets of various types held in a common file system folder, a Microsoft Access database, or a multiuser relational DBMS (such as Oracle, Microsoft SQL Server, PostgreSQL, Informix, or IBM DB2). Features native data structure for ArcGIS

file-based (or database based)

complex data models

proprietary, closed format Description GeoDatabase is very often used in the ArcGIS environment as the main exchange data format. Its features are very complex and advanced. On the other hand, since it is a proprietary closed format, implementations outside the environment of ESRI products are extremely limited. It is only a candidate for replacing Shapefiles in an enviroment centered on ArcGIS.

Last modification: 2017-10-08

Initially created by: Jachym Cepicky, OpenGeoLabs s.r.o.



This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

Contribute: On GitHub