Is spatial really special?

Many GIS professionals come into the field from a specific discipline and become interested in GIS as a tool with which to accomplish their goals in their original field and by attrition or intent begin to shift their focus towards GIS. I followed this path myself. After going to school to study wildlife biology, I learned about GIS and enrolled in a minor program in GIS and spatial analysis. I think that this is a good thing in many ways. People with an interest in solving problems in other disciplines will push the field forward in directions that someone who’s sole focus was on GIS would be unlikely to go. We need those people in the industry.

But GIS is a technical discipline. At its core GIS is database technology, albeit with a spatial focus. In my humble opinion many university GIS departments, especially at the certificate level, focus too much on the “spatial” aspects of GIS and not enough on the underlying database technology. As a result these programs produce GIS professionals who are very good at cartography and spatial analysis but who find themselves unprepared for the modern GIS job market where there is an increasing need for people with skills in enterprise level database administration, SQL, and web-based GIS. Even if they are not performing those tasks themselves, GIS professionals should at least have an understanding of the technology in order to be able to communicate effectively with those who will be performing those tasks.

To some extent this is changing as university programs catch up. Those of us who were educated 10 or 20 years ago when most computers were single-user standalone machines, however, still struggle with the vocabulary of multi-user enterprise level client-server databases. Fortunately I was better placed than many in my position as I had spent almost 10 years as a database programmer before returning to school to study biology.

The underlying issue is the concept that “spatial is special”. I would argue that it is not, although it does complicate things. There is certainly a need to understand spatial reference systems. There is a need to understand principles of cartography for outputting results as maps. There is also a need to understand the idea of spatial relationships (touches, disjunct, overlaps, contains, etc) and spatial operations (buffers, intersects, unions, difference, etc). These concepts certainly add some complexity to data but in the end it is still just data and the field of computer science has provided us with some amazing tools for dealing with the data. These tools just need to be extended to deal with the complexities that spatial data include and that is what a spatial database is.

GIS and databases

This has always been the case. When Jack Dangermond created Arc/INFO in the late 1970’s he licensed an existing database called INFO and added spatial capabilities to it (The Arc part). Later when ESRI developed a PC edition of Arc/Info they used another existing database for the PC platform named dBase III to store the attribute data. When ArcView was released and they needed a simpler data storage structure they came up with the shapefile which also used dBase III file structures to store attribute data. When ArcGIS 8.0 was released ESRI introduced the concept of a geodatabase in two flavors. Personal geodatabases were built on top of the Microsoft access .mdb file structure and enterprise geodatabases were built on top of existing enterprise level databases such as SQLServer and Oracle. It wasn’t until ArcGIS 9.0 that ESRI released the first data storage mechanism that they developed from scratch, the File Geodatabase, to replace the personal geodatabase which was subject to file size limitations and performance issues. File GDB’s were very compact and speedy but they also were limited to single-user editing and had no SQL access.

GIS has always been, literally from day one, the merging of existing database technology with some specialized software for dealing with the complexities of spatial data. Database technology has gotten extremely good over the past several decades. The core technology of data storage and retrieval using SQL queries has not changed fundamentally. I could probably run the exact same SQL queries I was writing in the mid 80’s on today’s databases with very little if any modification. What has changed is that modern enterprise databases are so fast they can handle literally millions of concurrent users (think Facebook or Amazon.com), can be accessed from anywhere in the world, are so robust that when one fails it often makes the nightly news, and have to be secure enough to stand up to well organized state-sponsored hacking attempts. Database companies spend millions pouring over minute details to optimize indexing and query processing and shave seconds off of query times and millions more ensuring that no data stored in their systems are ever lost. It is very, very, good software. So good in fact that it is almost inconceivable that any GIS company, even one as large as ESRI, could ever improve on them.

In an ideal world then, in my opinion, the perfect GIS software would be based on existing database technology. The GIS part would largely revolve around display and cartography (visual outputs) and editing (visual inputs). The visual aspect is not handled well by spatial databases which are really focused on data storage and retrieval and often don’t have any specific user interface at all. Instead they simply process SQL commands passed to them from a client and return results that are then interpreted by the client. Also in a perfect world GIS education would begin with a good grounding in database technology, RDBMS theory, and SQL. As in all disciplines, you have to walk before you can run, and walking in GIS terms is a solid grounding in data science. The spatial aspects can and should be added later, but they would be difficult to understand until the basics are covered.

In short, a spatial database provides the following benefits.

Speed Security Multi-user editing Access from multiple clients SQL queries

Speed

As I mentioned previously, database companies like Oracle, IBM, and Microsoft invest millions of dollars every year in optimizing data retrieval and security. There is intense competition between these companies and billions of dollars at stake and they have the ability to invest resources that no single GIS company, even one as large as ESRI can hope to compete with. They can afford to hire some of the world’s best computer scientists to delve deep into indexing theory to save seconds off a query or to come up with algorithms to split tasks between multiple computers to improve performance.

Enterprise level databases generally don’t come with much in the way of a user interface but are accessed through SQL queries, and thus speed and security are their main selling points. For that reason you can bet your bottom dollar that they are utilizing every last bit of your computers capacity in the most efficient way possible That is not necessarily the case with GIS software, some of which is still chugging away with 32 bit code on a single CPU core when almost every computer sold in the past decade has been 64-bit and with multiple cores

Security

There are several aspects to consider for security. All computers can fail and all software has bugs that can result in crashes and loss of data. Thus, steps must be taken to minimize the potential for these issues to affect the integrity of your data. Hardware issues are generally dealt with by frequent backups and by the use of replication. Replication means that data is written multiple times to multiple disk drives and sometimes on multiple computers in multiple locations so that if a hardware issue occurs, or some type of disaster results in the loss of a drive or a computer or even an entire city, one of those replicates can step in and provide access to your data often without even having to stop the program. Even small companies can have their data replicated to the cloud somewhere to prevent gaps in availability and make it easy to recover from computer failures.

Another aspect of data security involves minimizing the potential for human errors. These are almost impossible to prevent entirely but there are steps that can be taken to reduce that potential. Enterprise level databases allow fine-tuned control over what individual people have access to through the granting of permissions. Each user has to log-in to the database with a name and password and that determines what kinds of changes they are allowed to make. In many databases, permissions can be granted down to the individual field level. In this way, users can be prevented from viewing, changing or deleting information that they should not have access to.

Triggers can also be used in most database systems. Triggers are custom functions that are executed in response to a database event. For instance, a buffer around a raptor nest could be created automatically when a nest is located and the size of the buffer might be dependent on the species. This saves time for the user as well as minimizes human error. Triggers can also perform validation to make sure that a value falls within a specified range, or is entered in all capital letters, etc. Doing this at the database level ensures that all the data is consistent and there is no possible way for invalid entries to occur.

Transactions are another method for ensuring data integrity. Transactions are a group of SQL statements that are run together as a group in such a way that if any one of the statements fails, the entire group is rolled back. This makes certain that there are no half measures in the database, as could occur if, for instance, a buffer was created automatically in response to entering a nest but something fails before all the field data are copied over. If all the steps were grouped in one transaction, the buffer would be removed if any of the following steps failed to execute. A properly set up database can go a long way towards preventing human-error from affecting the integrity of your data, but as a wise man once said “You can fool proof anything but you can’t idiot proof nothing”.

The final part of data security is preventing malicious attacks on your data. You may think that this is not an issue for your company if your data is not something that would attract a lot of attention, but it is surprising how many people are out there probing everything they can find. Sometimes it is just curiosity, sometimes it might be just to sow chaos in whatever manner they can, sometimes it is to use your computer or network as a stepping stone to someone else’s computer system, sometimes they will attempt to hold your data ransom and extort money from you. Some companies, however, do deal with data that would be valuable in the wrong hands. They will have concerted attempts, sometimes by skillful and well-organized groups of hackers trying to break into their data. Regardless it is always best to make reasonable steps to secure your data. Data breaches into large companies often make national news and no database company wants to learn that it was a weakness in their database software that was to blame. Such an event could cost a company billions in lost revenue and so database companies are highly motivated to make their systems secure.

Multi-user editing

Most of my career was spent in small, but growing environmental consulting firms. Even in these small companies, which didn’t even focus on GIS, I spent large amounts of times banging my head against wall and trying to solve issues related to multi-user editing in commercial GIS which was fundamentally a single-user system. Moving up the ladder to a multi-user enterprise geodatabase in a commercial GIS would have required spending more money than the entire value of the contracts I was working on and so I was told to “find a way to make it work”. Had I known then what I know now I could have switched to an open-source solution in a couple of days but hind-sight is 20-20.

Like many people I was stuck in the mindset that “spatial is special” and that any solution would have to come from ESRI, and ESRI did have solutions available but there was no way that my company could afford them. Even though solutions existed for limited multi-user editing access, ESRI showed little interest in implementing them. Even worse, their locking system prevented editing access for anyone other than the first person who opened not just a file or layer, but an entire directory or geodatabase. This locking occurred regardless of whether they were actually in editing mode or not. As a result, a lot of time was wasted having the entire company exit ArcGIS just so one person could open a map that they needed to edit.

QGIS on the other hand was originally developed to be a client to data stored in PostGIS (a spatial extension to the PostgreSQL database) so it has multi-user editing capability built into its very core. It can also be a client to spatial data stored in SQLServer and Oracle. In all these cases, the database handles everything related to data storage, which it is very good at, while QGIS focuses on what it is good at, which is editing and displaying spatial data. And that is exactly as it should be in my opinion.

Access from multiple clients

Enterprise databases usually exist independent of any user interface. They are focused solely on the storage and retrieval of data in the most efficient means possible. Client software accesses the data, usually through a SQL query. This data can be accessed by many types of clients in many places simultaneously. The clients might be full-blown desktop GIS software such as ArcGIS or QGIS or some combination of the two. QGIS and ArcGIS can both access the same data simultaneously in many cases, but be careful because it is common that the data will only be able to be edited in one or the other so you will need to think carefully about who will be provided with which software if you are mixing them. Another client might be a spreadsheet that is just pulling tabular data for inclusion in a report or a statistical program for more formal analysis. At the same time the IT department might be querying the database in response to a request for information from a project manager, technicians might be entering data in the field from a mobile app, or using the data to help them navigate, and clients might be viewing that data in real-time through a secure web interface that is provided to them.

Everyone has access to the same set of data from multiple places and multiple types of software but each one only has specific tools available to them and each one can have specific permissions granted to them. This is the power and promise of a spatial database and the thing that makes it all possible is a 40 year old technology called SQL or structured query language.

SQL

SQL has been around for a long time in computer terms. I was writing SQL queries to an Informix database as a part-time after school job while I was in high school in the early 80’s. It is stunning how much computer technology has changed in the intervening 34 years but the basic structure remains largely the same. I’m sure that some of the same SQL queries I typed into a dumb terminal with a 9” green monochrome screen and saved on a cassette tape would run today if I sent them from a mobile device on top of a mountain through an ajax query to a web server. And the reason that SQL has stood the test of time is that it is very, very, good at what it is intended to do.

The primary purpose of SQL is to query the database to retrieve the information that you need. Sure, it can do a lot of other things, such as defining database structure, managing data through INSERT, UPDATE, and DELETE statements, and managing users by assigning roles and granting permissions. The SQL SELECT statement, however, is the crown jewel. It is used to query the database to get the information that you need and people who are good with SQL can get exactly the information that they need very quickly.

The basic structure of a SQL SELECT query is very simple. It starts with the SELECT keywords, followed by a list of the fields in the database table that the user is interested in, followed by the FROM keyword and the name of the table in the database. The standard is that keywords are capitalized, while field names and table names are all lowercase. The following SQL statement, for instance would return the first name, last name, birthday, age, city and state from the customers table in the database.

SELECT fname, lname, birthdate, age, city, state FROM customers

You can also add a WHERE clause to filter a subset of the data, for instance:

WHERE state=’NY’

And you can also add an ORDER BY statement to get the data sorted in a particular order, for instance

ORDER BY city, lname, fname

And while the basic structure is very simple, things can start getting complex. You might want to group on a certain field to get subtotals summarized for that field. This requires the GROUP BY statement. You might also want to JOIN two or more tables using the JOIN statement to connect them on a common field. For instance you could join an eagle nest feature to all of its survey results. If you add subqueries and window functions, etc., things can get complex very quickly, and this is before we get into anything even remotely spatial.

An important consideration is that a SQL query, no matter how complex, is simply text. The statement is passed to the database by the client, the database processes the query on the server, and returns a set of data to the client. It’s up to the client to know how to send the data to the server and how to process the results. A GIS program might display point data on a map, while a web page might display a table of results. Text-based SQL means that you can store an entire query in a database, or send it in an email, and they are very easy to work with across the internet as well. A web page can send a SQL statement to the database through an AJAX request that can be dealt with on the server side and return results to the web page.

Many traditional personal database programs such as Microsoft Access attempt to remove the user from writing SQL queries by providing graphical user interfaces, but any query that you can build in its interface you can also view in SQL view to see the SQL behind the query. ArcGIS does this as well, they call their graphical user interface ‘Select by Attributes’, and you’ve probably used it.

But you can see in the area circled in red, that all you are doing is creating a SQL SELECT statement. There is nothing wrong with these graphical user interfaces, they can be great tools for learning SQL and can speed up some queries as well, but the user should understand what is really going on under the hood.

Spatial databases

Hopefully if you’ve read this far you are beginning to realize the advantages of an enterprise level client server relational database management system. But what does it mean to spatially enable a database? At a minimum, it means that you have the ability to store geometry in the database. Generally, the actual coordinates that form points, lines, and polygons are stored in binary format in a single column in the database. You can’t read this information directly because it’s binary but, as we’ll see in a bit, you can transform it into human readable form if you want. This means that client software that is used to edit and/or display the data from the database also has to understand how to convert the binary data into coordinates that it can use for that purpose. Conceptually, this is not much different than the way that shapefiles store data, except that the coordinates are stored directly in the database, not in a separate file.

Most modern databases have the ability to store geometry as a column in a database alongside a features attributes. Many also include a set of spatial functions for dealing with spatial data but the quality of these can vary considerably. Commercial databases such as Oracle, DB2, and SQLServer often have their own proprietary spatial extensions that allow spatial data to be included in their databases. Open-source databases such as PostgreSQL and even the file-based database SQLite have spatial extensions for them as additional open-source projects (PostGIS and SpatiaLite). In addition ESRI has its own proprietary spatial extension called ArcSDE that can work on top of both the above mentioned commercial databases as well as PostgreSQL. QGIS on the other hand can be used as a client to both the proprietary spatial extensions of commercial databases and the open-source spatial extensions to open source databases.

These spatial functions include tools for converting between coordinate systems, measurement (distance and area), testing relationships (overlaps, touches, etc), and creating new geometries (buffer, intersect, difference, etc).

Using a spatial database

We’ve established that enterprise database software is awesome and that they can be spatially enabled to store spatial data and they usually include some special functions for dealing with the complexities of spatial data. But how does one actually use them? Consider the following example where we have a set of eagle nest data classified into active (green) and inactive (orange) nests.

If we wanted to create a layer containing buffers around just the active nests in desktop GIS software, we would need to first select the active nests and then run a geoprocessing tool to create a second layer of buffer polygons which will create another file on disk. With a spatially enabled database we can create a layer based on a SQL query using the ST_Buffer function with the following SQL statement.

SELECT nest_id, status, st_buffer(geometry, 0.01) FROM bald_eagle_nests WHERE status=’ACTIVE NEST’;

The st_buffer function takes an input geometry and a buffer distance, in this case 0.01 degrees (because the data is stored in geographic coordinates), and returns a polygon geometry representing a buffer around the point.

So far, the standard desktop software doesn’t seem that awful in comparison. You could probably get your results almost as fast the first time. But what if you also want a second layer showing a smaller buffer around the inactive nests? With desktop GIS software you’d need to go through the same set of steps again, but with SQL you could just edit the buffer function and the where clause and execute it again.

But what if one of your field staff reports that one of the inactive nests has become active. In desktop software you’d have to go through all of the steps again, creating new files each time, deleting the old files, making sure that all of the maps reflect the new data, etc. With SQL, however, you can just execute the queries again, by refreshing the map view, without the need to create new files on disk. In fact, with an enterprise database, your field personnel could change the nest status from a mobile device in the field that would update the data in the database in real time. Then when you ran your query again it would reflect those changes even before your field personnel returns to the office with no need for anyone to do anything and no need to create a bunch of intermediate files on your computer.

Now let’s consider a slightly more complicated example. Consider that you have a client who is building pipelines in this area and they need to stay on top of which pipelines are within a certain distance of active eagle nests so they can avoid disturbing the eagles. But it is spring time and the eagles are building new nests and abandoning old nests and your client is busy building new pipelines and finishing old ones. You have some field staff who are monitoring the eagle nests constantly and other field staff that are surveying new pipeline routes and the data is changing constantly. It’s a very dynamic environment and your client needs accurate information in real time. In the old desktop GIS paradigm, your field staff would have to return to the office and enter their changes, you would have to perform a series of steps to create a new set of data showing just the pipelines that are close to eagle nests, and then you would have to update a map to send to your client.

In the new spatial database paradigm, you could provide your client with a web interface that showed exactly which pipelines were close to eagle nests. They could open the web interface and have the information they needed in real time as the data was updated by your field staff. In fact, for this specific use case, there would be no need for anyone to even open desktop GIS period. The following SQL query would provide this information.

SELECT e.nest_id, e.status, p.project, p.type, ST_Distance(e.geometry, p.geometry) as dist, p.geometry FROM linear_projects As p LEFT JOIN bald_eagle_nests as e ON ST_DWithin(e.geometry, p.geometry, 0.01) WHERE e.status=’ACTIVE NEST’ AND p.type=’Pipeline’ ORDER BY p.project, dist

This query joins the bald_eagle_nests table (aliased as e) with the linear_projects table (aliased as p). Because we are dealing with two tables joined together we have to prefix all field names with the name (or alias) of the table.

Also notice that we use the ST_Distance function to calculate the distance between any two geometries and that the tables are joined, not on the basis of a common field, which they do not have, but on the basis of a spatial relationship. In this case the spatial relationship is determined by the ST_DWithin function which returns true if the two geometries are within the specified distance of one another. The output of this query could be a table listing the pipelines that are currently impacted by an active eagle nest.

Or the output could be viewed on a map. In this case they are shown in red in QGIS, but it could just as easily be viewed remotely through a secure web interface.

Clearly this is just the tip of the iceberg when it comes to what is possible with spatially enabled databases. Hopefully this post at least stimulated your thinking. If you want to learn how you can get started TODAY, working with spatially enabled databases for free in QGIS with SpatiaLite, please head to part II of this post Quick start to spatial databases with QGIS and SpatiaLite.

Shameless Plug

If you want to learn more about spatial databases, I have a course on the Udemy platform available now. The course covers the basics of working with PostGIS data in QGIS, including Spatial SQL, accessing the database from multiple clients, basic database administration, etc. This course has 10 hours of video content and is available to readers of this blog for $20. More information here.

I also touch on the subject in my current series of courses on web programming for GIS application on Udemy. The first course Introduction to web programming for GIS applications provides the big picture of how web GIS works and includes several sections on databases, SQL, PostGIS, and building web interfaces to data stored in PostGIS. The second course on building client side web mapping applications with Leaflet and third course on mobile GIS applications with Leaflet are less relevant but will be useful for building any web based mapping applications. All of the above courses are available now on the Udemy platform for $20 each (80% off) using the links provided.

I also have a course on QGIS 3.0 for GIS professionals. This is relevant because QGIS was originally designed as an interface for working with spatial data in PostGIS and so spatial databases, especially PostGIS and SpatiaLite, are embedded very deeply in its DNA and QGIS works very well as a client to PostGIS data.

After this I will return to web mapping with a course on server-side web mapping applications with PostGIS and PHP. If you want to stay advised of further offerings, please subscribe to this blog, or to my YouTube channel, or follow me on Facebook, Twitter, or LinkedIn.