At OpenCounter we’re always trying to make interacting with local government as seamless as possible. This is why when someone comes to OpenCounter to check their zoning or research permits and fees for their business, we’re able to replicate the conversations that happen at the counter by matching common business terms to land uses defined by the City. If cities have definitions for uses, those are available to users as well.

OpenCounter already has the ability to show the user a map of where their use code is allowed. Now the process of finding a use code is simplified too.

Over the last six months, over 90% of users found a use code through the search without resorting to browsing the land use table.

A screenshot of the use code search page of https://business.sandiego.gov/

How do we do this? We use a technique at the intersection of natural language processing and machine learning called word embeddings to generate a graph representation of all of the use codes across all of the jurisdictions in OpenCounter, and use results from other jurisdictions to inform any particular search. As OpenCounter grows and is able to compile more uses and associate more words, this algorithm becomes more accurate.

For example, when a user searches for “salon”, we find “Personal Services” because “salon” has a short distance with words like “barber”, “spa”, “beauty”, and “nail”, which our system has already associated with “Personal Services”. For more details on the graph nature of the search, see the technical explanation below.

Technical explanation

The motivation for constructing a use code graph on top of word embeddings is threefold:

We are unable to make a single search engine that takes a query, and maps onto a global taxonomy of uses, since jurisdictions have slightly (and sometimes very) different sets of use codes. We don’t want individual search engines for each jurisdiction, because we want to exploit the similarities between the use code sets to improve the quality of the search. We want to minimize the amount of human annotation that needs to go into the system. Therefore we want to fuzzy match on words that OpenCounter hasn’t needed to explicitly associate with the system. Word embeddings allow this. For example, we want “pilates” to have a close association with any uses we know to be related to “yoga”.

Embedding use codes

A word embedding is an embedding from a space where each word is one dimension to a continuous vector space. We use an implementation of word embeddings developed at Stanford called GloVe [1].

We use a pre-trained model of the top 400,000 thousand words from English Wikipedia, projected onto a hypersphere.

At the beginning, we treat the use code as a bag of words [w₀, w₁, … , wₙ], which consists of the category, subcategory, name, and any additional keywords associated with a given use code. We embed each word individually from this bag of words to form a bag of vectors [p₀, p₁, … , pₙ]. Then we combine these vectors into a single vector v⃗ using tf-idf.

Use codes are allowed to have multiple vector representations, since a use like “Commercial Administrative” can be the correct result for “law firm” as well as “advertising” or “software consulting” — words that aren’t necessarily close in the vector space. This is accomplished by giving each use code multiple “keyword sets” that append onto the name of the use itself. For the sake of simplicity, the rest of this article will treat use codes as the fundamental unit, rather than keyword sets.

Each use code is then represented as a tuple of (j, v⃗), j ∈ J, where v⃗ is the vector representation of the use code, and J is the set of all jurisdictions in OpenCounter, which are the clients in OpenCounter such as San Diego and Salt Lake City. Thus a use may share the same position in vector space with a use in a different jurisdiction, but is assumed to be unique within its own jurisdiction.

Each jurisdiction has its own set of use codes Uⱼ, which partition the set of all use codes in OpenCounter U.

Here is a visual representation of U in two dimensions, created using t-sne [2] and bokeh [3]:

t-sne representation of the vector space of uses

Each use code is a point, and each color corresponds to a top-level use category, such as “commercial”, “industrial”, or “residential”.

In the interactive version of the above plot, the user can move their cursor across the points to watch how the use codes change: https://sam.zhang.fyi/html/use-code-tsne.html. (Warning: 13MB html file)

Exploiting the use code graph

We use cosine distance as a distance metric D between use codes. The user’s query vector will be denoted q⃗, and the jurisdiction of the user t.

We find that searching for the closest match within the target jurisdiction from the query — minimizing { D(q⃗, u) | u ∈ Uₜ } — performs poorly. The main reason this happens is because often a particular jurisdiction simply doesn’t have detailed use codes. It is much harder to match the query “office” to “Commercial > General Administrative” than it is to “Commercial > General Administrative > Office”.

The quality increases when we rely on the underlying structure of the graph to allow use codes to influence each other across jurisdictions. The naive approach of minimizing { D(q⃗, u) | u ∈ Uₜ } can be viewed as a minimization of the path ∑pᵢ, where pₙ ∈ Uₜ and p₀=q⃗ (this is the great circle distance across the hypersphere between q⃗ and its nearest point in Uₜ). Instead, to make use of the prior structure within the graph, we minimize the length of each step argmin { D(pᵢ, pᵢ₊₁ | u ∈ U∧ u ∉ { p₀, p₁, … , pᵢ }) }. This opens up a potentially large search tree, but we find it sufficient to perform only one step.

In other words, for each u ∈ U, we save the precomputed use s(u, j) = argmin { D(u, uⱼ) | uⱼ ∈ Uⱼ, j ∈ J }. Then n=2, and p₂=s(p₁, t). This creates a precomputed graph G of size |U| ×|J−1|, where for every given use code, we find the closest use code in every other jurisdiction.

A hypothetical search

Suppose there is a city with a use code called “Commercial > Administrative”, and that use code is the correct result for “office”.

When the user types in “office”, we first find the top N (say, 20) matches across the entire system, across all jurisdictions. It would likely be a group of use codes all similarly named to “Commercial > General Administrative > Office”.

Then for each of those candidates for p₁, we look up s(p₁, t) within G, and rank the results using a linear combination of D(q⃗, p₁) and D(p₁, p₂).

This is essentially a form of query expansion, where we rely on the prior knowledge of all of the use codes in OpenCounter to expand our query from q⃗ to p₁. This is why adding as OpenCounter grows, and more use codes are added to the system, the quality of the search will continue to improve.

Productionization notes

This system was developed in Python with the gensim package [4], but to avoid maintaining a microservice in a separate programming language than our existing Rails application, we migrated the word vectors into Postgres to use with ActiveRecord. We store the data using the Postgres “cube” extension [5], which provides us with a high-dimensional cube data structure as well as performant distance functions.

Footnotes

[1] Pennington, Jeffrey, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation.” Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

[2] Maaten, Laurens van der, and Geoffrey Hinton. “Visualizing data using t-SNE.” Journal of Machine Learning Research 9.Nov (2008): 2579–2605.

[3] https://bokeh.pydata.org/en/latest/

[4] https://radimrehurek.com/gensim/

[5] https://www.postgresql.org/docs/9.5/static/cube.html