Using custom functions with SQLAlchemy and SQLite

I recently developed a Web-based translation memory (TM) application in Python. One thing the application does is fuzzy glossary matching: given a source sentence, it'll find all terms in the glossary that are fuzzy substrings of that sentence (using my fuzzy substring matching module, which is based on the Levenshtein distance algorithm), and return the terms along with their translations.

Here's how I created a custom function for fuzzy glossary searches, using SQLAlchemy for the ORM, with SQLite as the database engine. Assuming you've got your SessionClass object, create a session, and get a connection object:

import subdist subdist def make_gloss_func(haystack):

"""Creates a fuzzy substring matcher using haystack

"""

get_score = subdist.get_score def gloss_func(needle):

return get_score(needle, haystack)

return gloss_func class TM(object):

"""Represents a translation memory (TM)/glossary""" # stuff skipped… def gloss_search(self, query, minscore):

"""Do a fuzzy glossary search.

""" try:

session = self.SessionClass() # Create the custom function

gloss_func = make_gloss_func(query)

conn = session.bind.connect()

conn.connection.create_function("gloss_score",

1,

gloss_func) # Execute the query

search_string = """SELECT * FROM records

WHERE gloss_score(source)>=:minscore"""

return conn.execute(search_string,

dict(minscore=minscore)) finally:

self.SessionClass.remove()

Speedup

Just for fun, I compared the speed of (1) using custom functions in SQLite with (2) keeping the records as a Python array, and getting the matches in pure Python using a list comprehension. I found that the SQLAlchemy version is about 8 times faster. In the test, I created a glossary of 44,732 records using random word pairs, and got the fuzzy substrings for a query sentence.

version time native Python 0.7837 s SQLAlchemy 0.0966 s

Since the fuzzy-matching code and database code are written in C, the SQLAlchemy version is probably approaching near-C speeds, with the only slowdown being the overhead of calling them from Python (which is pretty minimal; most of the work is done elsewhere).

More importantly, the SQLAlchemy version easily meets my performance target of a 50,000-record search in 0.25 seconds, while the native Python version falls pretty far short.

Also interestingly, I found that psyco didn't speed up either version at all, and in fact made both slightly slower. Another demonstration that you should profile rather than applying psyco as a panacea.

Here's the code used for the test.