Using the Prospective Search API on App Engine for instant traffic analysis

Posted by Nick Johnson | Filed under python, channels, app-engine, prospective-search

One of the really interesting new APIs released as part of App Engine recently is the Prospective Search API. Prospective search inverts the usual search paradigm, where you have a database of documents, and search queries match on those documents. In Prospective Search, you instead have a list of persistent search queries, and as new documents are created or updated, you match them against the queries. Twitter's live search interface is a good example of Prospective Search in action.

Today, in the first of a two post series, we'll be trying out the Prospective Search API with a sample application, Clio. Clio, named after muse of history, is designed to give administrators insight into the actual live traffic being served by their app. With it, you can see user request logs as they occur, and apply filters so you only see the hits that interest you - invaluable on a heavily trafficed site. Mystefied where people are getting to that 404 page from, and don't want to wait 12 hours for the analytics? Clio can help.

In this post, we'll go over the details of how to use the Prospective Search API to construct the server-side, query matching component of this project. In the next post, we'll show how to use the Channel API to deliver results to the client, a web interface administrators can use to view the data. Since even a straightforward implementation of a tool like this will be fairly complex, we'll be leaving out any features that don't contribute to demonstrating the Prospective Search and Channel APIs - but I'll always note as much when doing so.

The core of the Prospective Search API is a set of 3 functions: subscribe , which establishes a persistent subscription, unsubscribe , which removes a subscription, and match , which takes a document and matches it against all the searches registered for it. There's a few key concepts central to these functions, which it helps to understand before getting started:

A document class is a type of document. Documents in a class share a set of properties and value types in common. The Prospective Search API uses Datastore model classes as document classes, so every document is an instance of a db.Model subclass. It's important to note here that just because documents are datastore model instances, this doesn't mean that they'll be stored in the datastore - it's just a convenient way to represent and encode documents. Prospective Search supports a limited subset of value types for documents, listed here.

subclass. It's important to note here that just because documents are datastore model instances, this mean that they'll be stored in the datastore - it's just a convenient way to represent and encode documents. Prospective Search supports a limited subset of value types for documents, listed here. A topic defines a unique name for a set of alike documents. This is set to the name of the document class by default, so you usually won't have to worry about this. Searches are always specific to a topic - queries against one topic won't be matched by a document posted with a different topic.

A subscription ID uniquely identifies a subscription. Subscription IDs are user-specified - they can be anything you want that uniquely identifies the subscription.

Defining our document

The first thing we need to do is define a document class that will encapsulate our request records for the Prospective Search API. This is simply done with a model definition like so:

class RequestRecord(db.Model): """Encapsulates information for a request log record.""" method = db.StringProperty(required=True) path = db.StringProperty(required=True) request_headers = db.StringListProperty(required=True) status_code = db.IntegerProperty(required=True) status_text = db.StringProperty(required=True) response_headers = db.StringListProperty(required=True) wall_time = db.IntegerProperty(required=True) cpu_time = db.IntegerProperty(required=True) random = db.FloatProperty(required=True)

Our model will capture most of the important properties of a request and our response to it: The method (eg, GET, POST, etc), the path (eg, '/foo/bar'), all the request headers sent by the client, the status code and message we returned, all the response headers we returned, and the wallclock and CPU time taken. We also include an additional property, random , which will be set to a random number between 0 and 1 - we'll cover exactly why this is useful later.

Here's a place where a real production system would probably do more: It would be useful to record debug logs here in some form, as well as integrating with appstats if it's present so admins can easily find the appstats page for a given request.

Since we'll eventually be sending request records to browsers, let's define a simple method to convert it to a dict suitable for JSON encoding:

def to_json(self): """Returns a dict containing the relevant information from this record. Note that the return value is not a JSON string, but rather a dict that can be passed to a JSON library for encoding.""" return dict((k, v.__get__(self, self.__class__)) for k, v in self.properties().iteritems())

Recording requests

The next step in writing Clio is recording the relevant information about each request in a RequestRecord instance and passing it to the matcher API. For this, we'll use WSGI middleware. Since working with the WSGI environment directly is a bit awkward, we'll use webob's Request and Response objects to make them easier to deal with. Here's how we do it:

class LoggingMiddleware(object): def __init__(self, application): self.application = application def __call__(self, environ, start_response): # Don't record if the request is to clio itself, or the config says no. if (environ['PATH_INFO'] == config.QUEUE_URL or environ['PATH_INFO'].startswith(config.BASE_URL) or not config.should_record(environ)): return self.application(environ, start_response) request = webob.Request(environ) start_time = time.time() response = request.get_response(self.application) elapsed = int((time.time() - start_time) * 1000) status_code, status_text = response.status.split(' ', 1)

Notice that the first thing our middleware does is check if the current request is Clio itself, and skip doing anything if it is. This is important, or you can easily end up with a series loop of tasks reporting on each other ad infinitum! Next, we construct a Webob request object from the environment, and use its get_response method to call the original WSGI app. We use a simple timer to keep track of how long all this took. Next, we construct a RequestRecord out of all the data we've collected and pass it to the Prospective Search API:

record = model.RequestRecord( method=request.method, path=request.path_qs, request_headers=_stringifyHeaders(request.headers), status_code=int(status_code), status_text=status_text, response_headers=_stringifyHeaders(response.headers), wall_time=elapsed, cpu_time=quota.get_request_cpu_usage(), random=random.random()) prospective_search.match( record, result_relative_url=config.QUEUE_URL, result_task_queue=config.QUEUE_NAME)

As we observed above, just because our document class is a db.Model subclass doesn't mean we have to store it in the datastore, and we don't - we just pass it to the Prospective Search API. The Prospective Search API doesn't return a list of matching searches directly - instead, it adds tasks to the task queue, so we provide it with two additional parameters: the URL of the task handler we want it to call, and the name of the queue to put tasks in.

Finally, we return the response from our middleware. Webob makes this easy by allowing us to call the Response object as a WSGI app itself:

return response(environ, start_response)

Using our middleware in a webapp follows the standard pattern established by libraries like appstats, by specifying it in appengine_config.py , like this:

def webapp_add_wsgi_middleware(app): from clio import middleware return middleware.LoggingMiddleware(app)

Registering queries

The next part of the puzzle is how we register queries against the API. This is pretty straightforward, but we'll need a way to keep track of the mapping between subscriptions and clients. We'll do this with a Subscription model:

class Subscription(db.Model): """Provides information on a client subscription to a filtered log feed.""" client_id = db.StringProperty(required=True) created = db.DateTimeProperty(required=True, auto_now=True)

The only important piece of data we track here is the client ID, which is used by the channel API to uniquely identify a connected client. We can't use this directly as the subscription key, because a client may subscribe to multiple feeds, so instead we'll use the key of the Subscription model for that. A more robust implementation would separate the user from their client ID here, since we know channels expire, but a user may want to persist a subscription longer than that, but since it's not relevant to our demonstration of the Prospective Search API, we'll leave it as an exercise for the reader.

Here's the handler that creates new subscriptions:

class SubscribeHandler(webapp.RequestHandler): """Handle subscription requests from clients.""" def post(self): sub = model.Subscription( client_id=self.request.POST['client_id']) sub.put() prospective_search.subscribe( model.RequestRecord, self.request.POST['query'], str(sub.key()), lease_duration_sec=config.SUBSCRIPTION_TIMEOUT.seconds) self.response.out.write(str(sub.key()))

Our handler is passed a client ID and a search term by the client. It then constructs a new subscription record with the client ID, stores it to the datastore, and uses the newly created entity's key as the subscription ID. The prospective_search.subscribe method takes, in order, the document class we want to listen for matches on, the query - specified in a simple textual query language documented here - the subscription ID, and how long to subscribe for. Omitting this last argument will create a subscription that lasts until cancelled, but since we know our channels have a limited lifetime, we may as well specify that here.

Finally, we return the key of the Subscription object to the client, so it can use it to uniquely identify this subscription.

Handling results

The final step in the chain is handling results returned by the Prospective Search API. As I mentioned previously, results aren't returned by the match call, but are instead inserted onto the task queue. Since there may be many subscriptions matching a given document, the API will enqueue a single task with a number of matching subscriptions; it supplies the matching document along with a list of subscription IDs that matched it. Here's how we extract this data:

class MatchHandler(webapp.RequestHandler): """Process matching log entries and send them to clients.""" def post(self): # Fetch the log record record = prospective_search.get_document(self.request) record_data = record.to_json() # Fetch the set of subscribers to send this record to subscriber_keys = map(db.Key, self.request.get_all('id')) subscribers = db.get(subscriber_keys)

First off, we get the matched document. The Prospective Search API provides a method, get_document to extract this from the request object for us. Since we're going to be sending it to clients, we use the method we defined previously to conver it to a JSON-encodable dict. The list of subscription keys are supplied as POST parameters with the id 'id'. Since our subscription IDs are datastore keys, we construct key objects out of them and retrieve the relevant subscriptions, so we know the client IDs to send the results to. Finally, we can iterate over the returned subscription entities, sending the message to each:

for subscriber_key, subscriber in zip(subscriber_keys, subscribers): # If the subscription has been deleted from the datastore, delete it # from the matcher API. if not subscriber: logging.error("Subscription %s deleted!", subscriber_key) prospective_search.unsubscribe(model.RequestRecord, subscriber.key()) else: data = simplejson.dumps({ 'subscription_key': str(subscriber_key), 'data': record_data, }) channel.send_message(subscriber.client_id, data)

This should be fairly self-explanatory. One subtlety we haven't taken care of here is that it's possible one document could match multiple subscriptions held by the same client; a more producitonized implementation would coalesce these into a single message, rather than sending the same document multiple times over the same channel.

Conclusion

That's it for today. You've learned how to use the Prospective Search API to register persistent queries against a document stream; how to feed documents into that stream, and how to deal with the matches that result. In the next post, we'll demonstrate using the Channel API to send these results back to users in real-time, completing our system.

The entire demo app is available online, here, along with a simple demo app so you can test it out.

Got interesting ideas for what to do with the Prospective Search API? Let us know in the comments!

Disqus