A uTouch architecture introduction

Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

As the Linux desktop increases in popularity, the user interface experience has become increasingly important. For example, most laptops today have multitouch capabilities that have yet to be fully exposed and exploited in the free software ecosystem. Soon we will be carrying around multitouch tablets with a traditional Linux desktop or similar foundation. In order to provide a high-quality and rich experience we must fully exploit multitouch gestures. The uTouch stack developed by Canonical aims to provide a foundation for gestures on the Linux desktop.

uTouch capabilities

The new X.org multitouch features allow for multitouch support in applications. We now have a software stack, uTouch, built on top of this multitouch support that can provide for practically any gesture scenario imaginable.

A "gesture" is normally thought of as a two-dimensional movement made by the user on some sort of input device—a two-finger pinch, for example, or a three-finger downward drag. Teaching a computer to recognize these movements requires a lower-level description, though; in uTouch, this description consists of values like the number of touches, movement thresholds, and timeout values. An application may register a "gesture subscription" describing a specific gesture and be notified when that gesture is recognized by the uTouch subsystem. Those notifications take the form of a sequence of events describing the gesture motion over time.

Key to understanding how uTouch works is knowledge of all the typical gesture use cases. First, we have the concept of gesture primitives: drag, pinch (including both "pinch" and "spread"), rotate, and tap. These primitives make up the foundation of all intuitive gestures. They can be strung together as needed for more complex gestures, such as a double tap. Stroke gestures, such as drawing an ‘M’ to open the mail client, may be recognized as a specific long gesture sequence, or as a sequence of drag gestures. Note, however, that uTouch does not have stroke gesture detection facilities built-in.

Second, there are two fundamental object interaction types: single motion, single interpretation gestures and direct object manipulation. The former involves gestures like a two-touch swipe to go backward and forward through browser history, while the latter involves gestures like a three touch drag to move an application window around the desktop.

The single motion, single interpretation gestures require thresholds and/or timeouts. For example, the colloquially implied difference between a swipe and a drag is that a swipe must be a quick motion in a given direction, whereas a drag may be any motion that manifests in a displacement in space. To put it in uTouch gesture subscription terms, a swipe is a drag primitive gesture with a displacement threshold that must be crossed within a specific amount of time. For example, when implementing browser history gestures a two-touch swipe may be implemented with a threshold of 100 pixels over half of a second. In contrast, direct object manipulation usually implies a zero threshold. For example, as soon as three touches begin on a window, the window should be movable.

Most simple gesture interactions may be handled through gesture subscriptions consisting of the required gesture primitives and the object interaction types. However, there are times when an application needs to have further control over gesture recognition. For example, a bezel drag gesture occurs when the user begins a drag from the bezel of the screen and moves inward. This gesture must be distinguished from the user initiating a touch at the edge of the screen. The problem lies in the fact that both the bezel drag and the direct touch near the edge of the screen look indistinguishable at the beginning of the gesture. The distinguishing aspect is that the bezel drag is perpendicular to the bezel and has a non-zero initial velocity as seen by the touchscreen, whereas the direct touch near the edge of the screen will likely not have an initial velocity and/or may not be moving perpendicular to the bezel. To cater for a client that cares about one of these gestures but not the other, uTouch requires the client to accept or reject every gesture. When a gesture is rejected, the touches may be replayed to the X server, which allows for the mixing of gestures and raw multitouch in the same application.

Another facet of uTouch, as hinted above, is that, by default, it operates through "touch grab" semantics. When used on top of X.org, uTouch gestures are recognized from touches received through touch grabs. One benefit of this approach is the ability to mix gestures and raw multitouch in the same application. However, it also allows for priority handling of gestures. For example, system gestures may be handled by a client listening to touches through a grab on the root window. When gestures are not recognized or are rejected by the uTouch client, the touches are replayed to the next touch grab or selecting client. Thus, global gestures, application gestures, and raw multitouch events are all possible when using uTouch.

The last major feature of uTouch is the ability to recognize multiple simultaneous gestures in the same area. For example, imagine a game where the user pinches bugs on the screen to squash them. The screen is one large gesture input area, but the user may use both hands to pinch bugs. In order to facilitate this interaction mode, whenever new touches begin within the gesture area they are combinatorially matched with other touches that begin within a "glue" time period. In our game example there is a two-touch pinch gesture subscription. If four touches begin in the game area within the glue time period, six combinations of potential gestures will be matched. As touch events are delivered, the state of each matched gesture will be updated and then checked against the threshold and timeout for the gesture subscription. If a gesture meets the threshold and timeout criteria, it will be delivered to the client. The client can then attempt to match up the touches of the gesture against its context to determine whether to accept or reject each gesture. In the example below, there will be four pinch gestures sent to the client:

(Bug icons licensed under LGPL)

There will be potential pinch gestures for: AB, CD, AD, and BC (AC and BD, by virtue of moving in the same direction, are not considered to be potential pinches). The application must determine which gestures make sense. One method would be to hit test the initial centroid of each gesture against the bugs on the screen. All gestures that hit a bug are accepted. Note that uTouch automatically rejects overlapping gestures, so as soon as AB and CD are accepted, AD and BC will be implicitly rejected.

There is a twist to this complex logic, however. Gesture events are received serially. The client may need to know if more gestures are possible for a set of touches. For example, if both one-touch and two-touch drag gestures are subscribed, a two touch drag will cause two one-touch drag gestures and a two-touch drag gesture. If the uTouch client receives a one-touch drag first, it may not realize that a two-touch drag is coming for the touch as well. To handle this issue, a gesture property is provided to denote the finish of gesture construction for all of its touches. When a gesture has finished construction, the client knows that it has received all possible gestures containing the same touches. Thus, in the one- and two-touch drag example the one touch gesture will not emit the gesture construction property until at least the two-touch gesture begin event has been sent to the client.

The uTouch stack was designed to be flexible and provide for all possible gesture use cases. However, it is recognized that not all clients will care about multiple simultaneous gestures. There are plans to create a gesture subscription option that precludes the ability to have multiple simultaneous gestures. This will effectively push some policy into the recognizer, such as a preference for gestures with more touches. This will be particularly useful when subscribing to gestures on an indirect device, like a touchpad, where multiple simultaneous gestures are likely not wanted.

Lastly, uTouch is a complete gesture stack that surpasses the functionality of all available consumer platforms. uTouch works well with both touchscreens and touchpads, and supports both gestures and raw touch events in the same window or region of an application. In contrast, Windows only supports touchscreens and either gestures or raw touch events, but not both, in a given window. OS X supports touchpads but not touchscreens. Mobile platforms are limited to touchscreen support and single-application gestures at a time due to their modal task design. In contrast to each of these platforms, uTouch has been designed from the ground up to support all device types and all known use cases, including multiple applications and windows at the same time.

The technical architecture of uTouch

uTouch consists primarily of three components: uTouch-Frame, uTouch-Grail, and uTouch-Geis. Each of these will be described briefly below.

uTouch-Frame groups touches into units that are easier for uTouch-Grail to operate on. Gestures are recognized per-device and per-window, so touches are grouped into units representing pairs of devices and windows. This is also where all backends for each window system are implemented. uTouch-Frame events are platform independent.

Some window systems, like X11, also have the concept of touch sequence acceptance and rejection. This functionality is provided through uTouch-Frame as well.

Touch sequence acceptance and rejection is a core aspect of the uTouch stack when used for system-level gestures. Imagine a finger painting application listening for raw touch events (not gestures) is open on a desktop environment where three-touch swipes are used to switch between applications. When the user performs such a swipe, uTouch accepts the touch sequences on behalf of the window manager and switches applications. This prevents the painting application from handling (or even seeing) the touches. In contrast, when the user performs a three-touch tap, uTouch rejects the touch sequences because they do not match a known gesture. The painting application then receives the rejected touch sequences.

uTouch-Grail is the gesture recognizer of the uTouch project. It takes the per-device, per-window touch frames from uTouch-Frame and analyzes them for potential gestures.

Grail events are generated by frame events. Rather than duplicate the uTouch-Frame data, grail events contain gesture data and a reference to the frame event that generated it. This allows for uTouch clients to see the full touch data comprising a gesture.

Grail gesture events are comprised of a set of touches, a uniform set of gesture properties, and a list of recognized gesture primitives. Again, the supported primitives are: drag, pinch, rotate, and tap. The gesture properties are:

Gesture ID Gesture state (begin, update, end) A list of touch IDs for the touches comprising the gesture The uTouch-Frame event that generated the Grail event The original and current centroid position of the touches The original and current average radius, or distance from the centroid, of the touches A best-fit 2D affine transformation of the touches from their original positions A best-fit 2D affine transformation of the touches from their previous positions A flag denoting whether the gesture construction has finished

Drag, pinch, and rotate properties are encapsulated by the affine transformations. For more detail on how to use 2D affine transformations, please see this excellent Wikipedia article on transformation matrices.

During operation, a pool of recently-begun touches is maintained. In the current implementation this pool includes any touches that have begun within the past 60 milliseconds of "glue" time. When a new touch begins, it is combined in all possible combinations with touches in this pool in order to create potential gestures matching any active subscriptions.

A new gesture instance is created for each combination of touches. Each instance has an event queue, and new instances have one begin event describing the original state of the touches. The events are queued until any gesture primitive is recognized. When frame events are processed, any changes to touches in a gesture instance generate a new grail event. The new touch state is analyzed, and subscription thresholds and timeouts are analyzed to determine if any of the subscription gesture primitives have been recognized. For example, the default rotate threshold is 1/50th of a revolution, and the default rotate timeout is one half second. If the threshold is met before the timeout expires, the rotate gesture primitive is recognized.

When a gesture primitive has been recognized, the grail event queue is flushed to the client. The client must process the gesture events and make a decision on whether to accept or reject each gesture.

uTouch-Geis is the C API layer for the uTouch implementation. uTouch originally began as a private X.org server extension. It has since been updated, bringing it out of the X.org server and into the client side of the X11 system. This required a complete rewrite of uTouch-Frame and uTouch-Grail. However, we have managed to maintain API and ABI compatibility through uTouch-Geis, albeit with a few behavioral changes. uTouch-Geis has two API versions, version 1, a simpler interface, and version 2, an advanced interface. Although both are currently supported, the first version is deprecated in favor of the more flexible second version.

uTouch-Geis also makes gesture event control simpler by wrapping much of the X.org interaction behind an event loop abstraction. The uTouch stack requires careful management of touch grabs and timers. Any client may use uTouch-Frame and uTouch-Grail directly, but uTouch-Geis vastly simplifies incorporating gestures into an application. See the uTouch-Geis API documentation for more information.

Toolkit and application development

uTouch-Geis is nice, but its C API is still a bit cumbersome in certain scenarios. The uTouch team has created a QML plugin called uTouch-QML in order to make gesture integration in QML applications easier. This plugin provides native QML elements for subscribing and handling gestures. It currently uses a legacy gesture handling system in the uTouch stack that does not provide for gesture accept/reject semantics or simultaneous gestures, but we plan to update it to include those features over the next six months.

We also have begun work on a gesture recognition system for the Chromium web browser. There are many potential gesture interactions that we hope to leverage in the browser. An initial implementation was proposed, but a rearchitecture of the gesture plumbing in Chromium required us to refactor it. We hope to merge an implementation into Chromium in the next few months.

Conclusion

Over the past two years the uTouch team has been working hard to bring multitouch gestures to the Linux desktop. We now have a complete stack that rivals, and in many ways surpasses, what is possible on other platforms. We look forward to further integration of uTouch gestures in desktop environments and applications, and we encourage everyone to take a look at what our stack has to offer.

