When Microsoft introduced the Cortana digital personal assistant last year at the company's Build developer conference, the company already left hints of its future ambitions for the technology. Cortana was built largely on Microsoft's Bing service, and the Cortana team indicated those services would eventually be accessible to Web and application developers.

As it turns out, eventually is now. Though the most important elements are only available in a private preview, many of the machine learning capabilities behind Cortana have been released under Project Oxford, the joint effort between Microsoft Research and the Bing and Azure teams announced at Build in April. And at the conference, Ars got to dive deep on the components of Project Oxford with Ryan Galgon, the senior program manager at Microsoft Technology and Research shepherding the project to market.

The APIs make it possible to add image and speech processing to just about any application, often by using just a single Web request. "They're all finished machine learning services in the sense that developers don't have to create any model for them in Azure," Galgon told Ars. "They're very modular." All of the services are exposed as representational state transfer (REST) Web services based on HTTP "verbs" (such as GET, PUT, and POST), and they require an Azure API subscription key. To boot, all the API requests and responses are encrypted via HTTPS to protect their content.

Currently, the Project Oxford services are free to try for anyone with an Azure account, though there are limits on the rate of usage. While its idiosyncrasies are worked out, the services can be leveraged through software developer kits for a number of platforms plus Microsoft's Azure—bringing speech-to-text, text-to-speech, computer vision, and facial recognition capabilities to virtually any application, mobile, web, and otherwise.

For now, the missing piece is the intelligence that can take text and speech interactions for applications to the next step. That capability is wrapped in what Microsoft calls LUIS (Language Understanding Intelligent Service), a text-processing capability that will be able to determine user intent from a string of text whether it's typed or spoken. LUIS identifies "entities" within text such as names, dates and times, actions, concepts and things, and the service can be wired into cloud applications to perform the appropriate task.

Things aren't perfect yet, but they demonstrate that Microsoft is trying to put its tools at the center of the next wave of new applications. Cortana aims to serve the mobile world and smart devices that will have no keyboards, mice, or even screens. When combined with the rest of the capabilities and interfaces being provided to developers through Azure and Bing, the approach makes a pretty strong case for Microsoft's continued relevance. Even if the Windows desktop moves from being the star of the show to a supporting role, it seems the many faces of Cortana are primed for the spotlight.

A quick trip through the public release of Project Oxford APIs.

Eye in the cloud

Two of the four sets of services in Project Oxford are focused on image processing. The first, the Face API, was partially demonstrated in Microsoft's sample How-old.net application, which guesses at the age of people whose faces are within an uploaded photo. The API "provides a set of detection, verification, proofing, and identification services," Galgon said, and it performs analysis of facial geometry and applying rules built through the machine-learning process. The Face service has been trained to guess age and gender of subjects it identifies with fair accuracy, and it can also perform facial recognition of an individual by either matching photos of the same person in a collection or based on a pre-loaded "person" identity.

The input for the Face API is an HTTP POST request that includes the image file (a JPG, GIF, PNG, or BMP file) being analyzed. Each type of processing request includes the photo either as a binary object as "application-octet-stream" data or as a URL pointing to a Web-accessible image in a JavaScript Object Notation (JSON) format. Along with the image, the request includes a set of parameters instructing the API on which information to return.

For example, using my image from my author page on Ars to get facial geometry data, a guess at my gender and age, and analysis of my head pose (with estimated pitch, yaw, and roll away from the dead-on view), my app would send the following:

POST face/v0/detections?analyzesFaceLandmarks=true&analyzesAge=true&analyzesGender=true&analyzesHeadPose=true

Content-Type: application/json

Host: api.projectoxford.ai

Ocp-Apim-Subscription-Key: ••••••••••••••••••••••••••••••••

Content-Length: 81

{ "url":"http://arstechnica.com/wp-content/uploads/authors/Sean-Gallagher.jpg" }

And the Web service returns the following in JSON, accurately guessing my gender and slightly underestimating my age. It also provides a Face ID that can be used to check any other images against later—an ID that Azure retains for up to 24 hours:

[{"faceId":"ade42988-fd58-4422-b207-688e6a0d417d","faceRectangle":{"top":111,"left":62,"width":137,"height":137},"faceLandmarks":{"pupilLeft":{"x":107.8,"y":144.8},"pupilRight":{"x":167.4,"y":155.0},"noseTip":{"x":121.3,"y":182.6},"mouthLeft":{"x":96.6,"y":203.8},"mouthRight":{"x":156.0,"y":215.5},"eyebrowLeftOuter":{"x":84.0,"y":134.2},"eyebrowLeftInner":{"x":120.3,"y":139.4},"eyeLeftOuter":{"x":95.3,"y":145.5},"eyeLeftTop":{"x":105.8,"y":141.4},"eyeLeftBottom":{"x":104.6,"y":150.5},"eyeLeftInner":{"x":114.3,"y":148.4},"eyebrowRightInner":{"x":146.9,"y":143.5},"eyebrowRightOuter":{"x":190.9,"y":151.1},"eyeRightInner":{"x":153.7,"y":155.2},"eyeRightTop":{"x":164.4,"y":151.9},"eyeRightBottom":{"x":163.8,"y":160.8},"eyeRightOuter":{"x":174.5,"y":158.2},"noseRootLeft":{"x":124.4,"y":151.8},"noseRootRight":{"x":137.1,"y":153.7},"noseLeftAlarTop":{"x":117.7,"y":170.8},"noseRightAlarTop":{"x":138.1,"y":174.0},"noseLeftAlarOutTip":{"x":109.3,"y":182.7},"noseRightAlarOutTip":{"x":144.1,"y":187.9},"upperLipTop":{"x":121.7,"y":207.1},"upperLipBottom":{"x":121.4,"y":211.2},"underLipTop":{"x":120.9,"y":213.3},"underLipBottom":{"x":119.4,"y":219.4}},"attributes":{"headPose":{"pitch":0.0,"roll":10.7,"yaw":-13.4},"gender":"male","age":44}}]





The Project Oxford Face API can be used for facial matching and recognition in a number of more sophisticated ways. It can be trained on specific faces, creating facial identity profiles (which can also be zapped remotely when no longer needed with a DELETE request via REST). Using the facial geometry data associated with an image or an identity, the Face service can also do group processing. And in a fashion similar to Facebook's automatic image tagging, Face identifies the individuals in each photo, returning face box data and identification data in JSON format.