





Media capture functionality in Microsoft Edge

In the latest Windows 10 preview release, we added support for media capture APIs in Microsoft Edge for the first time. This feature is based on the Media Capture and Streams specification, developed jointly at the W3C by the Web Real-Time Communications Working Group and the Device APIs Working Group . It is known by some web developers simply as, which is the main interface that allows webpages to access media capture devices such as webcams and microphones. This feature can be toggled under the experimental features interface in Microsoft Edge, which can be found by navigating to. To allow for early feedback from the web development community, we’ve set this feature to be “on” by default in the latest Windows Insider preview.

The media capture functionality in Microsoft Edge is implemented based on the W3C Media Capture and Streams specification that recently reached the Working Group “Last Call” status. We are very excited that major websites such as Facebook share the same vision and have adopted standards-based interfaces to enable the best user experiences across browsers.

In this blog, we will share insights on some of our implementation decisions, and share details on what we have implemented today and what we are still working on for a future release. We will also suggest some best practices when using the media capture APIs.

A brief summary of the Media Capture and Streams APIs

The getUserMedia() method is a good starting point to understand the Media Capture APIs. The getUserMedia() call takes MediaStreamConstraints as an input argument, which defines the preferences and/or requirements for capture devices and captured media streams, such as camera facingMode, video resolution, and microphone volume. Through MediaStreamConstraints, you can also pick the specific captured device using its deviceId, which can be derived from the enumerateDevices() method. Once the user grants permission, the getUserMedia() call will return a promise with a MediaSteam object if the specific MediaStreamConstraints can be met.

The MediaStream object will have one or both of the following: one MediaStreamTrack for the captured video stream from a webcam, and one MediaStreamTrack for the captured audio stream from a microphone. The MediaStream object can be rendered on multiple rendering targets, for example, by setting it on the srcObject attribute of MediaElement (e.g. video or audio tags), or the source node of a Web Audio graph. The MediaStreamTracks can also be used by the ORTC API (which we are in the process of implementing) to enable real-time communications.

User permissions

While media capture functionality can enable a lot of exciting user and business scenarios, it also introduces security and privacy concerns. Therefore, user consent on streaming audio and video from the capture devices is a critical part of the feature. The W3C spec recommends some best practices and meanwhile also leaves some flexibility to each browser’s implementation. In order to balance security and privacy concerns and user experiences, the implementation in Microsoft Edge does the following:

If the webpage is from an HTTP origin, the user will be prompted for permission when a capture device is accessed through the getUserMedia() call. We will allow permission to persist for the specific capture device type until all capture devices of the specific type are released by the webpage.

For webpages from an HTTPS origin, when a user grants permission for a webpage to access a capture device, the permission will persist for the specific capture device type. If the user navigates away to another page, all permissions will be dismissed. Microsoft Edge does not store any permanent permissions for a page or domain.

When a webpage calls getUserMedia() from an iframe, we will manage the capture device permission separately based on its own URL. This provides protection to the user in cases where the iframe is from a different domain than its parent webpage.

Once a user grants permission for a webpage to access a media capture device, it is important to help the user to track which browser tab is actively using the capture device, especially when the user has navigated to a different tab. Microsoft Edge will use a “recording” badge in the tab title to indicate tabs streaming audio and/or video data from the capture devices. Note that this feature is not implemented in the current release.

Capture device selection and settings

The getUserMedia() interface allows a lot of flexibility in capture device selection and settings through MediaStreamConstraints. The W3C spec has very detailed descriptions on the Constrainable Pattern and corresponding decision process. We’d like to share more of our implementation details, especially regarding default expectations.

The following table summarizes the default setting we have internally on some of the constraints.

Constraints Default values * width 640 height 360 aspectRatio 1.7777777778 (16:9) frameRate 30 volume 1.0 sampleRate device default sampleSize device default (16 or 32-bit)

When setting the constraints, please keep in mind that capture devices tend to have wide range of different capabilities. Unless your target scenario has any must-have requirement, you should allow flexibility as much as possible for the browser to make device selection and setting decisions for you. Our capture pipeline is currently limited to device default audio sample size and sample rate and doesn’t currently support setting a different sampleSize or sampleRate. Additionally,our capture pipeline currently relies on the global setting in the Windows audio device manager to determine the audio sampleRate of specific microphone devices.

If you plan to use the media capture streams jointly with ORTC in real-time communications, we suggest not setting the “volume” constraint. The Automatic Gain Control logic in the ORTC component will be invoked to handle the volume levels dynamically. The volume level can also be adjusted by Windows users through the audio device manager tool.

We don’t currently have any default or preferred facingMode mode for webcams. Instead, we encourage you to set the facingMode for your specific scenarios. Where not specified, we will try to pair up the webcam with the microphone in the device selection logic.

There is a known issue in the video capture pipeline which doesn’t allow setting webcam resolution in this preview release. We are working on a fix which we expect should be available in the next Insider build.

In case the deviceId or groupId is not explicitly set in the MediaStreamConstraints, we will go through the following logic to select the capture device. Here, let us assume we want to select one microphone and one webcam:

If there is one set of capture devices that satisfy the MediaStreamConstraints with the best match, we will choose those devices.



Otherwise, if multiple microphones and webcams match the MediaStreamConstraints equally well: We first pick the system default microphone device for communications if it is on the list. We then pick the webcam that pairs with the microphone if there is one, or pick the first webcam on the webcam list. If the system default microphone is not defined, we will enumerate through the capture devices to pair up the microphone and webcam based on its groupId. The first pair we find will be the one we select. If the above fails, we will pick the first microphone and first webcam from the candidate device list.



A headset, when plugged in, provides the default microphone device and default audio rendering option for communications scenarios. Advanced Windows users can change their audio device settings manually for specific purposes through the audio devices manager.

Updates to Media Elements

We have updated our Media Elements (audio/video tags) implementation to enable using it as a rendering target for MediaStream objects. The W3C spec has a table with a very good summary of the changes to the Media Elements. As part of our implementation decision, we now internally handle all real-time media streams, either from local capture device or from a remote ORTC receiver object, using the Windows Media Foundation low-latency playback mode (i.e. the real-time mode). For video capture using built-in webcams, we also handle device rotation internally by setting the right property on video samples so the video tag can render video frames in the correct orientation.

In some other implementations of the feature, “srcObject” is not supported. Developers would need to convert a MediaStream object using the URL.createObjectURL() method and then set it on the “src” attribute of the Media Element. We do not currently support that legacy behavior, and instead follow the latest W3C spec. Both Chrome and Firefox currently have active tickets to track “srcObject” support.

Promises vs. callback patterns

Based on the W3C spec, we support both the promise-based getUserMedia() method and the callback-based getUserMedia() method. The callback-based method allows an easier transition if you have a webpage using the interface already (although it might be a vendor-prefixed version). We encourage web developers to use the promise-based approach to follow the industry trend for new interface design styles on the web.

Missing features

Our implementation does not currently support getting video resolutions not natively supported by the webcam. This is largely due to a lack of a video DSP module in our media capture pipeline. We currently don’t have plans to address this in the near term.

We also currently don’t support echoCancellation in our MediaTrackConstraintSet. This is a limitation in our current media capture pipeline. We plan to support echo cancellation in our ORTC media stack for real-time communications in a future update.

Sample scenarios using media capture

Media capture is an essential step in many scenarios, including real-time audio and video communications, snapping a photo or capturing a barcode, or recording a voice message. Below we walk through a couple simple scenarios introducing you to how to use the Media Capture functionality.

Scenario #1: Capture photo from webcam

First, get a video stream from a webcam and put it in a video tag for preview. Let’s assume we have a video tag on the page and it is set to autoplay.

https://gist.github.com/kypflug/d0a1f5058873a8187b2c

Here is one example which accounts for the legacy src approach before all browsers support the standards-based approach:

https://gist.github.com/kypflug/3a73cddf9ddf8a171923

Next, copy a video frame onto a canvas. Let’s assume we have set up the event listener so when you tap the video tag, we will invoke the following function:

https://gist.github.com/kypflug/fdcccc85d859b0e60337

Finally, save the picture:

https://gist.github.com/kypflug/14392abe88c7e02dec81

You can also change code here to upload the data blob to your web server.

Don’t forget to release the webcam device after you complete the task. Some earlier browser implementations have a stop() method on the MediaStream object. That is not supported by the W3C spec. Instead, you should call the stop() method on the MediaStreamTrack object to release the capture device. For example:

https://gist.github.com/kypflug/dc7136fb3a88596fea01

Once you know how to use one webcam, it should not be difficult to introduce a camera switching feature to your page to handle multiple webcams. You can check out our demo at the Microsoft Edge Dev site for more.

Scenario #2: Capture a voice message from microphone

Now let’s look at a simple example using the microphone.

First, get an audio stream from a microphone and set it as the source node of a web audio graph.

https://gist.github.com/kypflug/52efb8cc728db6a77a45

Next, extract the audio data from a web audio ScriptProcessorNode. This is too lengthy to include here, but you can check out the actual demo code at the Microsoft Edge Dev site and GitHub repository, where you can also add some simple web audio DSP filters before the ScriptProcessorNode.

Finally, save the audio data into a wav file. Once we have the audio data, we can add a wav file header and save the data blob. Again, please check out the actual demo code at the Microsoft Edge Dev site.

We’ve only talked about a couple simple examples as starting points. Once you’ve gotten the captured video stream into a video tag and then canvas, and gotten captured audio into web audio, it’s easy to see many scenarios that start to light up with just a bit more work. We’re eager for your feedback so we can further improve our implementation, and meanwhile we are looking forward to seeing what you do with these new tools!

– Shijun Sun, Senior Program Manager, Microsoft Edge