In the last couple of days, I’ve been experimenting with webRTC as a means of getting live real-time-communication (voice, video, data) flowing between two Universal Windows Platform apps and I thought I’d start to share my experiments here.

There’s a big caveat in that these are rough notes as I’m very new to these pieces and so there’s probably quite a few mistakes in these posts that I’ll realise when I’ve spent more time on it but I quite like the approach of ‘learning in public’.

Why look at webRTC from the web as a technology for communications between native applications?

I think it comes down to;

it already exists and there’s lots of folks using it on the web so there’s a strong re-use argument.

there’s the chance of interoperability.

and last, but by no means least, there is already an implementation out there for webRTC in UWP

Working out the webRTC Basics

One of the advantages around well-used web technologies is that they come with a tonne of resources and the primary one that I’ve been reading is this one;

which tells me about the architecture of webRTC and its use in implementing RTC in a browser context and there are no shortage of code labs to show you how to implement things in JavaScript in a browser.

Once a technology like this works on the web it’s not unnatural to want to try and make use of it in other contexts and so there are also lots of tutorials that talk about making use of webRTC inside of an Android app or an iOS app but I didn’t find so much around Windows with/without UWP.

The other thing that’s great about that ‘Getting Started’ page is that it tells you about the core pieces of webRTC;

MediaStream – a stream of media, synchronized audio and video

RTCPeerConnection – what seems to be the main object in the API involved in shifting media streams between peers

RTCDataChannel – the channel for data that doesn’t represent media (e.g. chat messages)

and that led me to lots of samples on the web such as this one;

and these samples are great on the one hand because they got me used to the idea of a ‘flow’ that happens between 2 browsers that want to use webRTC which in my head runs something like this;

both browsers take a look at their media capabilities in terms of audio, video streams via the getUserMedia API.

API. browser one can now make ‘an offer’ to browser two around the capabilities that it has (streams, codecs, bitrates, etc) and it does this via the RTCPeerConnection.CreateOffer() API with the results represented via SDP (Session Description Protocol)

API with the results represented via SDP (Session Description Protocol) browser one uses the PeerConnection.SetLocalDescription(Type: Offer) API to store this SDP as its local description.

API to store this SDP as its local description. browser two can import that ‘offer’ and can create ‘an answer’ via the RTCPeerConnection.CreateAnswer() API with another lump of SDP to describe its own capabilities.

API with another lump of SDP to describe its own capabilities. browser two uses the PeerConnection.SetRemoteDescription(Type: Offer) API with what it received from its peer

API with what it received from its peer

browser two uses the PeerConnection.SetLocalDescription() API with the results from CreateAnswer()

API with the results from browser one can import that ‘answer’ and perhaps the two endpoints can agree on some common means of communicating the audio/video streams.

browser one uses the PeerConnection.SetRemoteDescription(Type: Answer) API to store the answer that it got from the peer.

So, there’s this little dance between the two endpoints and one of the initially confusing things for me was that webRTC doesn’t dictate things like;

how browser one discovers that browser two might exist or be open to communicating with it in the first place.

how browser one and two swap address details so that they can ‘directly’ communicate with each other.

how browser one and two swap these ‘offers’ and ‘answers’ back and forth before they have figured out address details for each other – how they ‘talk’ when they can’t yet ‘talk’!

Instead, the specification calls that signalling and it’s left to the developer who is using webRTC to figure out how to implement it and if I go back to this article again;

then it took me a little time to figure out that when this article talks about ‘the server’, it is really talking about a specific implementation of a signalling server for webRTC and it uses a web socket server running on node.js to provide signalling but webRTC isn’t tied to that server or its implementation in any way – it just needs some implementation of signalling to work.

That seems like quite a lot to get your head around but there’s more details to get this type of communications working over the public internet.

More ‘Basics’ – webRTC and ICE, STUN, TURN

In a simple world, two browsers that wanted to send audio/video streams back and forth would just be able to exchange IP addresses and port numbers and set up sockets to do the communications but that’s not likely to be possible on the internet.

That’s where the article…

comes in and does a great job of explaining what signalling is for and how additional protocols come into play trying to make this happen on the internet where devices are likely to be behind firewalls and NATs.

Specifically, the article explains that the ICE Framework is used to try and figure out the ‘most direct’ way for the two peers to talk to each other.

If the two peers were somehow able to make a direct host<->host connection (e.g. on a common network) then that’s what ICE seems to prefer to choose.

If it needs to, it can use Session Traversal Utilities for NAT (STUN) to deal with a host that has its address hidden behind a NAT.

Additionally, if it needs to, it can use Traversal Using Relays around NAT (TURN) for scenarios where it is not possible to do point<->point communications between the two hosts and a ‘relay’ (or man in the middle) server on the internet can be used to relay the messages between the two although, naturally, copying around media streams for lots of clients is likely to lead to a busy server and there’s the question of finding such a server and someone to pay for hosting it.

UWP and webRTC

With some of that background coming together for me, I turned my attention to the github project which has an implementation of webRTC for the UWP;

and I found it surprisingly approachable.

I cloned down the entire repository, installed Strawberry Perl in order to help me build it and followed the simple instructions of running the prepare.bat file to build it all out and then, as instructed, I opened up the solution which (as below) contains;

and so the first thing that surprised me here is that the various API pieces that I’d been reading about (RTCPeerConnection etc) look to be directly represented here and I think they are built out as a WinRT library by the Org.WebRtc project – i.e. that project seems to take the x-platform C++ pieces and wrapper them up for UWP use.

Then there’s the PeerCC folder which contains two samples. A server and a client.

It took me a little while to figure out that the server is just a simple socket server which runs as a signalling server;

and I think it’s a standalone executable as I copied it to a virtual machine and ran it in the cloud and it ‘just worked’ and you simply get a command line output like;

There’s then the client (UWP) side of this sample which is in the other project and runs up an interface as below;

this app then lets you enter the details of your signalling server (I don’t think 127.0.0.1 loopback will work so I didn’t try that) and then if you run the same app on another PC and point it at the same signalling server then you can very quickly get video & voice running between those two machines using this sample.

It’s important to say that in the screenshot above, I have deliberately removed the ICE servers that the sample runs with by default – it uses stun[1234].l.google.com:19302 when you run it up for the first time. I only removed them because I wanted to prove to myself that the sample didn’t need them in the case where the participants could make a direct connection to each other.

So, it was pretty easy to get hold of these bits and find a sample that worked but there’s a lot of code in that sample and I felt that I needed to unpick it a little as it seemed to be showcasing all the features but not really giving me an indication of the minimum amount of code to get this working.

Unpicking the Sample

I spent some time reading, running, debugging this sample and it’s well structured and the bits that I found most interesting were in 3 places;

The code in the Signalling folder represents two classes that do quite a lot of the work. The Signalling class knows how to send messages back/forth to the signalling socket server and it uses a really simple HTTP protocol operating on a ‘long poll’ such that;

New clients announces themselves to the server.

Each client polls the server on a long timeout, waiting to be told about other clients arriving/leaving and any new messages.

The Conductor class is a form of ‘controller’ which largely centralises the webRTC API calls for the app and is used a lot by the MainViewModel which takes parameters to/from the UI and passes them onto that Conductor to get things done.

This is really great but I still wanted a ‘simpler’ sample that captured more of the ‘essence’ of what was necessary here without getting bogged down in the details of signalling and ICE Servers and so on.

And so I wrote my own.

Making My Own Sample (using the NuGet package)

I made my own sample which is quite difficult to use but which allowed me to get started with the basics by taking away the need for signalling servers and ICE servers.

I felt that if I could do this to get more of a basic understanding of the essentials then I could add the other pieces afterwards and layer on the complexity.

In writing that sample, I initially worked in a project where I had my own C# project code alongside the the C++/CX code for the webRTC SDK so that I could use mixed-mode debugging and step through the underlying source as I made mistakes in using the APIs and that proved to be quite a productive approach.

However, as my C# code got a little closer to ‘working’ I switched from using the source code and/or the binaries I’d build from it and, instead, started using the webRTC SDK via the NuGet package that is shipped for it;

as that seemed to give Visual Studio less to think about when doing a rebuild on the project and simplified my dependencies.

While I’m trying to avoid having to use some kind of signalling service for my example, I still need something to transfer data between my two apps that want to communicate and so I figured that I would simply put the necessary data onto the screen and then copy it manually back and forwards between the two apps so that I become a form of human signalling service.

I made a new UWP project and switched on a number of capabilities to ensure that I didn’t bang up against problems (e.g. webcam, microphone, internet client/server and private network client/server).

I then constructed the ‘UI from hell’. The application runs up with an Initialise button only;

and, once initalised, it presents this confusing choice of buttons where only I know as the developer that there is a single, ‘safe’ path through pressing them

and I can then use the Create Offer button to create an offer from this machine and populate the text block above with it;

now, as the ‘human signalling server’ I now have a responsibility to take this offer information (by copying it out of the text block) along with the list of Ice Candidates over to a copy of this same app on another machine.

I can do this via a networked clipboard or a file share or similar.

On that machine, I paste the offer as a ‘Remote Description’ as below;

and when I click the button, the app creates an answer for the offer;

and I can go back to the original machine and paste this as the answer to the offer;

and then I just need to swap over the details of the ICE candidates and I’ve got buttons to write/read these from a file;

and I create that file, copy it to the second machine and then use the ‘Add Remotes from File…’ button on that machine to add those remote candidates.

Now, I kind of expected to have to copy the ICE candidates in both directions but I find that once I have copied it from one app to the other, things seem to get figured out and, sure enough, here’s my hand waving at me from my other device;

and I’m getting both audio and video over that connection

Now, clearly, manually moving these bits around over a network isn’t likely to be a realistic solution and I suspect that I still have a lot to learn here around the basics but I found it helpful as a way of exploring some of what was going on.

What’s surprising is how little code there is in my sample.

What Does the Code Look Like?

The ‘UI’ that I made here is largely just Buttons, TextBlocks, TextBoxes and a single MediaElement and I set the RealTimePlayback property on the MediaElement to True.

A lot of my code is then just property getters/setters and some callback functions for the UI but the main pieces of code end up looking something like this.

Initialisation

To get things going, I make use of a Media instance and an RTCPeerConnection instance and the code runs as below;

// I find that if I don't do this before Initialize() then I crash. await WebRTC.RequestAccessForMediaCapture(); WebRTC.Initialize(this.Dispatcher); RTCMediaStreamConstraints constraints = new RTCMediaStreamConstraints() { audioEnabled = true, videoEnabled = true }; this.peerConnection = new RTCPeerConnection( new RTCConfiguration() { // Hard-coding these for now... BundlePolicy = RTCBundlePolicy.Balanced, // I got this wrong for a long time. Because I am not using ICE servers // I thought this should be 'NONE' but it shouldn't. Even though I am // not going to add any ICE servers, I still need ICE in order to // get candidates for how the 2 ends should talk to each other. // Lesson learned, took a few hours to realise it 🙂 IceTransportPolicy = RTCIceTransportPolicy.All } ); this.media = Media.CreateMedia(); this.userMedia = await media.GetUserMedia(constraints); this.peerConnection.AddStream(this.userMedia); this.peerConnection.OnAddStream += OnRemoteStreamAdded; this.peerConnection.OnIceCandidate += OnIceCandidate;

and so this is pretty simple – I use the Media.CreateMedia() function and then call GetUserMedia telling it that I want audio+video. I then create a RTCPeerConnection and I use AddStream to add my one stream and I handle a couple of events.

Creating an Offer

Once initialised, creating an offer is really simple. The code’s as below;

// Create the offer. var description = await this.peerConnection.CreateOffer(); // We filter some pieces out of the SDP based on what I think // aren't supported Codecs. I largely took it from the original sample // when things didn't work for me without it. var filteredDescriptionSdp = FilterToSupportedCodecs(description.Sdp); description.Sdp = filteredDescriptionSdp; // Set that filtered offer description as our local description. await this.peerConnection.SetLocalDescription(description); // Put it on the UI so someone can copy it. this.LocalOfferSdp = description.Sdp;

and so it’s very much like the JavaScript examples out there. The only thing I’d add is that I’m filtering out some of the codecs because I saw the original sample do this too.

Accepting an Offer

When a remote offer is pasted into the UI and the button pressed, it’s imported/accepted by code;

// Take the description from the UI and set it as our Remote Description // of type 'offer' await this.SetSessionDescription(RTCSdpType.Offer, this.RemoteDescriptionSdp); // And create our answer var answer = await this.peerConnection.CreateAnswer(); // And set that as our local description await this.peerConnection.SetLocalDescription(answer); // And put it back into the UI this.LocalAnswerSdp = answer.Sdp;

and so again there’s little code here beyond the flow.

Accepting an Answer

There’s very little code involved in taking an ‘answer’ from the screen and dealing with it – it’s essentially a call to RTCPeerConnection.SetRemoteDescription with the SDP of the answer and a type set to Answer and so I won’t list out code for that here.

Dealing with ICE Candidates

I spent quite some time with these APIs assuming incorrectly that if I wasn’t using some ICE server on the internet then I didn’t need to think about ICE at all here but it seems to turn out that ICE is the mechanism via which all potential means for communication between the two endpoints are worked out and so not handling the ICE candidates meant that I never got any communication.

I handle the ICE candidates very simply here. There’s an event on the RTCPeerConnection which fires when it comes up with an ICE candidate and all I do is handle that event and put the details into a string in the UI with some separators which (hopefully) don’t naturally show up in the strings that I’m using them to separate

void OnIceCandidate(RTCPeerConnectionIceEvent args) { this.IceCandidates += $"{args.Candidate.Candidate}|{args.Candidate.SdpMid}|{args.Candidate.SdpMLineIndex}

"; }

and I have some code which writes this string to a file when the UI asks it to and some more code which reads back from a file and adds the remote candidates. That code’s not very interesting so here’s the relevant piece in taking the text lines from the file it’s just read and reconstructing them into instances of RTCIceCandidate before adding them to the RTCPeerConnection;

foreach (var line in lines) { var pieces = line.Split('|'); if (pieces.Length == 3) { RTCIceCandidate candidate = new RTCIceCandidate( pieces[0], pieces[1], ushort.Parse(pieces[2])); await this.peerConnection.AddIceCandidate(candidate); } }

Handling New Media Streams

Last but by no means least here is the act of handling a media stream when it ‘arrives’ from the remote peer.

I think that if I poked into the underlying APIs there’s some mechanism for getting hold of the raw streams here but it seems that the SDK has done some heavy lifting to at least make this very easy in the simple case in that there’s a method which looks to do the work of pairing up a media stream from webRTC with a MediaElement so that it takes it as a source.

So, the handler for RTCPeerConnection.RemoteStreamAdded just becomes;

void OnRemoteStreamAdded(MediaStreamEvent args) { if (this.mediaElement.Source == null) { // Get the first video track that's present if any var firstTrack = args?.Stream?.GetVideoTracks().FirstOrDefault(); if (firstTrack != null) { // Link it up with the MediaElement that we have in the UI. this.media.AddVideoTrackMediaElementPair(firstTrack, this.mediaElement, "a label"); } } }

and I never imagined that would be quite as simple as it seems it could be.

Wrapping Up & Next Steps

As I said at the start of the post, this is just some rough notes as I try and figure my way around webRTC and the UWP webRTC SDK that’s out there on github.

It’s very possible that I’ve messed things up in the text above so feel free to tell me as I’m quite new to webRTC and this UWP SDK.

Writing this up though has been a useful exercise for me as I feel that I’ve got a handle on at least how to put together a ‘hello world’ demo with this SDK and I can perhaps now move on to look at some other topics with it.

The code for what I put together here is on github – keep in mind that the UI is ‘not so easy’ to use but if you follow the screenshots above then you can probably make it work if you have the motivation to do all the copying/pasting of information back/forth between different applications.

In terms of next steps, there’s some things I’d like to try;

As far as I know, this code would only work if the two apps communicating were able to directly communicate with each other but I don’t think it’s more than a line of code or two in order to enable them to connect over different networks including the internet and I want to try that out soon.

Naturally, I need to reinstate a signalling service here and perhaps I can do some work to come up with a signalling service abstraction which can then be implemented using whatever technology might suit.

I’d like to try and get one end of this communication working on a non-PC device and, particularly, a HoloLens.

But those are all for another post, this one is long enough already