Hi, my name is Kane and during last couple of weeks I’ve been working on random video chat in raw JavaScript. Result could be seen at mitchat.com and in this post I’m going to describe my experience and technology that’s been used to achieve final result.

I apologize in advance if my English is incorrect at times — it’s not my native language.

Server

I’m a long time Linode customer so that’s what i’m using to host MitChat. Basic VPS for $20/month with Ubuntu 12.04 on top. Thanks to Linode and their awesome 2 TB outbound traffic I don’t have to think about server change for a while.

Stack

Laravel 4.1 (PHP framework) sitting behind nginx on 443 port (SSL). Alongside we’ve got a node.js running on the same server and same domain, but on 444 port. Node, because it’s in JS, it’s fast and more importantly has great WebSocket module (socket.io) — main pipeline for our video stream. But details on that below.

For version control I’m using git and python’s awesome Fabric library for deployment to the production server.

Here’s a picture of the structure and its workflow:

Back end stack and basic workflow

Ok, that was back end part. Now to the front end.

This one is pretty slim: require.js to handle structure and module loading. On top of that we’ve got jQuery, Backbone, Underscore, glfx.js, and some private modules to handle dates, modal window and some other things.

Now to the meat.

PHP back end

Back end is very thin and there’s really not that much to explain. Index page is being generated by Laravel PHP framework — this allows me to handle assets and environment values. It also generates unique user IDs (and will enable Twitter sign up in the future). And that’s all that PHP does.

Node.js back end

This one is also a very thin layer. What node does is handles WebSocket connections to the users, accepts text messages and video frames, sends text messages and frames to correct users, handles user events such as disconnects and requests to find a person to talk to. And no more. We want to keep server as light and as fast as possible in order to deal with constant stream of video data.

JS front end

Here comes the largest part of the app. Backbone provides structure for the MitChat — i’ve been using Backbone for over a year and I feel very comfortable with it, plus, for a simple app Ember.js or Angular.js would be overkill and MitChat is certainly not a large project. Backbone handles just two views and a handful of events. One view is Global — deals with some general stuff, and the other one is ChatView where most of the stuff is concentrated.

Let’s look at what ChatView does:

Initialize WebSocket connection

Detect browser (via Bowser) and checks if it supports WebP image format

Handle WebSocket events — text feed, video feed, administrative feed

DOM Events:

— Start new chat

— Close current chat

— Input field typing (emit socket message to the Stranger telling him that you’re typing)

Initialize Media module

Media module events:

— Redraw event that deals with video feed

I’m not going to talk about stuff like DOM events or WebSocket events — those things are pretty trivial. I think Media module will be much more interesting.

Media

This is a key element of the MitChat. First, we capture webcam video via awesome WebRTC function called getUserMedia(). This method captures video stream from your web camera and sends it into the <video> element (not in the DOM). After that, video data from the <video> is being transferred to normal canvas in 320x240px (cropped from the center if needed). It has to be done because getUserMedia() won’t return stream in a specific size you want — you’ll get something close to it. Chrome might return one size, while Firefox returns the other. And we can’t have that — resulting feed will be distorted. This is a reason why we are passing video frames to the canvas.

Now that we have a proper image in 320x240px we need to deal with blur. For this one we’ll be using WebGL shader via awesome library glfx.js written by Evan Wallace. Stuff is not so simple here too. Since glfx.js uses WebGL technology we can’t apply filters directly to our first canvas. Instead, we have to use another canvas with ‘experimental-webgl’ context. This “effects” canvas accepts texture from the normal canvas and then glfx.js applies triangle blur shader.

Media workflow

And only now we’ve got our blurred video feed in blazing 10fps. Every frame generates “redraw” event described in ChatView part.

You might ask why MitChat uses WebGL instead of creating blur on canvas via JS. And the answer for that question is speed — JS-generated blur is slow and CPU intensive, WebGL on the other hand is very fast and efficient. Unfortunately, this actually affects user base — you won’t be able to use MitChat on the device and browser that doesn’t have that feature.

So, we’ve got our feed and we are calling redraw events. Each event has a callback that contains code which receives video frames. Then each frame is being sent via websocket to the server and then to the Stranger you’re talking to. On the Stranger’s side of the chat, he receives your frames, dumps them into the Image (to decode base64) and then into the <canvas> — background for the chat.

FPS

While we are at frames, let’s talk about fps. Since we are not using actual data streams to send compressed video like dedicated software would (Skype, flash-based apps), we have to watch our traffic — frames have to be sent and received consecutively so that there wouldn’t be any jump or lags. That’s why frames have to be as lightweight as possible. For that, we’ve got two formats: jpeg and webp. At this moment webp format only available on webkit-based browsers, so this one is available only if both chatting users’ browsers support it, otherwise MitChat will use jpeg with fluctuating quality and lower fps. Lower, because of the size of the compressed frames — jpeg weights about 3-4x times of the webp frame.

Seems like we are done here. Obviously, this is a very rough and generalized description of how MitChat works, but i’m assuming if you’re interested in this stuff, you’ll be able to fill gaps by yourself.