The whole idea behind it is when you’re writing code, normally people think about the fact that you write code, then you build it, and you ship something, and what you ship is what matters, right? Source code is just a way to get there. And what we realized is that actually it’s a huge and very, very deep source of data. When you have a Git repository, you can actually see what’s happened there since the beginning of time until now. You can actually analyze trends, you can see so much stuff in there.

So what we did is we created this engine product that basically what it does is it provides a SQL interface, so you can find things in the repository. You can do things like “Find commit messages with this text”, or whatever, but you can actually go even deeper than that and go into actually “I wanna see the content of the file, I wanna parse it, I wanna extract the function names, I wanna extract the strings”, or whatever. So there’s a bunch of different projects that make this possible, and basically every single one of those projects is completely open source, and I’ve created a product which is called The Engine, which is putting all those together in a nice way to use; a little binary, you get started and everything just works. That is what we call the code as data - seeing source code as a source of data.

The other part is ML on code. ML on code is the part that I’ve been talking around, because it’s super-exciting. The whole idea is learning stuff from source code. One of the things that you can learn, for instance, is say to predict a token in a program, if you’re given the rest. I give you a Go program, and I’ve removed one viable name from somewhere, and you need to predict it. You train the neural network to do this, and eventually it will be able to do this quite correctly.

[ ] Now, what we try to do is not to predict the missing pieces of a program, because in general programs do not have missing pieces, but what we can see is that if what we predict and what you wrote is very different, and even more than that, what you wrote we know that is unpredictable, what we can tell you is that probably there’s a bug.

It’s a slightly complicated way of doing it, but what this detects is copy/paste errors. When you copy a section of code and paste it somewhere else and you modify a bunch of things, but probably when you’re checking for the error that’s not what you wanted, and you’re checking for the previous error, or something like that… That happens all the time. I know it happens to me all the time. With this, you’re actually able to detect it directly.

Building something that would use static analysis for that - it is possible, but it’s really hard, because static analysis deals with syntaxes, and grammar, and stuff like that, but not really with the semantics of the program. I like this idea, when you’re writing code, there’s two things - there’s what you say, and what you mean. When those two things differ, that’s when you have a bug. When you’re saying “Oh, actually that’s not what I meant. Sorry” and you need to fix it, what we’re trying to do is apply machine learning to see what you meant, and compare it to what you said, and see whether we can find bugs in there. That’s super-interesting, super-powerful, and we are doing a lot in that, but that is more like the future.

Currently, the cool thing that we’ve just released yesterday was an analysis of the Kubernetes codebase.