Two years ago, I wrote a blogpost about ELE project. Shortly after, the Programing Research Laboratory in Prague (PRL-PRG) was established at the Faculty of Information Technology. Since then, PRL-PRG matured into a strong research laboratory in the field of programming languages in Europe, pursuing high quality basic research with a number of exciting publications and projects demonstrating the of our multinational team.

Recently, the BigCode project was selected for funding as one of the best proposals in the OP VVV excellent research teams programme run by the Czech national science office. BigCode with a budget of more than $3m is a three and half year project starting this September 2018. The Bigcode team will closely collaborate with the ELE team and the PRL team in Boston.

The aim of Bigcode is to automatically extract insights from large software code bases

by a combination of program analysis and machine learning. This will enable to address three challenge problems:

language ecosystem evolution,

predictive workload performance modelling and

synthesis of personalized programming hints.

Language Ecosystem Evolution

Can we evolve an entire language ecosystem? Evolution refers to adapting code to changes in the environment. Automating adaptation of user code and code embedded in discussion forums like StackOverflow using knowledge extracted from programs already migrated to the new APIs. Validation will focus on the R language which has 2 million users in fields ranging from microbiology to computational finance. We will add features such as Gradual Typing and Multi-threading and automate migration of the whole R software stack.

Predictive Workload Performance Modelling

Can we predict the performance of the Internet without running it? We have arranged to gain access to the JavaScript code from all web pages archived by Google. We will extract a performance model from each page and cluster pages according to their model. This will allow, e.g., companies to reduce their testing effort when releasing a new JavaScript implementation.

Synthesis of Personalized Programming Hints

Imagine a student working on a massive online open course (MOOC) with 100000 other students solving the same problems. Can we provide programming hints that are tuned to one particular student’s knowledge based on answers from all the other students? The aim is to learn what mailto:students know, discover errors in their code, and propose fixes based on a massive code base of untested solutions.

Solving these “interdisciplinary” problems involves combining modern machine learning methods with traditional approaches to programming languages analysis.

Machine Learning in BigCode

From the machine learning perspective, we can use large code repositories to train ML models similarly as deep learning was used for image and text processing and understanding. Understanding the code is bit more challenging than understanding nature language, because of variety of possible structures, diversity in programming styles and expressiveness of programming languages. Compilers are also very sensitive to errors.

Learning to understand code and neural compilers are around for some time.

Mining Source Code Repositories at Massive Scale Using Language Modeling . M. Allamanis, C. Sutton. MSR 2013.

You can refer to work of Miltos Allamaniss for some of the best papers in code understanding research. There is a lot of room for improving “deep code embeddings” and incorporating modern recurrent neural networks, convnets with attention and fancy concepts such as differentiable neural computers. Also personalized code synthesis is new area of research with the ambition of improving suggestions not only by context, but also by utilizing coding history of programmer.

We are looking for both postdocs and junior researchers where experience with programming languages or advanced machine learning is an advantage.

Let us know, if you are interested.