Code boffins at Rice University in Texas have developed a system called Bayou to partially automate the writing of Java code with the help of deep-learning algorithms and training data sampled from GitHub.

Much of modern programming is already automated in one way or another. Anyone including a code library or copying-and-pasting from Stack Overflow is essentially replaying stored keystrokes. Integrated development environments and text editors generally often include code completion, akin to the text autocompletion in messaging apps. Then there are low-code and no-code applications that translate basic intentions into specific programming instructions.

Bayou is a bit more ambitious. It fleshes out a skeleton Java program by generating API patterns or idioms, based on a programmer-supplied query consisting of API method names and variable types.

The project, available in an online demo, is described in a recently published paper, "Neural Sketch Learning for Conditional Program Generation," scheduled to be presented next month at the Sixth International Conference on Learning Representations, a deep learning conference being held in Canada.

Bayou's creators – Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, and Chris Jermaine – describe the system as an assistive tool that allows a human programmer to specify a limited amount of information (a label) in order to produce a functioning program.

Complex

"I see Bayou as a smarter version of the kind of code-completion that's supported by IDEs," said Murali, a computer science researcher at Rice and one of the paper's co-authors, in an email to The Register.

"Bayou can generate more complex pieces of code, such as API calls, loops, and exception handling blocks, and it does this by learning common patterns from data. Our vision for Bayou would be to have it integrated within an IDE, running in the background suggesting snippets of code as the programmer is typing in their program."

What makes this approach noteworthy is that the label is not just a stub that gets replaced with a single correct answer. Rather, the system relies on a technique the researchers call "neural sketch learning," in conjunction with type-aware combinatorial search, to come up with possible answers.

Neural sketch learning is used to train a novel neural network called a Gaussian Encoder-Decoder on a data set of sample source code. It abstracts the source code into "tree-structured syntactic models," called sketches, which remove low-level names and operations but retain the code's control structure, the order in which API methods are invoked, and the types of data supplied and returned by these methods.

The neural network uses this information to match learned models to the supplied query and returns the best matching results.

Here's an example. In this bit of Java code to read from a given file, Bayou takes the query – /// call:readLine – and fills in the calls to the appropriate API methods.

Bayout input:

1 2 3 4 5 6 7 8 import java.io.File ; public class Test { void read ( File file ) { { /// call:readLine } } }

Bayou output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import java.io.BufferedReader ; import java.io.File ; import java.io.IOException ; import java.io.FileNotFoundException ; import java.io.FileReader ; public class TestIO { void read ( File file ) { { FileReader fr1 ; BufferedReader br1 ; String s1 ; try { fr1 = new FileReader ( file ) ; br1 = new BufferedReader ( fr1 ) ; s1 = br1. readLine ( ) ; } catch ( FileNotFoundException _e ) { } catch ( IOException _e ) { } return ; } } }

Murali sees room for further automation. "In the short term, we are working on supporting natural language queries in Bayou, and also providing an interactive user experience," he said. "In the longer term, we are interested in generating larger pieces of code such as a group of methods, or classes, after further research into this technology."

Bayou still has limitations. Presently, it can only handle a limited number of APIs: java.lang , java.io , and java.util . Also, it cannot manage wildcard types. And because the system is based on real-world code examples, it may miss obscure APIs not present in the training set.

"The advantage of using open-source projects in GitHub is that the patterns that Bayou learns from that data are the most common ones across a wide variety of programmers," Murali explained. "Having said that, we had to be meticulous with the quality of data that Bayou is trained on, as not all GitHub projects are of the same quality. We also had to be careful with forks and duplicates, as they would bias the patterns that Bayou ends up learning. An officially vetted corpus would mitigate such problems."

Murali said he sees automation tools as a way to make programming available to a wider set of people.

"With further advances in this technology, such as the natural language-based interface that will soon be supported by Bayou, we envision programming to be made accessible to even non-programmers," he said.

The research was funded by grants from a DARPA MUSE award and a Google Research Award. ®