The number of AI-related research papers has skyrocketed in recent years, outpacing papers from all other academic topics since 2000. This has, not unsurprisingly, resulted in a shortage of qualified peer reviewers in the machine learning community, particularly when it comes to conference paper submissions. Conference organizers are attempting to expand the supply, but it can take years of academic study in the field of AI to qualify a person as a peer reviewer.

Virginia Tech Associate Professor Jia-Bin Huang serves as an Area Chair for prestigious AI conferences CVPR 2019 and ICCV 2019. To reduce the workload of peer reviewers such as himself, Huang recently published research on arXiv which uses deep learning techniques to predict whether a paper should be accepted or not based solely on its visual appearance. The model checks features such as layout, detailed table of results, and percentage of allotted space to make its determinations.

Huang’s Deep Paper Gestalt presents a promising experimental result: the classifier safely rejected 50 percent of the bad papers it checked; while wrongly rejecting only 0.4 percent of the good papers.

Huang told Synced via email “the idea of training a classifier to recognize good/bad papers has been around since 2010 [Paper Gestalt, a 2010 paper by Carven von Bearnensquash from the University of Phoenix]. Since early December, I thought it might be good to revisit the problem with more modern tools. The goal is to provide some insights on what a good paper looks like and how we can improve our own work.”

Huang first created a new dataset, Computer Vision Paper Gestalt, comprising both positive examples — the list of accepted papers in six CVPR and three ICCV proceedings from 2013 to 2018 — and negative examples such as workshop papers. He removed the headers atop the first page to protect against potential data leakage and to make the model focus on the the visual contents of the body of the paper.

The next step was to train the classifier: “We used ResNet-18 (pre-trained on ImageNet) as our classification network, and replaced the ImageNet 1,000 class classification head with two output nodes (good or bad papers). Following the practice of transfer learning, we fine-tuned the ImageNet pre-trained network on the proposed CVPG dataset with stochastic gradient descent with a momentum of 0.9 for a total of 50 epochs.”

The trained model achieved 92 percent accuracy on the test dataset of CVPR 2018 conference/workshop papers. Huang notes that a funny thing happened when he tested his own paper: the model rejected it.

To discover visual appearance patterns specific to good papers, Huang used Generative Adversarial Networks (GANs) to forge “good papers,” which featured “illustrative figures upfront”, “colorful images”, and “a balanced layout of texts/math/tables/plots.” He also trained a GAN model to translate the visual appearance of bad papers into those of good papers. Suggestions include “adding teaser figure upfront”, “making the figures more colorful”, and “filling up the last page.”

Deep Paper Gestalt stirred a heated discussion on social media. Many said they appreciated Huang’s efforts to address the serious paper-reviewer shortage problem. A record-high 4,854 papers were submitted to NeurIPS 2018. Although the conference organizers has expanded their supply of reviewers, AI researchers are concerned unqualified reviewers might dismiss their work.

In a story covered by Synced earlier this year, a Reddit user who identified as a predoctoral student posted that they had been selected as a NIPS reviewer, and needed advice on how to properly write paper reviews.

Along with the positive responses to Huang’s paper, many also expressed doubt regarding the efficacy of Huang’s technique in actual practice. Machine Learning Researcher David Madras tweeted “we know that a paper’s content, and not just its appearance, is what should determine its acceptance. This begs the question — what are the signals this model is picking up on? Would we trust a model based on these signals to determine the future of our field?”

In his paper, Huang identified a number of the model’s limitations:

Ignoring the actual paper contents may wrongly reject papers with good materials but bad visual layout or accept crappy papers with good layout;

The trained classifier cannot be applied to other conference papers with different formatting styles;

The bad-to-good paper generator can only produce one single output as a good paper;

The collected training samples can be very noisy.

Huang concludes “while the model achieves decent classification performance, we believe that it is unlikely that the classifier will ever be used in an actual conference.” The researcher did however tell Synced that “the tools for visualizing the class-specific activation maps and bad-to-good paper translator would be helpful to inform junior authors to prepare their paper submissions.”

The paper Deep Paper Gestalt is on arXiv and Github.

Good paper generator: TensorFlow (https://github.com/tkarras/progressive_growing_of_gans)

Bad-to-good paper generator: PyTorch (https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/)