[TYPES] Two-tier reviewing process

By popular request, the following is a more detailed description of the reviewing process that Dave Evans and I used for IEEE Security and Privacy 2009. The reviewing process used by Oakland 2009 was adapted from a two-tier processed used successfully by a few conferences in previous years. It was pioneered by Tom Anderson for SIGCOMM 2006, and used subsequently by SOSP 2007 and OSDI 2008. Unlike most conference review processes, we had a two-tier PC, and three rounds of reviewing. I believe that this structure helped us make more informed decisions, led to better discussions at the PC meeting, gave authors more feedback, and resulted in a better product overall. We had 77 days to review 253 submissions. This may sound like a lot of time, but reviewing for Oakland stretches across Christmas and other winter holidays. The PC of 50 people was divided 25/25 into 'heavy' and 'light' groups. Despite the names, these PC members did similar amounts of work. The heavy members did a few more reviews and attended the PC meeting; the light members participated in electronic discussion before the meeting. Dividing the PC into half meant that we had a smaller group at the PC meeting; and had more effective discussions than in previous years. A two-tier PC also helped us recruit some PC members who preferred not to travel. We did not distinguish between heavy and light members in any external documents such as the proceedings. I think this helped us recruit light members. Reviewing proceeded in three rounds, seen pictorially at http://www.cs.cornell.edu/andru/oakland09/reviewing-slides.pdf. We started round 1 with 249 credible papers. Each paper received one heavy and one light reviewer. Reviewers had 35 days to complete up to 12 reviews. Based on these initial reviews, the chairs rejected 36 papers and marked 33 papers as probable rejects. In round 2, we had 180 papers considered fully live, each of which received an additional heavy and light review. Papers considered probable rejects were assigned just one additional reviewer. Round 2 started just after Christmas, and reviewers had 20 days to complete up to 12 reviews. After round 2, we had 3-4 reviews per live paper. Papers all of whose reviews were negative were rejected at this point, with some electronic discussion to make sure everyone involved agreed. By round 3, we were down to 68 papers, most of which were pretty good papers. Each live paper now received one additional heavy review, ensuring that there were three reviewers present at the PC meeting for each discussed paper. Reviewers received up to five papers to review, in ten days. Based on these reviews and more electronic discussion, we rejected four more papers. All papers with some support at this point made it to the PC meeting. The chairs actively worked to resolve papers through electronic discussion, which was important in achieving closure. The PC meeting was a day and and half long, and resulted in 26 of the 68 papers being chosen for the program. Each paper was assigned a lead reviewer ahead of time. The lead reviewer presented not only their own view, but also those of the light reviewers who were not present. Where possible, we chose lead reviewers who were positive and confident about their reviews. At some points, we had breakout sessions for small groups of reviewers to discuss papers in parallel. However, no paper was accepted without the whole PC hearing the reasons for acceptance. This seems important for a broad conference like Oakland (or POPL). One benefit of multiple rounds of reviewing was that we could do a better job of assigning reviewers in later rounds, for three reasons: first, the reviews helped us understand what the key issues were; second, we asked reviewers explicitly for suggestions; third, we could identify the problematic paper where all the reviews were low-confidence and do hole-filling. We also asked external experts to help review papers where we didn't have enough expertise in-house. In the end, all papers received between 2 and 8 reviews, and accepted papers received between 5 and 8 reviews. The multiround structure meant that reviewing effort was concentrated on the stronger papers, and authors of accepted papers got more feedback, and often more expert feedback, than they had in previous years. The reviewing load was increased slightly over previous years for heavy reviewers (~23), but decreased slightly (~20) for light reviewers. Keeping load mostly constant was possible because we had a larger PC than in the past. The two-tier structure meant that despite a larger PC, we could have a smaller PC meeting. Filtering out weak papers early helped keep the reviewing load manageable. Papers were rejected after round 1 only when they had two confident, strongly negative reviews. The chairs did this in consultation with each other. Papers with very negative reviews but without high confidence, or confident reviews that were not as negative, were considered probable rejects and assigned a third review in round 2. If that review was positive, the paper received three reviews in round 3 instead of the usual one, ensuring that it made it to the PC meeting (this only happened in a couple of cases). PC members did not report any concerns to us that good papers might have been filtered out early. Assigning the right reviewers in round 1 makes both filtering and assignment of additional reviewers more effective. To be able to assign round-1 reviewers efficiently, it is important for the chairs to get as much information from the PC as possible about what papers they would like to review and about what topics they are expert. A final issue we put thought into was the rating scale. While the rating scale might not seem that important, in past years the Oakland committee had found that a badly designed rating scale could cause problems. The four-point Identify the Champion scale (A-D) used by many PL conferences works fine for single-round reviewing. But for multiple rounds with early filtering, it's helpful to distinguish the papers that are truly weak from the ones that merely don't make the grade. Therefore, ratings came from the following scale: 1: Strong reject. Will argue strongly to reject. 2: Reject. Will argue to reject (Identify the Champion's D) 3: Weak reject. Will not argue to reject (C) 4: Weak accept. Will not argue to accept (B) 5: Accept. Will argue to accept. (A) 6: Strong accept. Will argue strongly to accept. As in Identify the Champion, giving the ratings meaningful semantics helped ensure consistency across reviewers. Papers that received 1's and 2's were easy to filter out after round 1; we rejected papers with confident 1/1 or 1/2 ratings, and some 2/2's. Having the extreme ratings of 1 and 6 also seemed to give reviewers a little more excuse to use 2 and 5 as ratings, staking out stronger positions than they might have otherwise. The absence of a middle 'neutral' point usefully forced reviewers to lean one way or the other. Overall, this reviewing process probably involved somewhat more total work for the chairs than a conventional reviewing process, but it was also spread out more over the reviewing period. Problems could be identified and addressed much earlier. Total work for PC members was comparable to a conventional process. Some PC members appreciated that the multiple intermediate deadlines prevented a last-minute rush to get reviews done, and that the average quality of reviewed papers was higher. Hope this helps, -- Andrew