BAIRD AND

THOMPSON:

READING

CHESS

551

recognized and discarded. If one or more correct candi-

dates are found for

both

ply, then they are recorded and

the text is never interpreted again. When there is more

than one candidate, they are sorted decreasing on the

product of the match scores (i.e., classifier’s top choices

first). These

preparsed

moves comprise about 98

%

of the

total moves.

Reverse typesetting

is the use of the bounding rectangle

of characters to prune lists of matches. Both height and

width of boxes are compared, but not height above base-

line. It is never required when pruning lists from the clas-

sifier, but it is helpful when scanning for missed brackets

or eliminating wild choices made by the move generator.

VJI.

CORRECTING

MOVES USING

SEMANTICS

Preparsed moves are interpreted in the current chess

context and, if legal, the move is made and the match

continues. If the preparsed move is illegal the match is

terminated. If the move is not preparsed, then we have

reached an

impasse,

and all legal next moves from the

current board position are generated and pruned using re-

verse typesetting. The potentially exponential runtime of

this exhaustive step is managed by taking the product

of

the scores

of

the matched moves and pruning below a

threshold. All matches above threshold are made and

matching continues in a depth-first, best-match-first man-

ner. The first longest match is taken as the game score.

It is conceivable that some game interpretation could

be legal and still differ from the clear intention of

the

ed-

itor-but such forced interpretations have not been ob-

served. An example of the results of semantic analysis are

shown in Fig.

13.

The text that represents the players and event are

matched against (separate) dictionaries of names. A larg-

est substring algorithm is used to find the closest name in

the dictionary. The scores for the matches are recorded

on an error file and examined by hand for additions to the

dictionaries. The dictionaries are initially created from the

index in the

Informant.

This is highly redundant between

issues with only about

5%

new entries per volume after a

two volume startup.

VIII.

RESULTS

Among the first 142 games, two were rejected by the

system for typographical errors. Three others could not

be corrected by the semantic analysis. The rest, 98% of

the typo-free (correctly-printed) games, were accepted by

the semantic analysis, and have been proofread by hand

to confirm the interpretation.

No

forced interpretations

(substitution errors) were found. If game interpretations

were selected from those that could be assembled from

the top three classifier choices (with syntactic but no se-

mantic analysis) this fraction falls to

76%.

Interpretation

by shape alone (using only the classifier’s top choices)

would have yielded 40

%

of the games completely correct,

for an error rate a factor of

30

worse than that achieved

by semantic analysis.

One of the games that could not be fully corrected had

/

person: TRINGOV (Trinqov

-0)

/

person: KORTCIINOI (KOrtchnOi

-2)

white: Kortchnoi

black: Tringov

/

event: Luzernlol)1982 (Lurcrn

(01)

1982

-0)

event: Lurern (01) 1982

result: 1-0

a4

Nf6 c4 e6 Nf3 CS d5 e:d5 c:d5 db

Nc3

q6

93

%37

%32

0-0

0-0

Nab h3 Nc7

e4

Nd7 Bf4 Qe7

Re1

f6 Nh2 Rb8

Be3

b5

f4 b4 NA~ Nb5 Rc1

Re8

Nf3 Of8 Bf2

Bb7

Bfl Nc7 Nd4

Kh8

Nc6 RA8 N:b4 f5

e5

d:e5

Of3 N:b4 O:b? Nd3

B:d3

Q:g3r

-2

Q:d3 Rcdl Ob5

b4

Race

Qd5

0.36

Bc5 QA4

e6

Qd3

Rd3 Ob2

e?

A6

QC6

class/

004

108.*lR76/a)A62

N:CS N:CS

B:CS

of7

~d6

~:d5

0c4

Of6 f:e5

w5

Fig.

13.

Results ofthe syntactic and semantic analysis

for

the game open-

ing shown in Figs.

1,

11.

and

12.

Note that move numbers and com-

mentary have been stripped

off

and a player’s name has been corrected.

The opening classification

“(R76/a)A62”

has been recomputed.

garbled characters at the very end, where there was no

later context from which to backtrack. Another had too

many consecutive impasses, at the outset of a game, lead-

ing to an unmanageably large search space. The line of

text involved

is

illustrated in Fig.

14.

The top line is the

bitmap (after cleaning up dirt), and the bottom line the

top-choice interpretations by the classifier. Bad matches

are marked with surrounding boxes.

For 99.4% of the ply, at least one candidate suggested

by the classifier was syntactically correct. Of these, the

top choice was semantically legal 98.7% of the time.

About

0.5%

of

the ply were associated with impasses,

and the final correct rate among ply after semantic anal-

ysis was 99.99%. Since ply are made of about three char-

acters on average, the effective per-character success rate

is probably better than 99.995

%

.

We have no direct mea-

surement of the uncorrected character recognition rate,

and it is difficult to infer it in an unbiased way since,

among other problems, about half

of

the characters are

ignored as commentary.

We have continued reading-without manually proof-

reading every game-and after four volumes (945 pages,

2850

games,

2

176

865

characters) performance is hold-

ing steady, with legal interpretations found for over 97%

of the games free from typographical errors. Since about

a million of these characters occur in commentary, the

semantic analysis was applied to about a million charac-

ters for an effective rate of successful interpretation of

99.995%.

The system was written entirely in the

C

programming

language and ran under various research editions of the

UNIX@ operating system. Runtime for the image analysis

phases averaged

10

CPU minutes per page on a DEC

VAX@ 11/8550, of which

87%

was consumed in exclu-

sive-or operations during template matching. Connected

components analysis and layout analysis required about

10

CPU seconds apiece. Resegmentation of dirty, etc.

shapes required an average of one CPU minute, with a

moderately large variance due to variations in image qual-

ity from page to page. The syntactic and semantic anal-

ysis was performed on a Sequent; CPU time averaged

2

WNIX

is a registered trademark

of

AT&T.

WAX

is

a

registered trademark

of

Digital Equipment Corporation