In my last post I discussed how to make a Japanese->English transliterator and outlined some problems that limited its usefulness. One problem is that there’s no obvious way to segment a sentence into words. I looked up existing solutions, and a lightweight Javascript implementation caught my eye. I quickly ported it to Common Lisp and to the surprise of absolutely no one, the results were awful

It was clear that I needed an actual database of Japanese words to do segmentation properly. This would also solve the “kanji problem” since this database would also include how to pronounce the words. My first hunch was Wiktionary, but it’s dump format turned out to be pretty inefficient for parsing.

Fortunately I quickly discovered a free JMDict database which was exactly what I needed. It even had open-source code in Python for parsing and loading its XML dumps. Naturally, I wrote my own code to parse it since its database schema looked too complex for my needs. But I’m not going to discuss that in this post, as it is quite boring.

Since now I had a comprehensive Postgres database of every word in Japanese language (not really, as it doesn’t include conjugations) it was only a matter of identifying the words in the sentence. To do this, for every substring of a sentence look up the database for exact matches. There are n(n+1)/2 substrings in a string, so we aren’t doing too badly in terms of performance (and the string wouldn’t be too long anyway since prior to running this procedure I’ll be splitting it by punctuation etc.)

(defstruct segment start end word)) (defun find-substring-words (str) (loop for start from 0 below (length str) nconcing (loop for end from (1+ start) upto (length str) for substr = (subseq str start end) nconcing (mapcar (lambda (word) (make-segment :start start :end end :word word)) (find-word substr)))))

The problem is that there’s a lot of words, and many of them are spelled identically. I decided to assign a score to each word based on its length (longer is better), whether it’s a preferred spelling of the word, how common the word is and whether it’s a particle (which tend to be short and thus need a boost to increase their prominence).

Now we have the following problem: for a sentence, find the set of non-intersecting segments with the maximum total score. Now, you might have better mathematical intuition than I, but my first thought was:

This looks NP-hard, man. This problem has “travelling salesman” written all over it.

My first attempt to crack it was to calculate score per letter for each word and select words with the highest scores. But a counterexample comes to mind rather easily: in a sentence “ABC” with words “AB” (score=5), “BC” (score=5) and “ABC” (score=6), words “AB” and “BC” have a higher score per letter (2.5), but the optimal covering is provided by the word “ABC” with its score per letter a measly 2.

At this point I was working with the most convenient mathematical instrument, which is pen and paper. The breakthrough came when I started to consider a certain relation between two segments: the segment a can be followed by the segment b iff (segment-start b) is greater or equal to (segment-end a). Under this relation our segments form transitive directed acyclic graph. The proof is left as an exercise for the reader. Clearly we just need to do a transitive reduction and use something similar to Dijkstra’s algorithm to find the path with the maximal score! This problem is clearly solvable in polynomial time!

Pictured: actual notes drawn by me

In reality the algorithm turns out to be quite simple. Since find-substring-words always returns segments sorted by their start and then by their end, every segment can only be followed by the segments after it. We can then accumulate the largest total score and the path used for it for every segment by using a nested loop:

(defstruct segment start end word (score nil) (accum 0) (path nil)) (defun find-best-path (segments) ;;assume segments are sorted by (start, end) (as is the result of find-substring-words) (let ((best-accum 0) (best-path nil)) (loop for (seg1 . rest) on segments when (> (segment-score seg1) (segment-accum seg1)) do (setf (segment-accum seg1) (segment-score seg1) (segment-path seg1) (list seg1)) (when (> (segment-accum seg1) best-accum) (setf best-accum (segment-accum seg1) best-path (segment-path seg1))) when (> (segment-score seg1) 0) do (loop for seg2 in rest if (>= (segment-start seg2) (segment-end seg1)) do (let ((accum (+ (segment-accum seg1) (segment-score seg2)))) (when (> accum (segment-accum seg2)) (setf (segment-accum seg2) accum (segment-path seg2) (cons seg2 (segment-path seg1))) (when (> accum best-accum) (setf best-accum accum best-path (segment-path seg2))))))) (values (nreverse best-path) best-accum)))

Of course when I actually tried to run this algorithm, SBCL just crashed. How could that be? It took me a while to figure out, but notice how segment-path contains a list that includes the segment itself. A recursive self-referential structure! When SBCL tried to print that in the REPL, it didn’t result in dragons flying out of my nose but a crash still happened. Interestingly, Common Lisp has a solution to this: if *print-circle* is set to t, it will actually print the structure using referential tokens. Anyway, I just added the following before returning the result to remove self-references:

(dolist (segment segments) (setf (segment-path segment) nil))

So, did it work? Yes, it did, and the result was impressive! Even though my scoring system is pretty barebones, it’s on par or even better than Google Translate’s romanization on a few test sentences I tried. I still need to add conjugations, and it can’t do personal names at all, but considering how little code there is and the fact that it doesn’t even attempt grammatical analysis of the sentence (due to me not knowing the language) I am very happy with the result. Also I plan to add a web interface to it so that it’s possible to hover over words and see the translation. That would be pretty useful. The work in progress code is on my Github.