If you are reading this blog, you are probably able to read Latin script. It is pretty widespread in the world, and used by 70% of the world’s population according to Wikipedia. Perhaps, like me, your native language uses a different script. There are many writing systems in the world, some are related, and some are wildly different from each other. Fortunately with the advent of the Internet and tools like Google Translate it is increasingly possible to read text not only in the language you don’t understand, but even the languages where you don’t even understand their writing system.



Well, Google is Google, but is it possible for a mere mortal to create something like that? Not to translate, but just to present some unknown writing system in your preferred alphabet (the process is called transliteration or transcription)? There’s no reason why not.

In this post I’ll talk about the process of romanization of Japanese language, which is transcription from Japanese to Latin script. For example “ありがとうございます” is romanized to “arigatō gozaimasu” under Hepburn romanization method (there are many of those).



First off, the basics of Japanese writing are as follows:

There are several scripts used to write in Japanese language. Hiragana is a syllabary (a writing system where each character represents a syllable) that is used for words of Japanese origin. Katakana is another syllabary that is used for loan words. Every possible syllable in Japanese language has a hiragana and katakana form, which usually are completely different. Both scripts have about 50 characters in them. Chinese characters (kanji) are used for words of Japanese and Chinese origin. There are thousands of such characters. Furthermore, most of them could be read in several different ways, which makes transcribing them difficult. We’re going to ignore those for now.



If we focus on romanization of hiragana and katakana (both systems are called kana for short) then the process seems pretty simple. It’s just a matter of replacing each kana with the syllable it represents, written in roman letters. However there are some characters that do not represent a syllable, but rather modify a syllable before or after that character. This includes sokuon, which doubles the consonant of the next syllable and yoon characters, which are a small version of normal kana and are used to modify a vowel of a preceding syllable.

Ok, so the first thing we must do is to somehow bring order to this madness. Since there is hiragana and katakana version of each character, it doesn’t make sense to work with the characters directly. Instead I’m going to replace each character with a keyword.

(defparameter *sokuon-characters* '(:sokuon "っッ")) (defparameter *iteration-characters* '(:iter "ゝヽ" :iter-v "ゞヾ")) (defparameter *modifier-characters* '(:+a "ぁァ" :+i "ぃィ" :+u "ぅゥ" :+e "ぇェ" :+o "ぉォ" :+ya "ゃャ" :+yu "ゅュ" :+yo "ょョ" :long-vowel "ー")) (defparameter *kana-characters* '(:a "あア" :i "いイ" :u "うウ" :e "えエ" :o "おオ" :ka "かカ" :ki "きキ" :ku "くク" :ke "けケ" :ko "こコ" :sa "さサ" :shi "しシ" :su "すス" :se "せセ" :so "そソ" :ta "たタ" :chi "ちチ" :tsu "つツ" :te "てテ" :to "とト" :na "なナ" :ni "にニ" :nu "ぬヌ" :ne "ねネ" :no "のノ" :ha "は" :hha "ハ" :hi "ひヒ" :fu "ふフ" :he "へヘ" :ho "ほホ" :ma "まマ" :mi "みミ" :mu "むム" :me "めメ" :mo "もモ" :ya "やヤ" :yu "ゆユ" :yo "よヨ" :ra "らラ" :ri "りリ" :ru "るル" :re "れレ" :ro "ろロ" :wa "わワ" :wi "ゐヰ" :we "ゑヱ" :wo "を" :wwo "ヲ" :n "んン" :ga "がガ" :gi "ぎギ" :gu "ぐグ" :ge "げゲ" :go "ごゴ" :za "ざザ" :ji "じジ" :zu "ずズ" :ze "ぜゼ" :zo "ぞゾ" :da "だダ" :dji "ぢヂ" :dzu "づヅ" :de "でデ" :do "どド" :ba "ばバ" :bi "びビ" :bu "ぶブ" :be "べベ" :bo "ぼボ" :pa "ぱパ" :pi "ぴピ" :pu "ぷプ" :pe "ぺペ" :po "ぽポ" )) (defparameter *all-characters* (append *sokuon-characters* *iteration-characters* *modifier-characters* *kana-characters*)) (defparameter *char-class-hash* (let ((hash (make-hash-table))) (loop for (class chars) on *all-characters* by #'cddr do (loop for char across chars do (setf (gethash char hash) class))) hash))



(defun get-character-classes (word)

(map 'list (lambda (char) (gethash char *char-class-hash* char)) word))

This creates a hash table that maps every kana to a keyword that describes it and we can now trivially convert a word into a list of “character classes” (or the characters themselves for non-kana characters). Then we need to transform this list into a kind of AST where modifier characters have the role of functions.

(defun process-modifiers (cc-list) (loop with result for (cc . rest) on cc-list if (eql cc :sokuon) do (push (cons cc (process-modifiers rest)) result) (loop-finish) else if (member cc *modifier-characters*) do (push (list cc (pop result)) result) else do (push cc result) finally (return (nreverse result))))

This is your basic push/nreverse idiom with some extra recursiveness added. Sokuon is applied to everything to the right of it, because I wanted it to have lower precedence, i.e. (:sokuon :ka :+yu) is parsed as (:sokuon (:+yu :ka)) instead of the other way around. Now we can write the outline of our algorithm:

(defun romanize-core (method cc-tree) (with-output-to-string (out) (dolist (item cc-tree) (cond ((null item)) ((characterp item) (princ item out)) ((atom item) (princ (r-base method item) out)) ((listp item) (princ (r-apply (car item) method (cdr item)) out))))))

The functions r-base and r-apply are generic functions that will depend on the method of romanization. Another generic function will be r-simplify that will “pretty up” the result. It is easy to write some reasonable fallback methods for them:

(defgeneric r-base (method item) (:documentation "Process atomic char class") (:method (method item) (string-downcase item))) (defgeneric r-apply (modifier method cc-tree) (:documentation "Apply modifier to something") (:method ((modifier (eql :sokuon)) method cc-tree) (let ((inner (romanize-core method cc-tree))) (if (zerop (length inner)) inner (format nil "~a~a" (char inner 0) inner)))) (:method ((modifier (eql :long-vowel)) method cc-tree) (romanize-core method cc-tree)) (:method ((modifier symbol) method cc-tree) (format nil "~a~a" (romanize-core method cc-tree) (string-downcase modifier)))) (defgeneric r-simplify (method str) (:documentation "Simplify the result of transliteration") (:method (method str) str))

Of course relying on symbol names isn’t flexible at all. It’s better to have a mapping from each keyword to a string that represents it. This is where we have to resort to classes to store this mapping in a slot.

(defclass generic-romanization () ((kana-table :reader kana-table :initform (make-hash-table)))) (defmethod r-base ((method generic-romanization) item) (or (gethash item (kana-table method)) (call-next-method))) (defmethod r-apply ((modifier symbol) (method generic-romanization) cc-tree) (let ((yoon (gethash modifier (kana-table method)))) (if yoon (let ((inner (romanize-core method cc-tree))) (format nil "~a~a" (subseq inner 0 (max 0 (1- (length inner)))) yoon)) (call-next-method))))

(defmacro hash-from-list (var list)

(alexandria:with-gensyms (hash key val)

`(defparameter ,var

(let ((,hash (make-hash-table)))

(loop for (,key ,val) on ,list

do (setf (gethash ,key ,hash) ,val))

,hash))))



(hash-from-list *hepburn-kana-table*

'(:a "a" :i "i" :u "u" :e "e" :o "o"

:ka "ka" :ki "ki" :ku "ku" :ke "ke" :ko "ko"

:sa "sa" :shi "shi" :su "su" :se "se" :so "so"

:ta "ta" :chi "chi" :tsu "tsu" :te "te" :to "to"

:na "na" :ni "ni" :nu "nu" :ne "ne" :no "no"

:ha "ha" :hha "ha" :hi "hi" :fu "fu" :he "he" :ho "ho"

:ma "ma" :mi "mi" :mu "mu" :me "me" :mo "mo"

:ya "ya" :yu "yu" :yo "yo"

:ra "ra" :ri "ri" :ru "ru" :re "re" :ro "ro"

:wa "wa" :wi "wi" :we "we" :wo "wo" :wwo "wo"

:n "n"

:ga "ga" :gi "gi" :gu "gu" :ge "ge" :go "go"

:za "za" :ji "ji" :zu "zu" :ze "ze" :zo "zo"

:da "da" :dji "ji" :dzu "zu" :de "de" :do "do"

:ba "ba" :bi "bi" :bu "bu" :be "be" :bo "bo"

:pa "pa" :pi "pi" :pu "pu" :pe "pe" :po "po"

:+a "a" :+i "i" :+u "u" :+e "e" :+o "o"

:+ya "ya" :+yu "yu" :+yo "yo"

))

(defclass generic-hepburn (generic-romanization) ((kana-table :initform (alexandria:copy-hash-table *hepburn-kana-table*))))

I’m going for a rather versatile class hierarchy here, starting with a completely empty kana-table for generic-romanization method, but defining the methods on it that will work for any table. Then I define a class generic-hepburn that will be the basis for different hepburn variations. The table is taken from Wikipedia article on Hepburn romanization, which is pretty detailed. By carefully reading it, we can identify the exceptions that the above functions can’t handle. For example a :sokuon before :chi is romanized as “tchi” and not as “cchi” as it would by the simple consonant-doubling method. Another exception is that, for example, :chi followed by :+ya is romanized as “cha”, not “chya”. CLOS makes it easy to handle these irregularities before passing the torch to a less specific method.

(defmethod r-apply ((modifier (eql :sokuon)) (method generic-hepburn) cc-tree) (if (eql (car cc-tree) :chi) (concatenate 'string "t" (romanize-core method cc-tree)) (call-next-method))) (defmethod r-apply ((modifier (eql :+ya)) (method generic-hepburn) cc-tree) (case (car cc-tree) (:shi "sha") (:chi "cha") ((:ji :dji) "ja") (t (call-next-method)))) ... and the same for :+yu and :+yo

Another thing Hepburn romanizations do is simplifying double vowels like “oo”, “ou” and “uu”. For example, our generic-hepburn will romanize “とうきょう” as “toukyou”, while most people are more familiar with the spelling “Tokyo” or “Tōkyō”.

(defun simplify-ngrams (str map) (let* ((alist (loop for (from to) on map by #'cddr collect (cons from to))) (scanner (ppcre:create-scanner (cons :alternation (mapcar #'car alist))))) (ppcre:regex-replace-all scanner str (lambda (match &rest rest) (declare (ignore rest)) (cdr (assoc match alist :test #'equal))) :simple-calls t))) (defclass simplified-hepburn (generic-hepburn) ((simplifications :initform nil :initarg :simplifications :reader simplifications :documentation "List of simplifications e.g. (\"ou\" \"o\" \"uu\" \"u\")" ))) (defmethod r-simplify ((method simplified-hepburn) str) (simplify-ngrams (call-next-method) (simplifications method)))



(defclass traditional-hepburn (simplified-hepburn)

((simplifications :initform '("oo" "ō" "ou" "ō" "uu" "ū"))))



I’m using the “parse tree” feature of CL-PPCRE here to create a complex :alternation regex on the fly and then use regex-replace-all with a custom replacing function. It’s probably not the most efficient method, but sometimes outsourcing string manipulations to a well-tested regex engine is the least painful solution. Anyway, we’re really close now, and all that’s left is to chain up our functions for a useful API.

(defparameter *hepburn-traditional* (make-instance 'traditional-hepburn)) (defvar *default-romanization-method* *hepburn-traditional*) (defun romanize-list (cc-list &key (method *default-romanization-method*)) "Romanize a character class list according to method" (let ((cc-tree (process-modifiers cc-list))) (values (r-simplify method (romanize-core method cc-tree))))) (defun romanize-word (word &key (method *default-romanization-method*)) "Romanize a word according to method" (romanize-list (get-character-classes word) :method method))



>>> (romanize-word "ありがとうございます")

"arigatōgozaimasu"

At my Github you can find an unabridged version of the above code. However there are still some difficult problems with romanization of Japanese that can’t be solved as easily. Even leaving kanji aside, the hiragana character は is pronounced either as “ha” or “wa” depending on whether it is used as a particle. For example a common greeting “こんにちは” is romanized as “konnichiwa” and not “konnichiha” because は plays the role of a particle. Which brings us to another problem: there are no spaces between the words, so it’s not possible to determine whether は is a part of a word or a standalone particle without a dictionary, and even then it can be ambiguous! I’m ending the post on this note, since I’m still not sure how to solve this. さようなら！