Table of Contents

Introduction

Unlike previous iterations of this series, today we’ll be diving into the field of linguistics a bit more intensely. The reason for my lack of any new posts in a month and a half has been the result of my work on a completely new language which is now approaching an alpha state (at which point, I will theoretically be able to build a full parser for it). Today, I’ll be covering why I decided to invent a language, where it came from, how it is different, and how it all ties into the overall goal of narrative scripting.

Future posts will most certainly reference this language, so if you aren’t interested in the background and just want the TL;DR of the language features and relevance to narrative scripting, then feel free to skip on down to the conclusion where I will review everything.

Without further ado, let’s begin!

Issues With “Toki Sona”

Prior to this post, I had puffed up the possibilities of using a toki pona-derived language, heretofore referred to as toki sona. While I was quite excited about tp’s potential to combine concepts together and support a minimal vocabulary with simple pronunciation and an intuitive second-hand vocabulary (e.g. “water-enclosedConstruction” = bathroom), there were also a variety of issues that forced me to reconsider its adaptation towards narrative scripting.

First and foremost is the ambiguity within the language. Syntactic ambiguity makes it nearly impossible for an algorithm to easily understand what is being stated, and tp has several instances of this lack of clarity. For example, “mi moku moku” could mean “the hungry me eats” or “I hungrily eat” or even some new double-word emphasis that someone is experimenting with: “I really dove into eating [something]” or “I am really hungry”. With the language unable to clearly distinguish modifiers and verbs from each other, identifying parts of speech and therefore the semantic intent of a word and its relationship to other words is needlessly complicated.

The one remedy I thought of for this would be to add hyphens in between nouns/verbs and their associated modifiers, not only allowing us to instantly disambiguate this syntactic confusion, but also to simplify and accelerate the computer’s parsing with a clear delimeter (a special symbol to separate two ideas). However, this solution is not audibly communicable during speech and therefore retains all of these issues in spoken dialogue, violating our needs. Using an audible delimeter would of course be completely impractical.

The other problem of ambiguity with the language is the intense level of semantic ambiguity present due to the restricted nature of the language’s vocabulary. The previously mentioned “bathroom” (“tomo telo”) could also be a bathhouse, a pool, an outhouse, a shower stall, or any number of other related things. Sometimes, the distinction is minor and unimportant (bathroom vs. outhouse), but other times that distinction may be the exact thing you wish to convey. What happens then if we specify “bathroom of outside”?

One possibility is the use of “ma” meaning “the land, the region, the outdoors, the earth”, but then we don’t know if this is an outdoor bathroom or if it is the only indoor bathroom in the region, or if it is just a giant pit in the earth that people use. The other possibility could be “poka” meaning “nearby” or “around”, but that is even more unclear as it specifies a request purely for an indoor bathroom of a given proximity.

As you can see, communicating specific concepts is not at all tp’s specialty. In fact, it goes against the very philosophy of the language: the culture supported by its creators and speakers is one that stresses the UNimportance of knowing such details.

If you ever happen to speak with a writer, however, they will tell you the importance of words, word choice, and the evocative nature of speech. They can make you feel different emotions and manipulate the thoughts of the reader purely through the style of speech and the nuanced meanings of the terminology they have used. If we are to support this capacity in a narrative scripting language, we cannot be allowed to build its foundation on a philosophy prejudiced against good writing.

The final issue, related to the philosophy, is the grammatical limitations imposed by its syntax.

Inter-sentence conjunctions like English’s FANBOYS (“I was tired, yet he droned on.”) are not entirely absent thankfully: they have an “also” (“kin”) and a “but” (“taso”) that can be used to start the following sentence and relate two ideas. One can even adverbial phrases (the only types of dependent clauses allowed) to assist in relating ideas. However, limitations are still present, and the reason for that is a mix of the philosophy and the (admirable) goal of keeping the vocabulary size compact. You cannot satisfactorily describe a single noun with adjectives of multiple, tiered details (“I like houses with toilet seats of gold and a green doorway”). This is a problem many have attempted to deal with revolving around the “noun1 pi multi-word modifier” technique that converts a set of words into an adjective describing noun1. Users of tp have debated on ways of combating this. One that I had considered, and which is mildly popular, was re-appropriating the “and” conjunction for nouns (“en”) as a way of connecting pi’s, but because you effectively need open and closed parentheses to accomplish the more complex forms of description, there isn’t really a clean way of handling this.

Through all of the limitations, prejudice of writing, and ambiguity, toki pona, and any language closely related to it, is inevitably going to find itself wanting in viability for narrative scripting. Time to move on.

Lojban: Let’s Speak Logically!

In an attempt to solve the problems of toki pona, a friend recommended to me that I check out the “logical language”, Lojban (that ‘j’ is soft, as in “beige”).

Lojban is unlike any spoken language you have ever learned: it borrows much of its syntax from functional programming languages like Haskell. Every phrase/clause is made up of a single word indicating a set of relationships and the other words are all things that act as “parameters” by plugging in concepts for the relations.

For example, “cukta” is the word for book. If you simply use it on its own, it plugs “book” into the parameter of another word. However, that’s not all it means. In full, “cukta” means…

x1 is a book containing work x2 by author x3 for audience x4 preserved in medium x5

If you were to have tons of cukta’s following each other (with the proper particles separating them), having a full version would mean…

A book is a book about books that is written by a book for books by means of a book.

You can also specifically mark which “x” a given word is supposed to plugin as, without needing to use any of the other x’s, so the word cukta can also function as the word for “topic”, “author”, “audience”, and “medium” (all in relation to books). If that’s not conservation of vocabulary, I don’t know what is.

It should be noted that 5-parameter words in Lojban are far more rare than simpler ones with only 2 or 3 parameters. Still, it’s impressive that, using this technique, Lojban is able to communicate a large number of topics using a compressed vocabulary, and yet remain extremely explicit about the meaning of its words.

Just as important to notice is how Lojban completely does away with the concept of a “noun”, “verb”, “object”, “preposition”, or anything of the sort. Concepts are simply reduced to a basic entity-relation-entity form: entity A has some relationship x? to entity B. This certainly makes things easier for the computer. In addition, while on the one hand one might think this would make things easier to understand when learning (since it is a much simpler system), the fact that it is so vastly different from the norm means that people coming from more traditional languages will have a more difficult time understanding this system, especially given the plurality of relationships that are possible with a single word.

Another strong advantage of Lojban is that it is structured to provide perfect syntactic clarity to a computer program and can be completely parsed by a computer in a single pass. In laymen’s terms, it means that the computer only needs to “read” the text one time to understand with 100% accuracy the “parts of speech” of every word in a sentence. There is no need for it to guess how a word is going to be syntactically interpreted.

In addition, Lojban employs a strict morphological structure on its words to indicate their meaning. For example, each of these “root set” words like “cukta” have one of the two following patterns: CVCCV and CCVCV (C being “consonant” and V being “vowel”). This makes it much easier for the computer to pick out these words in contrast to other words such as particles, foreign words, etc. Every type of word in the language conforms to morphological standards of a similar sort. The end result is that Lojban parsers, i.e. “text readers” are very very fast in comparison to those for other languages.

One more great advantage of Lojban is that it has these terrifically powerful words called “attitudinal indicators” that allow one to communicate a complex emotion using words on a spectrum. For example, “iu” is “love”, but alternative suffixes give you “iucu’i” (“lack of love”, a neutral state) and “iunai” (“hate/fear”, the opposite state). You can even combine these terms to compose new emotions like “iu.iunai” (literally “love-hate”).

For all of these great elements though, Lojban has two aspects that make it abhorrent to use for the simple narrative scripting we are aiming for. It is too large of a language: 1,350 words just for the “core” set that allows you to say reasonable sentences. While this is spectacularly small for a traditional language, in comparison to toki pona’s nicely compact 120, it is unacceptably massive. As game designers, we simply can’t expect people to devote the time needed to learn such a huge language within a reasonable play time.

The other damaging aspect is the sheer complexity of the language’s phonology and morphology. When someone wishes to invent a new word using the root terms, they essentially mash them together end-to-end. While this would be fine alone, switching letters around and having part of the latter consumed by the end of the former is unfortunately very difficult to follow. For example…

skami = “x1 is a computer used for purpose x2”

pilno = “x1 uses/employs x2 [tool, apparatus, machine, agent, acting entity, material] for purpose x3.”

skami pilno => sampli = “computer user”

Because “skami pilno” was a commonly occuring word in Lojban’s usage, a new word with the “root word” morphology can be invented on the fly by combining the letters. Obviously, this appears very difficult to do on the fly and effectively involves people learning an entirely new word for the concept.

All that to say that Lojban brings some spectacularly innovative concepts to the table, but due to its complex nature, fails to inspire any hope for an accessible scripting language for players.

tokawaje: The Spectral Language

We need some way of combining the computer-compatibility of Lojban with the elegance and simplicity of toki pona that omits as much ambiguity as possible, yet also allows the user to communicate as broadly and as specifically as needed using a minimal vocabulary.

Over the past month and a half, I’ve been developing just such a language, and it is called “tokawaje”. An overview of the language’s phonology, morphology, grammar, and vocabulary, along with some English and toki pona translations, can be found on my commentable Google Sheets page here (concepts on the Dictionary tab can be searched for with “abc:” where “abc” is the 3-letter root of the word). With grammar and morphology concepts derived from both Lojban and toki pona, and with a minimal vocabulary sized at 150 words, it approximates a toki pona-like simplicity with the potential depth of Lojban. While it is still in its early form, allow me to walk through the elements of tokawaje that capture the strengths of the other two despite avoiding their pitfalls.

Lojban has three advantages that improve its computer accessibility:

The entity-relation-entity syntax for simpler parsing and syntactic analysis. Morphological and grammatical constraints: the word and grammar structure is directly linked to its meaning. The flexibility of meaning for every individual learned word: “cukta” means up to 5 different things.

toki pona has two advantages that improve its human accessibility:

Words that are not present in the language can be estimated by combining existing words together and using composition to construct new words. This makes words much more intuitive. It is extremely easy to pronounce words due to its mouth-friendly word construction (every consonant must be followed by a single vowel).

“tokawaje” accomplishes this by…

Using a similar, albeit heavily modified entity-relation-entity syntax. Having its own set of morphological constraints to indicate syntax. Using words that represent several things that are associated with one another on a spectrum. Relying on toki pona-like combinatoric techniques to compose new words as needed. Using a phonology and morphology focused on simple sound combinations that are easily pronounced. Must match the pattern: VCV(CV)*(CVCV)*.

Now, once more, but with much more detail:

1) Entity-Relation-Entity Syntax

Sentences are broken up into simple 1-to-1 relations that are established in a context. These contexts contain words that each require a grammar prefix to indicate their role in that context. After the prefix, each word then has some combination of concepts to make a full word. Concepts are each composed of some particle, tag, or root, (some spectrum of topic/s) followed by a precision marker that indicates the exact meaning on that spectrum.

The existing roles are… (pronounced like in Spanish):

prefix ‘u’: a left-hand-side entity (lhs) similar to a subject. prefix ‘a’: a relation similar to a verb or preposition. prefix ‘e’: a right-hand-side entity (rhs) similar to an object. prefix ‘i’: a modifier for another word, similar to an adjective or adverb. prefix ‘o’: a vocative marker, i.e. an interjection meant to direct attention.

Sentences are composed of contexts. For example, “I am real to you,” is technically two contexts. One asserts that “I am real” while the other asserts that “my being real” is in “your” perspective. This nested-context syntax is at the heart of tokawaje.

These contexts are connected with each other using context particles:

‘xa’ (pronounced “cha”) meaning opening a new context (every sentence silently starts with one of these). ‘xo’ meaning close the current context. ‘xi’ meaning close all open contexts back to the original layer.

(These also must each be prefixed with a corresponding grammar prefix)

Examples of Concept Composition:

‘xa’ = an incomplete word composed of only a particle+precision. “uxa” = a full word with a concept composed of a grammar prefix and a particle+precision. “min” = root for pronouns, “mina” = “self”, full “umina” = “I”. “vel” = root for “veracity”, “vela”= “truth”, full “avela” = “is/are”. “sap” = root for object-aspects, “sapi” = “perspective”, full “asapi” = “from X’s perspective”.

Sample Breakdown:

“I am real to you” => “umina avela evela uxo asapi emino.”

“umina” {u: subject, min/a: “pronoun=self”} “avela” {a: relation, vel/a: “veracity=true”} “evela” {e: object, vel/a: “veracity=true”} “uxo” {u: subject, xo: “context close”} // indicating the previous content was all a left-hand-side entity for an external context. “asapi” {a: relation, sap/i: “aspect=perspective”} “emino” {e: object, min/o: pronoun=you}

It’s no coincidence that the natural grammatical breakdown of a sentence looks very much like JSON data (web API anyone?). In reality, it would be closer to…

{ prefix: ‘u’, concepts: [ [“min”,”a”] ] }

…since the meanings would be stored locally between client and server devices.

This is DIFFERENT from Lojban in the sense that no single concept will encompass a variety of relations to other words, but it is SIMILAR in that the concept of a “subject”/”verb”/”object” structure isn’t technically there in reality. For example:

“umina anisa evelo” => “I -inside-> lie” => “I am inside a lie.”

In this case, “am inside” isn’t even a verb, but purely a relation simulating an English prepositional phrase where no “is” verb is technically present.

These contexts can be used without a complete context to create gerunds, adjective phrases, etc. For example, to create a gerund left-hand-side entity of “existing”, I might say

“avela uxo avela evelo.” => “Existing is (itself) a falsehood.”

You might ask, “how do we tell the difference with something like [uxavela]? Might it be {u: object, xav/e: something, la: something}? Actually, no. The reason the computer can immediately understand the proper interpretation is because of tokawaje’s second Lojban incorporation:

2) Strict Morphological Constraints for Syntactic Roles

Consonants are split up into two groups: those reserved for particles, such as ‘x’ and those reserved for roots, such as ‘v’. The computer will always know the underlying structure of a word’s morphology and consequent syntax. Therefore, given the word “uxavela” we will know with 100% certainty that the division is u (has the V-form common to all prefixes), xa (CV-form of all particles), and vela (CVCV-form of all roots).

Particles can be split up into two categories based on their usual placement in a word.

Those that are usually the first concept in a word. ‘x’ = relating to contexts (as you have already seen previously) ‘xa’ = open ‘xo’ = close ‘xi’ = cascading close ‘xe’ = a literal grammar context (to talk about tokawaje IN tokawaje) ‘f’ = relating to irrelevant and/or non-tokawaje content ‘fa’ = name/foreign word with non-tokawaje morphology constraints ‘fo’ = name/foreign word with tokawaje morphology constraints ‘fe’ = filler word for something irrelevant Those that are usually AFTER a concept as a suffix (could be mid-word). ‘z’ = concept manipulation ‘za’ (zah) = shift meaning more towards the ‘a’ end of the spectrum ‘zo’ (zoh) = shift meaning more towards the ‘o’ end of the spectrum ‘zi’ (zee) = the source THING that assumes the left-hand-side of this relation. Ex. “uvelazi” => that which is something Shorthand for “ufe avela uxo”. ‘ze’ (zeh) = the object THING that assumes the right-hand-side of this relation. Ex. “uvelaze” => that which something is. Shorthand for “avela efe uxo”. ‘zu’ (as in “food”) = questioning suffix ‘zy’ (zai) = commanding suffix ‘zq’ (zow) = requesting suffix ‘c’ = tensing, pronounced “sh” ‘ca’ = future tense ‘ci’ = progressive tense ‘co’ = past tense ‘b’ = logical manipulation ‘be’ = not ‘ba’ = and ‘bi’ = to (actually, it is the “piping” functionality in programming, if you know about that) ‘bo’ = or ‘bu’ = xor

All other consonants in the language fall into the “root word” set. With these clear divisions, tokawaje will always know what role a concept has in manipulating the meaning of that word.

I’d also like to point out that informal, conversational uses of these two groups of particles may completely remove the distinction between them. For example, someone may simply say:

“uzq” => “Please.”

This would not actually impact the computer’s capacity to distinguish terms though. I even plan to make my own parser assume that lack of a grammar prefix implies an intended ‘u’ prefix (not that that’s encouraged)

3) Concepts in Tokawaje Exist on Spectra

Most every word in the language has exactly 4 meanings, with 3 non-root concepts using more than that: the grammar prefixes and ‘z’-based word manipulators (as you’ve already seen), and general expressive noises / sound effects which are vowel-only. This technique allows for vocabulary that is flexible, yet intuitive, despite its initial appearance of complexity.

4) Sounds and Structure are Designed for Clear, Flowing Speech

Every concept is restricted to a form that facilitates clear pronunciation and a consistent rhythm. Together, these elements ensure that the language is simple to learn phonetically.

Concepts have the form C (particles/tags) or CVC (roots) along with a vowel grammar prefix and a vowel precision suffix, resulting in a minimum word of VCV or VCVCV.

The rhythm to concepts emphasizes the middle CV: u-MI-na, a-VE-la, etc. Even with suffixes applied to words, this pattern never becomes unmanageable. The result is a nice, flowy-feeling language:

uvelominacoze / avelominaco (“velomina” => a personal falsehood) u-VE-lo-MI-na-CO-ze (that which one lied to oneself about) a-VE-lo-MI-na-co (to lie to oneself in the past)

5) Tokawaje Employs Tiered Combinatorics to Invent New Concepts

The first concept always communicates the root “what” of a thing while the subsequent concepts add further description of the thing. This structure emulates toki pona’s noun-combining mechanics.

‘u’, ‘a’, and other non-‘i’ terms are primary descriptors and more closely adhere to WHAT a thing is. ‘i’ terms are secondary descriptors and approximate the additional properties of a thing BEYOND simply WHAT it is. Fundamentally, every concept follows these simple rules:

Non-‘i’ words are more relevant to describing their role’s reality than ‘i’ words. However, individual words are described more strongly by their subsequent ‘i’ words than they are by other terms. Multiple non-‘i’ words will further describe that non-‘i’ term such that later non-‘i’ words act as descriptors for the next-left non-‘i’ word and its associated ‘i’ words.

Let’s say I have the following sentence (I’ll be using the filler particle “fe” with an artificially inserted number to reference more easily. Think of each of these as a root+precision CVCV form):

“ufe1fe2 ife3fe4 ife5 ufe6fe7 ife8 avela uxofe9 ife10 afe11”

This can be broken down in the following way:

Any pairing of adjacent fe’s form a compound word in which the second fe is an adjective for the previous fe, but the two of them together form a single concept. For example, “ife3fe4”: fe4 is modifying fe3, but the two together form an adjective modifying the noun “ufe1fe2”. The subject is primarily described by “ufe1fe2” and secondarily by “ufe6fe7” since they are both prefixed with ‘u’, but one comes later. “ufe6fe7” is technically modifying “ufe1fe2”, even if “ufe1fe2” is also being more directly modified by the ‘i’-terms following it. Each of those ‘u’ terms are additionally modified by their adjacent ‘i’ term adjective modfiers. “ife5” is an adverb modifying “ife3fe4”. The “existence of ” the “ufe1-8” entity is the u-term of the “afe11” relation. The entirety of that u-term has a primary adjective descriptor of “-fe9” and a secondary adjective descriptor of “ife10”.

Suppose the word for “dog” were “uhumosoviloja” (u/lhs,humo/beast,sovi/land,loja/loyal = “loyal land-beast”). How might you describe a disloyal dog then? You would use an ‘u’ for stating it is a dog (that identifying aspect) and an ‘i’ for the disloyalty (the added on description). The spectrum of loyalty (“loj”) would therefore show up twice.

“uhumosoviloja ilojo” => “disloyal dog”

For clarity purposes, you may even split up the “loja”, but that wouldn’t impact the meaning since “uloja” still has a higher priority than “ilojo”.

“uhumosovi uloja ilojo” => “disloyal dog” (equivalent)

Let’s say there were actual distinctions between words though. How about we take the noun phrase “big wood box room”? Here’s the necessary vocabulary:

“sysa” => “big/large/to be big/amount”

“lijavena” => “rigid thing of plants” => “wood/wooden/to be wood”

“tema” => “of or relating to cubes”

“kita” => of or relating to rooms and/or enclosed spaces”

Now let’s see some adaptations:

ukitasysa => an “atrium”, a “gym”, some space that, by definition, is large. ukita usysa => same thing. ukita isysa => a room that happens to be relatively big. ukita utema ulijavena usysa => cube room of large-wood. ukita utema ulijavena isysa => cube room of large-wood. ukita utema ilijavena usysa => room of wooden large-boxes. ukita itema ulijavena usysa => a cube-shaped room of large-wood. ikita utema ulijavena usysa => [something] related to rooms that is cube-shaped, wooden, and large. ukita utema ilijavena isysa => a room of large-wood cubes. ukita itema ilijavena usysa => the naturally large room associated with wooden cubes. ikita itema ulijavena usysa =>[something] related to cube-shaped rooms that is a large-plant. ukita itema ilijavena isysa => the room of large-wood boxes. ikita itema ilijavena usysa => [something] related to plant-box rooms that is an amount. (an inventory of greenhouses or something?) ikita itema ilijavena isysa => [something] related to rooms of large-wood boxes. ukita usysa utema ilijavena => large room of wooden boxes. ukita utema usysa ilijavena => room of wood-amount boxes. ukita utemasysa ilijavena => room of wooden big-boxes. ukita usysa iba ulijavena itema => A big-room and a cube-related plant.

Some of these are a little crazy and some of them are amazingly precise. The point is, we are achieving this level of precision using a vocabulary closer to the scope of toki pona. I can guarantee you that you would never have been able to say any of this in a language as vague as TP nor will it ever try to approximate this level of clarity. I can likewise guarantee that Lojban will never have a minified version of itself available for video games. Good thing we don’t need we have an alternative.

Conclusion

As you can see, tokawaje combines the breadth, depth and computer-accessibility of Lojban with the simplicity, intuitiveness, and human-accessibility of toki pona.

For those of you wanting the TL;DR:

The invented language, tokawaje, is a spectrum-based language. Clarity of pronunciation, compactness of vocabulary (150 words), and combinatoric techniques to invent concepts all lend the language to great accessibility for new users of the language. On the other hand, a sophisticated morphology and grammar with clear constraints on word formations, sentence structure, and their associated syntax and semantics result in a language that is well-primed for speedy parsing in software applications.

More information on the language can be found on my commentable Google Sheets page here (concepts on the Dictionary tab can be searched for with “abc:” where “abc” is the 3-letter root of the word).

This is definitely the longest article I’ve written thus far, but it properly illuminates the faults with pre-existing languages and addresses the potential tokawaje has to change things for the better. Please also note that tokawaje is still in an early alpha stage and some of its details are liable to change at this time.

If you have any comments or suggestions, please let me know in the comments below or, if you have specific thoughts that come up while perusing the Google Sheet, feel free to comment on it directly.

Next time, I’ll likely be diving into the topic of writing a parser. Hope you’ve enjoyed it.

Cheers!

Next Article: Coming Soon!

Previous Article: Relationship and Perception Modeling