Speech and language technology on a shoestring

Scottish Gaelic provides a model for small-team, high impact strategies for languages of few speakers

Small languages often struggle to gain a foothold in Speech and Language Technology (SALT). SALT can refer to any interface of language and digital technology, from localized user interfaces and digital wordlists to text to speech or voice recognition tools. Efforts in this area often end up in disjointed, isolated and uncoordinated projects, which can waste resources and slow progress. Yet with technology increasingly permeating our lives, the need to plant a linguistic flag in cyberspace becomes ever more important for small and medium languages. Research supports the intuitive notion that the “metalanguage of technology” — the ways we talk about technology — impacts our wider patterns of language use.

Coupled with these more subtle effects are the more practical issues of how to interface with technology if the technology does not support your language readily — such as predictive texting, auto-correction and voice recognition. The existence of domains where a language is not functional, such as scientific and technological domains, has well-known, negative impacts on attitudes towards a language. Further, the need for ‘modernizing the image of a language’ frequently appears in attitudinal studies of minority languages.

Ultimately, there is most likely a need for a linguistic SALT rights charter aimed at the private sector multinational corporations that develop commercial software applications. In the meantime, using the Scottish Gaelic experience as a case study, I argue that if efforts are focused, targeted and planned, a very small team can provide a language with a significant digital foothold. This can enable everyday users of a language to conduct a large percentage of their daily technological interactions in their language.

Scottish Gaelic is a Celtic language that has experienced a long displacement by English over the last millennium. Today, it remains a community language primarily in parts of the rural Outer Hebrides and Highlands, and has a few tens of thousands of fluent speakers, including a few thousand in Canada. In recent years, a decentralized but concerted effort to revitalize the Gaelic language has been developing, and one focus of this effort has been information and communication technology. Today, you can boot up a Gaelic operating system, edit documents in a Gaelic office suite, surf the web with a Gaelic browser, and play dozens of Gaelic video games. Just a few years ago, almost none of this was possible.

What to aim for, and how to get there on a shoestring

The ultimate goals of a SALT initiative should be to have the maximum number of speakers using a wide range of SALT in their language, to produce these with a minimum of resources and within a short time-frame, and to use future-proofed approaches where possible. For this, three main ingredients are needed:

A prioritized roadmap for lexicographical (dictionary-related) and SALT development.

for lexicographical (dictionary-related) and SALT development. A translator (with at least an interest in technology and computing), a lexicographer (or at least someone with a keen interest in dictionaries) and an IT developer . The developers usually do not have to be involved full-time.

(with at least an interest in technology and computing), a (or at least someone with a keen interest in dictionaries) and an . The developers usually do not have to be involved full-time. A willingness to cooperate with larger, existing projects, preferably Free and Open-Source Software (FOSS) for better sustainability.

This is not dissimilar to other approaches to language computerization such as the Basic Language Resource Kit (BLARK). However, my approach is less focused on the development and processing of linguistic corpora which, in their pure form, are likely to remain out of the reach of smaller languages.

How does it work?

Gaelic was fortunate that in 2009, Kevin Scannell, an experienced localizer and computer scientist, gave us some sound advice regarding key aspects of a technological and lexicographical roadmap. In our case, this led to a radically different design of a planned dictionary project (described below) and the localization of Mozilla Firefox. In turn, this lead to an almost natural progression of further projects.

Using a slightly modified map (the benefit of hindsight), three key elements of a successful SALT development strategy would appear to be:

A dictionary project (or at least a very advanced and well-maintained wordlist)

(or at least a very advanced and well-maintained wordlist) Localization and development efforts

efforts Dissemination efforts (including boosting user trust in the organization promoting the tools)

Dictionaries

Dictionary projects have historically wasted much effort by creating printed dictionaries based on text documents. While the immediate bonus is the relatively quick creation of a dictionary and a low technological bar, these are not future-proof as they are not easily amended and are hard to convert into digital tools (for example a spell-checker). For slightly different reasons, the “app trap” should also be avoided unless significant resources are at hand. App-based dictionaries sound great but involve a lot of time (=money) in terms of development for multiple platforms and almost interminable bugfixing. With the increasing popularity and sinking cost of data services on mobile devices, web-based dictionaries often are a more sustainable resource.

Faclair Beag (AFB), an online English-Scottish Gaelic dictionary. Image courtesy iGàidhlig.

Creating a digital dictionary based on a lexical database is an approach which is initially slower but ultimately leads to a more powerful resource. In the Gaelic case, this was achieved by the Faclair Beag (AFB) project, which from the end-user perspective functions more or less like any other online dictionary: bidirectional searches, pronunciations, sound and grammatical information, etc. However, behind the scenes, the lexicographical data is stored in tables in relational databases not only identifying general part-of-speech data but also micro-level morpho(phono)logical data, gender, case, tense, mood and person.

The immediate benefit are smart dictionary searches, which means a user can enter an inflected form and be directed to the appropriate root. Experience with AFB has shown this to be highly useful and popular not only for direct users of AFB but also for 3rd party projects such as Wordlink, a project which offers language learners a split browser screen with learning content on the left and via left-clicks an automatic lookup in a given dictionary on the right.

The Roadmap

The initial stages of the proposed roadmap likely apply to most languages in question, but once a certain level is achieved, the exact order of projects can be varied based on a needs analysis, community feedback or indeed requests.

Proposed roadmap for the most effective path towards a full range of SALT.

Once the initial projects (the dictionary, browser and office suite) have been tackled, the lexicography and localization/development start feeding off each other in various ways.

The work of an IT developer is not generally required on a full-time basis, but rather at certain junctures (for example an introduction to placeholders, plural formatting, the initial creation of the spellchecker, generating web-based statistics etc.).

Spreading the word

Details about availability, installation and common issues of the Gaelic SALT tools, particularly end-user software, are provided via the web (iGaidhlig.net) and various social media platforms. Some face-to-face workshop trials have also been held.

Common problems

In our experience so far, we have experienced multiple challenges, not all of which can be avoided. These have included:

Participating in a commercial or larger FOSS project is often a high-risk strategy that may lead both to great benefits, but has little guarantee for long-term sustainability.

For example, the once-popular Google In Your Language project, which enabled volunteers to translate Google product interfaces, was quietly shelved by Google after several years of use by small language communities. Joining Adaptxt, which produced a predictive texting mobile keyboard app, provided Scottish Gaelic with access to an industry-standard tool. Adaptxt shelved their predictive texting project in 2016 and wasn’t open to the idea of community development. Fortunately, SwiftKey stepped into the breach round about the same time but this again highlights the pitfalls of non-FOSS collaboration partners going out of business and/or shelving projects. Collaboration on Gaelic text-to-speech with Cereproc on the other hand has so far resulted in a solid partnership and an industry standard tool, not least of all because of the company’s laudably inclusive ethos.

So overall, if efforts are to some extent user-demand driven, engaging with such projects may occasionally be unavoidable or even desirable to reduce development workload.

Smaller (and even bigger) FOSS projects can “die” when key people leave.

Contributions to a project might be considered squandered if key developers leave a FOSS project, and its development stagnates or if the key drivers shelve the project or drop developer support (such as the now sadly defunct Mozilla OS or Ubuntu Phone projects).

“Sexy” projects can turn out to be high-cost and low-impact.

Some projects may not be worth the investment of limited resources. Machine translation, for example, can lead to mass production of poor translations.

Technological problems can arise, such issues with implementing correct plurals, or the requirement tying interface language to a “locale.”

“Force locale” is a feature which forces the interface language to match that of the operating system (OS). This works well if the OS is available in a given language and if the user is monolingual. In the case of smaller languages which may not have supported locale settings, this can make a carefully-localized interface completely inaccessible to the user.

Evaluating the Gaelic experience to date

The first digital Gaelic tool — the Stòr-dàta, an online termbase — appeared in 1994. Between 1994 and 2008, about a dozen other tools appeared, most of which then fell dormant for a period (the Opera web browser, OpenOffice.org and the Ning-based social network AbairThusa) or died off when funding/support ran out or the localizer moved on. Since the end of 2009, however, over 50 additional programs and tools have appeared, ranging from games and web-apps through predictive texting tools to operating systems (Ubuntu and Windows), allowing users to conduct a large percentage of their daily IT interactions through the medium of Gaelic.

These were almost all created by two (largely) unpaid part-time localizers and two (largely) unpaid part-time developers. Their time involvement is difficult to quantify, but an estimate puts it at 1.5 full-time equivalent of localizer and lexicographer time and 0.5 full-time equivalent of developer time over the last four years.

An unexpected benefit of having a small team produce a large number of localizations is unusually high consistency of the translations, especially in terms of terminology. Many FOSS projects suffer from having too many ‘cooks’ spoiling the broth of consistency, within and across projects.

The most significant challenge, though, is not technical but human. Most everyday users of technology use it “as it comes out of the box” and generally are reluctant to tinker with it unless coached by someone experienced. For example, although use of the Gaelic Firefox is slowly growing, it has taken almost 3 years to grow the userbase by approximately 20 to around 120 regular users. This pattern of low uptake (below what might be expected based on a product’s market penetration) appears to be common across languages (the Irish Firefox has about 300 regular users) and other projects.

Example of information leaflet produced to advertise Gaelic software. Image courtesy iGàidhlig.

Uptake of tools which are for, but not in, Gaelic is higher. For example, since 2009 AFB has approximately 140,000 searches per month, and the predictive texting tool Adaptxt had been download 2,871 times in Gaelic and 3,810 times in Irish as of 31 January, 2014. While this is encouraging, uptake remains an issue. Regarding home users, face-to-face pilot workshops where users are guided through available tools and the installation process have proven popular and the current aim is to expand this, ideally through hiring a travelling community “promoter” who would hold free workshops across Scotland.

Unfortunately, in spite of interest from the educational and public sector in Scotland, such efforts are hampered by the current IT provision model. Outside suppliers are contracted to provide (often thin or dumb) clients with limited or no administrative rights for the end-users, which can stymie efforts to install localized versions of software. At best, the user may install software not on the approved list on a local system but not across, for example, all computers of a school. Until there is a requirement to provide Gaelic IT alongside the English (or until an alternate route is found), there is little that can be done to improve the provision of Gaelic technology in spite of availability.

The take-home message

First, the dissemination of information, user support and promotion must be considered at an early stage, as such tools will not simply disseminate through their mere existence.

Second, FOSS is often harder to “sell” to everyday users than the commercial and proprietary software they may be more familiar with. Ultimately, FOSS provides the only really sustainable model for small and medium languages.

Finally, our experience with Gaelic demonstrates that such a SALT development strategy is feasible and realistic. Since 2009 Gaelic has acquired dozens of new tools, applications, and IT opportunities through the work of a small group of people, and any language development agency or initiative should seriously consider supporting or setting up such a community.