India’s startling linguistic diversity shows itself in many forms: in oral traditions, through the literature, in the film industries, etc. But the most visible marker of our linguistic identities is, of course, our scripts.

In fact, as linguistic nationalism burgeoned in early 20th century India, unique scripts came to mark the distinctiveness of these languages. The notion that each language requires a unique script to be independent still persists.

However, diversity comes with its own challenges. The European style nationalism, after all, has traditionally emphasised the need for unity among languages and, by extension, their scripts. Objectively speaking, it’s easy to see why: a common script provides a certain cultural cohesion. But in the real world, this would mean overlooking potent socio-cultural factors, and that’s a no-go area.

A link script

To try and bridge this graphical gulf between India’s languages without upsetting the balance of power inherent in such a project, a team at IIT-Madras recently came up with what they believe to be a solution: Bharati, a unified Indic script that could be used as an intermediary script between languages, similar to the way the Latin script is used by most European languages. (E.g., the author’s name would be represented as ‘Karthik’ in English, German, French and Spanish.)

At the moment, Bharati is be able to present words written in nine Indian scripts (encompassing at least 12 languages): Devanagari, Eastern Nagari (Assamese + Bengali), Gujarati, Gurmukhi, Kannada, Malayalam, Odia, Tamil and Telugu.

The logic is that something like the Hindi proper noun ‘Patna’ – while natively written in Devanagari (पटना) – can also be made intelligible to speakers of other languages by writing it in a script that everyone can recognise. So in Bharati, ‘Patna’ is:

According to the people behind it, Bharati is designed to be easy to learn. They have also developed an optical character recognition (OCR) system for it and have said that it is nearly 100% accurate.

This is cool innovation but it also raises many questions. For example, to put it rather bluntly, why would anyone learn it?

Scripts are used primarily for their cultural value, and not for the utilitarian purpose of mass-learning, leave alone digitisation. So no matter how well-made Bharati is, it will need to overcome a significant level of inertia before India’s massive population becomes familiar with it. Notwithstanding Team Bharati’s outreach efforts (already underway), this is not likely to happen.

In fact, if Indians are indeed looking for a ‘neutral’ script to play the intermediary, a modified form of the Latin script would be an easier choice in many ways. Many Indians are already familiar with it. It also carries significant cultural heft thanks to its association with English, India’s most aspirational language.

Academia has already performed this feat. Linguistic texts use carefully defined romanisation schemes to repurpose the Latin script for Indian languages. The most inclusive of these schemes is ISO 15919, with a close correspondence between characters in Indic scripts and the Latin ones used to represent them, resulting in next to no ambiguity once a reader knows how the system works. In this scheme, for example, ‘Patna’ would be ‘Paṭnā’.

Of course, Indic transliteration using the Latin script is already widespread on digital devices but is almost entirely ad hoc, with little consistency.

Digital languages

The next question is how something like Bharati would work in India’s rapidly growing language space online. A 2017 study by Google and KPMG, the consulting group, predicted that 90% of the 326 million Indians estimated to come online for the first time between 2016 and 2021 will be accessing web-pages in an Indian language.

To make Bharati available to these users, its script will have to be added to the Unicode standards to ensure its letters are properly encoded and displayed on websites and browsers.

Until this happens, no script can be said to have a meaningful digital presence, and Bharati itself would be restricted to offline use and consumption. This is also why the script of Tulu, which is many centuries old, hasn’t been brought online yet.

Additionally, the idea of an OCR for a script that isn’t widely used is something of a non-starter. To understand this better, it is useful to know what OCR can do for Indian languages.

An OCR machine scans a document or image, attempts to recognise individual letters in it through the shapes of their characters, matches them with their text forms and finally converts the document or image into text. So OCR for Indian languages can potentially bring textual material dating back to the introduction of printing in India to millions of Indians via the internet, and provide otherwise inaccessible novels, poems and histories a new lease of shelf-life. In the process, technical experts can also use such projects to improve the scope and efficiency of AI-driven language technologies.

On the other hand, even should Bharati gain mainstream acceptance, researchers will have to digitise a large volume of materials written in the conventional scripts for its benefit.

Like with so many of their efforts with Indic languages, Google is leading the way with Indic OCRs as well. Its Navlekhā platform allows publishers to upload scanned documents and PDFs and then digitises them via Google’s OCR, as a result giving the publishers a chance to publish their content online as well.

So while Team Bharati’s initiative is interesting as an intellectual exercise, there are significant issues that need to be addressed before it becomes a viable alternative to, say, modified Latin.

In fact, the questions raised above pertain to anyone working with Indic scripts, their promotion and their digitisation. They are the questions assailing anyone working towards improving the Indian language internet.

Karthik Malli is a Bangalore-based communications professional with a keen interest in linguistics, history and travel. He tweets on Indian languages @TianChengWen.