Azure Translator Text now supports the Inuktituk language spoken in the Inuit area in the far north of North America.
Over the years, I've received a lot of requests to provide machine translation for Inuktitut. Despite the tools that are available, particularly from Microsoft, to train your own neural translation engine for an unsupported language using a corpus of translated documents, and a great bilingual corpus from the debates of the Nunanut legislative assembly, I knew that this would not be possible. Other machine translation experts also agreed that it was beyond the state of the art. On a couple of occasions I had applied for funding to push beyond the state of the art to make this possible, unsuccessfully. Why is machine translation of Inuktitut so difficult?
Inuktitut belongs to a class called "polysynthetic languages". Most of the languages that you know are probably "agglutinative" There are some root words which can be modified by changing the beginning or the end of the word. The root word is in the dictionary, but the words that are modified by adding or changing suffixes and prefixes are typically not, because eveyone knows the rules. These agglutinative languages are part of a larger class called synthetic languages, which includes other simple rules for sticking words together, usually with a small set of rules that apply to one part of speech. For example German can stick a lot of known nouns end-to-end to make a new word, but there is one root word and all the other words are modifying or narrowing down the sense one of the words, and the resulting word behaves like a longer version of the base word, and has the same part of speech.
Inuktitut is polysynthetic. The combination rules are much more complex. There can be several root concepts and root words, and it can lose its part of speech or change it because the full word is an entire sentence with subject object, verb, adjective, even subordinate clauses, all contained in a big compound word. How you join words together can vary using complex rules about what comes before and after the join. A well known example is the word "ᖃᖓᑕᓲᒃᑯᕕᒻᒨᕆᐊᖃᓛᖅᑐᖓ" which means "I'll have to go to the airport". Verbs, nouns, subject, object, they're all contained in the same word.
Not all but most native American languages are polysynthetic. Unlike other languages, neural networds can't just have a dictionary and some rules and train the translation engine to see patterns of three or four words in a row that always translate to the same 3 or 4 words in another language. Almost all the neural translation engines I have seen are word-based. There are languages that are written without spaces, like Chinese, Japanese, Thai, and Korean, but they still have individual words and breaking them up into individual words is relatively simple. Not so with polysynthetic languages.
I don't see any information about how Microsoft tackled the problem for Inuktitut. I am assuming that they used a tool to break down words into morphemes. What I would have used but didn't get funding for was the National Research Council's Uqailaut Inuktitut Morphological Analyzer, but I don't know whether Microsoft did something similar. I am watching for any publications about it. There have been some advances lately in modeling and translating these languages, so that is not the only approach.
On the other hand, perhaps they trained a neural network to decompose into morphemes and vice-versa without a standalone processor. If that's the case, then the same techniques could be used for various other widespread but hard to translate polysynthetic languages from the Algonquian language family like Cree and Ojibwe, or Iroquoian languages like Mohawk, or Athabascan langauges like Dene or Navajo, or Siouan languages like Dakotan. It's a game changer.
Oh, and if you were curious, PointFire Translator now supports translation to and from Inuktitut on SharePoint sites, just use language code "iu". Your browser should already support Canadian Aboriginal Syllabics.