Indonesia has more than 700 languages. Can AI save them?

Growing up in the Indonesian province of Banyuwangi, Antariksawan Jusuf spoke Using with his family and friends. It wasn’t until he went to university in Bali, where he had to speak the national language Bahasa Indonesia, that he realized Using was in danger of dying out.

“Using is threatened by modernization,” Antariksawan, now 58, told Rest of World. “A lot of parents now prefer Bahasa Indonesia when they communicate with their children.”

It’s not just Using that is under threat. Indonesia has more than 700 regional languages and nearly 800 dialects across its vast archipelago. But more than 400 dialects are at risk of becoming extinct by the end of the 21st century, according to researchers. The government has turned to artificial intelligence to help preserve the languages, and make them more accessible to the population.

Popular large language models (LLMs), such as OpenAI’s GPT, Google’s Gemini, and Meta’s Llama are largely trained in English, excluding billions of people who speak languages that are not commonly found online. Non-English-speaking nations are trying to bridge the gap by building their own multilingual LLMs in low-resource languages — which are widely spoken but do not have a lot of data on the internet — as well as endangered languages.

“We are heading toward monolingualism due to globalization and modernization,” Endang Aminudin Aziz, head of the language development agency at the Ministry of Education and Culture, told Rest of World. “We are working on revitalizing languages to keep them from extinction. AI technology and LLMs, I think, will help.”

To train LLMs, large quantities of high-quality data are needed, including booksmedia, and academic papers, as well as public code repositories such as GitHub, and other data sets. As these are in short supply in regional languages, there are concerns around whether the available data best represents the cultures, Nuurrianti Jalli, an assistant professor at Oklahoma State University’s media school, told Rest of World. “You have to ask: Where does the data come from? Who is behind them?”

This is all the more important in a country where censorship is rampant and information is tightly controlled by the government. A diverse range of data sources is necessary to ensure that the LLMs’ output is inclusive and unbiased, said Jalli.

“Involving multiple experts, including those not aligned with government views, can help ensure that the data’s context is accurately represented,” Jalli said. “This is particularly important where data might be manipulated to favor certain political powers.”

700 The number of regional languages in Indonesia.

Earlier this year, Yellow.AI launched Komodo-7Ban LLM trained on Bahasa Indonesia and 11 other regional languages including Javanese, Balinese, and Sundanese. It uses Indonesian textbooks, among other sources, to ensure diversity, co-founder Rashid Khan told Rest of World. While Komodo-7B is currently aimed at business applications, and not at preserving local languages and dialects, that is not an impossible goal in the near future, Khan said.

According to him, it would require “a high level of digitization,” and that can only happen with community effort. The training of LLMs would get easier, he said, “once we get to very high levels of digitization, where a particular language, its books, its papers, poems — all of this becomes available online easily.” But for now, the bulk of training data is still in English, Khan said. “If that keeps happening … then some of the other languages will be left behind.”

So far, besides Bahasa Indonesia, only two regional languages have digitized texts: Balinese and Makassarese. Antariksawan is hopeful that Using can be another. He helped publish a Bahasa Indonesia-Using dictionary that took years to research, and has written a novel in the two languages. He has also set up a collective in Banyuwangi to preserve the Using language and culture, which publishes short stories, novels, and videos of folktales and children’s songs.

The team is working with the Banyuwangi regional library to digitize the literature, he said, to make it more accessible to their community and to tech firms looking for data to train LLMs.

There is a lot of interest in reaching Indonesia’s 275 million people. Last year, Singaporean startup Wiz.AI launched an LLM for Indonesianwhich “captures the linguistic nuances and cultural contexts of the region, leading to more contextually relevant outputs,” the company said. The Singapore government-led SEA-LION family of open-source LLMs, launched last year, also trains its models on Bahasa Indonesia and other Southeast Asian languages.

“You have to ask: Where does the data come from? Who is behind them?”

Most recently, Indonesian telecom company Indosat Ooredoo Hutchison signed an agreement with Mumbai-based Tech Mahindra to develop Garuda LLM. The model can be applied across industries including health care, e-commerce, education, finance, and agriculture, it said. “By preserving Bahasa Indonesia and its dialects … we promote linguistic diversity and enhance accessibility and inclusivity in the digital realm,” Vikram Sinha, chief executive officer of Indosat Ooredoo Hutchison, said in a statement.

But the paucity of data means that Garuda will train on 16 billion original Bahasa Indonesia tokens — the basic units of data, which can be a word or a character — providing 1.2 billion parameters, or the elements the model learns during training to make projections. The SEA-LION family, which is being built as open-source, has models with 3 billion and 7 billion parameters. Komodo-7B also has 7 billion parameters; Wiz.AI’s Bahasa Indonesia LLM has 13 billion parameters.

In comparison, GPT-4 has 1.76 trillion parameters, while Gemini has up to 175 trillion.

For Antariksawan, even this is a start. The Using language dates back to the 13th century, and he is determined to preserve it for future generations. “My hope is that the younger generation can learn Using from an early age, and won’t have to struggle as much as I did to find texts in the language,” he said. “I hope AI technology and LLMs can take us to the next level.”

Source

Leave a Comment