Large language models are not truly performant polyglots: how to escape the english centric trap?

Reading time: 17 minutes

Many people claim that large language models and particularly ChatGPT can speak and answer in almost all languages like humans.
Those marketing messages take a big shortcut by advertising those tools as efficient polyglots.

The progress in natural language processing is real but some problems and limits do remain on the table.

Are those tools equally efficient in all languages like some of us seem to think?
Isn't English better represented and related prompts receive better outputs?

How are those models trained and how important is the choice and volume of the initial corpus? How do editors fight against hallucinations and cultural biases?

I tried to run several examples, and compile more 30 sources and papers from different horizons to get out the hype bubble and dive in properly.

To have an objective look at the existing limits in those models that have consequences when dealing with diverse use cases in the real world.

From dialects to larger languages, how good are LLMs at dealing with multiple contexts?

Let's start with a small dialect test and see what's happening when we try to work with the chtimi/picard one?

From the north of France, this dialect is mainly spoken by approximately 700 000 people.
The sources of training should not be this many, this dialect should typically be underrepresented in most models.

What are the capacities of Bard and LLama2 regarding this?
My conversation with Bard on the Ch'ti language.


Question to Bard: Have you been trained on the Chti dialect from north of France?
Yes, I have been trained on the Chti dialect from north of France. I have access to a massive dataset of text and code, including a significant amount of text in Chti. This means that I can understand and generate Chti text, and I can translate between Chti and French.

For example, if you ask me to translate the following sentence from Chti to French: Ch’est bin bon, ch’carbonade flamande!

I will give you the following translation: La carbonade flamande est très bonne! I can also generate Chti text, such as the following poem: Ch’Nord, c’est mi pays, Ch’Nord, c’est mi chez-moi. J’aime ses gens, j’aime sa langue, J’aime ses paysages.

I am still under development, but I am learning to use the Chti dialect more and more effectively. I hope that one day I will be able to communicate fluently in Chti with native speakers. Question to Bard: How many sources in chti were available to train your model?

There were 5 sources in Chti available to train my model: https://fr.wikipedia.org/wiki/Chti https://fr.wiktionary.org/wiki/Catégorie:Chti https://www.chti.com/ https://www.lavoixdunord.fr/region/chti/ https://www.francebleu.fr/emissions/chti-et-fier-de-l-etre These sources include a variety of text formats, such as Wikipedia articles, news articles, and social media posts. This diversity of text formats helped me to learn the Chti dialect in a comprehensive way. I am still under development, but I am learning to use the Chti dialect more and more effectively. I hope that one day I will be able to access and process even more Chti text, so that I can become even better at communicating in this beautiful language.

Question to Bard: What does mean “tchul” in chti language?

The word “tchul” in the Chti dialect means “very” or “a lot.” It is a common intensifier that can be used to describe a variety of things, such as the weather, food, people, or experiences. For example, you might say “tchul bon” to mean “very good” or “tchul chaud” to mean “very hot.”

Here are some examples of how the word “tchul” can be used in Chti sentences: Ch’est tchul bon, ch’carbonade flamande! Il fait tchul chaud aujourd’hui. C’est un tchul gentil garçon. J’ai tchul bien dormi. Ch’ai tchul bien rigolé à ton histoire. The word “tchul” is thought to come from the Picard word “tchu,” which means “very.” It is also related to the French word “tellement,” which also means “very.".I hope this helps.

Question to Bard: Can you translate “Muche tin tchul v’lo ch’garde!” to french? Yes, I can translate “Muche tin tchul v’lo ch’garde!” to French. The translation is: “Cache ton vélo, le garde arrive!” This is a common warning phrase used in the Chti dialect to tell someone to hide their bike because the police are coming. The word “tchul” means “very” or “a lot” in Chti, so the phrase is essentially saying “Hide your bike, the police are coming very soon!” I hope this is helpful!

Now, my conversation with Llama2 using the same prompts.


Question to LLama: Have you been trained on the Chti dialect from North of France?

Yes, I have been trained on a variety of languages and dialects, including the Chti dialect from north of France. While my training data includes a wide range of languages and dialects, it's important to note that I may not be perfect at understanding or responding in every single dialect or language. However, I'll do my best to assist you with any questions or requests you might have!

Question to LLama: How many sources in chti were available to train your model?

I was trained on a large corpus of text data that includes a variety of sources, including books, articles, and websites, among others. The exact number of sources that were used to train my model for the Chti dialect is not publicly available information, but rest assured that my training data includes a diverse range of texts and topics to help me understand and generate responses in various languages and dialects.
It's worth noting that the quality and quantity of training data can impact the performance of machine learning models like myself. The more diverse and extensive the training data, the better the model can understand and respond to different queries and contexts. However, even with a large amount of training data, there may still be instances where I don't fully understand or respond accurately to certain questions or requests, especially if they are highly specialized or use very specific terminology. Question to LLama: What does mean "tchul" in chti language? In the Chti dialect, "tchul" means "to eat". It's a verb that describes the act of consuming food. Here are some examples of how "tchul" can be used in sentences: * J'vas tchuler une pomme. (I ate an apple.) * Il est tchul, il a faim. (He is hungry, he needs to eat.) * Tchule que j'ai pas faim. (Eat, because I am not hungry.) In addition to its basic meaning, "tchul" can also be used in idiomatic expressions such as "tchuler des gribouilles" (to nibble on something) or "faire un tchul" (to have a snack). Question to LLama:Can you translate "Muche tin tchul v'lo ch'garde!" to french? Sure! "Muche tin tchul v'lo ch'garde!" is a sentence in the Chti dialect, and it translates to "Mangeons tranquille, voilà qui est gardé!" in French. Here's a word-for-word translation of the sentence: "Muche" = "manger" (to eat) "tin" = "tout" (all, everything) "tchul" = "manger" (to eat) (this is the verb conjugated in the imperative form) "v'lo" = "voilà" (there it is) "ch'garde" = "qui est gardé" (who/that is kept, preserved) So the full sentence in French would be: "Mangeons tranquille, voilà qui est gardé!" which means "Let's eat quietly, there it is, who/that is kept!"

Both Bard and Llama seem to provide sound explanations and are very convincing but both of them are wrong.
According to the wikipedia page "tchul" means "butt/ass" in English or "cul" in French.

Let's now have a look at OpenAI and GPT regarding bigger dialects and languages.

Another way to measure how efficient a product is, is to read the users' feedback and answers provided by the editor or the community.

OpenAI Community forum is a good place to take the pulse of existing problems and struggles from real life. In 2021, questions were leveraged in greek language, and the answer back in time by OpenAI staff was that the API is "best at English"

https://community.openai.com/t/use-api-with-the-greek-language/5443

In 2022, same kind of answer from the OpenAI staff. I quote: "most of our models are purposely trained to use mostly English" regarding the question of a spanish user:

https://community.openai.com/t/chatgpt-spanish-support/24706

A link to the GPT study is provided as support and explanation for this answer: “Although GPT-3’s training data is still primarily English (93% by word count), it also includes 7% of text in other languages. These languages are documented in the supplemental material.”

In 2023, same question for the Japanese language, one answer by another member pointed to the official documentation.

https://community.openai.com/t/need-help-openai-japanese-language-support/426467

In 2023, a user experiencing difficulties to make the model answers in Portuguese. The English fallback behavior seems to be stronger than the instructions.

https://community.openai.com/t/how-force-the-api-to-really-follow-all-the-instructions/321455

The official message in OpenAI FAQ is a bit ambiguous on the capacities of the model. Officially OpenAI does not support other languages than English.

https://help.openai.com/en/articles/6742369-how-do-i-use-the-openai-api-in-different-languages

“While the OpenAI website is only available in English, you can use our models in other languages as well. The models are optimized for use in English, but many of them are robust enough to generate good results for a variety of languages.”

Only available and there’s a but? “robust enough” is a vague definition and needs more investigation to understand why it can be complicated to provide a clear answer.

Can we reach the same level of quality and accuracy for all languages with today's multilingual models? What do say studies?

We can find some answers in studies, the first one I read is about the basque dialect. This language is spoken by approximately 750 000 people.
It does spot a difference of performance between prompts provided in English and non-English.
"models are unable to leverage their full multilingual potential when prompted in non-English languages":

https://arxiv.org/abs/2308.01223

The second example is about a Portuguese study that points out that generic English centric models that are developed today are not an efficient solution. The monolingual pretraining performs better.

https://arxiv.org/abs/2304.07880

The third example is about how American culture can affect the outputs provided by GPT-3 answers by disregarding and overriding the inputs. A form of twisted cultural translation.

“We observed when values embedded in the input text were mutated in the generated outputs and noted when these conflicting values were more aligned with reported dominant US values."
The conclusion says:“We provided some detail on results that indicate often when the embedded values of a text are altered, they are altered to be more in line with statistically reported dominant values of US citizenry.”

https://arxiv.org/abs/2203.07785

Can the new versions of GPT fix those problems? Can other languages be as efficient as English?

If we come back to the marketing and read this we understand that GPT-4 should be better than GPT-3

https://openai.com/research/gpt-4

But to achieve the MMLU benchmark, OpenAI had to translate resources because benchmarks on the market mainly exist in English which is a blind spot in performance measurement.
“Many existing ML benchmarks are written in English.”

https://arxiv.org/pdf/2303.08774.pdf

In the same paper we can read limitations that are still in place in GPT-4 because of the very core technology of GPT.

Hallucinations still exist: “Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors).”

The model is still stubborn and overconfident: “GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake.”

We can add to this picture a paper mentionning the fact that OpenAI was not as transparent this time when releasing GPT-4.
“GPT-4 is the most secretive release the company has ever put out” “OpenAI has chosen not to reveal how large GPT-4”

https://www.technologyreview.com/2023/03/14/1069823/gpt-4-is-bigger-and-better-chatgpt-openai/amp/

According to the editor, the new version is better than the old one before and we have to trust them and accept it.

The question of the possibility of making an effective multilingual AI still remains.

As OpenAI does not really want to explain how they solved (or not) this problem, let's try to find another relevant source.

I found this study “Making Generative AI Effectively Multilingual at Scale” and the introduction is interesting and summarizes well the problem we face:
“The benefits and value of revolutionary technology like GPT-4 are much greater for English speakers over other languages.”

https://blog.modernmt.com/making-generative-ai-multilingual-at-scale/

It's worth reading the numbers and what the root reasons are:

  • The resourcefulness gap describes the asymmetry in the availability of high-quality digitized text that can serve as training data
  • the current models will only amplify this English bias further
  • OpenAI fine-tunes and studies the performance of the model is primarily focused on English "with a US-centric point of view."

This article also talks about concrete problems that occur today:
“Many users have documented and reported sub-optimal performance when searching with Bing Chat when they query in Spanish rather than English.”

“ChatGPT performed substantially worse at answering factual questions or summarizing complex text in non-English languages and was more likely to fabricate information.”

One of the sources used in this article comes from the Center of Democracy and Technology
"Lost in Translation - Large Language Models in Non-English Content Analysis"

https://cdt.org/wp-content/uploads/2023/05/non-en-content-analysis-primer-051223-1203.pdf?ref=blog.modernmt.com

"Multilingual language models are designed to address these gaps in data availability by inferring semantic and grammatical connections between higher- and lower-resource languages, allowing the former to bootstrap the latter. However, this architecture raises its own concerns. Multilingual language models are still usually trained disproportionately on English language text and thus end up transferring values and assumptions encoded in English into other language contexts where they may not belong"

“For example, a multilingual model might associate the word “dove” in all languages with “peace” even though the Basque word for dove (“uso”) can be an insult.”

Statistics and numbers on “The Resourcedness Gap: Why the Largest Language Models are in English” page 17 are self explanatory.

Chapter 2.B.3 “MULTILINGUAL LANGUAGE MODELS DO NOT AND CANNOT WORK EQUALLY WELL IN ALL LANGUAGES” and B.4 “WHEN MULTILINGUAL LANGUAGE MODELS FAIL, THEIR PROBLEMS ARE HARD TO IDENTIFY, DIAGNOSE, AND FIX.” do highlight interesting problems.

“Multilingual language models not only do not work equally well in all languages but they cannot, since the more languages a multilingual model is trained on, the less it can capture unique traits of any specific languages”

“the Google researchers behind the Perspective API, a model for detecting “toxic” content, found that their model flagged tweets that used the Italian word “sfiga” (which roughly translates to “bad luck”) as hate speech because two of the three examples included in the training dataset that contained the subword “sfiga” were labeled as hate speech”

Is English also the most represented language in other models like Llama or Bard? Yes.

When we ask Bard on the repartition of its training data, we obtain this answer:


Bard's training dataset includes the following languages and their proportions:
  • English: 61.25%
  • Non-English: 38.75%
The non-English portion of the dataset is further divided into the following languages and their proportions:
  • Arabic: 6.25%
  • Chinese: 6.25%
  • French: 6.25%
  • German: 6.25%
  • Hindi: 6.25%
  • Japanese: 6.25%
  • Korean: 6.25%
  • Portuguese: 6.25%
  • Russian: 6.25%
  • Spanish: 6.25%
It is important to note that this is just a general breakdown of the languages in Bard's training dataset. The actual proportions of each language may vary depending on the specific dataset that was used to train Bard. Additionally, Bard is still under development, so its capabilities may vary depending on the language. Please let me know if you have any other questions.

Meta in a paper regarding Llama2 provides statistics about the training of its model and warns the users.
"Most data is in English, meaning that Llama 2 will perform best for English-language use cases." (PDF document source accessible through the link to the following page - Page 20 of this PDF)

https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Bloom was built differently than GPT.

The choice of sources was more balanced and transparent.
We can see that English does only represent 30% of the training. It means that there's more space for other languages.

https://huggingface.co/bigscience/bloom#training-data

The project Bloom has openness at core involving a large number of researchers.
“A group of over 1,000 AI researchers has created a multilingual large language model bigger than GPT-3—and they’re giving it out for free.”

https://www.technologyreview.com/2022/07/12/1055817/inside-a-radical-new-project-to-democratize-ai/

Open source, transparency and diversity: basically an anti-OpenAI.

Unlike OpenAI, the support of languages seems to be an official statement in Bloom:
"It can handle 46 of them, including French, Vietnamese, Mandarin, Indonesian, Catalan, 13 Indic languages (such as Hindi), and 20 African languages. Just over 30% of its training data was in English. The model also understands 13 programming languages".
"Hugging Face organized workshops with African AI researchers to try to find data sets such as records from local authorities or universities that could be used to train the model on African languages".

Is it enough to address our actual issues?

Like other LLMs, BLOOM is flawed and all problems regarding alignments and human reinforcement are on the table and cost time, energy.
If the multilingual models are meant to fail because they can't genuinely address diversity and accuracy today, what strategy do we have to make those tools better in languages other than english?
Simplifying the problem by specializing the models through training or fine tuning could be a solution.

GPT-3 has been tested in Sweden in the education field.
This study from KTH was asking a simple question:"The purpose of this study was to improve education by leveraging GPT-3 as a teacher tool."
Leveraging GPT-3 as a question generator in Swedish for High School teachers:

https://kth.diva-portal.org/smash/get/diva2:1789082/FULLTEXT01.pdf

“The consensus is that GPT-3 performed well, but not perfect on all aspects. The imperfection is not due to many errors being made, but rather that the questions on average does not reach the quality of teacher-created questions in terms of relevance, pedagogical ability, and adaptability to the students”

Sweden also started a project and invested to create its own Language model.

https://www.ai.se/en/news/what-would-you-do-large-generative-language-model

A matter of linguistic sovereignty and digital democracy.

https://medium.com/ai-sweden/why-do-we-need-a-large-gpt-for-swedish-9a0409877f46

What if the solution to performance and accuracy was not to grow bigger but to stay small?

If we look at craftsmanship and initiative from smaller actors, we can see that solutions exist and can be implemented at a lower price.
Harper Caroll is a machine learning enthusiast and gives us her tutorials to fine tune existing models.
Tutorials are available here for both LLama2 and Mistral. The idea is to make this process accessible, cost-effective and probbaly more performant.

https://www.harpercarroll.com/articles/ai/llm-finetune-own-data

Her's another interesting experience of specialized models from Alexander Doria.
The idea is to go back to the 17th century and discuss with a ChatBot "from" this era.

https://huggingface.co/Pclanglais/MonadGPT?text=Hey+my+name+is+Clara%21+How+are+you%3F

“MonadGPT is a finetune of Mistral-Hermes on 11,000 early modern texts in English, French and Latin, mostly coming from EEBO and Gallica.”

As the price of computing is falling and as the technical barrier is lowering, we enter a new phase and open source will take more and more place in the world of large language models.
Always bigger models are costly to create, maintain and filter.

OpenAI drew this conclusion as well and identified it as a challenge in a paper on GPT-4 in February 2023.

“One of the main benefits of training such costly models is to achieve high performance across diverse tasks and languages, but this demands more data across languages. Which poses several challenges for models scalability, diversity, and fairness. As a future research direction, we propose to tackle the challenge of enabling truly multilingual capability for such models that would enable the same capabilities across languages.”

https://arxiv.org/pdf/2302.09210.pdf

“enabling truly multilingual capability for such models”? The “truly” is important here, it means that today, it’s not the case.

The "smaller" concurrence will be more efficient because they can focus and adapt faster

It also seems that the real danger for OpenAI and Google in terms of concurrence are the smaller specialized models. Especially the "open source"(*) ones. And this concurrence is already here.

(*) I use the shared vocabulary opensource here for comprehension but they’re not truly opensource, they’re free to use.

Here’s an element of the answer: “One half-day of training using a few hundred dollars yields similar results to mainstream large models, open-source and commercial-free domain-specific LLM solution”

https://www.hpc-ai.tech/blog/one-half-day-of-training-using-a-few-hundred-dollars-yields-similar-results-to-mainstream-large-models-open-source-and-commercial-free-domain-specific-llm-solution

And here's how Google and OpenAI have somehow seen this coming.
"Leaked Internal Google Document Claims Open Source AI Will Outcompete Google and OpenAI"

https://www.semianalysis.com/p/google-we-have-no-moat-and-neither

This sentence and conclusion is important: “Large models aren’t more capable in the long run if we can iterate faster on small models”

The timeline at the end of the article is also interesting. It moves fast but not the way the marketing of paid features loudly advertises it everyday.

This is probably why OpenAI decided recently to open the possibility for its users to create agents by allowing bigger embeddings from their own content, to allow more specialization and customization in GPT by injecting a similar mechanism as fine-tuning.
They also decided to lower their prices in front of a potential non-paid concurrence that's growing.

They also opened the door to partnerships to obtain data from more diverse sources.
"For example, we recently partnered with the Icelandic Government and Miðeind ehf to improve GPT-4's ability to speak Icelandic by integrating their curated datasets."

https://openai.com/blog/data-partnerships

Being English centric is not an anecdotic problem in the world of large language models and this lack of diversity starts shows its limits more and more everyday.

Thanks for reading,
Julien

Published: 2023-11-12 07:00:00 +0000 UTC