Large language models are not truly performant polyglots: how to escape the english centric trap?
Many people claim that large language models and particularly ChatGPT can speak and answer in almost all languages like humans.
Those marketing messages take a big shortcut by advertising those tools as efficient polyglots.
The progress in natural language processing is real but some problems and limits do remain on the table.
Are those tools equally efficient in all languages like some of us seem to think?
Isn't English better represented and related prompts receive better outputs?
How are those models trained and how important is the choice and volume of the initial corpus? How do editors fight against hallucinations and cultural biases?