Tech Tip: The Challenges of Multilingual AI in Rare Languages

Written byCapria Value-Add
January 19, 2024

Our recent exploration into language models like GPT’s difficulties in handling non-English languages, especially less common regional languages such as Telugu and Kannada, reveals critical insights. This understanding underscores the limitations of employing general-purpose trained Language Models like ChatGPT for multilingual tasks.

Capria Ventures - Multilingual models 2 1024x683 1

Tokenization: The First Step in Language Processing

At the core of these language models, words are numeric data. The initial process involves tokenizing the input sentence and breaking it down into indexed numeric chunks. A tokenizer, a separate system within the language model, manages this task.

The Complexity of Tokenization: Beyond Simple Breakdown

Tokenization needs to be more straightforward. A commonly used algorithm is Byte Pair Encoding, which starts with individual characters. The tokenizer then forms a vocabulary based on character frequency, combining these into larger subwords. However, this vocabulary has a fixed size, which plays a crucial role in the challenges faced by regional languages.

English Dominance in Training Data: A Hindrance for Rare Languages

Given that the training data for large language models is primarily a filtered subset of the internet, English often overshadows languages like Telugu in the dataset. While both languages start with individual characters, the dominance of English allows for the creating of larger subwords in its vocabulary. In contrast, Telugu struggles to develop similarly sized tokens due to the limited vocabulary space.

The Tokenization Gap: A Comparative Illustration

For instance, the English sentence “Hi, how are you?” might be tokenized into just four tokens. Conversely, the same sentiment expressed in Telugu could result in as many as 70 tokens due to the lesser development of larger-sized tokens in Telugu. This disparity is stark.

Efficiency and Time Constraints: The Multilingual Challenge

Language models generate predictions token by token. Consequently, writing a word in Telugu might require predicting around 18 tokens, whereas English might need just one. Theoretically, this makes generation in Telugu up to 18 times slower than in English, highlighting the inefficiency and increased time consumption when using a general-purpose, pre-trained language model for regional languages.

Understanding these intricacies is crucial in developing more inclusive and efficient multilingual AI systems. As we continue to push the boundaries of AI, addressing these language-specific challenges remains a key priority.

Subscribe to GAIN Newsletter

Be the first to hear the latest investment updates, AI tech trends, and partner insights from Capria Ventures by subscribing to our monthly newsletter. 

Report a Grievance

Capria Ventures and its related entities are committed to the highest standards of ethics and strictly enforce a zero-tolerance anti-corruption policy. Please report any suspicious activity to [email protected]. All reports will be treated with utmost urgency and resolved appropriately.

Unitus Ventures is now Capria India

Unitus Ventures, a leading venture capital firm in India, is joining forces with its US affiliate Capria Ventures, a Global South specialist, to operate with a unified global strategy under a single brand, Capria Ventures. 

Chat with Capria GainBot
Hello! I'm GAINBOT, here to share interesting insights from Capria's webpages. Feel free to search for anything you'd like to learn about.