Our recent exploration into language models like GPT’s difficulties in handling non-English languages, especially less common regional languages such as Telugu and Kannada, reveals critical insights. This understanding underscores the limitations of employing general-purpose trained Language Models like ChatGPT for multilingual tasks.
Tokenization: The First Step in Language Processing
At the core of these language models, words are numeric data. The initial process involves tokenizing the input sentence and breaking it down into indexed numeric chunks. A tokenizer, a separate system within the language model, manages this task.
The Complexity of Tokenization: Beyond Simple Breakdown
Tokenization needs to be more straightforward. A commonly used algorithm is Byte Pair Encoding, which starts with individual characters. The tokenizer then forms a vocabulary based on character frequency, combining these into larger subwords. However, this vocabulary has a fixed size, which plays a crucial role in the challenges faced by regional languages.
English Dominance in Training Data: A Hindrance for Rare Languages
Given that the training data for large language models is primarily a filtered subset of the internet, English often overshadows languages like Telugu in the dataset. While both languages start with individual characters, the dominance of English allows for the creating of larger subwords in its vocabulary. In contrast, Telugu struggles to develop similarly sized tokens due to the limited vocabulary space.
The Tokenization Gap: A Comparative Illustration
For instance, the English sentence “Hi, how are you?” might be tokenized into just four tokens. Conversely, the same sentiment expressed in Telugu could result in as many as 70 tokens due to the lesser development of larger-sized tokens in Telugu. This disparity is stark.
Efficiency and Time Constraints: The Multilingual Challenge
Language models generate predictions token by token. Consequently, writing a word in Telugu might require predicting around 18 tokens, whereas English might need just one. Theoretically, this makes generation in Telugu up to 18 times slower than in English, highlighting the inefficiency and increased time consumption when using a general-purpose, pre-trained language model for regional languages.
Understanding these intricacies is crucial in developing more inclusive and efficient multilingual AI systems. As we continue to push the boundaries of AI, addressing these language-specific challenges remains a key priority.
