Understanding Tokenizers in AI Models

In this article, we will explore tokenizers in the AI context.
We have briefly covered this topic in an older article that explored the basic features of Apache OpenNLP. If you havenāt already, do check that asĀ well.
In AI, tokenizers are tools or algorithms that break down natural language text into smaller units which are known as tokens. These tokens can be individual words, subwords, or even characters, depending on the specific tokenizer and its configuration.
Tokenization is an essential step in natural language processing (NLP) tasks because it converts raw text into a format that machine learning algorithms can process.Ā
By splitting text into tokens, tokenizers provide a structured representation of the input data, enabling subsequent analysis, feature extraction, or modeling tasks.
Tokenizers are designed to handle different types of text data, including languages with complex grammatical structures, punctuation, and other linguistic variations.Ā
They often consider contextual information, such as word order and sentence structure, to determine the appropriate token boundaries.
There are various types of tokenizers used in AI, which includeāāāĀ
rule-based tokenizers,Ā
statistical tokenizers, andĀ
neural network-based tokenizers.Ā
Rule-based tokenizers rely on predefined rules and patterns to segment text, while statistical tokenizers use statistical models to identify token boundaries based on patterns found in the training data. Neural network-based tokenizers employ machine learning techniques, often leveraging deep learning models, to learn tokenization patterns directly from data.
Popular tokenization libraries in the AI community include NLTK (Natural Language Toolkit), spaCy, and the tokenizers library developed by Hugging Face.Ā
These libraries provide pre-trained tokenizers for various languages and offer customizable options to adapt tokenization behavior to specific use cases.
Letās go through a few basic examples of each library.
NLTK
NLTK is a popular Python library for NLP tasks. It provides various tokenizers, including word tokenizers and sentence tokenizers.Ā
Hereās an example of word tokenization using NLTK
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "Tokenization is an important step in natural language processing."
tokens = word_tokenize(text)
print(tokens)
Output:
['Tokenization', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '.']
spaCy
spaCy is another widely used NLP library in Python. It offers an efficient tokenizer that provides linguistic annotations and detailed token information.Ā
Hereās an example:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "Tokenization is an important step in natural language processing."
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)
Output:
['Tokenization', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '.']
Transformers Library by Hugging FaceĀ
The Transformers library by Hugging Face provides state-of-the-art models and tokenization tools for various NLP tasks. It offers tokenizers for popular pre-trained models like BERT, GPT-2, and T5.Ā
Hereās an example using the tokenizer from the Transformers library:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "Tokenization is an important step in natural language processing."
tokens = tokenizer.tokenize(text)
print(tokens)
Output:
['token', '##ization', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '.']
Did you notice the ā##ā added to the tokens? Unlike other libraries, here, the word āTokenizationā is not considered a singleĀ token.
This technique is known as WordPiece tokenization.
What is WordPiece tokenization?
Adding ##
to tokens in Hugging Face's Transformers library is related to the subword tokenization technique called WordPiece tokenization. This technique is commonly used in models like BERT (Bidirectional Encoder Representations from Transformers).
In WordPiece tokenization, words are split into subwords or āword piecesā to handle out-of-vocabulary (OOV) words and to capture morphological variations. The ##
prefix is used to indicate that a subword is a continuation of a previous subword within a word.
For example, letās consider the word ātokenization.ā After applying WordPiece tokenization, it may be split into three subwords: ātoken,ā ā##ization.ā The ##
prefix indicates that "##ization" is a continuation of "token" and should be combined when reconstructing the original word.
The purpose of adding ##
to tokens is to maintain the integrity of the original word while capturing subword information. This allows the model to handle OOV words and recognize similar subword patterns across different words, which can be beneficial for understanding the semantics and context of the text.
During training, the models using WordPiece tokenization learn to handle these subword tokens and generate appropriate representations. When decoding or generating text with the model, the ##
markers are used to reconstruct the words correctly.
Not all tokenizers in Hugging Faceās Transformers library use WordPiece tokenization. Some models may use different tokenization techniques, such as Byte Pair Encoding (BPE), which may result in different tokenization patterns.
Fun Stuff
Want to try how GPT-3 and Codex models generate tokens? Try this link provided by OpenAIāāāhttps://platform.openai.com/tokenizer.