Understanding Tokenizers in AI Models

Jul 09, 2023

Photo by Nguyen Dang Hoang Nhu on Unsplash

In this article, we will explore tokenizers in the AI context.

We have briefly covered this topic in an older article that explored the basic features of Apache OpenNLP. If you haven’t already, do check that as well.

In AI, tokenizers are tools or algorithms that break down natural language text into smaller units which are known as tokens. These tokens can be individual words, subwords, or even characters, depending on the specific tokenizer and its configuration.

Tokenization is an essential step in natural language processing (NLP) tasks because it converts raw text into a format that machine learning algorithms can process.

By splitting text into tokens, tokenizers provide a structured representation of the input data, enabling subsequent analysis, feature extraction, or modeling tasks.

Tokenizers are designed to handle different types of text data, including languages with complex grammatical structures, punctuation, and other linguistic variations.

They often consider contextual information, such as word order and sentence structure, to determine the appropriate token boundaries.

There are various types of tokenizers used in AI, which include —

rule-based tokenizers,
statistical tokenizers, and
neural network-based tokenizers.

Rule-based tokenizers rely on predefined rules and patterns to segment text, while statistical tokenizers use statistical models to identify token boundaries based on patterns found in the training data. Neural network-based tokenizers employ machine learning techniques, often leveraging deep learning models, to learn tokenization patterns directly from data.

Popular tokenization libraries in the AI community include NLTK (Natural Language Toolkit), spaCy, and the tokenizers library developed by Hugging Face.

These libraries provide pre-trained tokenizers for various languages and offer customizable options to adapt tokenization behavior to specific use cases.

Let’s go through a few basic examples of each library.

NLTK

NLTK is a popular Python library for NLP tasks. It provides various tokenizers, including word tokenizers and sentence tokenizers.

Here’s an example of word tokenization using NLTK

import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize

text = "Tokenization is an important step in natural language processing."
tokens = word_tokenize(text)

print(tokens)

Output:

['Tokenization', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '.']

spaCy

spaCy is another widely used NLP library in Python. It offers an efficient tokenizer that provides linguistic annotations and detailed token information.

Here’s an example:

import spacy

nlp = spacy.load('en_core_web_sm')
text = "Tokenization is an important step in natural language processing."
doc = nlp(text)

tokens = [token.text for token in doc]

print(tokens)

Output:

['Tokenization', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '.']

Transformers Library by Hugging Face

The Transformers library by Hugging Face provides state-of-the-art models and tokenization tools for various NLP tasks. It offers tokenizers for popular pre-trained models like BERT, GPT-2, and T5.

Here’s an example using the tokenizer from the Transformers library:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "Tokenization is an important step in natural language processing."
tokens = tokenizer.tokenize(text)

print(tokens)

Output:

['token', '##ization', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '.']

Did you notice the “##” added to the tokens? Unlike other libraries, here, the word “Tokenization” is not considered a single token.

This technique is known as WordPiece tokenization.

What is WordPiece tokenization?

Adding ## to tokens in Hugging Face's Transformers library is related to the subword tokenization technique called WordPiece tokenization. This technique is commonly used in models like BERT (Bidirectional Encoder Representations from Transformers).

In WordPiece tokenization, words are split into subwords or “word pieces” to handle out-of-vocabulary (OOV) words and to capture morphological variations. The ## prefix is used to indicate that a subword is a continuation of a previous subword within a word.

For example, let’s consider the word “tokenization.” After applying WordPiece tokenization, it may be split into three subwords: “token,” “##ization.” The ## prefix indicates that "##ization" is a continuation of "token" and should be combined when reconstructing the original word.

The purpose of adding ## to tokens is to maintain the integrity of the original word while capturing subword information. This allows the model to handle OOV words and recognize similar subword patterns across different words, which can be beneficial for understanding the semantics and context of the text.

During training, the models using WordPiece tokenization learn to handle these subword tokens and generate appropriate representations. When decoding or generating text with the model, the ## markers are used to reconstruct the words correctly.

Not all tokenizers in Hugging Face’s Transformers library use WordPiece tokenization. Some models may use different tokenization techniques, such as Byte Pair Encoding (BPE), which may result in different tokenization patterns.

Fun Stuff

Want to try how GPT-3 and Codex models generate tokens? Try this link provided by OpenAI — https://platform.openai.com/tokenizer.

Thank you for reading The NonConformist Techie. This post is public, so feel free to share it.

The NonConformist Techie

Understanding Tokenizers in AI Models

Discussion about this post