Introduction to Apache OpenNLP

Apr 30, 2023

Apache OpenNLP is a Java library that uses machine learning to process natural language text. It’s released under Apache License Version 2.0.

It supports most of the NLP functions like Tokenization, Sentence Segmentation, Part of Speech Tagging, Sentiment Analysis, and so on.

To learn more about all the use cases that Apache OpenNLP offers, refer to the official manual at opennlp.apache.org.

In this article, we will go through three use cases that Apache OpenNLP offers.

Before we get into the examples, let’s understand that ML algorithms need a set of trained data. This trained data can be both user-generated or provided by the framework itself.

We will go through three use cases in this article,

One that doesn't need trained data,
One that needs data but is already provided by Apache OpenNLP
And for the last one, we will create our own training data.

For all the use cases (like Part of Speech Tagging) where the trained data is offered by OpenNLP itself, the data can download from http://opennlp.sourceforge.net/models-1.5/.
Same training data is available in multiple languages so make sure to download the right data from the site. We choose the language english for this article so the downloaded model has a prefix “en” to it.

Implementation

In a Maven-based SpringBoot application, we can add Apache OpenNLP dependency in the pom file like this,

<dependency><groupId>org.apache.opennlp</groupId><artifactId>opennlp-tools</artifactId><version>2.0.0</version></dependency>

Case 1: Tokenization

This is one of the most basic steps in NLP. This process breaks down text into smaller units.

This process can be used to develop the context of the speech quickly.

Apache OpenNLP provides an API in SimpleTokenizer class to generate tokens[ ] from a given text.

SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
String tokens[] = tokenizer.tokenize("Your Service is bad");

Tokens generated from the above code will be

[Your, Service, is, bad]

Case 2: Part of Speech Tagging

For any ML algorithm to work efficiently, one needs to have reliable training data. This use case comes with an in-built trained model — en-pos-maxent.bin

Using Part of Speech Tagging also known as POS Tagging, we can extract nouns, verbs, adjectives, and so on from a given text.

This type of tagging is used in use cases like sentiment analysis etc. where only the verbs or adjectives from a given text are further analyzed.

Here is a simple java method that extracts ADVERB, ADJECTIVE, and VERB from a given text.

If we pass the text — “Your Service is bad” to the above method, the POSTagger will tag the tokens generated from the text as

Your_PRP$ Service_NN is_VBZ bad_JJ

and the output generated from the method would be

[is, bad]

It’s difficult to remember what each tag like _NN or _VBZ stands for, so we can refer to this tag bank https://dpdearing.com/posts/2011/12/opennlp-part-of-speech-pos-tags-penn-english-treebank/.

Case 3: Sentiment Analyzer / Document Categorizer

Sentiment Analyzer / Document Categorizer can be used to analyze the sentiment of a given text like product reviews, tweets, etc.

This is one of the use cases that depends completely on the domain and requirement at hand so Apache OpenNLP doesn't provide a trained data model.

For use cases like these, the user will have to generate their own trained data.

Once the data model is generated, we then use that model to categorize an input text.

For example, if I had to separate happy speech from sad speech, I will have to train the model with ample data so that the ML algorithm can categorize the given data based on that.

First, let’s see how to generate a training model.

Create a file, add training data to it, and save the file with a .txt extension

In the above data, Angry and Neutral are two categories. Any given input that’s run against the above data model will categorize it as either Neutral / Angry or return the first category from the set (Neutral) in case a category cannot be clearly decided by the Algorithm.

Note that, the above-given test data is something I manually created for learning purposes, it may not be accurate for professional use.

Create the trained data model file with a .bin extension. The Java program to create a training model is given below.

Secondly, let’s use the generated model to categorize a given input

File file = ResourceUtils.getFile("src/main/resources/nlp-model/en-trained-model.bin");
InputStream in = new FileInputStream(file);
DoccatModel m = new DoccatModel(in);
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);String[] inputText = input.split(" ");
double[] outcomes = myCategorizer.categorize(inputText);
String category = myCategorizer.getBestCategory(outcomes);

For a given input text — “I will not recommend this product”, the category generated in the output is “Angry”.

That’s it, we just learned 3 use cases of Apache OpenNLP. We also found out how to use inbuilt data models and create our own set of data models.

The complete code for this article is available on GitHub. You could run the unit test cases under the test package to test all three different use cases.

Leave a star on the repository if you like my work 👩‍🏭

Thank you for reading The NonConformist Techie. This post is public, so feel free to share it.

The NonConformist Techie

Discussion about this post

Ready for more?