In February 2018, I wrote an article about ten interesting Python libraries for Natural Language Processing (NLP).
Compared to 2018, the NLP landscape has widened further, and the field has gained even more traction. While not a perfect measurement, the large number of available libraries and packages is a good indicator of how much (openly accessible) material is out there. At the time of writing, looking for “NLP” on PyPI leads to more than 1,560 results. Similarly, Libraries.io lists 450 Python packages/libraries for the NLP keyword. Therefore, I thought it is time to have a look at some interesting NLP libraries again. Please note, however, that this is not meant as a replacement for the 2018 article but rather as an update and addition. Therefore, if you have not done so yet, please also look at the original list as I made some more general comments about NLP and NLP libraries.
Since I wrote that first article, fortunately, a lot has changed in the field of NLP, and we have seen a lot of interesting progress. Things are moving fast! The biggest change, at least from my perspective, is the much more widespread adoption of neural networks, deep learning, and particularly Transformer-based models (see, for example, BERT (late 2018), GPT-2 (2019), and most recently GPT-3 (2020). While ‘traditional’ approaches to NLP and computational linguistics are still going strong, and rightfully so, these new models and methods are moving the field forward and allow us to do new and exciting things. Hence, I tried to reflect this change in the list of Python libraries as well.
Top 10 Python Libraries
Disclaimer: In the following I’m going to use the terms ‘library’, ‘module’, and ‘package’ rather loosely. Please don’t give me a hard time about it. For the sake of this article, a library will be defined as a (collection of) (third party) extension(s) that you can use in your own code that provides some additional functionality.
As I said before, there is a myriad of tools and libraries to choose from. Hence it is extremely hard to come up with a reasonable “Top 10 List”. The list I am presenting here has been compiled following a few, rather loose, criteria: I wanted to showcase actively developed, widely used, and multipurpose libraries that are useful to as many developers and researchers as possible. I also took the number of GitHub stars (as a proxy for spread and popularity) as well as the date of the last commit into account. Ultimately, however, the list is also influenced by my experience and my view of what is useful. I am certain that there will be readers who would have liked to see a very different set of libraries - that’s perfectly fine and reasonable!
Where it makes sense, I have provided short examples for the libraries. These examples, by no means, demonstrate the full capabilities of these tools. They are merely there for you to get a feeling for how the APIs look and feel like 😊.
|Library||Outstanding Functions/Features||GitHub Stars||Last Commit|
|🤗 Transformers||It provides easy and elegant access to a large number of powerful models (Transformer-based) that can be used directly or finetuned using PyTorch or TensorFlow.||32,900||2020-08-26|
|spaCy||It is a powerful all-purpose library with a great API, an excellent ecosystem, and good support for Deep Learning.||17,100 (+ 8,757 since 2018)||2020-08-26|
|Gensim||It is great for highly efficient and scalable topic/semantic modeling. It also provides easy access to Word2Vec and fastText.||11,100 (+ 6,312 since 2018)||2020-08-26|
|NLTK||It is the ‘mother’ of all NLP libraries and works excellently for educational purposes. Also, it is still the de-facto standard for many NLP tasks.||9,200 (+ 3,178 since 2018)||2020-07-30|
|flair||It is easy-to-use while providing powerful features such as stacking embeddings. It also gives you access to their custom “Flair embeddings.”||9,200||2020-08-24|
|AllenNLP||It allows you to build better NLP models by providing high-level abstractions. It also has a very good CLI for running pre-trained models.||8,900||2020-08-26|
|TextBlob||It is an easy-to-use toolkit that works really well for common, more ‘traditional’ NLP tasks.||7,200 (+ 2,393 since 2018)||2020-06-23|
|Stanza||It provides a complete and robust NLP pipeline and works very well as an all-purpose library. It is language-agnostic and currently supports 66 languages with pre-trained neural models. It also provides access to Stanford’s CoreNLP.||4,600||2020-08-14|
|NLP Architect||It allows you to explore and experiment with various state-of-the-art techniques.||2,500||2020-07-29|
|Spark NLP||It allows you to run complex NLP pipelines in a distributed environment. It also comes with more than 220 pre-trained pipelines and models that can be easily used.||1,400||2020-07-29|
The Top 10 Python NLP Libraries. The ‘+ X since 2018’ notes refer to the number of stars reported in the 2018 article. “GitHub Stars” and “Last Commit” as of 2020-08-26.
As you might have noticed, my 2020 list differs significantly from the 2018 one. Compared to the old list, the following libraries have been removed or replaced: Pattern, Polyglot, Vocabulary, PyNLPl, Stanford CoreNLP Python, and MontyLingua. These libraries have either been replaced by a newly developed one (e.g., Stanza replacing Stanford CoreNLP Python) or are not actively developed or maintained anymore. The exclusion of these libraries does not mean that they are bad or unusable in any way. However, I simply wanted to highlight actively developed projects as well as libraries that represent current developments in the field.
As you can see, the libraries that stayed on the list, measured in the number of GitHub stars, have all gained in popularity. Especially spaCy (first released in 2015), an absolutely fantastic all-purpose library, has gained a lot of traction. Also, again looking at the stars, we can see the huge interest in Transformers (🤗 Transformers having 32.9k stars).
An important side note: Many of these libraries and associated pre-trained models, despite often being multilingual or even language-agnostic, still have a strong focus on (standard) English. Similarly, a large amount of linguistic and NLP research focuses on English and other “major” languages. It is very important for me to point out that “language” should not be equated with (standard) English (or Spanish, or German, or French, …) and that the marginalization of other languages and varieties is a severe problem. Therefore, I want to point out that there are various research projects (see, for example, Masakhane for African languages) devoted to other languages that deserve both attention and support.
Libraries in Detail
This article will only focus on those libraries that I have newly added to the list. For details and examples regarding the other libraries, please just have a look at the previous article.
Hugging Face, hence the 🤗, has released a number of highly interesting NLP libraries, utilities, and datasets/models. The most widely known one being 🤗 Transformers, providing easy access to many pre-trained models as well as great interfaces to PyTorch and TensorFlow. Contrary to many of the other libraries, 🤗 Transformers is not a “toolkit,” but a very useful and powerful way of working with various language models and embeddings that is particularly optimized for NLU and NLG tasks.
For this brief example, we will be using the
pipeline() function to run pre-trained models for various common tasks. Essentially
pipeline() automatically retrieves (downloads) the desired model and also handles the preprocessing (especially tokenization) for us. If you are interested in what happens under the hood, definitely look at the documentation that nicely outlines all of the steps taken.
In the above example, the bert-base-multilingual-uncased-sentiment model is used to predict the sentiment of three statements (reviews). This particular model has been finetuned for sentiment analysis on product reviews and returns a score of stars between one and five. As we can see, for the three simple examples, it performed really well and produced the desired results.
While this example demonstrated how easy it is to get started with 🤗 Transformers, the library’s true value lies in its capabilities when it comes to (re)training/finetuning and customizing models for specific use-cases using either PyTorch or TensorFlow. Also, definitely have a look at the large set of available models for various languages and tasks provided by Hugging Face.
flair, developed at the Humbold University of Berlin in collaboration with Zalando Research, provides very easy access to powerful state-of-the-art NLP embeddings and models. While flair is built specifically with the powerful “Flair embeddings” (see “Contextual String Embeddings for Sequence Labeling for more details) in mind, the library also supports embeddings such as BERT, ELMo, and GloVe.
It comes with a number of embeddings and pre-trained models for various languages and tasks including, for example, Named Entity Recognition, Part-of-Speech Tagging, and Semantic Frame Detection. For each tagging task, flair can use a different model, allowing you to customize your pipeline very well. As we will see below, the library also supports stacking embeddings, which is an exciting concept. Besides, flair also supports biomedical data (biomedical NER), which leads to various interesting additional use-cases outside of NLP and linguistics.
In this example, we are using a
MultiTagger to tag/annotate a
Sentence object for Named Entities and Parts-of-Speech. We’re also adding a custom label
Python to the second token of the sentence. Without further configuration, flair will pick appropriate models. In this case,
en-pos-ontonotes-v0.5 (POS based on Ontonotes) and
en-ner-conll03-v0.4 (NER based on Conll-03) were used to tag the sentence.
While the proposed “Flair embeddings” are interesting in themselves, the core feature of the library is the ability to easily use different types of embeddings together.
Above, we have used two types of embeddings (GloVe and Flair) to embed a sentence using stacked embeddings. The resulting embedding still is a single PyTorch vector that can be used just like any other. This is a powerful concept that allows you to generate embeddings that possibly provide you with better results.
In addition to the interesting research and a great library, the team behind flair is also doing a fantastic job creating tutorials that will get you started right away.
AllenNLP is a platform/library built on top of PyTorch geared towards researchers interested in Deep Learning and building language models (see “A Deep Semantic Natural Language Processing Platform”). While it offers a variety of functions, AllenNLP shines when building custom models.
Since AllenNLP is primarily used to build and train custom models, there is not really a short and simple example I could put here. However, the team has prepared a really good “Your first model” tutorial, which leads you to the whole process. I would highly recommend going through this tutorial; not just for the beautifully type-hinted code. It nicely demonstrates how the library makes your life a lot easier when compared to “plain” PyTorch. Therefore, you should consider the library if you are thinking about building highly customized solutions.
Please also note that AllenNLP can be used as a really helpful CLI tool to run predictions based on pre-trained models. AllenNLP also works great with spaCy and uses the library extensively in the background.
Stanza is Stanford NLP Group’s new official Python NLP library and replaces the Stanford CoreNLP Python Interface. Stanza currently supports 66 languages and provides a language-agnostic approach to text analysis and many common tasks (see Stanza: A Python Natural Language Processing Toolkit for Many Human Languages). The library can also be used to easily access Stanford’s CoreNLP (Java) from Python.
In the example above, we have used a standard English pipeline on one sentence.
As you can see, the API looks quite similar to spaCy. Whil
e, according to the paper cited above, Stanza outperforms spaCy (which is optimized for speed and efficiency), the two libraries seem relatively similar. Hence, given that spaCy is widely used and that you can use Stanza models in spaCy using
spacy-stanza, I would currently prefer using spaCy in most cases.
That said, Stanza is really powerful, the multilingual/language-agnostic approach is very interesting, and elegantly accessing CoreNLP is certainly also very useful to many. In addition, Stanza has some very helpful utilities for training and evaluating your own models.
NLP Architect (Intel® AI Lab) is an NLP/NLU library focused on Deep Learning. Its primary purpose is to provide a fast and easy way to access, explore, train, and integrate current and future NLP models. Currently, NLP Architect comes with a variety of NLP and NLU models that can be accessed/trained using the library.
Similarly to the AllenNLP case, I will not provide an example here because of the complexity that example would need to have. Instead, I would advise you to have a look at the official examples and tutorials. A good starting point, for example, is the documentation for the “BIST Dependency Parser.” The documentation runs you trough training a parser for extracting dependencies using NLP Architect.
Please note that NLP Architect currently “is released as reference code for research purposes” and “is intended to be used locally and has not been designed […] for production usage or web-deployment” (GitHub).
NLP Architect is a very interesting project because it is under active development and sees a lot of change. It also provides access to some fascinating research and allows you to explore state-of-the-art techniques. However, as stated by the team, the library is currently not very well suited for any sort of production use. Nevertheless, I believe that Intel’s AI Lab is going to do some very interesting things with this in the future!
Spark NLP is an NLP library built on top of Apache Spark ML and is optimized for NLP machine learning pipelines in distributed environments. This is one of the reasons why I have added the library to the list: “Big Data” infrastructures (e.g., the Hadoop ecosystem, Spark, etc.) are becoming more important in linguistics and NLP, and tools supporting these types of infrastructure and data management are gaining in importance in some areas of research.
Spark NLP currently comes with more than 220 pre-trained models and pipelines supporting more than 46 languages and a variety of tasks (e.g., tokenization, stemming, n-grams, POS-tagging, language detection, Named Entity Recognition, Sentiment Analysis). Of course, current transformers/embeddings (BERT, GTP-X, etc.) are supported as well. I highly recommend exploring John Snow Lab’s website as they have lots of good live demos and Colab notebooks to study.
The above example is a slightly modified version of the official “Detect Emotions in Text” demo (Colab). I have abbreviated and reformatted the example to make it more readable here. While not being PEP-8 compliant anymore, I believe that it is easier to understand this way. In the example, we create a pipeline based on a pre-trained model and successfully predict the emotion/sentiment for three sentences. Be aware that we are working with a PySpark DataFrame here. Hence, outputting the result looks more complicated than it actually is!
As a very nice addition, John Snow Labs also have provided a “Spark NLP Workshop” that nicely demonstrates and teaches how to use the library.
Conclusion and Outlook
Of course, this was not a comprehensive review of all available resources and tools by any means. Nevertheless, I want to point out three things that seem important:
Firstly, all libraries (and their underlying models) are working very well in general, and while there are winners in certain areas or tasks, all of these are very usable. Therefore, in many cases, the decision will simply come down to the rest of your infrastructure, pre-existing pipelines, and your API preferences.
Secondly, many of the libraries I have presented do a very similar thing: they provide easy, streamlined access to some underlying models and embeddings. Do not forget that these models are the interesting bit and that great APIs, to some degree, are just a very welcome addition on top.
Lastly, none of the libraries and models are performing any kind of “magic.” It ultimately always comes down to the user applying these tools in a meaningful and theoretically sound way. Many of these tools make highly complicated tasks seem very easy. However, if you plan to use any of these in production or research, make sure to understand the underlying mechanics, the data, the linguistics, and all of the possible caveats. If you’re interested, also have a look at my 2019 talk “Natural Language Processing is Harder Than You Think” at 36c3.
Finally, I am going to end this article with a recommendation. In 2018, I recommended TextBlob for those that need to get results quickly and easily. While TextBlob is still actively developed and a very nice library, I now generally strongly recommend spaCy for its great API, support for ML/DL, pre-trained models, fantastic documentation, and ecosystem (have a look at
spacy- on PyPI).
Please let me know if you think I have forgotten an important library! I am always interested in exploring new tools!