Natural Language Processing
When I started studying linguistics a few years back, one of the first questions that arose was concerned with what defines a language and/or language itself. It’s still one of the first things I (fiercely) discuss with students in my linguistics classes. While this definitely is not the topic of this article (which is more of a notebook anyway), I believe that thinking about computational (natural) language processing provides an additional angle to this question.
When I talk about languages, I usually refer to so-called natural languages. These are languages that are or have “been learned and spoken naturally by a community, as opposed to an artificial system resembling language in one or more respects” (Matthews 2007: 259). In other words, natural languages have evolved over time (historically) and are used ‘naturally’ by some speakers or signers (not all languages need to be necessarily voiced).
Natural Language Processing (NLP) is concerned with computationally analyzing language data. Bird et al. (2009: ix) provide a very broad and general definition: NLP covers “any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves ‘understanding’ complete human utterances, at least to the extent of being able to give useful responses to them.”
That being said, NLP potentially provides invaluable insights (language learning, language patterns, history of language, linguistic forensics, etc.) and solutions for very practical language related problems (e.g. Alexa understanding what you want from her).
Since NLP is largely computational, we need the right tools to do it. These tools (or more specifically libraries - at least regarding this article) implement theoretical concepts, approaches, and algorithms and make them accessible to researchers, data scientists, engineers, and whoever else is interested in analyzing language.
The goal of this article is to provide a limited overview on various Python libraries that can be used to perform and implement some common and less common NLP tasks. While reading this, please don’t forget that there are also a lot of perfectly fine applications (e.g. AntConc for Corpus Linguistics) that provide direct access to (parts of) the “NLP toolset” - no coding involved.
Top 10 Python Libraries
Disclaimer: In the following I’m going to use the terms ‘library’, ‘module’, and ‘package’ rather loosely. Please don’t give me a hard time about it. For the sake of this article, a library will be defined as a (collection of) (third party) extension(s) that you can use in your own code that provides some additional functionality.
In the following I’m going to present ten of the most important (if not the most important) libraries used for Natural Language Processing (NLP) in Python. While some entries in this list seem like no-brainers (e.g. NLTK), others (e.g. Vocabulary) are the result of a more opinionated selection.
I chose to present them in order of GitHub stars. While the number of stars is not necessarily a good measure of quality or importance, it is one way of bringing order to something that, due to significant differences in scope and purpose, is inherently hard to order.
|Library||Outstanding Function/Feature||GitHub Stars (Feb. 2018)|
|spaCy||Extremely optimized NLP library that is meant to be operated together with deep learning frameworks such as TensorFlow or PyTorch.||8343|
|Gensim||Highly efficient and scalable topic/semantic modelling.||6312|
|Pattern||Web (data) mining / crawling and common NLP tasks.||6095|
|NLTK||The ‘mother’ of all NLP libraries. Excellent for educational purposes and the de-facto standard for many NLP tasks.||6022|
|TextBlob||Modern multi-purpose NLT toolset that is really great for fast and easy development.||4807|
|Polyglot||Multilingualism and transliteration capabilities.||754|
|Vocabulary||Retrieve semantic information from individual words.||397|
|PyNLPl||Extensive functionality regarding FoLiA XML and many other common NLP format (CQL, Giza, Moses, ARPA, Timbl, etc.).||294|
|Stanford CoreNLP Python||Reliable, robust and accurate NLP platform based on a client-server architecture. Written in Java, and accessible trough multiple Python wrapper libraries.||188|
|MontyLingua||End-to-end NLP processor working with Python and Java. Historical!||-|
For each of these libraries I will provide a short description and a short code-example highlighting one of the features of the library. These code-examples are not meant to show-off everything they can do, but to give you a feeling for the API you’re going to deal with.
spaCy is a relatively young project that labels itself as “industrial-strength natural language processing”. The library provides most of the standard functionality (tokenization, PoS tagging, parsing, named entity recognition, …) and is built to be lightning fast.
spaCy also really nicely interfaces with all major deep learning frameworks and comes prepacked with some really good and useful language models. While spaCy does not do everything (yet), it does the things it does do really, really well - including it’s beautiful documentation!
In this quick example, we’re loading a bit of text into an
nlp object. This then allows us to easily print the tokens next to their parts of speech.
Gensim is a fairly specialized library that is highly optimized for (unsupervised) semantic (topic) modelling.
Semantic analysis and topic modelling in particular is a very specific sub-discipline of NLP, but an important and exciting one. Gensim is the go-to library for these kinds of NLP and text mining. It’s fast, scalable, and very efficient.
If you are, however, looking for an all-purpose NLP library, Gensim should probably not be your first choice.
python -m gensim.scripts.make_wiki
Above you can see a very simple example (adopted from the documentation) of extracting topics from a Wikipedia dump via LDA. The ‘hardest’ part of working with Gensim is figuring out how to load/build vectorized corpora and dictionaries from (plain) text data.
Pattern, first of all, is a web mining library (module) for Python that can be used to crawl and parse a variety of sources such as Google, Twitter, and Wikipedia.
However, Pattern comes packed with various NLP tools (PoS tagging, n-Grams, sentiment analysis, WordNet), machine learning capabilities (vector space models, clustering, classification), and various tools for conducting network analysis.
Having all these features, Pattern is a great all-in-one solution for many (linguistic) data mining projects.
Pattern is maintained by CLiPS and hence there are not only good documentation and many examples, but also a lot of academic publications that are making use of the library.
The big (really, really big) downside is that Pattern currently only supports legacy (2.5+) Python. Well, there is this semi-official fork that tries to lift Pattern into current Python - give it a try!
In the above example we’re crawling Google for results containing the keyword ‘NLP’. We then print all of the result urls and the text. We also print the bigrams for each result. While this is really a fairly pointless example, it shows how easy crawling and NLP tasks can be performed in unison by Pattern.
4. Natural Language Toolkit / NLTK
The NLTK usually is the first contender when listing or talking about Python NLP libraries. The Natural Language Toolkit is fairly mature (it’s in development since 2001) and has positioned itself as one of the primary resources when it comes to Python and language processing.
It’s really hard to give too much praise to NLTK since it is not only written and maintained really well, but also comes packaged with a ridiculous amount of (example) data, corpora and pre-trained models. Since the NLTK was primarily developed as an educational library, there is also a fairly brilliant textbook (for free) that accompanies the library.
NLTK is huge, and has been rightfully considered as “a academic researcher’s theme-park” by an article very similar to this one.
Nevertheless, it’s growing size, educational focus, and long history have made NLTK a bit hard to work with and resulted in a, compared to other libraries, rather inefficient approach to some problems. Still, for many tasks, by far not only educational, it’s the de-facto standard library for good reasons.
If you are new to the whole field of natural language processing or (computational) linguistics, the NLTK and its handbook are a great place to get you started by being thrown into the deep end.
Similar to the spaCy example, we’re simply tokenizing and tagging a short text here.
TextBlob is definitely one of my favorite libraries and my personal go-to when it comes to prototyping or implementing common NLP tasks.
It is based on both NLTK and Pattern and provides a very straightforward API to all common (and some less common) NLP tasks. While TextBlob does nothing particularly new or exciting, it makes working with text very enjoyable and removes a lot of barriers.
While TextBlob certainly isn’t the fastest or most complete library, it offers everything that one needs on a day-to-day basis in an extremely accessible and manageable way.
Above we’re reading in the same text as before (only one sentence). We let TextBlob split the text into sentences (well, …) and perform some straightforward sentiment analysis (based on
PatternAnalyzer from Pattern).
TextBlobobject can do much more. Let’s say we wanted to extract all noun phrases from a given piece of text:
If you need to get things done, give TextBlob a try. If you’re lucky, you will be done - if not, you’ve only lost a few minutes :). If you’re looking for a starting point, Allison Parrish wrote a really nice introduction to NLP that is based on TextBlob.
Polyglot is in some respects very different from the libraries we’ve talked about so far. While it has similar features (tokenization, PoS tagging, word embeddings, named etity recognition, etc.), Polyglot is primarily designed for multilingual applications.
Within the space of simultaneously dealing with various languages, it provides some very interesting features such as language detection and transliteration that are usually not as pronounced in other packages.
For example, if we wanted to figure out the language of a given piece of text, we could do this in Polyglot:
Due to its focused approach, Polyglot certainly is not always the right choice. But if you want to work with various languages at the same time (currently they support between 16 and 196 languages depending on the feature in question), Polyglot is a library to keep in mind.
Vocabulary is another rather small Python library that provides some very interesting and specialized features. Vocabulary, as its name suggests, is basically a dictionary in the form of a Python module.
To be precise, Vocabulary will give you the meaning, synonyms, antonyms, part of speech, translation, examples, pronunciation, and hyphenation of a given word.
It’s somewhat similar to Wordnet, but provides some syntactic sugar and returns data in nice formats (i.e. JSON). However, it does not have the same scope as Wordnet (usually part of NLTK) and is best seen as a simple substitute or alternative.
Vocabulary is great because it is simple and super easy to use. At the same time, it is very limited and seems to sometimes produce questionable results. That being said, Vocabulary is a lovely small library that, given it’s scope, works reasonably well and definitely deserves some more attention.
PyNLPl, or ‘pineapple’ for short, is a lesser known NLP library that is particularly well suited for reading and writing common (and less common) file formats that can be found in the NLP space.
PyNLPl could, for example, come in really handy if you need to work with FoLiA XML (Linguistic Annoation) or GIZA++ files that are sometimes hard to read and parse.
In this simple example we’re opening a FoLiA XML file and adding a simple “Hello World” structure to it.
Ultimately, PyNLPl can save you a whole lot of time in some rather specific cases. If you find yourself in a situation in which you need to work with FoLiA, GIZA, Moses, ARPA, Timble, or CQL, remember that there is a friendly pineapple to help. For all other cases I would go with the larger, more actively maintained, libraries.
9. Stanford CoreNLP Python
CoreNLP is actively being developed at and by Stanford’s Natural Language Processing Group and is a well-known, long-standing player in the field. The toolkit provides very robust, accurate, and optimized techniques for tagging, parsing, and analyzing text in various languages. It also supports annotation pipelines and is easily extendible and accessible. Overall, CoreNLP is one of the NLP toolkits that has been and is definitely used in production a lot.
The official (read reference) Python wrapper makes working with CoreNLP, which needs to be present on the system (or via a server), relatively straightforward:
In this example we’ve annotated a piece of text (as in the other examples). We can then access various attributes (e.g. the lemmatized version) of sentences and tokens.
To demonstrate (and showcase) another Python wrapper, we’ll do the same thing in py-corenlp:
Generally speaking, CoreNLP provides a great infrastructure for NLP tasks. However, the client-server architecture introduces some overhead that might be counterproductive for smaller projects or during prototyping. On the other hand, CoreNLP allows you to construct complex NLP systems that are even accessible with various languages and not just Python.
Be careful: There is more than one of these wrapper libraries. I’ve linked to the reference implementation by Stanford themselves, but there are presumably others that work better in some cases. For our convenience, Stanford keeps a list of the major Python wrappers.
I don’t know many (…any) people who would start a new project based on MontyLingua. Nevertheless, I decided to put it on this list because of it’s history (MontyLingua has been developed at MITs Media Lab between 2002 and 2004 by Hugo Liu) and charming approach. Also, in its time, MontyLingua was known to outperform many of its competitors, including NLTK, in some respects.
MontyLingua is designed to make things as easy as possible. You feed in raw text and you’ll receive some semantic interpretation of that text - out of the box. Instead of relying on complex machine learning techniques, it comes equipped with so called “common sense knowledge”, i.e. rules and knowledge about everyday contexts that enrich the interpretations of the system. Well, to be fair, these rules are based on the “Open Mind Common Sense” project which heavily relies on AI.
The tool(s), written in Python and also available in Java, consists of six modules that provide tokenization, tagging, chunking, phrase extraction, lemmatization, and language generation capabilities.
As you can see, I’m not even going to the trouble of writing an actual code example.
To be honest, don’t use MontyLingua, but acknowledge its existence and possibly browse it’s (really old) code.
Ultimately, MontyLingua proves two points: Even fairly old libraries and tools still can bring value to the table - albeit sometimes rather educational. Secondly, there are many small projects, often academic, that are less flashy than larger libraries such as spaCy, but equally important given the right problem.
What About all the Other Libraries?
Of course, these are not all NLP libraries out there! If you look for NLP on PyPi you will find more than 470 packages and libraries.
It would be (almost) impossible to highlight all of these very interesting tools and resources. However, I will try to point out three directions that seem to be interesting:
One emerging category of NLP libraries is concerned with providing higher level interfaces to various deep learning frameworks. For example, finch, providing access and models for TensorFlow (interested in a short TensorFlow tutorial?), would be one library that quickly gains attention on GitHub for its focus on deep (Mandarin) NLP.
Secondly, there are various highly interesting libraries that are super specialized. To name a few examples: CLTK is the go-to library for classical languages such as Greek and Latin. PyTextRank is a straightforward implementation of the TextRank algorithm/approach proposed by Mihalcea (2004) and others. Newspaper is a really clever library that helps you to build and analyze newspaper corpora from the web. Such (highly) specialized libraries (or add-ons to existing libraries such as textblob-de) are also indispensably when working with lesser ‘known’ (read: less funded) languages.
Thirdly, I want to emphasize the (growing) NLP capabilities of the well-known (standard) ‘data science’ libraries. For example, scikit-learn provides not only some example data, but also some very useful models and algorithms for natural language processing.
Where to Start?
If you are new to natural language processing, I would recommend to start looking at the NLTK for inspiration. It’s a treasure trove of data and methods that will be perfect to get your feet wet. If, on the other hand, you are looking to get results quick, go straight to TextBlob and enjoy the ride.
Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. Safari Books Online. Sebastopol: O’Reilly Media.
Matthews, Peter. 2007. The Concise Oxford Dictionary of Linguistics. 2. Ed. Oxford Paperback Reference. Oxford: Oxford University Press.