After thinking (and reading) about Wikipedia scraping and topic modeling today, I wanted to provide a (really) simple, but clear example of topic modeling using LDA (Latent Dirichlet Allocation). Technically, the example will make use of Python (3.6), NLTK, wikipedia, and Gensim.
In the following example, I want to do the following:
- Download (scrape) a number of random Wikipedia articles
- Download (scrape) two additional Wikipedia articles (‘Car’ and ‘Bus’)
- Use LDA to model (or rather discover) abstract ‘topics’ in these articles (excluding ‘Bus’)
- Use the model to predict the topic of a new/unknown article (in this case ‘Bus’)
Topic Modeling and LDA
Topic Modeling is primarily concerned with identifying ‘topics’ (in this sense, a pattern of co-occurring words; ultimately, a topic is a distribution over a given vocabulary) in a corpus (= set of documents). Put simply, a Topic Model is an abstract statistical model of these topics in a given corpus.
Latent Dirichlet Allocation is a generative probabilistic model based on Bayesian thinking that has been developed by Blei et al. (2003). Without going into the details, LDA will lead us to a list of ‘topics’, each consisting of multiple words. These words are what define the ‘topic’ - there is no explicit name or label for each topic.
Before going into the example, I want to issue a warning: the corpus used is extremely small and the results will likely be somewhat skewed. Also, I will not go into fine-tuning the model (e.g. finding the optimal number of topics). While both issues are fairly severe, they will not really matter for this simple example.
Getting the Data
First, I want to download a number of random Wikipedia articles (just the content) in order to introduce some randomness into the corpus and ultimately the topics. Then, I want to add the articles ‘Car’ and ‘Bus’ to the corpus. These two will serve as the actual basis for the example.
The wikipedia module for Python makes accessing the Wikipedia API very straight forward.
This will return a list of seven article titles. These can then be downloaded:
This will lead us to a list of lists (tuples). Each item is comprised of the title and the actual content of the article (title, article_content). Now, we have to clean, tokenize, and stem the articles. With the help of the NLTK, this can be done fairly quickly:
After a very simple tokenization process based on a regular expression, (English) stopwords are removed and the articles are stemmed.
Document-Term Matrix/Dictionary and Vectorization
For the LDA model, we need a document-term matrix (a gensim dictionary) and all articles in vectorized format (we will be using a bag-of-words approach).
In the first step, for the sake of simplicity, we build a list that just contains the text of all articles. Then gensim is used to construct a dictionary - “a mapping between words and their integer ids” (Řehůřek 2017).
Constructing a vector representation of an article is equally simple:
Having these two things in place we can build (train) the LDA model. We need to (at least) provide two arguments - a predefined number of topics and a number of passes.
First, a corpus of all articles is constructed and vectorized. Then, a LDA model is trained with five topics over 100 passes. I’ve choosen the number of topics based on intuition - theoretically, one would have to experiment here. A higher number of passes usually leads to more precise results, but increases the computational complexity. For this example, 100 is just good enough.
[(0, ‘0.000“list” + 0.000“track” + 0.000“releas”’), (1, ‘0.018“hous” + 0.018“troy” + 0.018“caswel”’), (2, ‘0.008“winner” + 0.007“footbal” + 0.007“cup”’), (3, ‘0.036“car” + 0.015“vehicl” + 0.010“use”’), (4, ‘0.000“list” + 0.000“track” + 0.000*“releas”’)]
The above output is based on seven articles (‘Ese Que Va Por Ahí’, ‘Just Got Paid, Let's Get Laid’, ‘The Nearly Man’, ‘Invergordon F.C.’, ‘Caswell House (Troy, Michigan)’, ‘Car’, ‘Bus’) and represents the five topics. For each topic, the first three defining words are given. Hence, one topic (Topic 3) consists of vehicl, car, and use.
Having this in mind, we can now try to apply our model to a new text. As indicated above, we will be using the ‘Bus’ article that has not been used as training data. Intuitively, we would expect Topic 3 to also be assigned to this article.
While this looks rather complicated, we are just feeding the last article into the model as a vector. The model returns the following predictions:
[[(0, 0.073538951351519902), (1, 0.028946596735425441), (2, 0.080663353421899911), (3, 0.71335464756715505), (4, 0.10349645092399962)]]
As expected, the probability of Topic 3, compared to the others, is fairly high. :)
The whole code can be accessed as a Jupyter Notebook on GitHub.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993-1022.
Řehůřek , Radim. “Gensim: corpora.dictionary.” Gensim. Accessed July 21, 2017. https://radimrehurek.com/gensim/corpora/dictionary.html.