Refer
Word Clouds and Language Models: Visualizing Text
I’m going to do this demonstration in two stages:
First we will build a word cloud using TF-IDF weights on a relatively small corpus of documents. We’ll build the word clouds on a per-document basis. The terms which are most important in distinguishing one document from another is what we will see in the resulting graph.
The second demonstration will use a considerably larger corpus to generate a topic model. We will then graph the individual topics as word clouds. This should (although I’m no expert on topic models) tell us what terms are most important in ascertaining whether or not a document is relevant to a particular subject.
I’ll also be making use of the Natural Language Processing Toolkit (NLTK) to perform operations such as stemming and tokenization. While the Word Cloud package does this reasonably well, I prefer to use a package which was specifically designed to perform these types of operations (sorry Andreas).
Building a Term Frequency-Inverse Document Frequency Word Cloud
import sys
for file in sys.argv[1:]: # Iterate over the files
contents = open(file).read().lower() # Load file contents
Next we need to split the document into its constituent components. These are words, punctuation marks etc. For this we’ll use NLTK. We’ll also stem the tokens using the Porter stemmer. Stemming attempts to resolve variants of words back to their **root form. **For example, instances of the words “run”, “ran”, “runs”, “running” should all be resolved to the word “run”. Stemming is quite crude and the results aren’t really suitable for human consumption, but it’s a fairly effective method of merging tokens.
We’ll store all our documents in memory just because it’s easier to implement for now. However, this will quickly become unsustainable for large collections. We really ought to load our corpus bit by bit (functionality which Gensim is designed to support). We’ll store both the original tokenized text and the stemmed text.
import sys
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer() # Stemmer for reducing terms to root form
stemmed_corpus = [] # For storing the stemmed tokens
original_corpus = [] # For storing the non-stemmed tokens
for file in sys.argv[1:]: # Iterate over the files
contents = open(file).read().lower() # Load file contents
tokens = word_tokenize(contents) # Extract tokens
stemmed = [stemmer.stem(token) for token in tokens] # Stem tokens
stemmed_corpus.append(stemmed) # Store stemmed document
original_corpus.append(tokens) # Store original document
This is all that’s involved in preparing the collection for Gensim. We’re now ready to start analysing the collection using the data structures the library provides.
Gensim substitutes terms for integer IDs using a dictionary structure, so the first thing we need to do is build the dictionary. The dictionary contains unique instances of terms found within the collection. It is a complete lexicon of the vocabulary used in the documents. Building the dictionary is as simple as importing the data structure and passing our stemmed tokens to it. We’re only going to build the dictionary from the stemmed tokens and later we’ll use the non-stemmed tokens for presentation purposes.
import sys
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim.corpora import Dictionary
stemmer = PorterStemmer() # Stemmer for reducing terms to root form
stemmed_corpus = [] # For storing the stemmed tokens
original_corpus = [] # For storing the non-stemmed tokens
for file in sys.argv[1:]: # Iterate over the files
contents = open(file).read().lower() # Load file contents
tokens = word_tokenize(contents) # Extract tokens
stemmed = [stemmer.stem(token) for token in tokens] # Stem tokens
stemmed_corpus.append(stemmed) # Store stemmed document
original_corpus.append(tokens) # Store original document
dictionary = Dictionary(stemmed_corpus) # Build the dictionary
We are now going to build the TF-IDF model. There are two steps involved in this process. We need to convert the text corpus into a vector format. Then we pass the corpus to the TF-IDF model for analysis. Don’t forget to include the model from the gensim package!
import sys
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
stemmer = PorterStemmer() # Stemmer for reducing terms to root form
stemmed_corpus = [] # For storing the stemmed tokens
original_corpus = [] # For storing the non-stemmed tokens
for file in sys.argv[1:]: # Iterate over the files
contents = open(file).read().lower() # Load file contents
tokens = word_tokenize(contents) # Extract tokens
stemmed = [stemmer.stem(token) for token in tokens] # Stem tokens
stemmed_corpus.append(stemmed) # Store stemmed document
original_corpus.append(tokens) # Store original document
dictionary = Dictionary(stemmed_corpus) # Build the dictionary
# Convert to vector corpus
vectors = [dictionary.doc2bow(text) for text in stemmed_corpus]
# Build TF-IDF model
tfidf = TfidfModel(vectors)
TF-IDF determines what words are most important for distinguishing the content of one document from another. Arguably, it identifies the terms on which the document is most authoritative although there is no theoretical basis for this. Proving it has been something of a bane for the computer science community. Still, for our purposes we’ll say it’s true. What we plan to do is produce a word cloud where the size of a term is dictated by its TF-IDF weight in a given document.
We’ll generate a word cloud for one of the documents that we’ve loaded. Let’s use whatever the first document in the stemmed_corpus array is. Remember, this document exists across three arrays – stemmed_corpus, original_corpus and vectors. We’ll need to use them all. Fortunately the document will exist at the same index in each array. We could produces a series of word clouds for every document in the collection, but seeing as how this is just a demonstration, we’ll only run it for one. Let’s start by finding the TF-IDF weights for each term in the chosen document.
~~~import sys from nltk.tokenize import word_tokenize from nltk.stem.porter import PorterStemmer from gensim.corpora import Dictionary from gensim.models import TfidfModel
stemmer = PorterStemmer() # Stemmer for reducing terms to root form stemmed_corpus = [] # For storing the stemmed tokens original_corpus = [] # For storing the non-stemmed tokens
for file in sys.argv[1:]: # Iterate over the files contents = open(file).read().lower() # Load file contents tokens = word_tokenize(contents) # Extract tokens stemmed = [stemmer.stem(token) for token in tokens] # Stem tokens
stemmed_corpus.append(stemmed) # Store stemmed document
original_corpus.append(tokens) # Store original document
dictionary = Dictionary(stemmed_corpus) # Build the dictionary
Convert to vector corpus
vectors = [dictionary.doc2bow(text) for text in stemmed_corpus]
Build TF-IDF model
tfidf = TfidfModel(vectors)
Get TF-IDF weights
weights = tfidf(vectors[0]) ~~~
…