Scraping Data to build N-gram Word Clouds in R (LifeVantage Use Case)

As social networks, news, blogs, and countless other sources flood our data lakes and warehouses with unstructured text data, R programmers look to tools like word clouds (aka tag clouds) to aid in consumption of the data.

Using the tm.plugin.webmining package to scrape data on the Nrf2 antioxidant supplement-maker LifeVantage, this tutorial extends several existing tutorials to go beyond 1-gram (i.e. single-word) word clouds to N-gram word clouds (i.e. 2 or more words per token).

Foundation-l_word_cloud_without_headers_and_quotes

Word clouds are visualization tools wherein text data are mined such that important terms are displayed as a group according to some algorithm (e.g. scaling of words based upon term density, shading corresponding to frequency, etc.). This can allow a researcher to summarize thousands or even millions of text records in only a glance.

To get started, we load a few good libraries. Whereas tm is the R’s popular text-mining package tm.plugin.webmining, gives us the ability to quickly scrape data from several popular internet sources.

# Packages ----------------------------------------------------------------
require(RWeka)
require(tau)
require(tm)
require(tm.plugin.webmining)
require(wordcloud)

The goal is to scrape an example dataset related to LifeVantage, the maker of Nrf2-activating supplement Protandim. LifeVantage is a publicly traded corporation with a vast MLM salesforce, thus we can find data from Google and Yahoo! Finance,

# Scrape Google Finance ---------------------------------------------------
googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:LFVN"))

# Scrape Google News ------------------------------------------------------
lv.googlenews <- WebCorpus(GoogleNewsSource("LifeVantage"))
p.googlenews  <- WebCorpus(GoogleNewsSource("Protandim"))
ts.googlenews <- WebCorpus(GoogleNewsSource("TrueScience"))

# Scrape NYTimes ----------------------------------------------------------
lv.nytimes <- WebCorpus(NYTimesSource(query = "LifeVantage", appid = nytimes_appid))
p.nytimes  <- WebCorpus(NYTimesSource("Protandim", appid = nytimes_appid))
ts.nytimes <- WebCorpus(NYTimesSource("TrueScience", appid = nytimes_appid))

# Scrape Reuters ----------------------------------------------------------
lv.reutersnews <- WebCorpus(ReutersNewsSource("LifeVantage"))
p.reutersnews  <- WebCorpus(ReutersNewsSource("Protandim"))
ts.reutersnews <- WebCorpus(ReutersNewsSource("TrueScience"))

# Scrape Yahoo! Finance ---------------------------------------------------
lv.yahoofinance <- WebCorpus(YahooFinanceSource("LFVN"))

# Scrape Yahoo! News ------------------------------------------------------
lv.yahoonews <- WebCorpus(YahooNewsSource("LifeVantage"))
p.yahoonews  <- WebCorpus(YahooNewsSource("Protandim"))
ts.yahoonews <- WebCorpus(YahooNewsSource("TrueScience"))

# Scrape Yahoo! Inplay ----------------------------------------------------
lv.yahooinplay <- WebCorpus(YahooInplaySource("LifeVantage"))

Done! Our neat little plugin for tm makes scraping text data easier than ever. Custom scraping with sources other than those supported by tm.plugin.webmining can be achieved in R with tools such as RCurl, scrapeR, and XML, but let’s save that for another tut.

Now that we have our data it’s time to mine the text, starting with only 1-grams. We’ll want to pre-process the text before setting up a Term-Document Matrix (TDM) for our word clouds.

We’ll use tm_map to transform everything to lower case to avoid missing matching between differently cased terms. We’ll also want to remove common junk words (“stopwords”) and custom-defined words of non-interest, remove extra spaces, and take the stems of words to better match recurrence of the root words.

# Text Mining the Results -------------------------------------------------
corpus <- c(googlefinance, lv.googlenews, p.googlenews, ts.googlenews, lv.yahoofinance, lv.yahoonews, p.yahoonews,
ts.yahoonews, lv.yahooinplay) #lv.nytimes, p.nytimes, ts.nytimes,lv.reutersnews, p.reutersnews, ts.reutersnews,

inspect(corpus)
wordlist <- c("lfvn", "lifevantage", "protandim", "truescience", "company", "fiscal", "nasdaq")

ds0.1g <- tm_map(corpus, content_transformer(tolower))
ds1.1g <- tm_map(ds0.1g, content_transformer(removeWords), wordlist)
ds1.1g <- tm_map(ds1.1g, content_transformer(removeWords), stopwords("english"))
ds2.1g <- tm_map(ds1.1g, stripWhitespace)
ds3.1g <- tm_map(ds2.1g, removePunctuation)
ds4.1g <- tm_map(ds3.1g, stemDocument)

tdm.1g <- TermDocumentMatrix(ds4.1g)
dtm.1g <- DocumentTermMatrix(ds4.1g)

The TDM (and its transposition, the DTM) serves as the basis for so many analyses performed in text mining, including word clouds.

With our TDM in hand we can proceed to create our 1-gram word cloud.  Below you’ll also find some optional steps to which help gain familiarity with the data — you can check out which terms were most frequent, find the correlates of interesting words, and create a term-term adjacency matrix for further analysis.


findFreqTerms(tdm.1g, 40)
findFreqTerms(tdm.1g, 60)
findFreqTerms(tdm.1g, 80)
findFreqTerms(tdm.1g, 100)

findAssocs(dtm.1g, "skin", .75)
findAssocs(dtm.1g, "scienc", .5)
findAssocs(dtm.1g, "product", .75) 

tdm89.1g <- removeSparseTerms(tdm.1g, 0.89)
tdm9.1g  <- removeSparseTerms(tdm.1g, 0.9)
tdm91.1g <- removeSparseTerms(tdm.1g, 0.91)
tdm92.1g <- removeSparseTerms(tdm.1g, 0.92)

tdm2.1g <- tdm92.1g

# Creates a Boolean matrix (counts # docs w/terms, not raw # terms)
tdm3.1g <- inspect(tdm2.1g)
tdm3.1g[tdm3.1g>=1] <- 1 

# Transform into a term-term adjacency matrix
termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g)

# inspect terms numbered 5 to 10
termMatrix.1gram[5:10,5:10]
termMatrix.1gram[1:10,1:10]

# Create a WordCloud to Visualize the Text Data ---------------------------
notsparse <- tdm2.1g
m = as.matrix(notsparse)
v = sort(rowSums(m),decreasing=TRUE)
d = data.frame(word = names(v),freq=v)

# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d$word,
          freq = d$freq,
          scale = c(3,.8),
          random.order = F,
          colors = pal)

 

Running the code above will give you a nice, stemmed 1-gram word cloud that looks something like this:

LifeVantageNote the use of random.order and a sequential pallet from RColorBrewer, which allows me to capture more information in the word cloud by assigning meaning to the order and coloring of terms. For more details see a recent StackOverflow solution which I  wrote on this topic.

That’s it for the 1-gram case!

We’ve already gone a bit further than other word cloud tutorials by covering scraping data and symbolic shading/ordering in word clouds. However, we can make a major leap to n-gram word clouds and in doing so we’ll see how to make almost any text-mining analysis flexible enough to handle n-grams by transforming our TDM.

The initial difficulty you run into with n-grams in R is that tm, the most popular package for text mining, does not inherently support tokenization of bi-grams or n-grams. Tokenization is the process of representing a word, part of a word, or group of words (or symbols) as a single data element called a token.

Fortunately, we have some hacks which allow us to continue using tm with an upgraded tokenizer.  There’s more than one way to achieve this. We can write our own simple tokenizer using the textcnt() function from tau:

tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))

 

or we can invoke RWeka‘s tokenizer within tm:

# BigramTokenizer ####
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

 

Voilà! Now we are all set to make our n-gram TDM and carry out n-gram text mining analyses including word clouds.


# Create an n-gram Word Cloud ----------------------------------------------
tdm.ng <- TermDocumentMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))
dtm.ng <- DocumentTermMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))

# Try removing sparse terms at a few different levels
tdm89.ng <- removeSparseTerms(tdm.ng, 0.89)
tdm9.ng  <- removeSparseTerms(tdm.ng, 0.9)
tdm91.ng <- removeSparseTerms(tdm.ng, 0.91)
tdm92.ng <- removeSparseTerms(tdm.ng, 0.92)

 

This time I’ll choose tdm91.ng which leaves me with 56 2-gram stemmed terms. Note that the sparisity level that you enter into removeSparseTerms() may differ slightly from what you see when inspecting the TDM:

> <span class="GFKJRPGCNBB ace_keyword">tdm91.ng
</span><<TermDocumentMatrix (terms: 56, documents: 240)>>
Non-/sparse entries: 1441/11999
Sparsity           : 89%
Maximal term length: 24
Weighting          : term frequency (tf)


Finally, we generate our n-gram word cloud!

notsparse <- tdm91.ng
m = as.matrix(notsparse)
v = sort(rowSums(m),decreasing=TRUE)
d = data.frame(word = names(v),freq=v)

# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d$word,
          freq = d$freq,
          scale = c(3,.8),
          random.order = F,
          colors = pal)

 
Now bask in the glory of your n-gram word cloud!

lifevantage ngram

Depending on the length of your terms and other factors, you may need to play with rescaling a bit to make everything fit. This can be accomplished by changing the values of the **scale** parameter of wordcloud.

That's it! The same TDM you used here can be used for n-gram classification, text regression, recommendation, PCA, and any number of other data products.

The full code from this tutorial can be found on Github.


(c) 2014
Please send a backlink/citation if you use this tut! 🙂