Archive for the ‘Text Mining’ Category

How to Conditionally Remove Character of a Vector Element in R

I have (sometimes incomplete) data on addresses that looks like this:

data <- c("1600 Pennsylvania Avenue, Washington DC", 
          ",Siem Reap,FC,", "11 Wall Street, New York, NY", ",Addis Ababa,FC,")  

where I need to remove the first and/or last character if either one of them are a comma.

Avinash Raj was able to help me with this on S.O. and the question turned out to be a popular one, so I’ll show the solution here:

> data <- c("1600 Pennsylvania Avenue, Washington DC", 
+           ",Siem Reap,FC,", "11 Wall Street, New York, NY", ",Addis Ababa,FC,")
> gsub("(?<=^),|,(?=$)", "", data, perl=TRUE)
[1] "1600 Pennsylvania Avenue, Washington DC"
[2] "Siem Reap,FC"                           
[3] "11 Wall Street, New York, NY"           
[4] "Addis Ababa,FC" 

Pattern explanation:

  • (?<=^), In regex (?<=) called positive look-behind. In our case it asserts What precedes the comma must be a line start ^. So it matches the starting comma.
  • | Logical OR operator usually used to combine(ie, ORing) two regexes.
  • ,(?=$) Lookahead aseerts that what follows comma must be a line end $. So it matches the comma present at the line end.


Java: How to import StAX libraries for parsing XML

In short:


import java.util.*;//usually, but not always needed

In long:

Here are steps in writing code to parse an XML document with StAX.

  1. Import the following libraries:
  1. Create an XMLInputFactory . See the read() method above.
  2. Create an XMLStreamReader and pass a Reader to it such as a FileReader. The XML file is passed as a parameter to FileReader.
  3. We can now iterate through the contents of our XML file using the streamreader’s next() method.
  4. next() returns an event code that indicates which part of the document has been read such as: DTD, START_ELEMENT, CHARACTERS and END_ELEMENT.
  5. If you get the START_ELEMENT event code, you can retrieve the element’s name using the getLocalName() method. To read the attributes, use getAttributeValue() method.
  6. To read the text between the start and end tags, wait until you receive the CHARACTERS event code. Afterwards, you can read the text using getText().

Thanks to my instructor Carl Limsico for the step-by-step!

Scraping Data to build N-gram Word Clouds in R (LifeVantage Use Case)

As social networks, news, blogs, and countless other sources flood our data lakes and warehouses with unstructured text data, R programmers look to tools like word clouds (aka tag clouds) to aid in consumption of the data.

Using the tm.plugin.webmining package to scrape data on the Nrf2 antioxidant supplement-maker LifeVantage, this tutorial extends several existing tutorials to go beyond 1-gram (i.e. single-word) word clouds to N-gram word clouds (i.e. 2 or more words per token).


Word clouds are visualization tools wherein text data are mined such that important terms are displayed as a group according to some algorithm (e.g. scaling of words based upon term density, shading corresponding to frequency, etc.). This can allow a researcher to summarize thousands or even millions of text records in only a glance.

To get started, we load a few good libraries. Whereas tm is the R’s popular text-mining package tm.plugin.webmining, gives us the ability to quickly scrape data from several popular internet sources.

# Packages ----------------------------------------------------------------

The goal is to scrape an example dataset related to LifeVantage, the maker of Nrf2-activating supplement Protandim. LifeVantage is a publicly traded corporation with a vast MLM salesforce, thus we can find data from Google and Yahoo! Finance,

# Scrape Google Finance ---------------------------------------------------
googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:LFVN"))

# Scrape Google News ------------------------------------------------------
lv.googlenews <- WebCorpus(GoogleNewsSource("LifeVantage"))
p.googlenews  <- WebCorpus(GoogleNewsSource("Protandim"))
ts.googlenews <- WebCorpus(GoogleNewsSource("TrueScience"))

# Scrape NYTimes ----------------------------------------------------------
lv.nytimes <- WebCorpus(NYTimesSource(query = "LifeVantage", appid = nytimes_appid))
p.nytimes  <- WebCorpus(NYTimesSource("Protandim", appid = nytimes_appid))
ts.nytimes <- WebCorpus(NYTimesSource("TrueScience", appid = nytimes_appid))

# Scrape Reuters ----------------------------------------------------------
lv.reutersnews <- WebCorpus(ReutersNewsSource("LifeVantage"))
p.reutersnews  <- WebCorpus(ReutersNewsSource("Protandim"))
ts.reutersnews <- WebCorpus(ReutersNewsSource("TrueScience"))

# Scrape Yahoo! Finance ---------------------------------------------------
lv.yahoofinance <- WebCorpus(YahooFinanceSource("LFVN"))

# Scrape Yahoo! News ------------------------------------------------------
lv.yahoonews <- WebCorpus(YahooNewsSource("LifeVantage"))
p.yahoonews  <- WebCorpus(YahooNewsSource("Protandim"))
ts.yahoonews <- WebCorpus(YahooNewsSource("TrueScience"))

# Scrape Yahoo! Inplay ----------------------------------------------------
lv.yahooinplay <- WebCorpus(YahooInplaySource("LifeVantage"))

Done! Our neat little plugin for tm makes scraping text data easier than ever. Custom scraping with sources other than those supported by tm.plugin.webmining can be achieved in R with tools such as RCurl, scrapeR, and XML, but let’s save that for another tut.

Now that we have our data it’s time to mine the text, starting with only 1-grams. We’ll want to pre-process the text before setting up a Term-Document Matrix (TDM) for our word clouds.

We’ll use tm_map to transform everything to lower case to avoid missing matching between differently cased terms. We’ll also want to remove common junk words (“stopwords”) and custom-defined words of non-interest, remove extra spaces, and take the stems of words to better match recurrence of the root words.

# Text Mining the Results -------------------------------------------------
corpus <- c(googlefinance, lv.googlenews, p.googlenews, ts.googlenews, lv.yahoofinance, lv.yahoonews, p.yahoonews,
ts.yahoonews, lv.yahooinplay) #lv.nytimes, p.nytimes, ts.nytimes,lv.reutersnews, p.reutersnews, ts.reutersnews,

wordlist <- c("lfvn", "lifevantage", "protandim", "truescience", "company", "fiscal", "nasdaq")

ds0.1g <- tm_map(corpus, content_transformer(tolower))
ds1.1g <- tm_map(ds0.1g, content_transformer(removeWords), wordlist)
ds1.1g <- tm_map(ds1.1g, content_transformer(removeWords), stopwords("english"))
ds2.1g <- tm_map(ds1.1g, stripWhitespace)
ds3.1g <- tm_map(ds2.1g, removePunctuation)
ds4.1g <- tm_map(ds3.1g, stemDocument)

tdm.1g <- TermDocumentMatrix(ds4.1g)
dtm.1g <- DocumentTermMatrix(ds4.1g)

The TDM (and its transposition, the DTM) serves as the basis for so many analyses performed in text mining, including word clouds.

With our TDM in hand we can proceed to create our 1-gram word cloud.  Below you’ll also find some optional steps to which help gain familiarity with the data — you can check out which terms were most frequent, find the correlates of interesting words, and create a term-term adjacency matrix for further analysis.

findFreqTerms(tdm.1g, 40)
findFreqTerms(tdm.1g, 60)
findFreqTerms(tdm.1g, 80)
findFreqTerms(tdm.1g, 100)

findAssocs(dtm.1g, "skin", .75)
findAssocs(dtm.1g, "scienc", .5)
findAssocs(dtm.1g, "product", .75) 

tdm89.1g <- removeSparseTerms(tdm.1g, 0.89)
tdm9.1g  <- removeSparseTerms(tdm.1g, 0.9)
tdm91.1g <- removeSparseTerms(tdm.1g, 0.91)
tdm92.1g <- removeSparseTerms(tdm.1g, 0.92)

tdm2.1g <- tdm92.1g

# Creates a Boolean matrix (counts # docs w/terms, not raw # terms)
tdm3.1g <- inspect(tdm2.1g)
tdm3.1g[tdm3.1g>=1] <- 1 

# Transform into a term-term adjacency matrix
termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g)

# inspect terms numbered 5 to 10

# Create a WordCloud to Visualize the Text Data ---------------------------
notsparse <- tdm2.1g
m = as.matrix(notsparse)
v = sort(rowSums(m),decreasing=TRUE)
d = data.frame(word = names(v),freq=v)

# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d$word,
          freq = d$freq,
          scale = c(3,.8),
          random.order = F,
          colors = pal)


Running the code above will give you a nice, stemmed 1-gram word cloud that looks something like this:

LifeVantageNote the use of random.order and a sequential pallet from RColorBrewer, which allows me to capture more information in the word cloud by assigning meaning to the order and coloring of terms. For more details see a recent StackOverflow solution which I  wrote on this topic.

That’s it for the 1-gram case!

We’ve already gone a bit further than other word cloud tutorials by covering scraping data and symbolic shading/ordering in word clouds. However, we can make a major leap to n-gram word clouds and in doing so we’ll see how to make almost any text-mining analysis flexible enough to handle n-grams by transforming our TDM.

The initial difficulty you run into with n-grams in R is that tm, the most popular package for text mining, does not inherently support tokenization of bi-grams or n-grams. Tokenization is the process of representing a word, part of a word, or group of words (or symbols) as a single data element called a token.

Fortunately, we have some hacks which allow us to continue using tm with an upgraded tokenizer.  There’s more than one way to achieve this. We can write our own simple tokenizer using the textcnt() function from tau:

tokenize_ngrams <- function(x, n=3) return(rownames(,method="string",n=n)))))


or we can invoke RWeka‘s tokenizer within tm:

# BigramTokenizer ####
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))


Voilà! Now we are all set to make our n-gram TDM and carry out n-gram text mining analyses including word clouds.

# Create an n-gram Word Cloud ---------------------------------------------- <- TermDocumentMatrix(ds5.1g, control = list(tokenize = BigramTokenizer)) <- DocumentTermMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))

# Try removing sparse terms at a few different levels <- removeSparseTerms(, 0.89)  <- removeSparseTerms(, 0.9) <- removeSparseTerms(, 0.91) <- removeSparseTerms(, 0.92)


This time I’ll choose which leaves me with 56 2-gram stemmed terms. Note that the sparisity level that you enter into removeSparseTerms() may differ slightly from what you see when inspecting the TDM:

> <span class="GFKJRPGCNBB ace_keyword">
</span><<TermDocumentMatrix (terms: 56, documents: 240)>>
Non-/sparse entries: 1441/11999
Sparsity           : 89%
Maximal term length: 24
Weighting          : term frequency (tf)

Finally, we generate our n-gram word cloud!

notsparse <-
m = as.matrix(notsparse)
v = sort(rowSums(m),decreasing=TRUE)
d = data.frame(word = names(v),freq=v)

# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d$word,
          freq = d$freq,
          scale = c(3,.8),
          random.order = F,
          colors = pal)

Now bask in the glory of your n-gram word cloud!

lifevantage ngram

Depending on the length of your terms and other factors, you may need to play with rescaling a bit to make everything fit. This can be accomplished by changing the values of the **scale** parameter of wordcloud.

That's it! The same TDM you used here can be used for n-gram classification, text regression, recommendation, PCA, and any number of other data products.

The full code from this tutorial can be found on Github.

(c) 2014
Please send a backlink/citation if you use this tut! 🙂