Posts Tagged ‘data-science’
White & Black box Debuggers, Intelligent Debugging, and Dynamic Analysis
Debugging is a common task for data scientists, programmers, and security experts alike. In good ole RStudio we have a nice, simple built-in white-box debugger. For many analysis-oriented coders, the basic debugging functionality of an IDE like RStudio is all they know and it may be a surprise that debugging is a bigger, much sexier, topic. Below I define and describe key topics in debugging and dynamic analysis, as well as provide links to the most cutting edge free debuggers I use.
All Hail the Data Science Venn Diagram
Forged by the Gods, the ancient data science venn diagram is the oldest, most sacred representation of the field of data science.
I’ve been in love with this simple diagram since I first began working as a data scientist. I love it because it so clearly and simply represents the unique skillset that makes up data science. I’ll write more on this topic and how my own otherwise eclectic skillset coalesced into the practice of professional data science.
I wish I could take credit for creating this simple-but-totally-unsurpassed graphic. Over the past couple of years I’ve often used it as an avatar and if you look close enough you’ll even find it in the background of my (hacked) WordPress header image. While I like to think that it was immaculately convinced, the word on the street is that it was created for the public domain by Drew Conway, who is the co-author of Machine Learning for Hackers*, a private market intelligence and business consultant, my fellow recovering social scientist, recent PhD grad from NYU, and a fast-rising name in data science (yea, he wants to be like me).
*It’s an O’reilly book on ML in R which I kept with me at all times for at least a year; the code is on GitHub and I highly recommend it, though it’s a little basic and its social network analysis section is based on the deprecated Google Social Graph APIScraping Data to build N-gram Word Clouds in R (LifeVantage Use Case)
As social networks, news, blogs, and countless other sources flood our data lakes and warehouses with unstructured text data, R programmers look to tools like word clouds (aka tag clouds) to aid in consumption of the data.
Using the tm.plugin.webmining package to scrape data on the Nrf2 antioxidant supplement-maker LifeVantage, this tutorial extends several existing tutorials to go beyond 1-gram (i.e. single-word) word clouds to N-gram word clouds (i.e. 2 or more words per token).
Word clouds are visualization tools wherein text data are mined such that important terms are displayed as a group according to some algorithm (e.g. scaling of words based upon term density, shading corresponding to frequency, etc.). This can allow a researcher to summarize thousands or even millions of text records in only a glance.
To get started, we load a few good libraries. Whereas tm is the R’s popular text-mining package tm.plugin.webmining, gives us the ability to quickly scrape data from several popular internet sources.
# Packages ---------------------------------------------------------------- require(RWeka) require(tau) require(tm) require(tm.plugin.webmining) require(wordcloud)
The goal is to scrape an example dataset related to LifeVantage, the maker of Nrf2-activating supplement Protandim. LifeVantage is a publicly traded corporation with a vast MLM salesforce, thus we can find data from Google and Yahoo! Finance,
# Scrape Google Finance --------------------------------------------------- googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:LFVN")) # Scrape Google News ------------------------------------------------------ lv.googlenews <- WebCorpus(GoogleNewsSource("LifeVantage")) p.googlenews <- WebCorpus(GoogleNewsSource("Protandim")) ts.googlenews <- WebCorpus(GoogleNewsSource("TrueScience")) # Scrape NYTimes ---------------------------------------------------------- lv.nytimes <- WebCorpus(NYTimesSource(query = "LifeVantage", appid = nytimes_appid)) p.nytimes <- WebCorpus(NYTimesSource("Protandim", appid = nytimes_appid)) ts.nytimes <- WebCorpus(NYTimesSource("TrueScience", appid = nytimes_appid)) # Scrape Reuters ---------------------------------------------------------- lv.reutersnews <- WebCorpus(ReutersNewsSource("LifeVantage")) p.reutersnews <- WebCorpus(ReutersNewsSource("Protandim")) ts.reutersnews <- WebCorpus(ReutersNewsSource("TrueScience")) # Scrape Yahoo! Finance --------------------------------------------------- lv.yahoofinance <- WebCorpus(YahooFinanceSource("LFVN")) # Scrape Yahoo! News ------------------------------------------------------ lv.yahoonews <- WebCorpus(YahooNewsSource("LifeVantage")) p.yahoonews <- WebCorpus(YahooNewsSource("Protandim")) ts.yahoonews <- WebCorpus(YahooNewsSource("TrueScience")) # Scrape Yahoo! Inplay ---------------------------------------------------- lv.yahooinplay <- WebCorpus(YahooInplaySource("LifeVantage"))
Done! Our neat little plugin for tm makes scraping text data easier than ever. Custom scraping with sources other than those supported by tm.plugin.webmining can be achieved in R with tools such as RCurl, scrapeR, and XML, but let’s save that for another tut.
Now that we have our data it’s time to mine the text, starting with only 1-grams. We’ll want to pre-process the text before setting up a Term-Document Matrix (TDM) for our word clouds.
We’ll use tm_map to transform everything to lower case to avoid missing matching between differently cased terms. We’ll also want to remove common junk words (“stopwords”) and custom-defined words of non-interest, remove extra spaces, and take the stems of words to better match recurrence of the root words.
# Text Mining the Results ------------------------------------------------- corpus <- c(googlefinance, lv.googlenews, p.googlenews, ts.googlenews, lv.yahoofinance, lv.yahoonews, p.yahoonews, ts.yahoonews, lv.yahooinplay) #lv.nytimes, p.nytimes, ts.nytimes,lv.reutersnews, p.reutersnews, ts.reutersnews, inspect(corpus) wordlist <- c("lfvn", "lifevantage", "protandim", "truescience", "company", "fiscal", "nasdaq") ds0.1g <- tm_map(corpus, content_transformer(tolower)) ds1.1g <- tm_map(ds0.1g, content_transformer(removeWords), wordlist) ds1.1g <- tm_map(ds1.1g, content_transformer(removeWords), stopwords("english")) ds2.1g <- tm_map(ds1.1g, stripWhitespace) ds3.1g <- tm_map(ds2.1g, removePunctuation) ds4.1g <- tm_map(ds3.1g, stemDocument) tdm.1g <- TermDocumentMatrix(ds4.1g) dtm.1g <- DocumentTermMatrix(ds4.1g)
The TDM (and its transposition, the DTM) serves as the basis for so many analyses performed in text mining, including word clouds.
With our TDM in hand we can proceed to create our 1-gram word cloud. Below you’ll also find some optional steps to which help gain familiarity with the data — you can check out which terms were most frequent, find the correlates of interesting words, and create a term-term adjacency matrix for further analysis.
findFreqTerms(tdm.1g, 40) findFreqTerms(tdm.1g, 60) findFreqTerms(tdm.1g, 80) findFreqTerms(tdm.1g, 100) findAssocs(dtm.1g, "skin", .75) findAssocs(dtm.1g, "scienc", .5) findAssocs(dtm.1g, "product", .75) tdm89.1g <- removeSparseTerms(tdm.1g, 0.89) tdm9.1g <- removeSparseTerms(tdm.1g, 0.9) tdm91.1g <- removeSparseTerms(tdm.1g, 0.91) tdm92.1g <- removeSparseTerms(tdm.1g, 0.92) tdm2.1g <- tdm92.1g # Creates a Boolean matrix (counts # docs w/terms, not raw # terms) tdm3.1g <- inspect(tdm2.1g) tdm3.1g[tdm3.1g>=1] <- 1 # Transform into a term-term adjacency matrix termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g) # inspect terms numbered 5 to 10 termMatrix.1gram[5:10,5:10] termMatrix.1gram[1:10,1:10] # Create a WordCloud to Visualize the Text Data --------------------------- notsparse <- tdm2.1g m = as.matrix(notsparse) v = sort(rowSums(m),decreasing=TRUE) d = data.frame(word = names(v),freq=v) # Create the word cloud pal = brewer.pal(9,"BuPu") wordcloud(words = d$word, freq = d$freq, scale = c(3,.8), random.order = F, colors = pal)
Running the code above will give you a nice, stemmed 1-gram word cloud that looks something like this:
Note the use of random.order and a sequential pallet from RColorBrewer, which allows me to capture more information in the word cloud by assigning meaning to the order and coloring of terms. For more details see a recent StackOverflow solution which I wrote on this topic.
That’s it for the 1-gram case!
We’ve already gone a bit further than other word cloud tutorials by covering scraping data and symbolic shading/ordering in word clouds. However, we can make a major leap to n-gram word clouds and in doing so we’ll see how to make almost any text-mining analysis flexible enough to handle n-grams by transforming our TDM.
The initial difficulty you run into with n-grams in R is that tm, the most popular package for text mining, does not inherently support tokenization of bi-grams or n-grams. Tokenization is the process of representing a word, part of a word, or group of words (or symbols) as a single data element called a token.
Fortunately, we have some hacks which allow us to continue using tm with an upgraded tokenizer. There’s more than one way to achieve this. We can write our own simple tokenizer using the textcnt() function from tau:
tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
or we can invoke RWeka‘s tokenizer within tm:
# BigramTokenizer #### BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
Voilà! Now we are all set to make our n-gram TDM and carry out n-gram text mining analyses including word clouds.
# Create an n-gram Word Cloud ---------------------------------------------- tdm.ng <- TermDocumentMatrix(ds5.1g, control = list(tokenize = BigramTokenizer)) dtm.ng <- DocumentTermMatrix(ds5.1g, control = list(tokenize = BigramTokenizer)) # Try removing sparse terms at a few different levels tdm89.ng <- removeSparseTerms(tdm.ng, 0.89) tdm9.ng <- removeSparseTerms(tdm.ng, 0.9) tdm91.ng <- removeSparseTerms(tdm.ng, 0.91) tdm92.ng <- removeSparseTerms(tdm.ng, 0.92)
This time I’ll choose tdm91.ng which leaves me with 56 2-gram stemmed terms. Note that the sparisity level that you enter into removeSparseTerms() may differ slightly from what you see when inspecting the TDM:
> <span class="GFKJRPGCNBB ace_keyword">tdm91.ng </span><<TermDocumentMatrix (terms: 56, documents: 240)>> Non-/sparse entries: 1441/11999 Sparsity : 89% Maximal term length: 24 Weighting : term frequency (tf)
Finally, we generate our n-gram word cloud!
notsparse <- tdm91.ng m = as.matrix(notsparse) v = sort(rowSums(m),decreasing=TRUE) d = data.frame(word = names(v),freq=v) # Create the word cloud pal = brewer.pal(9,"BuPu") wordcloud(words = d$word, freq = d$freq, scale = c(3,.8), random.order = F, colors = pal)
Now bask in the glory of your n-gram word cloud!Depending on the length of your terms and other factors, you may need to play with rescaling a bit to make everything fit. This can be accomplished by changing the values of the **scale** parameter of wordcloud. That's it! The same TDM you used here can be used for n-gram classification, text regression, recommendation, PCA, and any number of other data products.
The full code from this tutorial can be found on Github.
(c) 2014
Please send a backlink/citation if you use this tut! 🙂