Archive for the ‘Statistics’ Category

There are lots of common, mostly very simple, statistics used in experimentation and classification. You find them in Machine Learning courses, medical literature and just about everywhere.

There are some handy resources out there, such as a table of formulas for common performance measures in classification from University of Alberta, but I’ve still noticed a bit of a gap between the stats I commonly see when running validation on algorithms in various software packages and what’s in those cheatsheets.

It has everything from University of Alberta’s reference table (i.e. all the basics), all the stats used in the handy cofusionMatrix function in R’s caret package and a few others I threw in for good measure.

Enjoy

piRacy in R

Ahoy there land lubbers! In honor of TLAPD (international Talk Like a Pirate Day) I’ve decided to write a quick post on topic. The code to company this analysis can be found on Github.

I downloaded data on the top 10 most pirated movies by week over 8 weeks and their legal availability. The goal is to determine if pirates pirate movies completely hedonistically, or if pirating is a super-cool fallback for those 1337 Internet users who can’t find a legitimate copy of their flick available through the normal channels.

```## File: pirate.R
## Description: This script is 97% chum free
## Jason Miller
## http://hack-r.com
## http://github.com/hack-r
## http://hack-r.github.io

# International Talk Like a Pirate Day is Today, Sept. 19 =)
# http://www.talklikeapirate.com/

require(gvlma) #Screw library(), require() rules
require(MASS)

# Source Pirate Functions -------------------------------------------------
source("pirate_functions.R")

# Grab data on pirates ----------------------------------------------------
# Top 10 Pirated Movies by week and their legal availability across forms of media

```

This tells us the rank of the top 10 most pirated movies over time. I’ll invert the columns called rank so that I have a variable which increases as piracy increases.

```# Invert Rank So that Higher = More Piracy --------------------------------
piracy\$pirate <- 1/piracy\$rank
```

Now to determine if legal unavailability of movies drives their piracy.

If we run an Ordinary Least Squares multiple regression model then we’ll quickly discover that many parameters can’t be estimated due to singularity.

```# Determine if Legal Un-availability of Movies Drives Piracy -----------

mod <- lm(pirate ~ available_digital + streaming + rental + purchase + dvd +
netflix_instant + amazon_prime_instant_video + hulu_movies +
crackle + youtube_free + epix + streampix + amazon_video_rental +
apple_itunes_rental + android_rental + vudu_rental + youtube_rental +
amazon_video_purchase + apple_itunes_purchase + android_purchase +
vudu_purchase + amazon_dvd + amazon_bluray + netflix_dvd +redbox,data = piracy)

summary.lm(mod)
gvlma(mod)
```

What the singularity is telling us is that there’s no identifying variation in some of the model parameters. This means many of these highly pirated movies have absolutely NO availability in the singular venues.

Let’s look at some basic descriptive statistics:

```> # Arrr! Seems that we have a singularity in the Ordinary Least Squares regression
> #   above with all these predictors, matee! Well blow me down!
> # There was no identifying variation in some of the explanatory variables!
> # This means many of these highly pirated movies have absolutely NO availability
> # in many of these channels, for example:
> mean(piracy\$netflix_instant)
[1] 0
> mean(piracy\$amazon_prime_instant_video)
[1] 0
[1] 0
> mean(piracy\$epix)
[1] 0
> mean(piracy\$crackle)
[1] 0
> mean(piracy\$streampix)
[1] 0
> mean(piracy\$hulu_movies)
```

So, after running some diagnostic testing and rejecting OLS (see my code on Github) I decide to address both problems by running a more parsimonious probit model after having transformed the dependent variable to be dichotomous. I use bootstrap sampling to estimate the marginal effects of the parameters.

```piracy\$higher_piracy <- 0
piracy\$higher_piracy[piracy\$pirate >= .5] <- 1
table(piracy\$higher_piracy)

probit <- glm(higher_piracy ~ amazon_video_purchase + redbox + amazon_dvd
,data = piracy, family =binomial(link = "probit"))

summary.glm(probit)

mfxboot(modform = "higher_piracy ~ amazon_video_purchase + redbox + amazon_dvd ",
dist = "probit",
data = piracy)
```

The final results are only directional because this is a tiny sample, but they indicate a 6.3% decrease in probability of being a movie with high piracy if the digital movie is available on Amazon for purchase and a decline of 3.7% in probability of higher piracy if the movie is available as a DVD.

SAS: How to Define a Variable for Path to Folder/Project

%let prj = X:\Your\Path\;
libname dat “&prj” ;

It’s a best practice to use this because it makes it easier for other people to edit and reuse with their own paths later.

SAS: What’s CALL SYMPUT?

 CALL SYMPUT(macro-variable, value);

macro-variable
can be one of the following items:

• a character string that is a SAS name, enclosed in quotation marks. For example, to assign the character string testing to macro variable NEW, submit the following statement:
```call symput('new','testing');
```
• the name of a character variable whose values are SAS names. For example, this DATA step creates the three macro variables SHORTSTP, PITCHER, and FRSTBASE and respectively assign them the values ANN, TOM, and BILL.
```data team1;
input position : \$8. player : \$12.;
call symput(position,player);
datalines;
shortstp Ann
pitcher Tom
frstbase Bill
;
```
• a character expression that produces a macro variable name. This form is useful for creating a series of macro variables. For example, the CALL SYMPUT statement builds a series of macro variable names by combining the character string POS and the left-aligned value of _N_ and assigns values to the macro variables POS1, POS2, and POS3.
```data team2;
input position : \$12. player \$12.;
call symput('POS'||left(_n_), position);
datalines;
shortstp Ann
pitcher Tom
frstbase Bill
;
```
value
is the value to be assigned, which can be

• a string enclosed in quotation marks. For example, this statement assigns the string testing to the macro variable NEW:
```call symput('new','testing');
```
• the name of a numeric or character variable. The current value of the variable is assigned as the value of the macro variable. If the variable is numeric, SAS performs an automatic numeric-to-character conversion and writes a message in the log. Later sections on formatting rules describe the rules that SYMPUT follows in assigning character and numeric values of DATA step variables to macro variables.Note:   This form is most useful when macro-variable is also the name of a SAS variable or a character expression that contains a SAS variable because a unique macro variable name and value can be created from each observation, as shown in the previous example for creating the data set TEAM1.

If macro-variable is a character string, SYMPUT creates only one macro variable, and its value changes in each iteration of the program. Only the value assigned in the last iteration remains after program execution is finished.

• a DATA step expression. The value returned by the expression in the current observation is assigned as the value of macro-variable. In this example, the macro variable named HOLDATE receives the value July 4,1997 :
```data c;
input holiday mmddyy.;
call symput('holdate',trim(left(put(holiday,worddate.))));
datalines;
070497
;
run;
```

If the expression is numeric, SAS performs an automatic numeric-to-character conversion and writes a message in the log. Later sections on formatting rules describe the rules that SYMPUT follows in assigning character and numeric values of expressions to macro variables.

 Details

If macro-variable does not exist, SYMPUT creates it. SYMPUT makes a macro variable assignment when the program executes.

SYMPUT can be used in all SAS language programs, including SCL programs. Because it resolves variables at program execution instead of macro execution, SYMPUT should be used to assign macro values from DATA step views, SQL views, and SCL programs.

 Concepts

Scope of Variables Created with SYMPUT

SYMPUT puts the macro variable in the most local nonempty symbol table. A symbol table is nonempty if it contains the following:

• a value
• a computed %GOTO (A computed %GOTO contains % or & and resolves to a label.)
• the macro variable &SYSPBUFF, created at macro invocation time.

However, there are three cases where SYMPUT creates the variable in the local symbol table, even if that symbol table is empty:

• Beginning with Version 8, if SYMPUT is used after a PROC SQL, the variable will be created in a local symbol table.
• If an executing macro contains a computed %GOTO statement and uses SYMPUT to create a macro variable, the variable is created in the local symbol table.
• If an executing macro uses &SYSPBUFF and SYMPUT to create a macro variable, the macro variable is created in the local symbol table.

Problem Trying to Reference a SYMPUT-Assigned Value Before It Is Available

One of the most common problems in using SYMPUT is trying to reference a macro variable value assigned by SYMPUT before that variable is created. The failure generally occurs because the statement referencing the macro variable compiles before execution of the CALL SYMPUT statement that assigns the variable’s value. The most important fact to remember in using SYMPUT is that it assigns the value of the macro variable during program execution, but macro variable references resolve during the compilation of a step, a global statement used outside a step, or an SCL program. As a result:

• You cannot use a macro variable reference to retrieve the value of a macro variable in the same program (or step) in which SYMPUT creates that macro variable and assigns it a value.
• You must specify a step boundary statement to force the DATA step to execute before referencing a value in a global statement following the program (for example, a TITLE statement). The boundary could be a RUN statement or another DATA or PROC statement. For example:
```data x;
x='December';
call symput('var',x);

proc print;
title "Report for &var";
run;
```

 `corr(Y, X)` double precision double precision correlation coefficient `covar_pop(Y, X)` double precision double precision population covariance `covar_samp(Y, X)` double precision double precision sample covariance `regr_avgx(Y, X)` double precision double precision average of the independent variable (sum(X)/N) `regr_avgy(Y, X)` double precision double precision average of the dependent variable (sum(Y)/N) `regr_count(Y, X)` double precision bigint number of input rows in which both expressions are nonnull `regr_intercept(Y, X)` double precision double precision y-intercept of the least-squares-fit linear equation determined by the (X, Y) pairs `regr_r2(Y, X)` double precision double precision square of the correlation coefficient `regr_slope(Y, X)` double precision double precision slope of the least-squares-fit linear equation determined by the (X, Y) pairs `regr_sxx(Y, X)` double precision double precision sum(X^2) – sum(X)^2/N (“sum of squares” of the independent variable) `regr_sxy(Y, X)` double precision double precision sum(X*Y) – sum(X) * sum(Y)/N (“sum of products” of independent times dependent variable) `regr_syy(Y, X)` double precision double precision sum(Y^2) – sum(Y)^2/N (“sum of squares” of the dependent variable) `stddev(expression)` smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric historical alias for `stddev_samp` `stddev_pop(expression)` smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric population standard deviation of the input values `stddev_samp(expression)` smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric sample standard deviation of the input values `variance`(expression) smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric historical alias for `var_samp` `var_pop`(expression) smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric population variance of the input values (square of the population standard deviation) `var_samp`(expression) smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric sample variance of the input values (square of the sample standard deviation)

Scraping Data to build N-gram Word Clouds in R (LifeVantage Use Case)

As social networks, news, blogs, and countless other sources flood our data lakes and warehouses with unstructured text data, R programmers look to tools like word clouds (aka tag clouds) to aid in consumption of the data.

Using the tm.plugin.webmining package to scrape data on the Nrf2 antioxidant supplement-maker LifeVantage, this tutorial extends several existing tutorials to go beyond 1-gram (i.e. single-word) word clouds to N-gram word clouds (i.e. 2 or more words per token).

Word clouds are visualization tools wherein text data are mined such that important terms are displayed as a group according to some algorithm (e.g. scaling of words based upon term density, shading corresponding to frequency, etc.). This can allow a researcher to summarize thousands or even millions of text records in only a glance.

To get started, we load a few good libraries. Whereas tm is the R’s popular text-mining package tm.plugin.webmining, gives us the ability to quickly scrape data from several popular internet sources.

```# Packages ----------------------------------------------------------------
require(RWeka)
require(tau)
require(tm)
require(tm.plugin.webmining)
require(wordcloud)
```

The goal is to scrape an example dataset related to LifeVantage, the maker of Nrf2-activating supplement Protandim. LifeVantage is a publicly traded corporation with a vast MLM salesforce, thus we can find data from Google and Yahoo! Finance,

```# Scrape Google Finance ---------------------------------------------------

# Scrape NYTimes ----------------------------------------------------------
lv.nytimes <- WebCorpus(NYTimesSource(query = "LifeVantage", appid = nytimes_appid))
p.nytimes  <- WebCorpus(NYTimesSource("Protandim", appid = nytimes_appid))
ts.nytimes <- WebCorpus(NYTimesSource("TrueScience", appid = nytimes_appid))

# Scrape Reuters ----------------------------------------------------------
lv.reutersnews <- WebCorpus(ReutersNewsSource("LifeVantage"))
p.reutersnews  <- WebCorpus(ReutersNewsSource("Protandim"))
ts.reutersnews <- WebCorpus(ReutersNewsSource("TrueScience"))

# Scrape Yahoo! Finance ---------------------------------------------------
lv.yahoofinance <- WebCorpus(YahooFinanceSource("LFVN"))

# Scrape Yahoo! News ------------------------------------------------------
lv.yahoonews <- WebCorpus(YahooNewsSource("LifeVantage"))
p.yahoonews  <- WebCorpus(YahooNewsSource("Protandim"))
ts.yahoonews <- WebCorpus(YahooNewsSource("TrueScience"))

# Scrape Yahoo! Inplay ----------------------------------------------------
lv.yahooinplay <- WebCorpus(YahooInplaySource("LifeVantage"))

```

Done! Our neat little plugin for tm makes scraping text data easier than ever. Custom scraping with sources other than those supported by tm.plugin.webmining can be achieved in R with tools such as RCurl, scrapeR, and XML, but let’s save that for another tut.

Now that we have our data it’s time to mine the text, starting with only 1-grams. We’ll want to pre-process the text before setting up a Term-Document Matrix (TDM) for our word clouds.

We’ll use tm_map to transform everything to lower case to avoid missing matching between differently cased terms. We’ll also want to remove common junk words (“stopwords”) and custom-defined words of non-interest, remove extra spaces, and take the stems of words to better match recurrence of the root words.

```# Text Mining the Results -------------------------------------------------
ts.yahoonews, lv.yahooinplay) #lv.nytimes, p.nytimes, ts.nytimes,lv.reutersnews, p.reutersnews, ts.reutersnews,

inspect(corpus)
wordlist <- c("lfvn", "lifevantage", "protandim", "truescience", "company", "fiscal", "nasdaq")

ds0.1g <- tm_map(corpus, content_transformer(tolower))
ds1.1g <- tm_map(ds0.1g, content_transformer(removeWords), wordlist)
ds1.1g <- tm_map(ds1.1g, content_transformer(removeWords), stopwords("english"))
ds2.1g <- tm_map(ds1.1g, stripWhitespace)
ds3.1g <- tm_map(ds2.1g, removePunctuation)
ds4.1g <- tm_map(ds3.1g, stemDocument)

tdm.1g <- TermDocumentMatrix(ds4.1g)
dtm.1g <- DocumentTermMatrix(ds4.1g)

```

The TDM (and its transposition, the DTM) serves as the basis for so many analyses performed in text mining, including word clouds.

With our TDM in hand we can proceed to create our 1-gram word cloud.  Below you’ll also find some optional steps to which help gain familiarity with the data — you can check out which terms were most frequent, find the correlates of interesting words, and create a term-term adjacency matrix for further analysis.

```
findFreqTerms(tdm.1g, 40)
findFreqTerms(tdm.1g, 60)
findFreqTerms(tdm.1g, 80)
findFreqTerms(tdm.1g, 100)

findAssocs(dtm.1g, "skin", .75)
findAssocs(dtm.1g, "scienc", .5)
findAssocs(dtm.1g, "product", .75)

tdm89.1g <- removeSparseTerms(tdm.1g, 0.89)
tdm9.1g  <- removeSparseTerms(tdm.1g, 0.9)
tdm91.1g <- removeSparseTerms(tdm.1g, 0.91)
tdm92.1g <- removeSparseTerms(tdm.1g, 0.92)

tdm2.1g <- tdm92.1g

# Creates a Boolean matrix (counts # docs w/terms, not raw # terms)
tdm3.1g <- inspect(tdm2.1g)
tdm3.1g[tdm3.1g>=1] <- 1

# Transform into a term-term adjacency matrix
termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g)

# inspect terms numbered 5 to 10
termMatrix.1gram[5:10,5:10]
termMatrix.1gram[1:10,1:10]

# Create a WordCloud to Visualize the Text Data ---------------------------
notsparse <- tdm2.1g
m = as.matrix(notsparse)
v = sort(rowSums(m),decreasing=TRUE)
d = data.frame(word = names(v),freq=v)

# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d\$word,
freq = d\$freq,
scale = c(3,.8),
random.order = F,
colors = pal)

```

Running the code above will give you a nice, stemmed 1-gram word cloud that looks something like this:

Note the use of random.order and a sequential pallet from RColorBrewer, which allows me to capture more information in the word cloud by assigning meaning to the order and coloring of terms. For more details see a recent StackOverflow solution which I  wrote on this topic.

That’s it for the 1-gram case!

We’ve already gone a bit further than other word cloud tutorials by covering scraping data and symbolic shading/ordering in word clouds. However, we can make a major leap to n-gram word clouds and in doing so we’ll see how to make almost any text-mining analysis flexible enough to handle n-grams by transforming our TDM.

The initial difficulty you run into with n-grams in R is that tm, the most popular package for text mining, does not inherently support tokenization of bi-grams or n-grams. Tokenization is the process of representing a word, part of a word, or group of words (or symbols) as a single data element called a token.

Fortunately, we have some hacks which allow us to continue using tm with an upgraded tokenizer.  There’s more than one way to achieve this. We can write our own simple tokenizer using the textcnt() function from tau:

```tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
```

or we can invoke RWeka‘s tokenizer within tm:

```# BigramTokenizer ####
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
```

Voilà! Now we are all set to make our n-gram TDM and carry out n-gram text mining analyses including word clouds.

```
# Create an n-gram Word Cloud ----------------------------------------------
tdm.ng <- TermDocumentMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))
dtm.ng <- DocumentTermMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))

# Try removing sparse terms at a few different levels
tdm89.ng <- removeSparseTerms(tdm.ng, 0.89)
tdm9.ng  <- removeSparseTerms(tdm.ng, 0.9)
tdm91.ng <- removeSparseTerms(tdm.ng, 0.91)
tdm92.ng <- removeSparseTerms(tdm.ng, 0.92)

```

This time I’ll choose tdm91.ng which leaves me with 56 2-gram stemmed terms. Note that the sparisity level that you enter into removeSparseTerms() may differ slightly from what you see when inspecting the TDM:

```> <span class="GFKJRPGCNBB ace_keyword">tdm91.ng
</span><<TermDocumentMatrix (terms: 56, documents: 240)>>
Non-/sparse entries: 1441/11999
Sparsity           : 89%
Maximal term length: 24
Weighting          : term frequency (tf)
```
```

Finally, we generate our n-gram word cloud!

```
```notsparse <- tdm91.ng
m = as.matrix(notsparse)
v = sort(rowSums(m),decreasing=TRUE)
d = data.frame(word = names(v),freq=v)

# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d\$word,
freq = d\$freq,
scale = c(3,.8),
random.order = F,
colors = pal)

```
` `
```Now bask in the glory of your n-gram word cloud!

Depending on the length of your terms and other factors, you may need to play with rescaling a bit to make everything fit. This can be accomplished by changing the values of the **scale** parameter of wordcloud.

That's it! The same TDM you used here can be used for n-gram classification, text regression, recommendation, PCA, and any number of other data products.```

The full code from this tutorial can be found on Github.

(c) 2014