Posts Tagged ‘statistics’

R: Setup a grid search for xgboost (!!)

I find this code super useful because R’s implementation of xgboost (and to my knowledge Python’s) otherwise lacks support for a grid search:

# set up the cross-validated hyper-parameter search
xgb_grid_1 = expand.grid(
nrounds = 1000,
eta = c(0.01, 0.001, 0.0001),
max_depth = c(2, 4, 6, 8, 10),
gamma = 1
)

# pack the training control parameters
xgb_trcontrol_1 = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all",                                                        # save losses across all models
classProbs = TRUE,                                                           # set to TRUE for AUC to be computed
summaryFunction = twoClassSummary,
allowParallel = TRUE
)

# train the model for each parameter combination in the grid,
#   using CV to evaluate
xgb_train_1 = train(
x = as.matrix(df_train %>%
select(-SeriousDlqin2yrs)),
y = as.factor(df_train$SeriousDlqin2yrs),
trControl = xgb_trcontrol_1,
tuneGrid = xgb_grid_1,
method = "xgbTree"
)

# scatter plot of the AUC against max_depth and eta
ggplot(xgb_train_1$results, aes(x = as.factor(eta), y = max_depth, size = ROC, color = ROC)) +
geom_point() +
theme_bw() +
scale_size_continuous(guide = "none")</code>

Rlogo

Mean and Multi-modal Mean Functions (methods) for Java

When it comes to stats Java ain’t no R. Still, we can do anything in one language that we can do in another.

Let’s have a look at some mean functions for Java, to illustrate:

public static double mean(double[] m) {
     double sum = 0;
     for (int i = 0; i < m.length; i++) {
         sum += m[i];
     }
    return sum / m.length;}
For multi-modal:
public static List<Integer> mode(final int[] a) {
     final List<Integer> modes = new ArrayList<Integer>();
     final Map<Integer, Integer> countMap = new HashMap<Integer, Integer>();

     int max = -1;

     for (final int n : numbers) {
         int count = 0;

         if (countMap.containsKey(n)) {
             count = countMap.get(n) + 1;
         } else {
             count = 1;
         }

         countMap.put(n, count);

         if (count > max) {
             max = count;
         }
     }

     for (final Map.Entry<Integer, Integer> tuple : countMap.entrySet()) {
         if (tuple.getValue() == max) {
             modes.add(tuple.getKey());
         }
     }

     return modes;}

Java

Stats: Moments

Significance of moments (raw, central, standardised) and cumulants (raw, standardised), in connection with named properties of distributions
Moment numberRaw momentCentral momentStandardised momentRaw cumulantStandardised cumulant
1mean00meanN/A
2variance1variance1
3skewnessskewness
4historical kurtosis (or flatness)modern kurtosis (i.e. excess kurtosis)
5hyperskewness
6hyperflatness
7+

Stats: Gini Importance in Random Forest Models

Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.

Download: Hack-R Classification Performance Stats Cheatsheet

There are lots of common, mostly very simple, statistics used in experimentation and classification. You find them in Machine Learning courses, medical literature and just about everywhere.

There are some handy resources out there, such as a table of formulas for common performance measures in classification from University of Alberta, but I’ve still noticed a bit of a gap between the stats I commonly see when running validation on algorithms in various software packages and what’s in those cheatsheets.

I made my own, which you may download and reproduce freely (just do me a favor and link to this blog).

cheatsheet_image

It has everything from University of Alberta’s reference table (i.e. all the basics), all the stats used in the handy cofusionMatrix function in R’s caret package and a few others I threw in for good measure.

Enjoy

Download the PDF

Protected: Comparison of Approaches to Multinomial Logit Regression in R

This content is password protected. To view it please enter your password below:

piRacy in R

Ahoy there land lubbers! In honor of TLAPD (international Talk Like a Pirate Day) I’ve decided to write a quick post on topic. The code to company this analysis can be found on Github.

pirate

I downloaded data on the top 10 most pirated movies by week over 8 weeks and their legal availability. The goal is to determine if pirates pirate movies completely hedonistically, or if pirating is a super-cool fallback for those 1337 Internet users who can’t find a legitimate copy of their flick available through the normal channels.

## File: pirate.R
## Description: This script is 97% chum free
## Jason Miller
## http://hack-r.com
## http://github.com/hack-r
## http://hack-r.github.io

# International Talk Like a Pirate Day is Today, Sept. 19 =)
# http://www.talklikeapirate.com/

# Load Packages -----------------------------------------------------------
 require(gvlma) #Screw library(), require() rules
 require(MASS)

# Source Pirate Functions -------------------------------------------------
 source("pirate_functions.R")

# Grab data on pirates ----------------------------------------------------
# Top 10 Pirated Movies by week and their legal availability across forms of media
download.file("http://piracydata.org/csv", destfile = "p.csv")

piracy <- read.csv("p.csv")

This tells us the rank of the top 10 most pirated movies over time. I’ll invert the columns called rank so that I have a variable which increases as piracy increases.

# Invert Rank So that Higher = More Piracy --------------------------------
piracy$pirate <- 1/piracy$rank

Now to determine if legal unavailability of movies drives their piracy.

If we run an Ordinary Least Squares multiple regression model then we’ll quickly discover that many parameters can’t be estimated due to singularity.

# Determine if Legal Un-availability of Movies Drives Piracy -----------

mod <- lm(pirate ~ available_digital + streaming + rental + purchase + dvd +
              netflix_instant + amazon_prime_instant_video + hulu_movies +
            crackle + youtube_free + epix + streampix + amazon_video_rental +
            apple_itunes_rental + android_rental + vudu_rental + youtube_rental +
            amazon_video_purchase + apple_itunes_purchase + android_purchase +
            vudu_purchase + amazon_dvd + amazon_bluray + netflix_dvd +redbox,data = piracy)

summary.lm(mod)
gvlma(mod)

What the singularity is telling us is that there’s no identifying variation in some of the model parameters. This means many of these highly pirated movies have absolutely NO availability in the singular venues.

Let’s look at some basic descriptive statistics:

> # Arrr! Seems that we have a singularity in the Ordinary Least Squares regression
> #   above with all these predictors, matee! Well blow me down!
> # There was no identifying variation in some of the explanatory variables!
> # This means many of these highly pirated movies have absolutely NO availability
> # in many of these channels, for example:
> mean(piracy$netflix_instant)
[1] 0
> mean(piracy$amazon_prime_instant_video)
[1] 0
> mean(piracy$youtube_free)
[1] 0
> mean(piracy$epix)
[1] 0
> mean(piracy$crackle)
[1] 0
> mean(piracy$streampix)
[1] 0
> mean(piracy$hulu_movies)

So, after running some diagnostic testing and rejecting OLS (see my code on Github) I decide to address both problems by running a more parsimonious probit model after having transformed the dependent variable to be dichotomous. I use bootstrap sampling to estimate the marginal effects of the parameters.

piracy$higher_piracy <- 0
piracy$higher_piracy[piracy$pirate >= .5] <- 1
table(piracy$higher_piracy)

probit <- glm(higher_piracy ~ amazon_video_purchase + redbox + amazon_dvd
                  ,data = piracy, family =binomial(link = "probit"))

summary.glm(probit)

mfxboot(modform = "higher_piracy ~ amazon_video_purchase + redbox + amazon_dvd ",
                           dist = "probit",
                         data = piracy)
 

The final results are only directional because this is a tiny sample, but they indicate a 6.3% decrease in probability of being a movie with high piracy if the digital movie is available on Amazon for purchase and a decline of 3.7% in probability of higher piracy if the movie is available as a DVD.

Postgres / PaDB: Statistics

corr(YX)double precisiondouble precisioncorrelation coefficient
covar_pop(YX)double precisiondouble precisionpopulation covariance
covar_samp(YX)double precisiondouble precisionsample covariance
regr_avgx(YX)double precisiondouble precisionaverage of the independent variable (sum(X)/N)
regr_avgy(YX)double precisiondouble precisionaverage of the dependent variable (sum(Y)/N)
regr_count(YX)double precisionbigintnumber of input rows in which both expressions are nonnull
regr_intercept(YX)double precisiondouble precisiony-intercept of the least-squares-fit linear equation determined by the (XY) pairs
regr_r2(YX)double precisiondouble precisionsquare of the correlation coefficient
regr_slope(YX)double precisiondouble precisionslope of the least-squares-fit linear equation determined by the (XY) pairs
regr_sxx(YX)double precisiondouble precisionsum(X^2) – sum(X)^2/N (“sum of squares” of the independent variable)
regr_sxy(YX)double precisiondouble precisionsum(X*Y) – sum(X) * sum(Y)/N (“sum of products” of independent times dependent variable)
regr_syy(YX)double precisiondouble precisionsum(Y^2) – sum(Y)^2/N (“sum of squares” of the dependent variable)
stddev(expression)smallintintbigintrealdouble precision, or numericdouble precision for floating-point arguments, otherwise numerichistorical alias for stddev_samp
stddev_pop(expression)smallintintbigintrealdouble precision, or numericdouble precision for floating-point arguments, otherwise numericpopulation standard deviation of the input values
stddev_samp(expression)smallintintbigintrealdouble precision, or numericdouble precision for floating-point arguments, otherwise numericsample standard deviation of the input values
variance(expression)smallintintbigintrealdouble precision, or numericdouble precision for floating-point arguments, otherwise numerichistorical alias for var_samp
var_pop(expression)smallintintbigintrealdouble precision, or numericdouble precision for floating-point arguments, otherwise numericpopulation variance of the input values (square of the population standard deviation)
var_samp(expression)smallintintbigintrealdouble precision, or numericdouble precision for floating-point arguments, otherwise numericsample variance of the input values (square of the sample standard deviation)

Protected: Logistic-Linked GLM Regression Diagnostics in R

This content is password protected. To view it please enter your password below: