Posts Tagged ‘statistics’
R: Setup a grid search for xgboost (!!)
I find this code super useful because R’s implementation of xgboost (and to my knowledge Python’s) otherwise lacks support for a grid search:
# set up the cross-validated hyper-parameter search xgb_grid_1 = expand.grid( nrounds = 1000, eta = c(0.01, 0.001, 0.0001), max_depth = c(2, 4, 6, 8, 10), gamma = 1 ) # pack the training control parameters xgb_trcontrol_1 = trainControl( method = "cv", number = 5, verboseIter = TRUE, returnData = FALSE, returnResamp = "all", # save losses across all models classProbs = TRUE, # set to TRUE for AUC to be computed summaryFunction = twoClassSummary, allowParallel = TRUE ) # train the model for each parameter combination in the grid, # using CV to evaluate xgb_train_1 = train( x = as.matrix(df_train %>% select(-SeriousDlqin2yrs)), y = as.factor(df_train$SeriousDlqin2yrs), trControl = xgb_trcontrol_1, tuneGrid = xgb_grid_1, method = "xgbTree" ) # scatter plot of the AUC against max_depth and eta ggplot(xgb_train_1$results, aes(x = as.factor(eta), y = max_depth, size = ROC, color = ROC)) + geom_point() + theme_bw() + scale_size_continuous(guide = "none")</code>
Mean and Multi-modal Mean Functions (methods) for Java
When it comes to stats Java ain’t no R. Still, we can do anything in one language that we can do in another.
Let’s have a look at some mean functions for Java, to illustrate:
public static double mean(double[] m) {
double sum = 0;
for (int i = 0; i < m.length; i++) {
sum += m[i];
}
return sum / m.length;}
public static List<Integer> mode(final int[] a) {
final List<Integer> modes = new ArrayList<Integer>();
final Map<Integer, Integer> countMap = new HashMap<Integer, Integer>();
int max = -1;
for (final int n : numbers) {
int count = 0;
if (countMap.containsKey(n)) {
count = countMap.get(n) + 1;
} else {
count = 1;
}
countMap.put(n, count);
if (count > max) {
max = count;
}
}
for (final Map.Entry<Integer, Integer> tuple : countMap.entrySet()) {
if (tuple.getValue() == max) {
modes.add(tuple.getKey());
}
}
return modes;}
Stats: Moments
Moment number | Raw moment | Central moment | Standardised moment | Raw cumulant | Standardised cumulant |
---|---|---|---|---|---|
1 | mean | 0 | 0 | mean | N/A |
2 | – | variance | 1 | variance | 1 |
3 | – | – | skewness | – | skewness |
4 | – | – | historical kurtosis (or flatness) | – | modern kurtosis (i.e. excess kurtosis) |
5 | – | – | hyperskewness | – | – |
6 | – | – | hyperflatness | – | – |
7+ | – | – | – | – | – |
Stats: Gini Importance in Random Forest Models
Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.
Download: Hack-R Classification Performance Stats Cheatsheet
There are lots of common, mostly very simple, statistics used in experimentation and classification. You find them in Machine Learning courses, medical literature and just about everywhere.
There are some handy resources out there, such as a table of formulas for common performance measures in classification from University of Alberta, but I’ve still noticed a bit of a gap between the stats I commonly see when running validation on algorithms in various software packages and what’s in those cheatsheets.
I made my own, which you may download and reproduce freely (just do me a favor and link to this blog).
It has everything from University of Alberta’s reference table (i.e. all the basics), all the stats used in the handy cofusionMatrix function in R’s caret package and a few others I threw in for good measure.
Enjoy
Download the PDF
piRacy in R
Ahoy there land lubbers! In honor of TLAPD (international Talk Like a Pirate Day) I’ve decided to write a quick post on topic. The code to company this analysis can be found on Github.
I downloaded data on the top 10 most pirated movies by week over 8 weeks and their legal availability. The goal is to determine if pirates pirate movies completely hedonistically, or if pirating is a super-cool fallback for those 1337 Internet users who can’t find a legitimate copy of their flick available through the normal channels.
## File: pirate.R ## Description: This script is 97% chum free ## Jason Miller ## http://hack-r.com ## http://github.com/hack-r ## http://hack-r.github.io # International Talk Like a Pirate Day is Today, Sept. 19 =) # http://www.talklikeapirate.com/ # Load Packages ----------------------------------------------------------- require(gvlma) #Screw library(), require() rules require(MASS) # Source Pirate Functions ------------------------------------------------- source("pirate_functions.R") # Grab data on pirates ---------------------------------------------------- # Top 10 Pirated Movies by week and their legal availability across forms of media download.file("http://piracydata.org/csv", destfile = "p.csv") piracy <- read.csv("p.csv")
This tells us the rank of the top 10 most pirated movies over time. I’ll invert the columns called rank so that I have a variable which increases as piracy increases.
# Invert Rank So that Higher = More Piracy -------------------------------- piracy$pirate <- 1/piracy$rank
Now to determine if legal unavailability of movies drives their piracy.
If we run an Ordinary Least Squares multiple regression model then we’ll quickly discover that many parameters can’t be estimated due to singularity.
# Determine if Legal Un-availability of Movies Drives Piracy ----------- mod <- lm(pirate ~ available_digital + streaming + rental + purchase + dvd + netflix_instant + amazon_prime_instant_video + hulu_movies + crackle + youtube_free + epix + streampix + amazon_video_rental + apple_itunes_rental + android_rental + vudu_rental + youtube_rental + amazon_video_purchase + apple_itunes_purchase + android_purchase + vudu_purchase + amazon_dvd + amazon_bluray + netflix_dvd +redbox,data = piracy) summary.lm(mod) gvlma(mod)
What the singularity is telling us is that there’s no identifying variation in some of the model parameters. This means many of these highly pirated movies have absolutely NO availability in the singular venues.
Let’s look at some basic descriptive statistics:
> # Arrr! Seems that we have a singularity in the Ordinary Least Squares regression > # above with all these predictors, matee! Well blow me down! > # There was no identifying variation in some of the explanatory variables! > # This means many of these highly pirated movies have absolutely NO availability > # in many of these channels, for example: > mean(piracy$netflix_instant) [1] 0 > mean(piracy$amazon_prime_instant_video) [1] 0 > mean(piracy$youtube_free) [1] 0 > mean(piracy$epix) [1] 0 > mean(piracy$crackle) [1] 0 > mean(piracy$streampix) [1] 0 > mean(piracy$hulu_movies)
So, after running some diagnostic testing and rejecting OLS (see my code on Github) I decide to address both problems by running a more parsimonious probit model after having transformed the dependent variable to be dichotomous. I use bootstrap sampling to estimate the marginal effects of the parameters.
piracy$higher_piracy <- 0 piracy$higher_piracy[piracy$pirate >= .5] <- 1 table(piracy$higher_piracy) probit <- glm(higher_piracy ~ amazon_video_purchase + redbox + amazon_dvd ,data = piracy, family =binomial(link = "probit")) summary.glm(probit) mfxboot(modform = "higher_piracy ~ amazon_video_purchase + redbox + amazon_dvd ", dist = "probit", data = piracy)
The final results are only directional because this is a tiny sample, but they indicate a 6.3% decrease in probability of being a movie with high piracy if the digital movie is available on Amazon for purchase and a decline of 3.7% in probability of higher piracy if the movie is available as a DVD.