## Archive for the ‘Statistics’ Category

## Happy Pi Day 2018!

In honor of Pi Day, I usually like to do a little on-topic code snippet.

This year I was running low on time, but I thought I’d ask the question “Can Pi be useful in predictive modeling, ML, AI, etc?”.

Of course the answer is going to be a big “yes!”. Transformations with natural numbers are underutilized, perhaps because it’s not always intuitive to leverage a constant scalar in a model. Let’s see a trivial example with the famous `iris` data set, built into R.

Compare the 2 models below and you’ll be pleasantly surprised. Pi helped us explain more variance and helped to create another highly significant predictor capturing a potentially unique effect:

data(iris) summary(lm(I(iris$Species=="setosa") ~ iris$Sepal.Length, data = iris)) summary(lm(I(iris$Species=="setosa") ~ iris$Sepal.Length + I(iris$Sepal.Length^pi), data = iris))

## R: Setup a grid search for xgboost (!!)

I find this code ** super **useful because R’s implementation of xgboost (and to my knowledge Python’s) otherwise lacks support for a grid search:

# set up the cross-validated hyper-parameter search xgb_grid_1 = expand.grid( nrounds = 1000, eta = c(0.01, 0.001, 0.0001), max_depth = c(2, 4, 6, 8, 10), gamma = 1 ) # pack the training control parameters xgb_trcontrol_1 = trainControl( method = "cv", number = 5, verboseIter = TRUE, returnData = FALSE, returnResamp = "all", # save losses across all models classProbs = TRUE, # set to TRUE for AUC to be computed summaryFunction = twoClassSummary, allowParallel = TRUE ) # train the model for each parameter combination in the grid, # using CV to evaluate xgb_train_1 = train( x = as.matrix(df_train %>% select(-SeriousDlqin2yrs)), y = as.factor(df_train$SeriousDlqin2yrs), trControl = xgb_trcontrol_1, tuneGrid = xgb_grid_1, method = "xgbTree" ) # scatter plot of the AUC against max_depth and eta ggplot(xgb_train_1$results, aes(x = as.factor(eta), y = max_depth, size = ROC, color = ROC)) + geom_point() + theme_bw() + scale_size_continuous(guide = "none")</code>

## Mean and Multi-modal Mean Functions (methods) for Java

When it comes to stats Java ain’t no R. Still, we can do anything in one language that we can do in another.

Let’s have a look at some mean functions for Java, to illustrate:

```
public static double mean(double[] m) {
double sum = 0;
for (int i = 0; i < m.length; i++) {
sum += m[i];
}
```

return sum / m.length;}

```
public static List<Integer> mode(final int[] a) {
final List<Integer> modes = new ArrayList<Integer>();
final Map<Integer, Integer> countMap = new HashMap<Integer, Integer>();
int max = -1;
for (final int n : numbers) {
int count = 0;
if (countMap.containsKey(n)) {
count = countMap.get(n) + 1;
} else {
count = 1;
}
countMap.put(n, count);
if (count > max) {
max = count;
}
}
for (final Map.Entry<Integer, Integer> tuple : countMap.entrySet()) {
if (tuple.getValue() == max) {
modes.add(tuple.getKey());
}
}
return modes;}
```

## Stats: Dirichlet process

In probability theory, the **Dirichlet process** (after Peter Gustav Lejeune Dirichlet) is a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose domain is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables, that is, how likely it is that the random variables are distributed according to one or another particular distribution.

The Dirichlet process is specified by a base distribution and a positive real number (alpha) called the concentration parameter. The base distribution (H) is the expected value of the process, that is, the Dirichlet process draws distributions “around” the base distribution in the way that a normal distribution draws real numbers around its mean. However, even if the base distribution is continuous, the distributions drawn from the Dirichlet process are almost surely discrete. The concentration parameter specifies how strong this discretization is: in the limit of alpha –> 0, the realizations are all concentrated on a single value, while in the limit of alpha –> infinity the realizations become continuous. In between the two extremes the realizations are discrete distributions with less and less concentration as increases.

The Dirichlet process can also be seen as the infinite-dimensional generalization of the Dirichlet distribution. In the same way as the Dirichlet distribution is the conjugate prior for the categorical distribution, the Dirichlet process is the conjugate prior for infinite, nonparametric discrete distributions.

^{[1]}and has since been applied in data mining and machine learning, among others for natural language processing, computer vision and bioinformatics.

## Stats: Major Correlation Types (Pearson, Kendall, Spearman)

**Correlation** is a bivariate analysis that measures the strengths of association between two variables. In statistics, the value of the correlation coefficient varies between +1 and -1. When the value of the correlation coefficient lies around ± 1, then it is said to be a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. Usually, in statistics, we measure three types of correlations: Pearson correlation, Kendall rank correlation and Spearman correlation.

* Pearson r correlation:* Pearson r correlation is widely used in statistics to measure the degree of the relationship between linear related variables. For example, in the stock market, if we want to measure how two commodities are related to each other, Pearson

*r*correlation is used to measure the degree of relationship between the two commodities.

**Questions a Pearson correlation answers**

Is there a statistically significant relationship between age, as measured in years, and height, measured in inches?

Is there a relationship between temperature, measure in degree Fahrenheit, and ice cream sales, measured by income?

Is there a relationship among job satisfaction, as measured by the JSS, and income, measured in dollars?

**Assumptions**

For the Pearson *r* correlation, both variables should be normally distributed. Other assumptions include linearity and homoscedasticity. Linearity assumes a straight line relationship between each of the variables in the analysis and homoscedasticity assumes that data is normally distributed about the regression line.

**Key Terms**

** Effect size:** Cohen’s standard will be used to evaluate the correlation coefficient to determine the strength of the relationship, or the effect size, where coefficients between .10 and .29 represent a small association; coefficients between .30 and .49 represent a medium association; and coefficients above .50 represent a large associate or relationship.

** Continuous data: **This type of data possess the properties of magnitude and equal interval between adjacent units. Equal intervals between adjacent units means that there are equal amounts of the variable being measured between adjacent units on the scale. An example would be age. An increase in age from 21 to 22 would be the same as an increase in age from 60 to 61; one year. In addition, we can perform mathematical functions on scale data to determine if X – Y = A – B, X – Y > A – B, or if X – Y < A – B. We can also perform other mathematical operations including addition, multiplication, and division.

** Kendall rank correlation: **Kendall rank correlation is a non-parametric test that measures the strength of dependence between two variables. If we consider two samples, a and b, where each sample size is

*n*, we know that the total number of pairings with a b is

*n*(

*n*-1)/2

*.*

**Key Terms**

** Concordant: **Ordered in the same way

** Discordant: **Ordered differently.

** Spearman rank correlation: **Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables. It was developed by Spearman, thus it is called the Spearman rank correlation. Spearman rank correlation test does not assume any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.

**Questions Spearman Correlation Answers**

Is there a statistically significant relationship between participant responses to two Likert scales questions?

Is there a statistically significant relationship between how the horses place in the race and the horses’ ages?

**Assumptions**

Spearman rank correlation test does not make any assumptions about the distribution. The assumptions of Spearman rho correlation are that data must be at least ordinal and scores on one variable must be montonically related to the other variable.

**Key Terms**

** Effect size: **Cohen’s standard will be used to evaluate the correlation coefficient to determine the strength of the relationship, or the effect size, where coefficients between .10 and .29 represent a small association; coefficients between .30 and .49 represent a medium association; and coefficients above .50 represent a large associate or relationship.

** Ordinal data: **** **Ordinal scales rank order the items that are being measured to indicate if they possess more, less, or the same amount of the variable being measured. An ordinal scale allows us to determine if X > Y, Y > X, or if X = Y. An example would be rank ordering the participants in a dance contest. The dancer who was ranked one was a better dancer than the dancer who was ranked two. The dancer ranked two was a better dancer than the dancer who was ranked three, and so on. Although this scale allows us to determine greater than, less than, or equal to, it still does not define the magnitude of the relationship between units.

**For Assistance Conducting Analyses:**

Please click here for more information on our academic services.

**Correlation Resources:**

Algina, J., & Keselman, H. J. (1999). Comparing squared multiple correlation coefficients: Examination of a confidence interval and a test significance. *Psychological Methods, 4*(1), 76-83.

Bobko, P. (2001). *Correlation and regression: Applications for industrial organizational psychology and management* (2nd ed.). Thousand Oaks, CA: Sage Publications. View

Bonett, D. G. (2008). Meta-analytic interval estimation for bivariate correlations. *Psychological Methods, 13*(3), 173-181.

Chen, P. Y., & Popovich, P. M. (2002). *Correlation: Parametric and nonparametric measures*. Thousand Oaks, CA: Sage Publications. View

Cheung, M. W. -L., & Chan, W. (2004). Testing dependent correlation coefficients via structural equation modeling.*Organizational Research Methods, 7*(2), 206-223.

Coffman, D. L., Maydeu-Olivares, A., Arnau, J. (2008). Asymptotic distribution free interval estimation: For an intraclass correlation coefficient with applications to longitudinal data. *Methodology, 4*(1), 4-9.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). *Applied multiple regression/correlation analysis for the behavioral sciences*. (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. View

Hatch, J. P., Hearne, E. M., & Clark, G. M. (1982). A method of testing for serial correlation in univariate repeated-measures analysis of variance. *Behavior Research Methods & Instrumentation, 14*(5), 497-498.

Kendall, M. G., & Gibbons, J. D. (1990). *Rank Correlation Methods* (5th ed.). London: Edward Arnold. View

Krijnen, W. P. (2004). Positive loadings and factor correlations from positive covariance matrices. *Psychometrika, 69*(4), 655-660.

Shieh, G. (2006). Exact interval estimation, power calculation, and sample size determination in normal correlation analysis. *Psychometrika, 71*(3), 529-540.

Stauffer, J. M., & Mendoza, J. L. (2001). The proper sequence for correcting correlation coefficients for range restriction and unreliability. *Psychometrika, 66*(1), 63-68.

**Sources:**

## Passing-Bablok regression analysis

This is a placeholder; here are a few relevant links:

## Stats: Moments

Moment number | Raw moment | Central moment | Standardised moment | Raw cumulant | Standardised cumulant |
---|---|---|---|---|---|

1 | mean | 0 | 0 | mean | N/A |

2 | – | variance | 1 | variance | 1 |

3 | – | – | skewness | – | skewness |

4 | – | – | historical kurtosis (or flatness) | – | modern kurtosis (i.e. excess kurtosis) |

5 | – | – | hyperskewness | – | – |

6 | – | – | hyperflatness | – | – |

7+ | – | – | – | – | – |

## Machine Learning: Definition of %Var(y) in R’s randomForest package’s regression method

The second column is simply the first column divided by the variance of the response that have been OOB up to that point (20 trees), times 100.

Source:

https://stat.ethz.ch/pipermail/r-help/2008-July/167748.html

## Stats: Gini Importance in Random Forest Models

Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.

## Stats: ANOVA/ANCOVA: Type I, II, III SS

The different types of sums of squares then arise depending on the stage of model reduction at which they are carried out. In particular:

**Type I (“sequential”):**`SS(A)`for factor A.`SS(B | A)`for factor B.`SS(AB | B, A)`for interaction AB.- This tests the main effect of factor
`A`, followed by the main effect of factor`B`*after*the main effect of`A`, followed by the interaction effect`AB`*after*the main effects. - Because of the sequential nature and the fact that the two main factors are tested
*in a particular order*, this type of sums of squares will give different results for unbalanced data depending on which main effect is considered first. - For unbalanced data, this approach tests for a difference in the
*weighted*marginal means. In practical terms, this means that the results are dependent on the realized sample sizes, namely the proportions in the particular data set. In other words, it is testing the first factor without*controlling*for the other factor . - Note that this is often
**not**the hypothesis that is of interest when dealing with unbalanced data.

**Type II:**`SS(A | B)`for factor A.`SS(B | A)`for factor B.- This type tests for each main effect
*after*the other main effect. - Note that
*no significant interaction*is assumed (in other words, you should test for interaction first (`SS(AB | A, B)`) and only if`AB`is not significant, continue with the analysis for main effects). - If there is indeed no interaction, then type II is statistically more powerful than type III (see Langsrud [3] for further details).
- Computationally, this is equivalent to running a type I analysis with different orders of the factors, and taking the appropriate output (the second, where one main effect is run
*after*the other, in the example above).

**Type III:**`SS(A | B, AB)`for factor A.`SS(B | A, AB)`for factor B.- This type tests for the presence of a main effect
*after*the other main effect and interaction. This approach is therefore valid in the presence of significant interactions. - However, it is often not interesting to interpret a main effect if interactions are present (generally speaking, if a significant interaction is present, the main effects should not be further analyzed).
- If the interactions are not significant, type II gives a more powerful test.

When data is balanced, the factors are *orthogonal*, and types I, II and III all give the same results.