Archive for the ‘Data Science’ Category

R: Setup a grid search for xgboost (!!)

I find this code super useful because R’s implementation of xgboost (and to my knowledge Python’s) otherwise lacks support for a grid search:

# set up the cross-validated hyper-parameter search
xgb_grid_1 = expand.grid(
nrounds = 1000,
eta = c(0.01, 0.001, 0.0001),
max_depth = c(2, 4, 6, 8, 10),
gamma = 1
)

# pack the training control parameters
xgb_trcontrol_1 = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all",                                                        # save losses across all models
classProbs = TRUE,                                                           # set to TRUE for AUC to be computed
summaryFunction = twoClassSummary,
allowParallel = TRUE
)

# train the model for each parameter combination in the grid,
#   using CV to evaluate
xgb_train_1 = train(
x = as.matrix(df_train %>%
select(-SeriousDlqin2yrs)),
y = as.factor(df_train$SeriousDlqin2yrs),
trControl = xgb_trcontrol_1,
tuneGrid = xgb_grid_1,
method = "xgbTree"
)

# scatter plot of the AUC against max_depth and eta
ggplot(xgb_train_1$results, aes(x = as.factor(eta), y = max_depth, size = ROC, color = ROC)) +
geom_point() +
theme_bw() +
scale_size_continuous(guide = "none")</code>

Rlogo

Magile Manifesto: Deprecating over- and mis- applied “Agile” concepts

After working in a couple of “Agile shops” that embodied the typical misapplication, misinterpretation, and commonly correlated (though technically unrelated) evils associated with the mutated forms of Agile, Scrum, and Lean now reaching Business Intelligence and other non-software related business units en mass.

Magile* Data Science Principles:

– Interactions over buzzwords and fluff
– Accurate information over false-but-compelling “high level” simplified reporting
– Collaboration over cutting throats
– Adaptive planning over planless adaptation
– Transparency over secrecy
– Individuals over groups

*  Miller’s rebooted Agile

Mean and Multi-modal Mean Functions (methods) for Java

When it comes to stats Java ain’t no R. Still, we can do anything in one language that we can do in another.

Let’s have a look at some mean functions for Java, to illustrate:

public static double mean(double[] m) {
     double sum = 0;
     for (int i = 0; i < m.length; i++) {
         sum += m[i];
     }
    return sum / m.length;}
For multi-modal:
public static List<Integer> mode(final int[] a) {
     final List<Integer> modes = new ArrayList<Integer>();
     final Map<Integer, Integer> countMap = new HashMap<Integer, Integer>();

     int max = -1;

     for (final int n : numbers) {
         int count = 0;

         if (countMap.containsKey(n)) {
             count = countMap.get(n) + 1;
         } else {
             count = 1;
         }

         countMap.put(n, count);

         if (count > max) {
             max = count;
         }
     }

     for (final Map.Entry<Integer, Integer> tuple : countMap.entrySet()) {
         if (tuple.getValue() == max) {
             modes.add(tuple.getKey());
         }
     }

     return modes;}

Java

How to Conditionally Remove Character of a Vector Element in R

I have (sometimes incomplete) data on addresses that looks like this:

data <- c("1600 Pennsylvania Avenue, Washington DC", 
          ",Siem Reap,FC,", "11 Wall Street, New York, NY", ",Addis Ababa,FC,")  

where I need to remove the first and/or last character if either one of them are a comma.

Avinash Raj was able to help me with this on S.O. and the question turned out to be a popular one, so I’ll show the solution here:

> data <- c("1600 Pennsylvania Avenue, Washington DC", 
+           ",Siem Reap,FC,", "11 Wall Street, New York, NY", ",Addis Ababa,FC,")
> gsub("(?<=^),|,(?=$)", "", data, perl=TRUE)
[1] "1600 Pennsylvania Avenue, Washington DC"
[2] "Siem Reap,FC"                           
[3] "11 Wall Street, New York, NY"           
[4] "Addis Ababa,FC" 

Pattern explanation:

  • (?<=^), In regex (?<=) called positive look-behind. In our case it asserts What precedes the comma must be a line start ^. So it matches the starting comma.
  • | Logical OR operator usually used to combine(ie, ORing) two regexes.
  • ,(?=$) Lookahead aseerts that what follows comma must be a line end $. So it matches the comma present at the line end.

Rlogo

Java: Determine if String is a URL/URI or file

In the spirit of making a more polymorphous app, you may need to pull off this trick, as I did in a recent assignment at Berkeley. I compiled a few different ways of getting the job done:
public boolean isLocalFile(String file) {
     try {
         new URL(file);
         return false;
     } catch (MalformedURLException e) {
         return true;
    }}
there’s also a util for this in Android’s toolkit (not worth grabbing unless you’re specifically writing for Android, though).
another semi-related thing;
  1. Make sure the filename is correct (proper capitalization, matching extension etc – as already suggested).
  2. Use the Class.getResource method to locate your file in the classpath – don’t rely on the current directory:
    URL url = insertionSort.class.getResource("10_Random");
    
    File file = new File(url.toURI());
  3. Specify the absolute file path via command-line arguments:
    File file = new File(args[0]);

In Eclipse:

  1. Choose “Run configurations”
  2. Go to the “Arguments” tab
  3. Put your “c:/Users/HackR/somewhere/10_myjava.txt.or.something” into the “Program arguments” section

Java

Java: How to import StAX libraries for parsing XML

In short:

 

import javax.xml.stream.*;
import java.io.*;
import java.util.*;//usually, but not always needed

In long:

Here are steps in writing code to parse an XML document with StAX.

  1. Import the following libraries:
     import javax.xml.stream.*;
     import java.io.*;
  1. Create an XMLInputFactory . See the read() method above.
  2. Create an XMLStreamReader and pass a Reader to it such as a FileReader. The XML file is passed as a parameter to FileReader.
  3. We can now iterate through the contents of our XML file using the streamreader’s next() method.
  4. next() returns an event code that indicates which part of the document has been read such as: DTD, START_ELEMENT, CHARACTERS and END_ELEMENT.
  5. If you get the START_ELEMENT event code, you can retrieve the element’s name using the getLocalName() method. To read the attributes, use getAttributeValue() method.
  6. To read the text between the start and end tags, wait until you receive the CHARACTERS event code. Afterwards, you can read the text using getText().

Thanks to my instructor Carl Limsico for the step-by-step!

Write an R Package from Scratch with Github

Writing an R package is simple. Writing an R package via Github is simple and smart. Github adds all the traditional benefits of version control, in addition to showing off your work and providing and facilitating publication of your package. This tutorial was inspired by a blog post from the beautiful Hillary Parker last year. I used her tut myself, but trying to integrate it with Github leads to some headaches and I felt there were a couple of other small additions to be made.

 This has been sitting in my Evernote for some time, so I figured it was about time to upload to my own highly neglected blog, however as a caveate I’ll say that I still need to append more sample code and such, so watch for updates.

 
Step 0: Load the necessary packages  
if (!require(“pacman”)) install.packages(“pacman”) # Don’t use pacman yet? Get ready to fall in love
pacman::p_load("devtools", "roxygen2")

 

Step 1: Create your package directory
 
* Create a new repo on Github with the name of your package 
* Create a new project in RStudio from the Github repo
* Open a .R file to begin writing code
* Open the automatically generated README.md file and edit appropriately
 
Step 2: Add functions
 
* Enter your functions and save the file (i.e. dog_function.R) 
* You can move this to the R folder once it has been automatically created in Step 3, or feel free to create the folder before saving the .R file (remember not to overwrite it in the next step)
 

Step 3: Add minimal documentation

* Utilize roxygen2 by typing create(“packagename”)

* Copy the files in this newly created folder — except the .Rproj and .gitignore files — to the top level folder you cloned from Github
* Delete the folder created by roxygen2 
* Edit the files to reflect the details of your package, such as its license and author
 

Step 4: Add optional, but recommended example and docs

 

4a. data
 
* dir.create(“data”) # Example .RData goes here (optional, but strongly recommended)
* include a file called datalist to list the data in this folder, for example:
 
4b. vignettes
 
* dir.create(“vignettes”) # From the top level folder that you created on Github
* Add a .pdf, .Rnw vignette files here
 
 
4c. man
 
* dir.create(“man”) # From the top level folder that you created on Github
* Add .Rd manual files here
 
Step 5: Process your documentation
 
setwd("./dogs")
document()
 
Step 6: Install your package!
 
setwd("..")
install("dogs")


R: Happy Pi Day

Today, 3/14/2015, is Pi Day (see http://piday.org).

In honor of Pi Day, I threw together a little R code on Github, which discusses pi, prints it, and creates Julia set (fractal) images based on it:

https://github.com/hack-r/Rpiday

Happy Pi Day!

pi_fractal

Software Sec: C / C++ Buffer overflows and Robert Morris

Buffer Overflow = any access of a bugger outside of its allotted bounds
  •      over-read or over-write
  •      could be during iteration (running off the end), or direct access (pointer arithmetic)
  •      this is a general definition; some people use more specific definitions of differing types of buffer overflows

A buffer overflow is a bug that affects low-level code, typically C and C++ with significant sec implications

Normally causes a crash, but can be used to:
  • dump/steal information
  • corrupt information
  • run code (payload)
They also share common features with other bugs.
C and C++ are the most popular languages (behind Java) and therefore buffer overflows are a major vuln. C/C++ are heavily used in:
  •      OS Kernels
  •      embedded systems
  •      HPC servers
 First buffer overflow occurred in 1988 by a student named Robert Morris, as part of a self-propagating computer worm that was an attack against fingerd and VAXes (Morris was caught and punished but is now a MIT professor); this worm affected 10% of the Internet
In 2001, CodeRed exploited a buffer overflow in the MS-IIS server, which infected >300,000 machines in 14 hours
In 2003 SQL Slammer worm infected 75,000 machines in 10 minutes by exploiting a buffer overflow in MS-SQL Server
In 2014 a latent buffer overflow bug was found in X11 that had been present over 23 years.

 

 

Computer History: CDC 6600

The first super-computer was the CD 6600, announced in 1964 by Control Data Corporation with a starting price of $6million