Archive for the ‘Editorial’ Category

TM Philosophy via gutenbergr

The gutenbergr package is an excellent API wrapper for Project Gutenberg, which provides unlimited free access to public domain books and materials.

Inspired by an analytical philosopher working in DS, I decided to spin up a Shiny app to allow myself and the public to text-mine 1 or more works at a time.

I started a new blog for the project at:

The app itself can be found at, however the free is a bit slow, so it won’t load quickly until the project gets moved to AWS, Azure, a NameCheap VM or some other affordable production environment.

Please see the initial, and currently only, blog article at for GitHub links to the full source code – and feel free to contribute!

Kaggle – my brief shining moment in the top 10

I started playing with the (all too addictive) Kaggle competitions this past December, on and off.

This past week I reached a personal high point by making the top 10 in a featured competition for the first time.


Since then, my ranking has dropped a bit, but there’s still time for me to take first! 😉 Just don’t hold your breath…

Magile Manifesto: Deprecating over- and mis- applied “Agile” concepts

After working in a couple of “Agile shops” that embodied the typical misapplication, misinterpretation, and commonly correlated (though technically unrelated) evils associated with the mutated forms of Agile, Scrum, and Lean now reaching Business Intelligence and other non-software related business units en mass.

Magile* Data Science Principles:

– Interactions over buzzwords and fluff
– Accurate information over false-but-compelling “high level” simplified reporting
– Collaboration over cutting throats
– Adaptive planning over planless adaptation
– Transparency over secrecy
– Individuals over groups

*  Miller’s rebooted Agile

Editorial: Notes v. aRticles and Tuts

My original plan for was to be a blog composed of code snippets and miscellaneous notes, taken from my own Evernote notes with only minimal editing.

This accomplished a couple of goals — it gave me a practical use for my notes and allowed me to contribute knowledge to “the community” (by which I could mean the Data Science “community”, but by which I effectively just mean the Internet) in a time-effective manner. It also used to provide me with some small amount of ad revenue until

  1.  some visitors complained that the ads detracted from the blog’s UX and
  2. Google AdSense froze my account anyway due to spammer-hackers*** spam-dexing a network Russian international dating websites with referrer bombs used some clever hack to route promote their site by sending zombie traffic to a non-existing referrer URL appended to my domain (and thousands of other folks’) domains like mine

However, I came to a bit of a dilemma. The short code snippets and brief programming notes give me a fair amount of search engine traffic, but they were burying the smaller number of higher-quality, longer-length tutorials and articles (“aRticles”, for you R devotees) that folks from the R-Bloggers crowd come looking for.

My initial solution was to link those tuts and articles from the homepage, but this seemed insufficient. I could solve the issue with tags, except to follow the protocol used by R-Bloggers and others, only my article-style posts should receive have “R” tags whereas I have lots of snippets and quick-tips which need that tag to be properly indexed. Even if I did conform to that standard, it would only solve the problem for R-related material.

I could create completely separate blogs for the 2 types of content, but it would waste precious time in duplicating logistical tasks, would require me to share even more links for such a humble amount of content, and since I love this domain name I don’t want to detract from it with some other closely related but different domain.

Let me know if you have any thoughts. I haven’t decided, but I think I may try forcing users entering through the homepage to chose between the 2 distinct sections of the site, each directing to a subdomain with a separately-indexed blog.


*** More power to them! I think it’s pretty funny and clever, though I figured the folks back in Mountain View (i.e. Google) were smart enough to be able to just adjust the ad revenue calculation to remove that part of the traffic. Oh well, this blog wasn’t exactly about to buy me a luxury yacht with Google’s cash anyway.

Download: Hack-R Classification Performance Stats Cheatsheet

There are lots of common, mostly very simple, statistics used in experimentation and classification. You find them in Machine Learning courses, medical literature and just about everywhere.

There are some handy resources out there, such as a table of formulas for common performance measures in classification from University of Alberta, but I’ve still noticed a bit of a gap between the stats I commonly see when running validation on algorithms in various software packages and what’s in those cheatsheets.

I made my own, which you may download and reproduce freely (just do me a favor and link to this blog).


It has everything from University of Alberta’s reference table (i.e. all the basics), all the stats used in the handy cofusionMatrix function in R’s caret package and a few others I threw in for good measure.


Download the PDF

All Hail the Data Science Venn Diagram

Forged by the Gods, the ancient data science venn diagram is the oldest, most sacred representation of the field of data science.


Data_Science_Venn_DiagramI’ve been in love with this simple diagram since I first began working as a data scientist. I love it because it so clearly and simply represents the unique skillset that makes up data science. I’ll write more on this topic and how my own otherwise eclectic skillset coalesced into the practice of professional data science.

I wish I could take credit for creating this simple-but-totally-unsurpassed graphic. Over the past couple of years I’ve often used it as an avatar and if you look close enough you’ll even find it in the background of my (hacked) WordPress header image. While I like to think that it was immaculately convinced, the word on the street is that it was created for the public domain by Drew Conway, who is the co-author of Machine Learning for Hackers*, a private market intelligence and business consultant, my fellow recovering social scientist,  recent PhD grad from NYU, and a fast-rising name in data science (yea, he wants to be like me).

*It’s an O’reilly book on ML in R which I kept with me at all times for at least a year; the code is on GitHub and I highly recommend it, though it’s a little basic and its social network analysis section is based on the deprecated Google Social Graph API

Protected: Another Take on library() v. require()

This content is password protected. To view it please enter your password below:

Stackoverflow Solutions

Just started! Have not answered any questions.