piRacy in R

Ahoy there land lubbers! In honor of TLAPD (international Talk Like a Pirate Day) I’ve decided to write a quick post on topic. The code to company this analysis can be found on Github.

pirate

I downloaded data on the top 10 most pirated movies by week over 8 weeks and their legal availability. The goal is to determine if pirates pirate movies completely hedonistically, or if pirating is a super-cool fallback for those 1337 Internet users who can’t find a legitimate copy of their flick available through the normal channels.

## File: pirate.R
## Description: This script is 97% chum free
## Jason Miller
## http://hack-r.com
## http://github.com/hack-r
## http://hack-r.github.io

# International Talk Like a Pirate Day is Today, Sept. 19 =)
# http://www.talklikeapirate.com/

# Load Packages -----------------------------------------------------------
 require(gvlma) #Screw library(), require() rules
 require(MASS)

# Source Pirate Functions -------------------------------------------------
 source("pirate_functions.R")

# Grab data on pirates ----------------------------------------------------
# Top 10 Pirated Movies by week and their legal availability across forms of media
download.file("http://piracydata.org/csv", destfile = "p.csv")

piracy <- read.csv("p.csv")

This tells us the rank of the top 10 most pirated movies over time. I’ll invert the columns called rank so that I have a variable which increases as piracy increases.

# Invert Rank So that Higher = More Piracy --------------------------------
piracy$pirate <- 1/piracy$rank

Now to determine if legal unavailability of movies drives their piracy.

If we run an Ordinary Least Squares multiple regression model then we’ll quickly discover that many parameters can’t be estimated due to singularity.

# Determine if Legal Un-availability of Movies Drives Piracy -----------

mod <- lm(pirate ~ available_digital + streaming + rental + purchase + dvd +
              netflix_instant + amazon_prime_instant_video + hulu_movies +
            crackle + youtube_free + epix + streampix + amazon_video_rental +
            apple_itunes_rental + android_rental + vudu_rental + youtube_rental +
            amazon_video_purchase + apple_itunes_purchase + android_purchase +
            vudu_purchase + amazon_dvd + amazon_bluray + netflix_dvd +redbox,data = piracy)

summary.lm(mod)
gvlma(mod)

What the singularity is telling us is that there’s no identifying variation in some of the model parameters. This means many of these highly pirated movies have absolutely NO availability in the singular venues.

Let’s look at some basic descriptive statistics:

> # Arrr! Seems that we have a singularity in the Ordinary Least Squares regression
> #   above with all these predictors, matee! Well blow me down!
> # There was no identifying variation in some of the explanatory variables!
> # This means many of these highly pirated movies have absolutely NO availability
> # in many of these channels, for example:
> mean(piracy$netflix_instant)
[1] 0
> mean(piracy$amazon_prime_instant_video)
[1] 0
> mean(piracy$youtube_free)
[1] 0
> mean(piracy$epix)
[1] 0
> mean(piracy$crackle)
[1] 0
> mean(piracy$streampix)
[1] 0
> mean(piracy$hulu_movies)

So, after running some diagnostic testing and rejecting OLS (see my code on Github) I decide to address both problems by running a more parsimonious probit model after having transformed the dependent variable to be dichotomous. I use bootstrap sampling to estimate the marginal effects of the parameters.

piracy$higher_piracy <- 0
piracy$higher_piracy[piracy$pirate >= .5] <- 1
table(piracy$higher_piracy)

probit <- glm(higher_piracy ~ amazon_video_purchase + redbox + amazon_dvd
                  ,data = piracy, family =binomial(link = "probit"))

summary.glm(probit)

mfxboot(modform = "higher_piracy ~ amazon_video_purchase + redbox + amazon_dvd ",
                           dist = "probit",
                         data = piracy)
 

The final results are only directional because this is a tiny sample, but they indicate a 6.3% decrease in probability of being a movie with high piracy if the digital movie is available on Amazon for purchase and a decline of 3.7% in probability of higher piracy if the movie is available as a DVD.