Posts Tagged ‘big-data’

R: cbind fill for data.table

cbind.fill <- function(…){
nm <- list(…)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow)), lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))


Rails: Migration Files

When you generate a model in Ruby on Rails

rails generate model somename

A migration table is also generated.

The Migration is a table that describes a database and the information it holds in columns.

Hive: Make CLI output files comma delimited

bash >> hive -e ‘select * from some_Table’ | sed ‘s/[\t]/,/g’ > outputfile.txt

Here [\t] means Control+V and then the tab button, i.e.
sed ‘s//,/g’


[user@server]$ hive -e "use dbname ; select * from tablename" | sed ‘s/ /,/g’ > kpi_event_jan8.csv

Hive: Get Column Names in CLI Queries

Add this to your query:

set hive.cli.print.header=true;


hive -e "set hive.cli.print.header=true; use a_db; select * from a_table;" > test


Linux: How to Install and Configure a Seedbox

#rTorrent for Transferring Free and Open Source files only!
mkdir ~/install
mkdir /var/www/files
mkdir /var/www/watch
mkdir /var/www/.temp
chown -R www-data:www-data /var/www
apt-get update
apt-get -y upgrade
apt-get -y install apache2 apache2-utils autoconf build-essential ca-certificates comerr-dev libapache2-mod-php5 libcloog-ppl-dev libcppunit-dev libcurl3 libcurl4-openssl-dev libncurses5-dev ncurses-base ncurses-term libterm-readline-gnu-perl libsigc++-2.0-dev libssl-dev libtool libxml2-dev ntp openssl patch libperl-dev php5 php5-cli php5-dev php5-fpm php5-curl php5-geoip php5-mcrypt php5-xmlrpc pkg-config python-scgi dtach ssl-cert subversion zlib1g-dev pkg-config unzip htop irssi curl cfv nano unrar-free mediainfo libapache2-mod-scgi
ln -s /etc/apache2/mods-available/scgi.load /etc/apache2/mods-enabled/scgi.load
cd ~/install
svn checkout xmlrpc-c
cd xmlrpc-c
./configure --disable-cplusplus
make install
cd ~/install
tar xvf libtorrent-0.13.2.tar.gz
cd libtorrent-0.13.2
make install
cd ~/install
tar xvf libtorrent-0.13.2.tar.gz
cd libtorrent-0.13.2
make install
nano ~/.rtorrent.rc
# Configuration file created for for single user rutorrent seedbox
# Maximum and minimum number of peers to connect to per torrent.
# min_peers = 25
max_peers = 100
# Same as above but for seeding completed torrents (-1 = same as downloading)
min_peers_seed = -1
max_peers_seed = -1
# Maximum number of simultanious uploads per torrent.
max_uploads = 100
# Global upload and download rate in KiB. "0" for unlimited.
download_rate = 0
upload_rate = 0
# Default directory to save the downloaded torrents.
directory = /var/www/files
# Default session directory. Make sure you don't run multiple instance
# of rtorrent using the same session directory. Perhaps using a
# relative path?
session = /var/www/.temp
# Watch a directory for new torrents, and stop those that have been
# deleted.
schedule = watch_directory,5,5,load_start=/var/www/watch/*.torrent
schedule = untied_directory,5,5,stop_untied=
# Close torrents when diskspace is low.
schedule = low_diskspace,5,60,close_low_diskspace=100M
# The ip address reported to the tracker.
#ip =
#ip =
# The ip address the listening socket and outgoing connections is
# bound to.
#bind =
#bind =
# Port range to use for listening.
port_range = 6890-6999
# Start opening ports at a random position within the port range.
#port_random = no
# Check hash for finished torrents. Might be usefull until the bug is
# fixed that causes lack of diskspace not to be properly reported.
#check_hash = no
# Set whetever the client should try to connect to UDP trackers.
#use_udp_trackers = yes
# Alternative calls to bind and ip that should handle dynamic ip's.
#schedule = ip_tick,0,1800,ip=rakshasa
#schedule = bind_tick,0,1800,bind=rakshasa
# Encryption options, set to none (default) or any combination of the following:
# allow_incoming, try_outgoing, require, require_RC4, enable_retry, prefer_plaintext
# The example value allows incoming encrypted connections, starts unencrypted
# outgoing connections but retries with encryption if they fail, preferring
# plaintext to RC4 encryption after the encrypted handshake
encryption = allow_incoming,enable_retry,prefer_plaintext
# Enable DHT support for trackerless torrents or when all trackers are down.
# May be set to "disable" (completely disable DHT), "off" (do not start DHT),
# "auto" (start and stop DHT as needed), or "on" (start DHT immediately).
# The default is "off". For DHT to work, a session directory must be defined.
dht = disable
# UDP port to use for DHT.
# dht_port = 6881
# Enable peer exchange (for torrents not marked private)
peer_exchange = no
# Do not modify the following parameters unless you know what you're doing.
# Hash read-ahead controls how many MB to request the kernel to read
# ahead. If the value is too low the disk may not be fully utilized,
# while if too high the kernel might not be able to keep the read
# pages in memory thus end up trashing.
#hash_read_ahead = 10
# Interval between attempts to check the hash, in milliseconds.
#hash_interval = 100
# Number of attempts to check the hash while using the mincore status,
# before forcing. Overworked systems might need lower values to get a
# decent hash checking rate.
#hash_max_tries = 10
scgi_port =
To test: 
cd ~/install
tar xvf rutorrent-3.5.tar.gz
mv rutorrent /var/www
tar xvf plugins-3.5.tar.gz
mv plugins /var/www/rutorrent
mv /var/www/rutorrent/* /var/www
chown -R www-data:www-data /var/www/rutorrent
#Set up authentication
nano /etc/apache2/sites-available/default
#paste this:
Options Indexes FollowSymLinks MultiViews
AllowOverride All
Order allow,deny
allow from all
nano /var/www/.htaccess
#paste this:
AuthType Basic
AuthName "Protected Area"
AuthUserFile /var/passwd/.htpasswd
Require valid-user
#change permissions to enable www-data group
chown -R www-data:www-data /var/www/.htaccess
# create pw file using Apache's htpasswd util
mkdir /var/passwd
htpasswd -c /var/passwd/.htpasswd testuser
chown -R www-data:www-data /var/passwd
#run on boot
nano /etc/rc.local
# add this before ‘exit 0’:
screen -S rtorrent -d -m rtorrent

All Hail the Data Science Venn Diagram

Forged by the Gods, the ancient data science venn diagram is the oldest, most sacred representation of the field of data science.


Data_Science_Venn_DiagramI’ve been in love with this simple diagram since I first began working as a data scientist. I love it because it so clearly and simply represents the unique skillset that makes up data science. I’ll write more on this topic and how my own otherwise eclectic skillset coalesced into the practice of professional data science.

I wish I could take credit for creating this simple-but-totally-unsurpassed graphic. Over the past couple of years I’ve often used it as an avatar and if you look close enough you’ll even find it in the background of my (hacked) WordPress header image. While I like to think that it was immaculately convinced, the word on the street is that it was created for the public domain by Drew Conway, who is the co-author of Machine Learning for Hackers*, a private market intelligence and business consultant, my fellow recovering social scientist,  recent PhD grad from NYU, and a fast-rising name in data science (yea, he wants to be like me).

*It’s an O’reilly book on ML in R which I kept with me at all times for at least a year; the code is on GitHub and I highly recommend it, though it’s a little basic and its social network analysis section is based on the deprecated Google Social Graph API

MySQL v. MS SQL Server Information Schema Queries

In MySQL, you would run the following to get information on tables in a database:
SELECT * FROM information_schema.tables
WHERE table_schema = ‘YourDatabaseName’;
In MS SQL Server, the same would be:

USE YourDatabaseName
– See more at:

Quick Tip: HBase Scan (a simple example)

hbase(main):008:0> scan ‘test’
ROW                   COLUMN+CELL
 row1                 column=cf:a, timestamp=1381963930588, value=value1
 row2                 column=cf:b, timestamp=1381963944569, value=value2
 row3                 column=cf:c, timestamp=1381963957538, value=value3

3 row(s) in 0.1270 seconds

Quick Tip: How to Connect to a Running Session (shell) in HBase in 1 Line

./bin/hbase shell

How to Connect R + EMR in 6 Short Steps w/ Segue

Want the power of your own Hadoop cluster but lack the budget, time, human resources, and/or capital? No problem!

Segue makes using Amazon’s Elastic MapReduce in R not only possible, but downright simple.R_AWS_Hadoop The best thing about this Big Data trick is that is brings you scalability without the out-of-pocket or opportunity costs normally associated with setting up a Hadoop cluster because of the elastic nature of EMR. The term elastic emphasizes that Amazon Web Services (AWS) extended the concept of scalability to their pricing model — clusters of virtual machines can be spun up and down at will, using whatever amount of RAM and compute power that you deem necessary. As soon as you’re finished with your virtual cluster the billing clock stops.

It gets better. AWS provides a free tier of service for the community, so that you can learn what you’re doing and test out EMR on your use case without having to worry about money (last time I checked it was about 750 free machine hours per month; check for changes). I’ve boiled down a 30-45 presentation that I used to give on the topic to the follow 6 simple steps:

1. Download the tar file from Google code

2. install it (can be done in bash or R):
R CMD INSTALL segue_0_05.tar.gz
3. Load the package in R:
or require(segue), depending on your school of thought on library() v. require()
4. Enter the AWS Credentials:
setCredentials("user", "pass")
5. Create your cluster!
emr.handle <- createCluster(numInstances=6 )
6. Run your code using emrlapply() instead of regular lapply()

Pretty cool, huh? Just remember that the improvements in performance that you experience when using this or any other HPC/cluster solution depends heavily on well your code lends itself to parallelism, which is why we must always remember to write our Big Data scripts such that they are massively parallel.