Posts Tagged ‘big-data’
Hive: Make CLI output files comma delimited
bash >> hive -e ‘select * from some_Table’ | sed ‘s/[\t]/,/g’ > outputfile.txt
Here [\t] means Control+V and then the tab button, i.e.
sed ‘s/
Example:
[user@server]$ hive -e "use dbname ; select * from tablename" | sed ‘s/ /,/g’ > kpi_event_jan8.csv
Linux: How to Install and Configure a Seedbox
#rTorrent for Transferring Free and Open Source files only!
mkdir
~
/install
mkdir
/var/www/files
mkdir
/var/www/watch
mkdir
/var/www/
.temp
chown
-R www-data:www-data
/var/www
apt-get update
apt-get -y upgrade
apt-get -y
install
apache2 apache2-utils autoconf build-essential ca-certificates comerr-dev libapache2-mod-php5 libcloog-ppl-dev libcppunit-dev libcurl3 libcurl4-openssl-dev libncurses5-dev ncurses-base ncurses-term libterm-readline-gnu-perl libsigc++-2.0-dev libssl-dev libtool libxml2-dev ntp openssl patch libperl-dev php5 php5-cli php5-dev php5-fpm php5-curl php5-geoip php5-mcrypt php5-xmlrpc pkg-config python-scgi dtach ssl-cert subversion zlib1g-dev pkg-config unzip htop irssi curl cfv nano unrar-
free
mediainfo libapache2-mod-scgi
ln
-s
/etc/apache2/mods-available/scgi
.load
/etc/apache2/mods-enabled/scgi
.load
cd ~/install
svn checkout http:
//svn
.code.sf.net
/p/xmlrpc-c/code/stable
xmlrpc-c
cd
xmlrpc-c
.
/configure
--disable-cplusplus
make
make
install
cd ~/install
wget http:
//libtorrent
.rakshasa.no
/downloads/libtorrent-0
.13.2.
tar
.gz
tar
xvf libtorrent-0.13.2.
tar
.gz
cd
libtorrent-0.13.2
.
/autogen
.sh
.
/configure
make
make
install
cd ~/install
wget http:
//libtorrent
.rakshasa.no
/downloads/libtorrent-0
.13.2.
tar
.gz
tar
xvf libtorrent-0.13.2.
tar
.gz
cd
libtorrent-0.13.2
.
/autogen
.sh
.
/configure
make
make
install
nano ~/.rtorrent.rc
#PASTE THE FOLLOWING
# Configuration file created for www.filesharingguides.com for single user rutorrent seedbox
# Maximum and minimum number of peers to connect to per torrent.
# min_peers = 25
max_peers = 100
# Same as above but for seeding completed torrents (-1 = same as downloading)
min_peers_seed = -1
max_peers_seed = -1
# Maximum number of simultanious uploads per torrent.
max_uploads = 100
# Global upload and download rate in KiB. "0" for unlimited.
download_rate = 0
upload_rate = 0
# Default directory to save the downloaded torrents.
directory =
/var/www/files
# Default session directory. Make sure you don't run multiple instance
# of rtorrent using the same session directory. Perhaps using a
# relative path?
session =
/var/www/
.temp
# Watch a directory for new torrents, and stop those that have been
# deleted.
schedule = watch_directory,5,5,load_start=
/var/www/watch/
*.torrent
schedule = untied_directory,5,5,stop_untied=
# Close torrents when diskspace is low.
schedule = low_diskspace,5,60,close_low_diskspace=100M
# The ip address reported to the tracker.
#ip = 127.0.0.1
#ip = rakshasa.no
# The ip address the listening socket and outgoing connections is
# bound to.
#bind = 127.0.0.1
#bind = rakshasa.no
# Port range to use for listening.
port_range = 6890-6999
# Start opening ports at a random position within the port range.
#port_random = no
# Check hash for finished torrents. Might be usefull until the bug is
# fixed that causes lack of diskspace not to be properly reported.
#check_hash = no
# Set whetever the client should try to connect to UDP trackers.
#use_udp_trackers = yes
# Alternative calls to bind and ip that should handle dynamic ip's.
#schedule = ip_tick,0,1800,ip=rakshasa
#schedule = bind_tick,0,1800,bind=rakshasa
# Encryption options, set to none (default) or any combination of the following:
# allow_incoming, try_outgoing, require, require_RC4, enable_retry, prefer_plaintext
#
# The example value allows incoming encrypted connections, starts unencrypted
# outgoing connections but retries with encryption if they fail, preferring
# plaintext to RC4 encryption after the encrypted handshake
#
encryption = allow_incoming,enable_retry,prefer_plaintext
# Enable DHT support for trackerless torrents or when all trackers are down.
# May be set to "disable" (completely disable DHT), "off" (do not start DHT),
# "auto" (start and stop DHT as needed), or "on" (start DHT immediately).
# The default is "off". For DHT to work, a session directory must be defined.
#
dht = disable
# UDP port to use for DHT.
#
# dht_port = 6881
# Enable peer exchange (for torrents not marked private)
#
peer_exchange = no
#
# Do not modify the following parameters unless you know what you're doing.
#
# Hash read-ahead controls how many MB to request the kernel to read
# ahead. If the value is too low the disk may not be fully utilized,
# while if too high the kernel might not be able to keep the read
# pages in memory thus end up trashing.
#hash_read_ahead = 10
# Interval between attempts to check the hash, in milliseconds.
#hash_interval = 100
# Number of attempts to check the hash while using the mincore status,
# before forcing. Overworked systems might need lower values to get a
# decent hash checking rate.
#hash_max_tries = 10
scgi_port = 127.0.0.1:5000
######################################################
To test:
cd
~
/install
wget http:
//rutorrent
.googlecode.com
/files/rutorrent-3
.5.
tar
.gz
tar
xvf rutorrent-3.5.
tar
.gz
mv
rutorrent
/var/www
wget http:
//rutorrent
.googlecode.com
/files/plugins-3
.5.
tar
.gz
tar
xvf plugins-3.5.
tar
.gz
mv
plugins
/var/www/rutorrent
mv
/var/www/rutorrent/
*
/var/www
chown
-R www-data:www-data
/var/www/rutorrent
#Set up authentication
nano /etc/apache2/sites-available/default
#paste this:
Options Indexes FollowSymLinks MultiViews
AllowOverride All
Order allow,deny
allow from all
nano /var/www/.htaccess
#paste this:
AuthType Basic
AuthName
"Protected Area"
AuthUserFile
/var/passwd/
.htpasswd
Require valid-user
#change permissions to enable www-data group
chown -R www-data:www-data /var/www/.htaccess
# create pw file using Apache's htpasswd util
mkdir /var/passwd
htpasswd -c
/var/passwd/
.htpasswd testuser
chown
-R www-data:www-data
/var/passwd
#run on boot
nano /etc/rc.local
screen -S rtorrent -d -m rtorrent
All Hail the Data Science Venn Diagram
Forged by the Gods, the ancient data science venn diagram is the oldest, most sacred representation of the field of data science.
I’ve been in love with this simple diagram since I first began working as a data scientist. I love it because it so clearly and simply represents the unique skillset that makes up data science. I’ll write more on this topic and how my own otherwise eclectic skillset coalesced into the practice of professional data science.
I wish I could take credit for creating this simple-but-totally-unsurpassed graphic. Over the past couple of years I’ve often used it as an avatar and if you look close enough you’ll even find it in the background of my (hacked) WordPress header image. While I like to think that it was immaculately convinced, the word on the street is that it was created for the public domain by Drew Conway, who is the co-author of Machine Learning for Hackers*, a private market intelligence and business consultant, my fellow recovering social scientist, recent PhD grad from NYU, and a fast-rising name in data science (yea, he wants to be like me).
*It’s an O’reilly book on ML in R which I kept with me at all times for at least a year; the code is on GitHub and I highly recommend it, though it’s a little basic and its social network analysis section is based on the deprecated Google Social Graph APIMySQL v. MS SQL Server Information Schema Queries
SELECT * FROM information_schema.tablesWHERE table_schema = ‘YourDatabaseName’;
USE YourDatabaseNameSELECT * FROM INFORMATION_SCHEMA.tables;
Quick Tip: HBase Scan (a simple example)
3 row(s) in 0.1270 seconds
How to Connect R + EMR in 6 Short Steps w/ Segue
Want the power of your own Hadoop cluster but lack the budget, time, human resources, and/or capital? No problem!
Segue makes using Amazon’s Elastic MapReduce in R not only possible, but downright simple. The best thing about this Big Data trick is that is brings you scalability without the out-of-pocket or opportunity costs normally associated with setting up a Hadoop cluster because of the elastic nature of EMR. The term elastic emphasizes that Amazon Web Services (AWS) extended the concept of scalability to their pricing model — clusters of virtual machines can be spun up and down at will, using whatever amount of RAM and compute power that you deem necessary. As soon as you’re finished with your virtual cluster the billing clock stops.
It gets better. AWS provides a free tier of service for the community, so that you can learn what you’re doing and test out EMR on your use case without having to worry about money (last time I checked it was about 750 free machine hours per month; check for changes). I’ve boiled down a 30-45 presentation that I used to give on the topic to the follow 6 simple steps:
1. Download the tar file from Google code
R CMD BUILD segue R CMD INSTALL segue_0_05.tar.gz
library(segue)
setCredentials("user", "pass")
emr.handle <- createCluster(numInstances=6 )
Pretty cool, huh? Just remember that the improvements in performance that you experience when using this or any other HPC/cluster solution depends heavily on well your code lends itself to parallelism, which is why we must always remember to write our Big Data scripts such that they are massively parallel.