Setting up Amazon EC2 and using Hadoop

Setting up EC2 account and tools

Create AMI signing certificate
mkdir ~/.ec2
cd ~/.ec2
openssl genrsa -des3 -out pk-<group>.pem 2048
openssl rsa -in pk-<group>.pem -out pk-unencrypt-<group>.pem
openssl req -new -x509 -key pk-<group>.pem -out cert-<group>.pem -days 1095
Share all three .pem files manually with group members
Troubleshooting: If your client date is wrong your certs will not work
Upload certificate to AWS via IAM page
Account: 123456
Username: group** (e.g. group1, group5, group10)
Password: xxxxxxxxxxxxx
Click IAM tab -> users -> select yourself (use right arrow if needed)
In bottom pane select “Security Credentials” tab and click “Manage Signing Certificates”
Click “Upload Signing Certificate”
cat ~/.ec2/cert-<group>.pem
Copy contents into ‘Certificate Body’ textbox and click ‘OK’
Retrieve and unpack AWS tools
Create ec2 initialization script
vi (you can use your preferred editor)
export JAVA_HOME=/usr
export EC2_HOME=~/ec2-api-tools-
export PATH=$PATH:$EC2_HOME/bin
export EC2_PRIVATE_KEY=~/.ec2/pk-unencrypt-<group>.pem
export EC2_CERT=~/.ec2/cert-<group>.pem
This will need to be done every login
Alternately, put it in ~/.profile to have it done automatically on login
Test it out
ec2-describe-images -o self -o amazon
Create a new keypair (allows cluster login)
ec2-add-keypair <group>-keypair | grep –v KEYPAIR > ~/.ec2/id_rsa-<group>-keypair
chmod 600 ~/.ec2/id_rsa-<group>-keypair
Only do this once! It will create a new keypair in AWS every time you run it
Share private key file between group members, keep it private
Don’t delete other groups’ keypairs!
Everyone has access to everyone else’s keypairs from the AWS console
EC2 tab ->Network and Security -> Keypairs
Setting up Hadoop for EC2
Retrieve hadoop toolswget
tar –xzvf hadoop-1.0.0.tar.gzCreate hadoop-ec2 initialization script
vi (you can use your preferred editor)
export HADOOP_EC2_BIN=~/hadoop-1.0.0/src/contrib/ec2/bin
This will need to be done every login
Alternately, put it in ~/.profile to have it done automatically on login
Configure hadoop with EC2 account
vi ~/hadoop-1.0.0/src/contrib/ec2/bin/
AWS_ACCESS_KEY_ID=<from Dr. Jin’s email>
Looks like FtDMaAuSXwzD7pagkR3AfIVTMjc6+pdab2/2iITL
The same keypair you set up earlier at ~/.ec1/ida_rsa-<group>-keypairCreate/launch cluster
hadoop-ec2 launch-cluster <group>-cluster 2
Can take 10-20 minutes!
Keep an eye on it from the AWS -> EC2 console tab
Note your master node DNS name, you’ll need it later
Looks like:
Test login to master node
hadoop-ec2 login <group>-cluster
Troubleshooting: If you didn’t setup your keypair properly, you’ll get:
[ec2-user@ip-10-123-22-179 ~]$ hadoop-ec2 login test-cluster
Logging in to host
Warning: Identity file /home/ec2-user/.ec2/id_rsa-<group>-keypair not accessible: No such file or directory.
Permission denied (publickey,gssapi-with-mic).
Troubleshooting: a Map/Reduce Job

Copy the jar file to the master-node
scp -i ~/.ec2/id_rsa-<group>-keypair hadoop-1.0.0/hadoop-examples-1.0.0.jar root@<master node>:/tmp
Get your master node from the ‘hadoop login <group>-cluster’ command, it will look something like this:
(Optional) Copy your HDFS files to the master-node
Compress data for faster transfer
tar –cjvf data.bz2 <data-dir>
scp -i ~/.ec2/id_rsa-<group>-keypair data.bz2 root@<master node>:/tmp
Upload data to HDFS, HDFS is already setup on the nodes
hadoop fs –put /tmp/<data-file>

Login to the master node
hadoop login <group>-cluster
Run the Map/Reduce job
hadoop jar /tmp/hadoop-examples-1.0.0.jar pi 10 10000000
Track task process from the web
http://<master node>:50030

