-
Notifications
You must be signed in to change notification settings - Fork 64
Running pignlproc scripts on a EC2 Hadoop cluster
pignlproc requires the 0.8.1 version of Pig which is not yet available on the managed Amazon ElasticMapReduce service. So here we give instructions to run pignlproc scripts on a traditional Amazon EC2 cluster managed by the Apache Whirr (Incubating) utility.
First setup a Hadoop cluster by following the instructions of the Apache Whirr getting started guide. Once the cluster is started (can take a while), you can open a SSH connection to the master node with:
[local-laptop]$ ssh -i /path/to/id_rsa_whirr [email protected]
Login to the master node can be useful to load data from an EBS volume (attached and mounted to the master node) directly to your cluster HDFS virtual file-system. However if you feed your pig jobs from S3 you can do without ssh-ing to the cluster by using the hadoop distcp
command as explained in the following section.
Note that the socks proxy trick allows you to easily monitor what's happening on your cluster by pointing your Firefox to the master node public name, e.g.: http://ec2-XXX-XXX-XXX-XX.compute-1.amazonaws.com:50030
For some reason I cannot get pig to read its input data directly from the s3n://bucket/
URI scheme. To workaround this issue you can copy the content of your input files from S3 to you HDFS cluster with the hadoop distcp
command (run on the master node). To download and extract the hadoop archive to your client machine's /opt
folder from your closest mirror. Don't forget to put the bin/
sub-folder in your PATH
:
[local-laptop]$ export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster
[local-laptop]$ export PATH="$PATH:/opt/hadoop-0.20.2/bin"
[local-laptop]$ hadoop distcp s3n://wikipedia-chunks/frwiki/20101231/ /user/root/frwiki-201012310/
If you are to download data from anywhere but your laptop (e.g. from the Wikipedia or DBpedia dump sites), it's better to download it directly into you home folder on the virtual machine using curl or wget and then pushing it to some HDFS folder such as:
[ec2-box]$ sudo su - root
[ec2-box]$ wget http://downloads.dbpedia.org/3.6/en/article_categories_en.nt.bz2
[ec2-box]$ export PATH="$PATH:/usr/local/hadoop/bin/"
[ec2-box]$ hadoop fs -mkdir /user/root/workspace # create target folder if missing
[ec2-box]$ hadoop fs -put article_categories_en.nt.bz2 workspace
You can then check the content of the workspace
HDFS folder with hadoop fs -ls workspace
.
Install pig as usual, the only difference is to point pig to the cluster configuration file by using the PIG_CLASSPATH
env variable:
[local-laptop]$ wget http://apache.crihan.fr/dist//pig/pig-0.8.1/pig-0.8.1.tar.gz
[local-laptop]$ tar zxvf pig-0.8.1.tar.gz
[local-laptop]$ mkdir bin
[local-laptop]$ (cd bin && ln -s ~/pig-0.8.1/bin/pig)
[local-laptop]$ export PIG_CLASSPATH="$HOME/.whirr/myhadoopcluster"
[local-laptop]$ pig
[...]
grunt>
Once all your input data is loaded into the HFDS and that the PIG_CLASSPATH
env variable
is pointing to the whirr configuration folder you can launch the example scripts as usual, but without
the -x local
command line parameter. For instance, let us extract the text and link information from
the french Wikipedia dump:
[local-laptop]$ pig -P pignlproc.properties \
-p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \
-p INPUT=frwiki-201012310 \
-p OUTPUT=workspace \
-p LANG=fr \
examples/ner-corpus/01_extract_sentences_with_links.pig
You can then monitor the progress with the jobtracker web interface from Firefox.
Once your job is finished, copy the output to your S3 buck with the hadoop distcp
or
hadoop fs -cp
commands, for instance to save the results from the previous script in
my personal s3 bucket:
[local-laptop]$ hadoop distcp /user/root/workspace/ s3n://ogrisel/workspace/
Don't forget to destroy the cluster with by running the following command from
your WHIRR_HOME
folder:
[local-laptop]$ whirr destroy-cluster --config ~/hadoop.properties