Running pignlproc scripts on a EC2 Hadoop cluster

pignlproc requires the 0.8.1 version of Pig which is not yet available on the managed Amazon ElasticMapReduce service. So here we give instructions to run pignlproc scripts on a traditional Amazon EC2 cluster managed by the Apache Whirr (Incubating) utility.

Start a Hadoop cluster using Apache Whirr

First setup a Hadoop cluster by following the instructions of the Apache Whirr getting started guide. Once the cluster is started (can take a while), you can open a SSH connection to the master node with:

[local-laptop]$ ssh -i /path/to/id_rsa_whirr [email protected]

Login to the master node can be useful to load data from an EBS volume (attached and mounted to the master node) directly to your cluster HDFS virtual file-system. However if you feed your pig jobs from S3 you can do without ssh-ing to the cluster by using the hadoop distcp command as explained in the following section.

Note that the socks proxy trick allows you to easily monitor what's happening on your cluster by pointing your Firefox to the master node public name, e.g.: http://ec2-XXX-XXX-XXX-XX.compute-1.amazonaws.com:50030

Import data to HDFS

From S3

For some reason I cannot get pig to read its input data directly from the s3n://bucket/ URI scheme. To workaround this issue you can copy the content of your input files from S3 to you HDFS cluster with the hadoop distcp command (run on the master node). To download and extract the hadoop archive to your client machine's /opt folder from your closest mirror. Don't forget to put the bin/ sub-folder in your PATH:

[local-laptop]$ export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster
[local-laptop]$ export PATH="$PATH:/opt/hadoop-0.20.2/bin"
[local-laptop]$ hadoop distcp s3n://wikipedia-chunks/frwiki/20101231/  /user/root/frwiki-201012310/

From anywhere else

If you are to download data from anywhere but your laptop (e.g. from the Wikipedia or DBpedia dump sites), it's better to download it directly into you home folder on the virtual machine using curl or wget and then pushing it to some HDFS folder such as:

[ec2-box]$ sudo su - root
[ec2-box]$ wget http://downloads.dbpedia.org/3.6/en/article_categories_en.nt.bz2
[ec2-box]$ export PATH="$PATH:/usr/local/hadoop/bin/"
[ec2-box]$ hadoop fs -mkdir /user/root/workspace # create target folder if missing
[ec2-box]$ hadoop fs -put article_categories_en.nt.bz2 workspace

You can then check the content of the workspace HDFS folder with hadoop fs -ls workspace.

Accessing your EC2 Hadoop cluster from pig

Install pig as usual, the only difference is to point pig to the cluster configuration file by using the PIG_CLASSPATH env variable:

[local-laptop]$ wget http://apache.crihan.fr/dist//pig/pig-0.8.1/pig-0.8.1.tar.gz
[local-laptop]$ tar zxvf pig-0.8.1.tar.gz
[local-laptop]$ mkdir bin
[local-laptop]$ (cd bin && ln -s ~/pig-0.8.1/bin/pig)
[local-laptop]$ export PIG_CLASSPATH="$HOME/.whirr/myhadoopcluster"
[local-laptop]$ pig
[...]
grunt>

Running the pignlproc example scripts

Once all your input data is loaded into the HFDS and that the PIG_CLASSPATH env variable is pointing to the whirr configuration folder you can launch the example scripts as usual, but without the -x local command line parameter. For instance, let us extract the text and link information from the french Wikipedia dump:

[local-laptop]$ pig -P pignlproc.properties \
  -p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \
  -p INPUT=frwiki-201012310 \
  -p OUTPUT=workspace \
  -p LANG=fr \
  examples/ner-corpus/01_extract_sentences_with_links.pig

You can then monitor the progress with the jobtracker web interface from Firefox.

Saving the results and shutting down the cluster

Once your job is finished, copy the output to your S3 buck with the hadoop distcp or hadoop fs -cp commands, for instance to save the results from the previous script in my personal s3 bucket:

[local-laptop]$ hadoop distcp /user/root/workspace/ s3n://ogrisel/workspace/

Don't forget to destroy the cluster with by running the following command from your WHIRR_HOME folder:

[local-laptop]$ whirr destroy-cluster --config ~/hadoop.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly