This is a fixed version of the lab originally hosted here.
RETRIEVE_AND_RANK_USER=<userIdFromServiceInstance>
RETRIEVE_AND_RANK_PASSWORD=<passwordFromServiceInstance>
curl -X POST -u "$RETRIEVE_AND_RANK_USER":"$RETRIEVE_AND_RANK_PASSWORD" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters" -d ""
Should respond with:
{"solr_cluster_id":"<cluster_id>","cluster_name":"","cluster_size":"","solr_cluster_status":"NOT_AVAILABLE"}
Store the solr_cluster_id and set as an environment variable like so:
SOLR_CLUSTER_ID=<solrClusterId>
curl -u "${RETRIEVE_AND_RANK_USER}":"${RETRIEVE_AND_RANK_PASSWORD}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/${SOLR_CLUSTER_ID}"
Responds with something like this:
{"solr_cluster_id":"<cluster_id>","cluster_name":"","cluster_size":"","solr_cluster_status":"NOT_AVAILABLE"}
- Keep querying this API until the
solr_cluster_status
isREADY
.
-
Download this zip file.
-
Set the env var to your downloads folder like so:
DOWNLOADS_FOLDER=<pathToWhereYouDownloadFiles>
-
example in Git Bash on Windows:
DOWNLOADS_FOLDER=/C/Users/IBM_ADMIN/Downloads
-
Upload the config:
curl -X POST -H "Content-Type: application/zip" -u "${RETRIEVE_AND_RANK_USER}":"${RETRIEVE_AND_RANK_PASSWORD}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/${SOLR_CLUSTER_ID}/config/example_config" --data-binary @${DOWNLOADS_FOLDER}/cranfield-solr-config.zip
curl -X POST -u "${RETRIEVE_AND_RANK_USER}":"${RETRIEVE_AND_RANK_PASSWORD}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/${SOLR_CLUSTER_ID}/solr/admin/collections" -d "action=CREATE&name=example_collection&collection.configName=example_config"
- After a few minutes it should respond with the following (if successful):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">12639</int></lst><lst name="success"><lst name="10.176.43.45:6113_solr"><lst name="responseHeader"><int name="status">0</int><int
name="QTime">2870</int></lst><str name="core">example_collection_shard1_replica1</str></lst><lst name="10.176.43.136:6844_solr"><lst name="responseHeader"><int name="status">0</int><int name="QTime">
3220</int></lst><str name="core">example_collection_shard1_replica2</str></lst></lst>
</response>
- Download the documents to search here.
curl -X POST -H "Content-Type: application/json" -u "${RETRIEVE_AND_RANK_USER}":"${RETRIEVE_AND_RANK_PASSWORD}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/${SOLR_CLUSTER_ID}/solr/example_collection/update" --data-binary @${DOWNLOADS_FOLDER}/cranfield-data.json
- A successful response should look like the following (try a few times if it doesn't work the first time around).
{"responseHeader":{"status":0,"QTime":1785}}
- Try running the following command in Git Bash:
python ./train.py -u "${RETRIEVE_AND_RANK_USER}":"${RETRIEVE_AND_RANK_PASSWORD}" -i ${DOWNLOADS_FOLDER}/cranfield-gt.csv -c ${SOLR_CLUSTER_ID} -x example_collection -n "example_ranker"
You will see a failure. The reason is that the python script was not designed with Windows architecture in mind (it is not a platform-independent script). It assumes certain things about the OS architecture which is true on Linux and Mac OS machines but not windows. But fear not! The Bluemix Continuous Delivery Pipeline in Toolchains has you covered. Every job of every stage in the pipeline runs in a Linux "container" So you can have it run the python script for you.
- Simply run the following in a terminal:
python ./train.py -u "${RETRIEVE_AND_RANK_USER}":"${RETRIEVE_AND_RANK_PASSWORD}" -i ${DOWNLOADS_FOLDER}/cranfield-gt.csv -c ${SOLR_CLUSTER_ID} -x example_collection -n "example_ranker"
Next we will try running the python script in a Delivery Pipline in Bluemix. We
will in fact have you go through the steps of manually creating your own empty
toolchain, adding the pipeline tool to it, adding the shell script to the pipeline
and kicking it off. We could've automated this entire process for you with a
Create Toolchain
button but then you wouldn't appreciate all the work being
done for you behind the scenes!
- First go to bluemix.net/devops
which should lead you to the Getting Started page
for
Continuous Delivery
in Bluemix. - You'll want to click the
Create a toolchain
button here.
- On the next page, under the "Other Templates" section (we're all pros, aren't we?) select
Build your own toolchain
.
- Finally, ensure the
Organization
the toolchain is being created is your own personal one and then give your toolchain an appropriate name on the next page (we suggestretrieve-and-rank-toolchain
) and then hitCreate
.
First off, you will need to own source code repository to store your scripts and other assets in. For this, Continuous Delivery supports GitHub with more source code targets coming soon.
- Tools can be added from the Tool Catalog, to get there from your shiny new toolchain click the
Add a Tool
button on the top-right.
- Now that you're on the Tool Catalog page, select the GitHub tool (feel free to use the search box at the top to filter).
-
In case this is the first time you are using Continuous Delivery with GitHub, you will most likely have to
Authorize
the Continuous Delivery service to have access to your GitHub account here. If that's the case, click the button. This access allows us to do things you ask us to such as forking, cloning or accessing repositories on GitHub on your behalf. In case you are worried about granting this access, you can use a brand new GitHub account to use with us. -
Once authorized, you should see the configuration page below. Here we'd like to select
Clone
in the first dropdown and type the name of your repository in the second input box, such asRetrieve-And-Rank-Lab
. For the 3rd and final input box, specify the location of this repository to "fork":https://github.com/Bluemix-Watson-Labs/Retrieve-And-Rank-Lab
. You may also want to uncheck theEnable GitHub issues
checkbox, as we won't be using it in this lab. Once completed, click theCreate Integration
button.
-
Once you've added the GitHub tool, watch the fancy little animation while it gets created. Note that if you've already forked this project into your GitHub account in the past you may get an error while it is configuring. If that's the case just go into the errored tool and change the
Repository Type
toExisting
rather thanFork
, hitSave
and it should work. -
Once that is all configured, you should see a page as below. If you see it you are ready to move on to the next step, which is to add a
Delivery Pipeline
tool.
-
Once again, click the
Add a Tool
button on the top-right page of your toolchain. -
Search or scroll down to find the
Delivery Pipeline
tool and click on it.
- In the configuration page for it, optionally give it a name like
train-ranker-pipeline
but leave the checkbox unchecked, as in this case we are just running a script, there is no user interface associated with the deployment. Once you've given it a name, hitCreate Integration
.
- Now that your toolchain has the necessary tools, click into the Pipeline tool card once it's
Configured
.
- Welcome to the magical Pipeline User Interface! It's quite barebones right now, but we'll fix that shortly. Hit the
Add Stage
button.
- The very first step of any pipeline you create is generally a build step to pull in any source code from wherever it is located to begin with. So firstly, give the new stage a name like
Pull scripts
. Under this initialInput
tab, that's all you need to do, as the defaults for everything else should already be set based on the contents of your toolchain (which has a single Github repo). Move on to theJobs
tab.
- Every Pipeline stage is made up of 1 or more jobs. So click the
Add Job
button in theJOBS
tab and selectBuild
type from the drowpdown. The defaultSimple
build is enough for our purposes (it will just package up the contents of your repository for subsequent stages where the actual stuff happens). Once that is all done, hitSave
at the bottom to return to the main Pipeline page.
- Once you're back at the main pipeline page, you should see the stage you just created but that isn't enough yet as you haven't actually created a stage to run your train-ranker python script! Let's create a second stage by clicking the
Add Stage
button.
- Let's call this next stage
Train Ranker
and then move on to theJOBS
tab as the defaultINPUT
for this stage should be correct already.
- Under the
JOBS
tab, click theADD JOB
button and selectDeploy
this time around. For thisDeploy
job name, let's call itRun python script
. For everything else, you can likely leave the defaults (ensure it is using your personal org, and thedev
space perhaps). Once you are down to theDeploy Script
section enter the following in it (remove the current contents, including the line that sayscf push "${CF_APP}"
).
#!/bin/bash
python --version
python ./train-ranker.py -u ${RANK_AND_RETRIEVE_USER}:${RANK_AND_RETRIEVE_PASSWORD} -i ./cranfield-gt.csv -c ${SOLR_CLUSTER_ID} -x example_collection -n "example_ranker"
- If you have a keen eye, you'll notice that we actually used environment variables in the above little scriptlet. Let's define those in the 3rd and final tab in this Pipeline stage, the aptly named
ENVIRONMENT PROPERTIES
tab.
- Use the
Add Property
button multiple times to add the following env vars that were used by the scriptlet we pasted and fill in the value as appropriate (hopefully you kept the values noted down somewhere)!- Text Property:
RANK_AND_RETRIEVE_USER
- Secure Property:
RANK_AND_RETRIEVE_PASSWORD
- Text Property:
SOLR_CLUSTER_ID
- Text Property:
- Once the configuration is completely filled in, hit that
Save
button to return to the main Pipeline page. Once there, let's kick off the pipeline to watch the fruit of our labour come to life by clicking the Play (Run Stage
) button on thePull Scripts
stage - see image below.
- There we go! You've trained the ranker. You can view the logs for the stage running by clicking on the
View logs and history
link in the Train Ranker stage, do so now and record theranker_id
to be used in your future curl commands.
RANKER_ID=<yourRankerID>
While you're waiting for the ranker to finish training, you can search your documents. This search, which uses the "Retrieve" part of the Retrieve and Rank service, does not use the machine learning ranking features. It's a standard Solr search query.
- Try the following retrieve curl while the ranker is training:
curl https://${RETRIEVE_AND_RANK_USER}:${RETRIEVE_AND_RANK_PASSWORD}@gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/${SOLR_CLUSTER_ID}/solr/example_collection/select?q=what%20is%20the%20basic%20mechanism%20of%20the%20transonic%20aileron%20buzz&wt=json&fl=id%2Ctitle
-
Note: The /select request handler in the cURL command indicates that this is a standard Solr search query, not a query that uses the Rank portion of the Retrieve and Rank service. Standard Solr search results are unordered and might not contain the most significant results. In Stage 6, we will use the /fcselect request handler with a trained ranker to obtain results that are ordered (ranked) by significance.
-
Be sure to hit
CTRL+C
to cancel out of the request once the response is received. -
Your query returns the 10 most relevant results from Solr.
-
If you want to try other queries, look in the cranfield-gt.csv that you downloaded above (a copy of it is also in this repo). Copy a question from the first column and paste it as the value of the q parameter in your browser. Note before you use it, you must also url-encode it. You can Google how to url encode it, if you have Chrome press
F12
to open the Developer tools, switch to theConsole
tab and type the function nameencodeURIComponent('can the three-dimensional problem of a transverse potential flow about a body of revolution be reduced to a two-dimensional problem.')
to encode the question andencodeURIComponent('id,title')
to encode that query param. For example you may get:
&q=can%20the%20three-dimensional%20problem%20of%20a%20transverse%20potential%20flow%20about%20a%20body%20of%20revolution%20be%20reduced%20to%20a%20two-dimensional%20problem.&wt=json&fl=id%2Ctitle
- Check the status of the ranker until you see a status of Available. With this sample data, training takes about 5 minutes and started when the pipeline finished earlier. Run the following command to test the ranker status.
curl -u "${RETRIEVE_AND_RANK_USER}":"${RETRIEVE_AND_RANK_PASSWORD}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/rankers/${RANKER_ID}"
- Once trained, query the ranker to view reranked results. See the example curl below, note that the
select
parameter has been changed tofcselect
.
https://${RETRIEVE_AND_RANK_USER}:${RETRIEVE_AND_RANK_PASSWORD}@gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/${SOLR_CLUSTER_ID}/solr/example_collection/fcselect?ranker_id=${RANKER_ID}&q=what%20is%20the%20basic%20mechanism%20of%20the%20transonic%20aileron%20buzz&wt=json&fl=id%2Ctitle
-
After evaluating the reranked search results, you can refine them by repeating the steps above starting with training the ranker. You can also add new documents, as described in the stage before training the ranker, to broaden the scope of the search. Repeat the process until you are completely satisfied with the results. This can require multiple iterations of refining and reranking.
-
As above, you can experiment with other results. For example with:
&q=can%20the%20three-dimensional%20problem%20of%20a%20transverse%20potential%20flow%20about%20a%20body%20of%20revolution%20be%20reduced%20to%20a%20two-dimensional%20problem.&wt=json&fl=id%2Ctitle
-
You might want to delete the Solr components and ranker that you created in this tutorial. To clean up, you delete the cluster and ranker you created previously.
-
Delete the cluster
curl -i -X DELETE -u "${RETRIEVE_AND_RANK_USER}":"${RETRIEVE_AND_RANK_PASSWORD}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/${SOLR_CLUSTER_ID}"
-
When the cluster is deleted, the response is HTTP response 200.
-
Delete the ranker
curl -X DELETE -u "${RETRIEVE_AND_RANK_USER}":"${RETRIEVE_AND_RANK_PASSWORD}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/rankers/${RANKER_ID}"
- When the ranker is deleted, the response is an empty JSON object.
For more details as well as what you may want to do next see the original tutorial here.