If you want to run your own instance of ShiCo, there are a few things you will need:
- A set of word2vec models which your ShiCo instance will use.
- Run the python back end on your a server (you will need a server with enough memory to hold your word2vec models).
- Run a web server to serve the front end to the browser.
You are welcome to use our existing w2v models:
curl -o word2vecModels.tgz https://zenodo.org/records/1189328/files/word2vecModels.tgz
tar zxf word2vecModels.tgz
If you use these models, please contact us for more details on how the models were build and to know how to cite our work. You can also create your own models, based on your own corpus.
First download the code (or clone this repo) and install all Python requirements (contained in requirements.txt):
git clone https://github.com/h-gear/ShiCo.git
cd ShiCo
python3 -m venv --prompt shico venv3
source venv3/bin/activate
pip install -r requirements.txt
Next, launch the flask server as follows:
python shico/server/app.py -f "../word2vecModels/????_????.w2v"
Note: loading the word2vec models takes some time and may consume a large amount of memory. If they do not fit in your computer's memory (error message: Killed), you might want to remove some of the model files and relaunch the server:
rm -f ../word2vecModels/19?[1-9]*
You can check that the server is up and running by connecting to the server using curl (or your web browser):
curl http://localhost:8000/load-settings
Alternatively you use Gunicorn for starting the server, by setting your configuration on shico/server/config.py and then running:
gunicorn --bind 0.0.0.0:8000 --timeout 1200 shico.server.wsgi:app
The necessary files for serving the front end are located in the webapp folder. You will need to edit your configuration file (webapp/src/config.json) to tell the front end where your back end is running. For example, if your backend is running on localhost port 8000 as in the example above, you would set your configuration file as follows:
{
"baseURL": "http://localhost:8000"
}
If you are familiar with the Javascript world, you can use the gulp tasks provided. You can serve your front end as follows (from the webapp folder):
gulp serve # gulp and phantomjs do not run at our systems
You can build a deployable version (minified, uglified, etc) as follows:
gulp build
This will build a deployable version on the webapp/dist folder.
If you are not familiar with the Javascript world (or just don't feel like building your own deployable version), the demo branch of this repository contains a pre-build version of the front end. You can checkout (or download) that branch, and then you are ready to go.
Once you have a webapp/dist folder (whether downloaded or self built) you can serve the content of it using your favorite web server. For example, you could use Python SimpleHTTPServer as follows (from the webapp/dist folder):
python -m SimpleHTTPServer
In some cases, resulting vocabularies may contain words which we would like to filter. ShiCo offers the possibility of using a cleaning function, for filtering vocabularies after they have been generated. To use this option, it is necessary to indicate the name of the cleaning function when starting the ShiCo server. A sample cleaning function is provided (shico.extras.cleanTermList). You can use this function as follows:
python shico/server/app.py -c "shico.extras.cleanTermList"
If you are using gunicorn, in your config.py, you can set cleaningFunctionStr
to the name of your cleaning function, for instance:
cleaningFunctionStr = "shico.extras.cleanTermList"
Current implementation of ShiCo relies on gensim word2vec model most_similar
function, which in turn requires the calculation of the dot product between two large matrices, via numpy.dot
function. For this reason, ShiCo greatly benefits from using libraries which accelerate matrix multiplications, such as OpenBLAS. ShiCo has been tested using Numpy with OpenBLAS, producing a significant increase in speed.