By default, only python3.9 and higher are supported. You may be able to run the application using python 3.7 or 3.8, but we do not explicitly maintain support for these versions.
Dependencies for the project are managed with poetry.
poetry install.
In the root directory of the project, create a .env
file
DB_HOST = "localhost"
DB_PORT = 5432
DB_USERNAME = "geneweaver-dev"
DB_PASSWORD = ""
DB_NAME = "geneweaver-dev"
AUTH_CLIENTID = ""
AUTH_CLIENTSECRET = ""
You will need to reach out to the Geneweaver development team for the
AUTH_CLIENTID
and AUTH_CLIENTSECRET
values.
If you are not hosting your own database instance, you will need to reach out to the
Geneweaver development team for the value of DB_PASSWORD
, and you will need to install
and set up the cloud-sql-proxy tool from Google.
If you are using development infrastructure hosted by The Jackson laboratory, you will need to install the cloud-sql-proxy tool from Google. You can find the installation instructions here.
You will also need gcloud and kubectl installed on your machine and configured with the appropriate credentials.
Once you have the cloud-sql-proxy installed, you can run the following command to connect
to the Geneweaver database. Please reach out to the Geneweaver team for the appropriate
values for $PROJECT
, $REGION
, and $DBINSTANCE
.
cloud-sql-proxy $PROJECT:$REGION:$DBINSTANCE
To use the development search index, you can port forward the search index service to your local machine using the following command. You will need to reach out to the Geneweaver team to get the appropriate cluster credentials.
kubectl port-forward \
--namespace dev \
$(kubectl get pod \
--namespace dev \
--selector="app=geneweaver-legacy" \
--output jsonpath='{.items[0].metadata.name}') \
9312:9312
Kubernetes deployments are fully defined in the deploy/k8s directory in this repository.
Deployments are orchestrated using skaffold. For example:
skaffold run -p jax-cluster-dev-10--dev
You should never need to run these commands on your own. They are intended to be run through CICD.
GeneWeaver is designed to run on Linux and requires a relatively recent release. It has been tested on CentOS 7 and Red Hat distributions but should run on other distributions with minimal changes.
To begin, you'll need the following application dependencies:
RedHat/Fedora/CentOS:
$ sudo yum install boost boost-devel cairo cairo-devel git graphviz libffi libffi-devel libpqxx libpqxx-devel postgresql-server postgresql-devel rabbitmq-server sphinx ImageMagick ImageMagick-devel
Debian/Ubuntu:
$ sudo apt-get install libboost-all-dev libcairo2 libcairo2-dev git graphviz libffi6 libffi-dev libpqxx-4.0 libpqxx-dev postgresql postgresql-server-dev-9.5 rabbitmq-server sphinxsearch imagemagick libmagickcore-dev libmagickwand-dev
Ensure that the following applications meet these version requirements:
- python == 3.7.*
- PostgreSQL >= 9.2
- Graphviz >= 2.3
- RabbitMQ >= 3.3.*
- Sphinx >= 2.1.*
Python3 (3.7
) versions and pip packages can be manged by pipenv
Detailed version information is specified and locked in Pipfile.lock
.
(If you want to add or manage the version information, please read pipenv
document and apply for other developers.)
# Install pipenv package and synchronize current python version for this repo.
$ cd {PROJECT_ROOT}
$ pip install pipenv
$ pipenv sync
$ cd {PROJECT_ROOT}
$ pipenv run python src/application.py
When package list has been changed, all packages should be synchronized.
$ pipenv sync
The more on the pipenv
documentation.
pipenv
can automatically manage the versions of several packages explicitly. Since some packages do not support backward compatibility, pipenv
can help to keep proper package versions.
See Pycharm official instruction.
If you cannot find the pipenv
on the interpreter settings, you can restart the Pycharm to let it check the PATH
for pipenv.
Pipenv
should take care of all package dependencies for you. If you run into trouble setting up Sphinx, you may
have luck with the following Geneweaver hosted package.
GeneWeaver utilizes Sphinx's python API. The python package can no longer be found in any of the PyPi repos but we have a custom package that can be retrieved and installed:
$ wget http://geneweaver.org/sphinxapi.tar.gz
$ pip install sphinxapi.tar.gz
This will eventually change to support installing and setting up a skeleton DB
In most cases, installing the Postgres server will automatically create a new postgres user to own and administer the server. If this user does not already exist, create it:
$ useradd -d /var/lib/pgsql postgres
Switch to the postgres user account:
$ sudo su - postgres
Initialize a database cluster to store the database(s):
$ initdb -D /var/lib/pgsql/data
Start the server:
$ pg_ctl start -D /var/lib/pgsql/data -l logfile
Add a role for future connections. We typically use 'odeadmin':
$ createuser -s odeadmin
Create a new database. We typically use 'geneweaver':
$ createdb geneweaver
The defalut postgres settings are not optimal for dealing with large data sets. It is advised that you alter memory and cache parameters for a more performant database, especially if the database will be on its own server or you plan on utilizing variant annotation functions.
Copy the lines below to the end of the postgres configuration file which
should be found at /var/lib/pgsql/data/postgresql.conf
. Each setting is
commented and you should alter them depending on your needs and available
resources. The settings below were based on a server with 24GB of RAM.
## The amount of memory postgres uses for caching data. A good value (assuming
## a separate database server) is 1/4 the available RAM.
shared_buffers = 7168MB
## An estimate of memory available for disk caching and used by the query
## planner. Conservative value is 1/2 the available memory and a more
## aggressive amount is 3/4.
effective_cache_size = 18432MB
## Postgres writes DB transactions in segments of 16MB and everytime a number
## of these files (parameter below) has been written, a checkpoint occurs.
## Doing these frequently is resource intensive and requires a lot of overhead.
## The default is 3 (3 * 16MB = 48MB). A good value for larger datasets is
## anywhere from 32 (512MB) to 256 (4GB). Keep in mind large settings use
## more disk and cause longer recovery times.
checkpoint_segments = 64
## Memory used for in-memory sorts. This setting is used per connection and
## must be set with care. e.g. if it is set to 50MB and 30 users submit
## queries, you will be using 1.5GB of real memory. If a query involves a
## merge sort of 8 tables you are using (8 * 50MB = 400MB) of memory. For
## applications that don't have many users at once, the value can be set
## higher. Required whenever ORDER BY, DISTINCT, merge joins, or IN is used
## in a query.
work_mem = 128MB
## Memory used by maintenance operations (e.g. VACUUM, CREATE INDEX). Only a
## single maintenance operation can be executed at a time so this value can be
## much higher than work_mem.
maintenance_work_mem = 512MB
If you altered any settings you will need to restart the server.
$ pg_ctl restart -D /var/lib/pgsql/data -m fast -l logfile
You can now log out of the postgres user account.
If you already have a copy of the DB, you can skip this section. Dumping a current instance of the GWDB is done in two parts. First, a copy of the schema is saved:
$ pg_dump -U odeadmin -Fc -Cs geneweaver > gw-schema.custom
Next, the data is stored separately. To speed up the restore process, we exclude two large tables, geneset_jaccard and result:
$ pg_dump -U odeadmin -Fc -a -T extsrc.geneset_jaccard -T production.result geneweaver > gw-data.custom
First restore the schema:
$ pg_restore --no-owner -d geneweaver -U postgres -s gw-schema.custom
Next, restore the data. The -j option specifies the number of cores to use during the restore process. The --disable-triggers option must be used otherwise the restore will fail. This will take several hours.
$ pg_restore -a -d geneweaver -Fc --disable-triggers -j 6 -S odeadmin -U odeadmin gw-data.custom
RabbitMQ is the message broker used by Celery to distribute GW tool runs. The easiest and most expected way to run it is with systemctl:
# Start Rabbitmq Server
sudo systemctl start rabbitmq-server
# Enable start on boot
sudo systemctl enable rabbitmq-server
# Check status
sudo systemctl status rabbitmq-server
Alternatively, it can be run in the background:
$ rabbitmq-server start &
Or it can be configured to start on boot and daemonized:
$ chkconfig rabbitmq-server on
$ /sbin/service rabbitmq-server start
Once runnning, it's a good idea to create a user, password, namepsace and permissions:
rabbitmqctl add_user geneweaver geneweaver
rabbitmqctl add_vhost geneweaver
rabbitmqctl set_permissions -p geneweaver geneweaver ".*" ".*" ".*"
This would result in a [celery]
url that looks like the following
amqp://geneweaver:geneweaver@<RABBITMQ-SERVER-HOST>:5672/geneweaver
NOTE: This documentation on setting up Sphinx is in the process of being updated.
A sample sphinx config can be found in the sample-configs/
directory.
The following example stores the Sphinx config and indices under /var/lib
.
Create a folder to hold the Sphinx config file and indices:
$ sudo mkdir /var/lib/sphinx/geneweaver
$ sudo cp geneweaver-configs/sphinx/sphinx.conf geneweaver-configs/sphinx/stopwords.txt /var/lib/sphinx/geneweaver
You'll have to make several edits to the sphinx configuration. First, edit the
source base
section to point to the newly set up Postgres DB.
Both source geneset_src
and source geneset_delta_src
sections contain an
sql_query
variable that should be edited to support any species that are
currently found in the database. You'll have to associate the sp_id with a
common species name.
Under the index geneset
section, specify the full path of the geneset index
using the path
variable. This path can be anywhere on the system--we
typically use the sphinx folder. Set the stopwords
variable to the full
path containing the list of stop words we copied into the sphinx folder above.
Make the same changes under the index geneset_delta
section too.
Create a 'log' directory; 'chown [sphinxuser]'.
Under the searchd
section, change the log
, query_log
, and
pid_file
variables to point to full paths for each of those files.
Installing Sphinx will usually create a user to own the search server. If a sphinx user does not exist, create one:
$ useradd -d /var/lib/sphinx sphinx
Generate the search indices:
$ sudo -usphinx indexer --all --config /var/lib/sphinx/geneweaver/sphinx.conf
Start the server as the sphinx user:
$ sudo -usphinx searchd --config /var/lib/sphinx/geneweaver/sphinx.conf
Retrieve the GeneWeaver web application and toolset from the BitBucket repo. Create a new project folder if you haven't already:
$ mkdir /opt/geneweaver && cd /opt/geneweaver
$ git clone https://[email protected]/geneweaver/website-py.git
$ git clone https://[email protected]/geneweaver/tools.git
Provided all previous installations were successful, the web application should
be ready to start. First, edit src/config.py
and change the CONFIG_PATH
variable to point to a location to store the GeneWeaver config file.
The config file doesn't need to exist, GeneWeaver will generate one for you to edit at the path you provide. Run the application once to generate the config file:
$ cd website-py
$ python src/application.py
Edit the newly generated configuration file with the proper application, celery, database, and sphinx information. In most cases, the default celery configuration is appropriate.
Create a results directory with 777 permissions. The path will be placed in the config file.
Like the web application, edit tools/config.py
change the CONFIG_PATH
variable to point to a location to store the tool config file. From the tools
parent directory, you can run the tools once to generate a default config.
If you installed virtualenv, be user to run the tools from that environment.
$ celery -A tools.celeryapp worker --loglevel=info
Edit the config with the appropriate information.
If you need to upgrade an older version of the toolset running celery 3.x, follow these steps:
- Pull the latest version of the
tools
andwebsite-py
repos. - Upgrade the package requirements using pip
$ pip install -r website-py/sample-configs/requirements.txt
. - Restart the tool and web applications.
NOTE: The tool table may be incompatible with the celeryapp.py tool list. You may drop the tool table data and reload with ODE-data-only-tool.dump to correct.
We use several highly optimized software implementations written in C/C++.
This suite of tools can be found in the TOOLBOX/
directory and should be
compiled prior to running the toolset written in Python.
Ensure gcc and other development tools exist on your system. If they are missing, install them:
$ sudo yum install gcc g++ make
These tools can be compiled using the "master" makefile located in TOOLBOX.
$ cd tools/TOOLBOX && make && cd ../..
The distribution generator tool is written in C++ and used to generate a null distribution with which we can use to assess the significance of a jaccard similarity result. It is located in the tools/cpp_tools
directory. This tool requires two dependencies, libpqxx
and libpqxx-devel
which should have been installed earlier.
This tool will generate a connection to the database and requires you to set the proper connection info. If you have been following this guide, the only connection parameter you should have to change is the database host address. This must be changed in the following files: distribution_generator.cpp
, drone.cpp
, and fileGenerator.cpp
. fileGenerator.cpp
contains two separate lines where the host address must be changed.
To change all the necessary lines in a single sitting, run the following command in the tools directory:
$ cd tools
$ find . -name "*.cpp" -exec sed -i "s/129.62.148.19/DATABASE_IP/g" '{}' \;
Then compile the distribution generator:
$ cd cpp_tools && make
GeneWeaver should now be ready to run. Start the tools application from the parent directory of the tools:
$ celery -A tools.celeryapp worker --loglevel=info
Start the web application from the website-py directory:
$ cd website-py
$ python src/application.py
By default the web app runs on port 5000. You can point your browser to the host you assigned and you should see the GeneWeaver home page. They application may require sudo privileges to establish a connection on the given port.
Handling multiple requests using the Flask application alone may result in some performance issues. A web server can be used to handle user requests and route those requests to the web app. Start by installing nginx:
$ sudo yum install nginx
Serving Flask applications with nginx requires an additional deployment application such as uWSGI:
$ pip install uwsgi
Copy the sample uWSGI config, uwsgi.ini
to an easily accessible
directory such as /srv/geneweaver
. Change the chdir
, venv
, and socket
variables to match your installation directories. If you want to change the
number of worker processes to spawn, and the number of threads per process,
change the processes
and threads
variables.
There is a sample nginx config file in the sample-configs
directory. The
default nginx config, typically found in /etc/nginx
should only require minor
edits. Copy the custom geneweaver location blocks, location /
and location @geneweaver
from the sample nginx config to the one in /etc/nginx
. Also
ensure that the uwsgi_pass
variable points to the correct socket location
found in the uWSGI config. Start the nginx service:
$ sudo systemctl start nginx
Start uWSGI using the given configuration file:
$ uwsgi --ini uwsgi.ini
GeneWeaver should now be accessible using just the server name or IP address; all requests are routed through the default HTTP port (80).
Supervisor is a system management utility that can be used to control the GeneWeaver application. Start by installing it:
$ sudo yum install supervisor
Copy the sample supervisord config from the sample-configs
directory to a
directory of your choosing. Here we use the geneweaver application directory:
$ cp sample-configs/supervisord.conf /srv/geneweaver
Create a folder to store the supervisord logs, or store them in any directory you wish:
$ mkdir /srv/geneweaver/supervisord
Now edit the supervisord.conf
file to match your installation and log
directories. After editing, you can start the supervisor:
$ sudo supervisord -c /srv/geneweaver/supervisord.conf
To manage your applications use:
$ sudo supervisorctl -c /srv/geneweaver/supervisord.conf