-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #8 from autonomio/initial_release
#2 Initial Release Loose-Ends Tracker
- Loading branch information
Showing
15 changed files
with
856 additions
and
436 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,316 @@ | ||
# DistributedScan | ||
|
||
The experiment is configured and started through the `DistributedScan()` command. All of the options effecting the experiment, other than the hyperparameters themselves, are configured through the Scan arguments. The most common use-case is where ~10 arguments are invoked. | ||
|
||
## Minimal Example | ||
|
||
```python | ||
jako.DistributedScan(x='x', y='y', params=p, model=input_model,config='config.json') | ||
|
||
``` | ||
|
||
## DistributedScan Arguments | ||
|
||
`x`, `y`, `params`, `model` and `config` are the only required arguments to start the experiment, all other are optional.</aside> | ||
|
||
Argument | Input | Description | ||
--------- | ------- | ----------- | ||
`x` | array or list of arrays | prediction features | ||
`y` | array or list of arrays | prediction outcome variable | ||
`params` | dict or ParamSpace object | the parameter dictionary or the ParamSpace object after splitting | ||
`model` | function | the Keras model as a function | ||
`experiment_name` | str | Used for creating the experiment logging folder | ||
`x_val` | array or list of arrays | validation data for x | ||
`y_val` | array or list of arrays | validation data for y | ||
`val_split` | float | validation data split ratio | ||
`random_method` | str | the random method to be used | ||
`seed` | float | Seed for random states | ||
`performance_target` | list | A result at which point to end experiment | ||
`fraction_limit` | float | The fraction of permutations to be processed | ||
`round_limit` | int | Maximum number of permutations in the experiment | ||
`time_limit` | datetime | Time limit for experiment in format `%Y-%m-%d %H:%M` | ||
`boolean_limit` | function | Limit permutations based on a lambda function | ||
`reduction_method` | str | Type of reduction optimizer to be used used | ||
`reduction_interval` | int | Number of permutations after which reduction is applied | ||
`reduction_window` | int | the lookback window for reduction process | ||
`reduction_threshold` | float | The threshold at which reduction is applied | ||
`reduction_metric` | str | The metric to be used for reduction | ||
`minimize_loss` | bool | `reduction_metric` is a loss | ||
`disable_progress_bar` | bool | Disable live updating progress bar | ||
`print_params` | bool | Print each permutation hyperparameters | ||
`clear_session` | bool | Clear backend session between permutations | ||
`save_weights` | bool | Keep model weights (increases memory pressure for large models) | ||
`config` | str or dict | Configuration containing information about machines to distribute and database to upload the data. | ||
|
||
|
||
## DistributedScan Object Properties | ||
|
||
Once the `DistributedScan()` procedures are completed, an object with several useful properties is returned.The namespace is strictly kept clean, so all the properties consist of meaningful contents. | ||
|
||
In the case conducted the following experiment, we can access the properties in `distributed_scan_object` which is a python class object. | ||
|
||
```python | ||
distributed_scan_object = jako.DistributedScan(x, y, model=iris_model, params=p, fraction_limit=0.1, config='config.json') | ||
``` | ||
<hr> | ||
|
||
**`best_model`** picks the best model based on a given metric and returns the index number for the model. | ||
|
||
```python | ||
distributed_scan_object.best_model(metric='f1score', asc=False) | ||
``` | ||
NOTE: `metric` has to be one of the metrics used in the experiment, and `asc` has to be True for the case where the metric is something to be minimized. | ||
|
||
<hr> | ||
|
||
**`data`** returns a pandas DataFrame with the results for the experiment together with the hyperparameter permutation details. | ||
|
||
```python | ||
distributed_scan_object.data | ||
``` | ||
|
||
<hr> | ||
|
||
**`details`** returns a pandas Series with various meta-information about the experiment. | ||
|
||
```python | ||
distributed_scan_object.details | ||
``` | ||
|
||
<hr> | ||
|
||
**`evaluate_models`** creates a new column in `distributed_scan_object.data` with result from kfold cross-evaluation. | ||
|
||
```python | ||
distributed_scan_object.evaluate_models(x_val=x_val, | ||
y_val=y_val, | ||
n_models=10, | ||
metric='f1score', | ||
folds=5, | ||
shuffle=True, | ||
average='binary', | ||
asc=False) | ||
``` | ||
|
||
Argument | Description | ||
-------- | ----------- | ||
`distributed_scan_object` | The class object returned by DistributedScan() upon completion of the experiment. | ||
`x_val` | Input data (features) in the same format as used in DistributedScan(), but should not be the same data (or it will not be much of validation). | ||
`y_val` | Input data (labels) in the same format as used in DistributedScan(), but should not be the same data (or it will not be much of validation). | ||
`n_models` | The number of models to be evaluated. If set to 10, then 10 models with the highest metric value are evaluated. See below. | ||
`metric` | The metric to be used for picking the models to be evaluated. | ||
`folds` | The number of folds to be used in the evaluation. | ||
`shuffle` | If the data is to be shuffled or not. Set always to False for timeseries but keep in mind that you might get periodical/seasonal bias. | ||
`average` |One of the supported averaging methods: 'binary', 'micro', or 'macro' | ||
`asc` |Set to True if the metric is to be minimized. | ||
`saved` | bool | if a model saved on local machine should be used | ||
`custom_objects` | dict | if the model has a custom object, pass it here | ||
|
||
<hr> | ||
|
||
**`learning_entropy`** returns a pandas DataFrame with entropy measure for each permutation in terms of how much there is variation between results of each epoch in the permutation. | ||
|
||
```python | ||
distributed_scan_object.learning_entropy | ||
``` | ||
|
||
<hr> | ||
|
||
**`params`** returns a dictionary with the original input parameter ranges for the experiment. | ||
|
||
```python | ||
distributed_scan_object.params | ||
``` | ||
|
||
<hr> | ||
|
||
**`round_times`** returns a pandas DataFrame with the time when each permutation started, ended, and how many seconds it took. | ||
|
||
```python | ||
distributed_scan_object.round_times | ||
``` | ||
|
||
<hr> | ||
|
||
<hr> | ||
|
||
**`round_history`** returns epoch-by-epoch data for each model in a dictionary. | ||
|
||
```python | ||
distributed_scan_object.round_history | ||
``` | ||
|
||
<hr> | ||
|
||
**`saved_models`** returns the JSON (dictionary) for each model. | ||
|
||
```python | ||
distributed_scan_object.saved_models | ||
``` | ||
|
||
<hr> | ||
|
||
**`saved_weights`** returns the weights for each model. | ||
|
||
```python | ||
distributed_scan_object.saved_weights | ||
``` | ||
|
||
<hr> | ||
|
||
**`x`** returns the input data (features). | ||
|
||
```python | ||
distributed_scan_object.x | ||
``` | ||
|
||
<hr> | ||
|
||
**`y`** returns the input data (labels). | ||
|
||
```python | ||
distributed_scan_object.y | ||
``` | ||
|
||
## Input Model | ||
|
||
The input model is any Keras or tf.keras model. It's the model that Jako will use as the basis for the hyperparameter experiment. | ||
|
||
#### A minimal example | ||
|
||
```python | ||
def input_model(x_train, y_train, x_val, y_val, params): | ||
|
||
model.add(Dense(12, input_dim=8, activation=params['activation'])) | ||
model.add(Dense(1, activation='sigmoid')) | ||
model.compile(loss='binary_crossentropy', params['optimizer']) | ||
out = model.fit(x=x_train, | ||
y=y_train, | ||
validation_data=[x_val, y_val]) | ||
|
||
return out, model | ||
``` | ||
See specific details about defining the model [here](Examples_Typical?id=defining-the-model). | ||
|
||
#### Models with multiple inputs or outputs (list of arrays) | ||
|
||
For both cases, `DistributedScan(... x_val, y_val ...)` must be explicitly set i.e. you split the data yourself before passing it into Jako. Using the above minimal example as a reference. | ||
|
||
For **multi-input** change `model.fit()` as highlighted below: | ||
|
||
```python | ||
out = model.fit(x=[x_train_a, x_train_b], | ||
y=y_train, | ||
validation_data=[[x_val_a, x_val_b], y_val]) | ||
``` | ||
|
||
For **multi-output** the same structure is expected but instead of changing the `x` argument values, now change `y`: | ||
|
||
```python | ||
out = model.fit(x=x_train, | ||
y=[y_train_a, y_train_b], | ||
validation_data=[x_val, [y_val_a, y_val_b]]) | ||
``` | ||
For the case where its both **multi-input** and **multi-output** now both `x` and `y` argument values follow the same structure: | ||
|
||
```python | ||
out = model.fit(x=[x_train_a, x_train_b], | ||
y=[y_train_a, y_train_b], | ||
validation_data=[[x_val_a, x_val_b], [y_val_a, y_val_b]]) | ||
``` | ||
|
||
|
||
## Parameter Dictionary | ||
|
||
The first step in an experiment is to decide the hyperparameters you want to use in the optimization process. | ||
|
||
#### A minimal example | ||
|
||
```python | ||
p = { | ||
'first_neuron': [12, 24, 48], | ||
'activation': ['relu', 'elu'], | ||
'batch_size': [10, 20, 30] | ||
} | ||
``` | ||
|
||
#### Supported Input Formats | ||
|
||
Parameters may be inputted either in a list or tuple. | ||
|
||
As a set of discreet values in a list: | ||
|
||
```python | ||
p = {'first_neuron': [12, 24, 48]} | ||
``` | ||
As a range of values `(min, max, steps)`: | ||
|
||
```python | ||
p = {'first_neuron': (12, 48, 2)} | ||
``` | ||
|
||
For the case where a static value is preferred, but it's still useful to include it in in the parameters dictionary, use list: | ||
|
||
```python | ||
p = {'first_neuron': [48]} | ||
``` | ||
|
||
## DistributedScan Config file | ||
|
||
A config file has all the information regarding connection to remote machines. The config file will also use one of the remote machines as the central datastore. A sample config file will look like the following: | ||
|
||
``` | ||
{ | ||
"run_central_node": true, | ||
"machines": [ | ||
{ | ||
"machine_id": 1, | ||
"JAKO_IP_ADDRESS": "machine_1_ip_address", | ||
"JAKO_PORT": machine_1_port, | ||
"JAKO_USER": "machine_1_username", | ||
"JAKO_PASSWORD": "machine_1_password" | ||
}, | ||
{ | ||
"machine_id": 2, | ||
"JAKO_IP_ADDRESS": "machine_2_ip_address", | ||
"JAKO_PORT": machine_2_port, | ||
"JAKO_USER": "machine_2_username", | ||
"JAKO_KEY_FILENAME": "machine_2_key_file_path" | ||
} | ||
], | ||
"database": { | ||
"DB_HOST_MACHINE_ID": 1, #the id for machine which is the datastore | ||
"DB_USERNAME": "database_username", | ||
"DB_PASSWORD": "database_password", | ||
"DB_TYPE": "database_type", | ||
"DATABASE_NAME": "database_name", | ||
"DB_PORT": database_port, | ||
"DB_ENCODING": "LATIN1", | ||
"DB_UPDATE_INTERVAL": 5 | ||
} | ||
} | ||
``` | ||
|
||
### DistributeScan config arguments | ||
|
||
Argument | Input | Description | ||
--------- | ------- | ----------- | ||
`run_central_node` | bool | if set to true, the central machine where the script runs will also be included in distributed run. | ||
`machines` | list of dict | list of machine configurations | ||
`machine_id` | int | id for each machine in ascending order. | ||
`JAKO_IP_ADDRESS` | str | ip address for the remote machine | ||
`JAKO_PORT` | int | port number for the remote machine | ||
`JAKO_USER` | str | username for the remote machine | ||
`JAKO_PASSWORD` | str | password for the remote machine | ||
`JAKO_KEY_FILENAME` | str | if password not available, the path to RSA private key of the machine could be supplied to this argument. | ||
`database` | dict | configuration parameters for central datastore | ||
`DB_HOST_MACHINE_ID` | int | The machine id to one of the remote machines that can be used as the host where the database resides. | ||
`DB_USERNAME` | str | database username | ||
`DB_PASSWORD` | str | database_password | ||
`DB_TYPE` | str | database_type. Default is `postgres`.The available options are `postgres`, `mysql` and `sqlite` | ||
`DATABASE_NAME` | str | database name | ||
`DB_PORT` | int | database_port | ||
`DB_ENCODING` | str | Defaults to `LATIN1` | ||
`DB_UPDATE_INTERVAL` | int | The frequency with which database update happens. The value is in seconds.Defaults to `5`. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# Install Options | ||
|
||
Before installing jako, it is recommended to first setup and start the following: | ||
* A python or conda environment. | ||
* A postgresql database setup in one of the machines. This will be used as the central datastore. | ||
|
||
|
||
## Installing Jako | ||
|
||
#### Creating a python virtual environment | ||
```python | ||
virtualenv -p python3 jako_env | ||
source jako_env/bin/activate | ||
``` | ||
|
||
#### Creating a conda virtual environment | ||
```python | ||
conda create --name jako_env | ||
conda activate jako_env | ||
``` | ||
|
||
#### Install latest from PyPi | ||
```python | ||
pip install jako | ||
``` | ||
|
||
#### Install a specific version from PyPi | ||
```python | ||
pip install jako==0.1 | ||
``` | ||
|
||
#### Upgrade installation from PyPi | ||
```python | ||
pip install -U --no-deps jako | ||
``` | ||
|
||
#### Install from monthly | ||
```python | ||
pip install --upgrade --no-deps --force-reinstall git+https://github.com/autonomio/jako | ||
``` | ||
|
||
#### Install from weekly | ||
```python | ||
pip install --upgrade --no-deps --force-reinstall git+https://github.com/autonomio/jako@dev | ||
``` | ||
|
||
#### Install from daily | ||
```python | ||
pip install --upgrade --no-deps --force-reinstall git+https://github.com/autonomio/jako@daily-dev | ||
``` | ||
|
||
## Installing a postgres database | ||
|
||
To enable postgres in your central datastore, follow the steps in this | ||
* Postgres for ubuntu machine: [link](https://blog.logrocket.com/setting-up-a-remote-postgres-database-server-on-ubuntu-18-04/) | ||
* Postgres for Mac Machines: [link](https://www.sqlshack.com/setting-up-a-postgresql-database-on-mac/) | ||
|
Oops, something went wrong.