Getting Started with JuliaHub.jl
This tutorial walks you through the basic operations you can do with JuliaHub.jl, from installation to submitting simple jobs and working with datasets. If you are unfamiliar with JuliaHub.jl, this is a good place to get started.
If you already know what you wish to achieve with JuliaHub.jl, you can also skip this and jump directly into one of the more detailed how-to guides.
In particular, the tutorial will show
- How to install JuliaHub.jl and connect it to a JuliaHub instance.
- How to create, access and update a simple dataset.
- How to submit a simple job.
Installation
JuliaHub.jl is a registered Julia package and can be installed using Julia's package manager. You can access the Julia package manager REPL mode by pressing ]
, and you can install JuliaHub.jl with
pkg> add JuliaHub
Alternatively, you can use the Pkg
standard library functions to install it.
import Pkg
+Pkg.add("JuliaHub")
Once it is installed, simply use import
or using
to load JuliaHub.jl into your current Julia session.
julia> using JuliaHub
JuliaHub.jl does not have any exported names, so doing using JuliaHub
does not introduce any functions or types in Main
. Instead, JuliaHub.jl functions are designed to be used by prefixing them with JuliaHub.
(e.g. JuliaHub.authenticate(...)
or JuliaHub.submit_job(...)
)
That said, there is nothing stopping you from explicitly bringing some names into your current scope, by doing e.g. using JuliaHub: submit_job
, if you so wish!
Authentication
In order to communicate with a JuliaHub instance, you need a valid authentication token. If you are working in a JuliaHub Cloud IDE, you actually do not need to do anything to be authenticated, as the authentication tokens are automatically set up in the cloud environment. To verify this, you can still call authenticate
, which should load the pre-configured token.
julia> JuliaHub.authenticate()
JuliaHub.Authentication("https://juliahub.com", "username", *****)
If you are working on a local computer, the easiest way to get started is to pass the URL of the JuliaHub instance to authenticate
. Unless you have authenticated before, this will initiate an interactive browser-based authentication.
julia> JuliaHub.authenticate("juliahub.com")
+Authentication required: please authenticate in browser.
+The authentication page should open in your browser automatically, but you may need to switch to the opened window or tab. If the authentication page is not automatically opened, you can authenticate by manually opening the following URL: ...
Once you have completed the steps in the browser, the function should return a valid authentication token.
The authenticate
function returns an Authentication
object, which hold the authentication token. In principle, you can pass these objects directly to JuliaHub.jl function via the auth
keyword argument. However, in practice, this is usually not needed, because JuliaHub.jl also remembers the last authentication in the Julia session in a global variable. You can see the current globally stored authentication token with current_authentication
.
julia> JuliaHub.current_authentication()
JuliaHub.Authentication("https://juliahub.com", "username", *****)
There is more to authentication than this, including its relationship to the Julia package server and JULIA_PKG_SERVER
environment variable. See the Authentication how-to if you want to learn more.
Creating & accessing datasets
JuliaHub.jl allows you to create, access, and update the datasets that are hosted on JuliaHub. This section shows some of the basic operations you can perform with datasets.
The datasets
function allows you to list the datasets you have. Optionally, you can also make it show any other datasets you have access to.
julia> JuliaHub.datasets()
JuliaHub.Dataset[]
Unless you have created datasets in the web UI or in the IDE, this list will likely be empty currently. To fix that, let us upload a simple dataset using JuliaHub.jl.
Just as an example, we'll generate a simple 5-by-5 matrix, and save it in a file using the using the DelimitedFiles
standard library.
julia> using DelimitedFiles
julia> mat = [i^2 + j^2 for i=1:5, j=1:5]
5×5 Matrix{Int64}: + 2 5 10 17 26 + 5 8 13 20 29 + 10 13 18 25 34 + 17 20 25 32 41 + 26 29 34 41 50
julia> writedlm("matrix.dat", mat)
Now that the matrix has been serialized into a text file on the disk, we can upload that file to JuliaHub with upload_dataset
.
julia> JuliaHub.upload_dataset("tutorial-matrix", "matrix.dat")
Transferred: 86.767 KiB / 86.767 KiB, 100%, 0 B/s, ETA - +Transferred: 1 / 1, 100% +Elapsed time: 2.1s +Dataset: tutorial-matrix (Blob) + owner: username + description: + versions: 1 + size: 57 bytes
If you already happen to have a dataset with the same name, the upload_dataset
call will fail. It is designed to be safe by default. However, you can pass update=true
or replace=true
to either upload your file as a new version of the dataset, or to delete all existing versions and upload a brand new version.
If we now call datasets
, it should show up in the list of datasets.
julia> JuliaHub.datasets()
1-element Vector{JuliaHub.Dataset}: + JuliaHub.dataset(("username", "tutorial-matrix"))
To see more details about the dataset, you can index into the array returned by datasets
. Alternatively, you can also use the dataset
function to pick out a single dataset by its name.
julia> JuliaHub.dataset("tutorial-matrix")
Dataset: tutorial-matrix (Blob) + owner: username + description: + versions: 1 + size: 57 bytes
JuliaHub datasets also support basic metadata, such as tags and a description field. You could set it directly in the upload_dataset
function, but we did not. But that is fine, since we can use update_dataset
to update the metadata at any time.
julia> JuliaHub.update_dataset("tutorial-matrix", description="An i^2 + j^2 matrix")
Dataset: tutorial-matrix (Blob) + owner: username + description: An i^2 + j^2 matrix + versions: 1 + size: 57 bytes
The function also immediately queries JuliaHub for the updated dataset metadata by internally calling JuliaHub.dataset("tutorial-matrix")
.
Finally, JuliaHub.jl also allows you to download the datasets you have with the download_dataset
function. We can also imagine doing this on a different computer or in a JuliaHub job.
julia> JuliaHub.download_dataset("tutorial-matrix", "matrix-downloaded.dat")
Transferred: 86.767 KiB / 86.767 KiB, 100%, 0 B/s, ETA - +Transferred: 1 / 1, 100% +Elapsed time: 2.1s +"/home/runner/work/JuliaHub.jl/JuliaHub.jl/docs/build/matrix-downloaded.dat"
This downloads the dataset into a local file, after which you can e.g. read it back into Julia and do operations on it.
julia> mat = readdlm("matrix-downloaded.dat", '\t', Int)
5×5 Matrix{Int64}: + 2 5 10 17 26 + 5 8 13 20 29 + 10 13 18 25 34 + 17 20 25 32 41 + 26 29 34 41 50
julia> sum(mat)
550
While this demo uploaded a single file as a dataset, JuliaHub also supports uploading whole directories as a single dataset. For that, you can simply point upload_dataset
to a directory, rather than a file. See the datasets how-to for more information on how to work with datasets.
Submitting a job
JuliaHub.jl allows for an easy programmatic submission of JuliaHub jobs. In this example, we submit a simple script that downloads the dataset from the previous step, does a simple calculations and then upload the result. We then access the result locally with JuliaHub.jl.
First, we need to specify the code that we want to run in the job. There are a few options for this, but in this example we use the @script_str
string macro to construct a script
-type computation, that simply runs the code snippet we specify.
The following script will access the dataset, calculates the sum of all the elements, and stores the value in the job results. You will be able to access the contents of RESULTS
in both the web UI, but also via JuliaHub.jl.
s = JuliaHub.script"""
+using JuliaHub, DelimitedFiles
+@info JuliaHub.authenticate()
+JuliaHub.download_dataset("tutorial-matrix", "matrix-downloaded.dat")
+mat = readdlm("matrix-downloaded.dat", '\t', Int)
+mat_sum = @show sum(mat)
+ENV["RESULTS"] = string(mat_sum)
+"""
JuliaHub.BatchJob:
+code = """
+using JuliaHub, DelimitedFiles
+@info JuliaHub.authenticate()
+JuliaHub.download_dataset("tutorial-matrix", "matrix-downloaded.dat")
+mat = readdlm("matrix-downloaded.dat", '\t', Int)
+mat_sum = @show sum(mat)
+ENV["RESULTS"] = string(mat_sum)
+"""
+sha256(project_toml) = 93a83d60d4a9c6a3d1438259fd506929eaad296b7e112e886b305781b85cb85b
+sha256(manifest_toml) = 69cdbbbf2d3df6f0f561ac94bef374ad21e04420ac0ff00170d6af71e59d4cf9
In most cases, you also submit a Julia package environment (i.e. Project.toml
and Manifest.toml
files together with a job). That environment then gets instantiated before the user-provided code is run.
The script""
string macro, by default, attaches the currently active environment to the job. This means that any packages that you are currently using should also be available on the job (although only registered packages added as non-development dependencies will work). You can use Base.active_project()
or pkg> status
to see what environment is currently active.
To submit a job, you can simply call submit_job
on it.
julia> j = JuliaHub.submit_job(s)
JuliaHub.Job: jr-xf4tslavut (Submitted) + submitted: 2023-03-15T07:56:50.974+00:00 + started: 2023-03-15T07:56:51.251+00:00 + finished: 2023-03-15T07:56:59.000+00:00
The submit_job
function also allows you to specify configure how the job gets run, such as how many CPUs or how much memory it has available. By default, though, it runs your code on a single node, picking the smallest instance that is available.
At this point, if you go to the "Jobs" page web UI, you should see the job there. It may take a few moments to actually start running. You can also call job
on the returned Job
object to refresh the status of the job.
julia> j = JuliaHub.job(j)
JuliaHub.Job: jr-xf4tslavut (Running) + submitted: 2023-03-15T07:56:50.974+00:00 + started: 2023-03-15T07:56:51.251+00:00 + finished: 2023-03-15T07:56:59.000+00:00
Finally, after the job has completed, if you refresh the Job
it should reflect the final status of the job, and also give you access to the
julia> j = JuliaHub.job(j)
JuliaHub.Job: jr-xf4tslavut (Completed) + submitted: 2023-03-15T07:56:50.974+00:00 + started: 2023-03-15T07:56:51.251+00:00 + finished: 2023-03-15T07:56:59.000+00:00 + outputs: "550"
julia> j.results
"550"
See the jobs how-to guide for more details on the different options when it comes to job submission.
Next steps
This tutorial has hopefully given an overview of basic JuliaHub.jl usage. For more advanced usage, you may want to read through the more detailed how-to guides.