If using the variation script you do not have to handle the uploads of your sequencing data yourself. All you need to do is upload simple text files with download links for the data into a specially tagged history. The variation script will scan the history for you, parse the links, upload the data and trigger variation analysis runs as the data becomes available.
To make this work you need to structure the history according to the expectations of the script. Here's how to do that:
-
For each batch of samples you want to analyze prepare one text file with download links.
-
Links found in one file will be analyzed as one batch in one script run in Galaxy
-
There must be one link per line in the file
-
Link lines must follow one of these formats:
-
Just a link in the form:
<baseurl>/<sampleID>_[12].<file_extension>
for example,
ftp.sra.ebi.ac.uk/vol1/fastq/ERR545/006/ERR5451836/ERR5451836_1.fastq.gz
specifies an ENA download link for the forward (
_1
) reads of a sample with IDERR5451836
. -
Explicit sample ID and link:
<sample_id>: <baseurl>/<arbitrary_name>_[12].<file_extension>
for example:
ERR5451836: https://files.my-server.org/sample001_1.fq.gz
The explicitly provided sample ID will take precedence over the downloaded file's base name.
-
Explicit sample IDs and links with sample IDs reused on multiple lines
Example:
ERX5451093: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/000/ERR5739250/ERR5739250_1.fastq.gz ERX5451093: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/000/ERR5739250/ERR5739250_2.fastq.gz ERX5451093: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/001/ERR5739251/ERR5739251_1.fastq.gz ERX5451093: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/001/ERR5739251/ERR5739251_2.fastq.gz ERX5451094: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/005/ERR5739255/ERR5739255_1.fastq.gz ERX5451094: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/005/ERR5739255/ERR5739255_2.fastq.gz ERX5451094: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/006/ERR5739256/ERR5739256_1.fastq.gz ERX5451094: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/006/ERR5739256/ERR5739256_2.fastq.gz
Cases like this are interpreted as the linked datasets representing partial data of the indicated samples. Using this format will trigger the execution of an extended workflow version that merges the partial data first into combined datasets for each indicated sample before performing a regular variation analysis.
In the above example
ERR5739250
andERR5739251
represent partial data of a sample described by ENA experimentERX5451093
, whileERR5739255
andERR5739256
each provide partial data forERX5451094
. If specified like above, the variation script will trigger an analysis for two samples,ERX5451093
andERX5451094
with the downloaded data merged and rearranged appropriately.
-
-
The order of lines in the file does not matter.
You must, however, specify exactly one forward (
_1
) reads and one reverse (_2
) reads file for each sample. -
The sample ID portion of the link will be carried through the whole pipeline and will become the basename of every output file for that sample.
-
If links do not specify the transport protocol directly, like in the above example, you need to configure the protocol in the variation script's config file (see the scripts Usage instructions)
-
-
Create a new history on your target Galaxy server
-
Upload your batch files with download links to the new history as a Galaxy Collection
-
Open the Galaxy Upload Manager (by clicking the Upload Data button on the top-right of the tool panel)
-
In the
Download from web or upload from disk
dialogue window, switch to theCollection
tab and confirm thatCollection Type
is set toList
Note: Even if you have just a single dataset with links from just one batch of data, you need to upload it into a (single-element) collection!
-
Select
Choose local files
-
Select the file(s) you want to upload
-
Press
Start
-
Once the
Build
button gets enabled, click on it -
In the ensuing dialog, enter a name for the collection
Important: The name has to match the
metadata_collection_name
set in the variation script config file (see the scripts Usage instructions)Note: By clicking on the individual dataset names in that same dialog you can edit these, too. These names will be treated as the batch identifiers in the analysis and will be propagated to the history names generated by the scripts.
-
Press
Create list
-
-
To make the variation script aware of the history and start processing the download links in the collection, add its recognized tag to your history
The history tag that the variation script will be looking for can be set in its configuration file under
metadata_history_tag
.Click on the
Edit history tags
icon below the history name in the history panel. This will reveal any existing tags (none in your case) of the history and a big tag icon. Click on the icon, start typing the name as it appears in the config file and confirm with the Enter key.
That's it! Upon the next run of the variation script it will pick up the history and process the first unprocessed links dataset in any suitably named collections. Then in each subsequent run it will work on the links in the next dataset until all datasets are processed.
Whenever you obtain sequencing data for additional samples you can add them exactly as described above. You can either
-
create a completely new history with a collection in it and add the expected history tag to it
-
reuse your existing and tagged history and simply add new data as a new collection of datasets with download links
Remember that in either case all collections must use the same name as defined in the variation script config file. What will differ between them are the names/batch identifiers of the contained datasets.