Skip to content

Latest commit

 

History

History
138 lines (93 loc) · 6.19 KB

data_import.md

File metadata and controls

138 lines (93 loc) · 6.19 KB

Preparing a history with download links

If using the variation script you do not have to handle the uploads of your sequencing data yourself. All you need to do is upload simple text files with download links for the data into a specially tagged history. The variation script will scan the history for you, parse the links, upload the data and trigger variation analysis runs as the data becomes available.

To make this work you need to structure the history according to the expectations of the script. Here's how to do that:

  1. For each batch of samples you want to analyze prepare one text file with download links.

    • Links found in one file will be analyzed as one batch in one script run in Galaxy

    • There must be one link per line in the file

    • Link lines must follow one of these formats:

      • Just a link in the form:

        <baseurl>/<sampleID>_[12].<file_extension>

        for example,

        ftp.sra.ebi.ac.uk/vol1/fastq/ERR545/006/ERR5451836/ERR5451836_1.fastq.gz

        specifies an ENA download link for the forward (_1) reads of a sample with ID ERR5451836.

      • Explicit sample ID and link:

        <sample_id>: <baseurl>/<arbitrary_name>_[12].<file_extension>

        for example:

        ERR5451836: https://files.my-server.org/sample001_1.fq.gz

        The explicitly provided sample ID will take precedence over the downloaded file's base name.

      • Explicit sample IDs and links with sample IDs reused on multiple lines

        Example:

        ERX5451093: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/000/ERR5739250/ERR5739250_1.fastq.gz
        ERX5451093: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/000/ERR5739250/ERR5739250_2.fastq.gz
        ERX5451093: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/001/ERR5739251/ERR5739251_1.fastq.gz
        ERX5451093: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/001/ERR5739251/ERR5739251_2.fastq.gz
        ERX5451094: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/005/ERR5739255/ERR5739255_1.fastq.gz
        ERX5451094: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/005/ERR5739255/ERR5739255_2.fastq.gz
        ERX5451094: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/006/ERR5739256/ERR5739256_1.fastq.gz
        ERX5451094: ftp.sra.ebi.ac.uk/vol1/fastq/ERR573/006/ERR5739256/ERR5739256_2.fastq.gz
        

        Cases like this are interpreted as the linked datasets representing partial data of the indicated samples. Using this format will trigger the execution of an extended workflow version that merges the partial data first into combined datasets for each indicated sample before performing a regular variation analysis.

        In the above example ERR5739250 and ERR5739251 represent partial data of a sample described by ENA experiment ERX5451093, while ERR5739255 and ERR5739256 each provide partial data for ERX5451094. If specified like above, the variation script will trigger an analysis for two samples, ERX5451093 and ERX5451094 with the downloaded data merged and rearranged appropriately.

    • The order of lines in the file does not matter.

      You must, however, specify exactly one forward (_1) reads and one reverse (_2) reads file for each sample.

    • The sample ID portion of the link will be carried through the whole pipeline and will become the basename of every output file for that sample.

    • If links do not specify the transport protocol directly, like in the above example, you need to configure the protocol in the variation script's config file (see the scripts Usage instructions)

  2. Create a new history on your target Galaxy server

  3. Upload your batch files with download links to the new history as a Galaxy Collection

    • Open the Galaxy Upload Manager (by clicking the Upload Data button on the top-right of the tool panel)

    • In the Download from web or upload from disk dialogue window, switch to the Collection tab and confirm that Collection Type is set to List

      Note: Even if you have just a single dataset with links from just one batch of data, you need to upload it into a (single-element) collection!

    • Select Choose local files

    • Select the file(s) you want to upload

    • Press Start

    • Once the Build button gets enabled, click on it

    • In the ensuing dialog, enter a name for the collection

      Important: The name has to match the metadata_collection_name set in the variation script config file (see the scripts Usage instructions)

      Note: By clicking on the individual dataset names in that same dialog you can edit these, too. These names will be treated as the batch identifiers in the analysis and will be propagated to the history names generated by the scripts.

    • Press Create list

  4. To make the variation script aware of the history and start processing the download links in the collection, add its recognized tag to your history

    The history tag that the variation script will be looking for can be set in its configuration file under metadata_history_tag.

    Click on the Edit history tags icon below the history name in the history panel. This will reveal any existing tags (none in your case) of the history and a big tag icon. Click on the icon, start typing the name as it appears in the config file and confirm with the Enter key.

That's it! Upon the next run of the variation script it will pick up the history and process the first unprocessed links dataset in any suitably named collections. Then in each subsequent run it will work on the links in the next dataset until all datasets are processed.

Adding additional batches as they become available

Whenever you obtain sequencing data for additional samples you can add them exactly as described above. You can either

  • create a completely new history with a collection in it and add the expected history tag to it

  • reuse your existing and tagged history and simply add new data as a new collection of datasets with download links

Remember that in either case all collections must use the same name as defined in the variation script config file. What will differ between them are the names/batch identifiers of the contained datasets.