Skip to content

Commit

Permalink
Develop (#68)
Browse files Browse the repository at this point in the history
* Change print statements to log statements

* Fix flake errors

* Add retry logic for 500 and 401 errors from CMR

* Subscriber check if file exists before downloading

Prevents re-downloading files (e.g. in case previous run
failed because of other file failures).

If the subscriber sees a file already exists, it will also calculate
the file checksum and see if it matches the checksum in
CMR. If the checcksum doesn't match, it will re-download.

There is now a --force/-f option that will cause subscriber
to re-download even if the file exists and is up to date.

Issue #17

* Issues/15 (#65)

* updated get_search to include verbose option, not entire 'args' option

* added search after functionality to podaac access; removed scroll from initial parameters

* updated changelog

* closes #15

* Update python-app.yml

added netrc creation for future use of regression tests.

* Add checks for pre-existing files to downloader (#67)

* Check if file exists before download - downloader

* Update documentation

Co-authored-by: Wilbert Veit <[email protected]>

* Programmatic Regression Testing (#66)

* added programmatice regression testing. currently relies on a valid .netrc file, refactoring might be needed to manually add a user/password to the CMR/TEA downloads

* Update python-app.yml

* updated regression tests, readied 1.9.0 version

* added -f option test to downloader regression

* Update python-app.yml

Co-authored-by: Joe Sapp <[email protected]>
Co-authored-by: mgangl <[email protected]>
Co-authored-by: Frank Greguska <[email protected]>
Co-authored-by: Wilbert Veit <[email protected]>
Co-authored-by: Wilbert Veit <[email protected]>
  • Loading branch information
6 people authored Apr 28, 2022
1 parent 9e324a4 commit bd21411
Show file tree
Hide file tree
Showing 17 changed files with 800 additions and 165 deletions.
13 changes: 11 additions & 2 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
branches: [ main, develop ]

jobs:
build:
Expand All @@ -33,4 +33,13 @@ jobs:
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest
pytest -m "not regression"
- name: netrc-gen
uses: extractions/netrc@v1
with:
machine: urs.earthdata.nasa.gov
username: ${{ secrets.EDL_OPS_USERNAME }}
password: ${{ secrets.EDL_OPS_PASSWORD }}
- name: Regression Test with pytest
run: |
pytest -m "regression"
9 changes: 8 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,16 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)


## Unreleased

## [1.9.0]
### Added
- check if file exists before downloading a file. [17](https://github.com/podaac/data-subscriber/issues/17)
- added automated regression testing
### Changed
- Implemented Search After CMR interface to allow granule listings > 2000 [15](https://github.com/podaac/data-subscriber/issues/15)
- Retry CMR queries on server error using random exponential backoff max 60 seconds and 10 retries
- Refresh token if CMR returns 401 error
- Converted print statements to log statements
### Deprecated
### Removed
### Fixed
Expand Down
24 changes: 20 additions & 4 deletions Downloader.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,7 @@ For installation and dependency information, please see the [top-level README](R

```
$> podaac-data-downloader -h
usage: PO.DAAC bulk-data downloader [-h] -c COLLECTION -d OUTPUTDIRECTORY [--cycle SEARCH_CYCLES] [-sd STARTDATE] [-ed ENDDATE]
[-b BBOX] [-dc] [-dydoy] [-dymd] [-dy] [--offset OFFSET] [-e EXTENSIONS] [--process PROCESS_CMD]
[--version] [--verbose] [-p PROVIDER] [--limit LIMIT]
usage: PO.DAAC bulk-data downloader [-h] -c COLLECTION -d OUTPUTDIRECTORY [--cycle SEARCH_CYCLES] [-sd STARTDATE] [-ed ENDDATE] [-f] [-b BBOX] [-dc] [-dydoy] [-dymd] [-dy] [--offset OFFSET] [-e EXTENSIONS] [--process PROCESS_CMD] [--version] [--verbose] [-p PROVIDER] [--limit LIMIT]
optional arguments:
-h, --help show this help message and exit
Expand All @@ -22,6 +20,8 @@ optional arguments:
The ISO date time before which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z
-ed ENDDATE, --end-date ENDDATE
The ISO date time after which data should be retrieved. For Example, --end-date 2021-01-14T00:00:00Z
-f, --force
Flag to force downloading files that are listed in CMR query, even if the file exists and checksum matches
-b BBOX, --bounds BBOX
The bounding rectangle to filter result in. Format is W Longitude,S Latitude,E Longitude,N Latitude without
spaces. Due to an issue with parsing arguments, to use this command, please use the -b="-180,-90,180,90" syntax
Expand Down Expand Up @@ -50,7 +50,7 @@ optional arguments:

Usage:
```
usage: PO.DAAC bulk-data downloader [-h] -c COLLECTION -d OUTPUTDIRECTORY [--cycle SEARCH_CYCLES] [-sd STARTDATE] [-ed ENDDATE]
usage: PO.DAAC bulk-data downloader [-h] -c COLLECTION -d OUTPUTDIRECTORY [--cycle SEARCH_CYCLES] [-sd STARTDATE] [-ed ENDDATE] [-f]
[-b BBOX] [-dc] [-dydoy] [-dymd] [-dy] [--offset OFFSET] [-e EXTENSIONS] [--process PROCESS_CMD]
[--version] [--verbose] [-p PROVIDER] [--limit LIMIT]
```
Expand Down Expand Up @@ -163,6 +163,22 @@ The subscriber allows the placement of downloaded files into one of several dire
* -dymd - optional, relative paths use the start time of a granule to layout data in a YEAR/MONTH/DAY path


### Downloader behavior when a file already exists

By default, when the downloader is about to download a file, it first:
- Checks if the file already exists in the target location
- Creates a checksum for the file and sees if it matches the checksum for that file in CMR

If the file already exists AND the checksum matches, the downloader will skip downloading that file.

This can drastically reduce the time for the downloader to complete. Also, since the checksum is verified, files will still be re-downloaded if for some reason the file has changed (or the file already on disk is corrupted).

You can override this default behavior - forcing the downloader to always download matching files, by using --force/-f.

```
podaac-data-downloader -c SENTINEL-1A_SLC -d myData -f
```

### Setting a bounding rectangle for filtering results

If you're interested in a specific region, you can set the bounds parameter on your request to filter data that passes through a certain area. This is useful in particular for non-global datasets (such as swath datasets) with non-global coverage per file.
Expand Down
25 changes: 21 additions & 4 deletions Subscriber.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,15 @@ For installation and dependency information, please see the [top-level README](R

```
$> podaac-data-subscriber -h
usage: PO.DAAC data subscriber [-h] -c COLLECTION -d OUTPUTDIRECTORY [-sd STARTDATE] [-ed ENDDATE] [-b BBOX] [-dc] [-dydoy] [-dymd] [-dy] [--offset OFFSET] [-m MINUTES] [-e EXTENSIONS] [--process PROCESS_CMD] [--version] [--verbose] [-p PROVIDER]
usage: PO.DAAC data subscriber [-h] -c COLLECTION -d OUTPUTDIRECTORY [-f] [-sd STARTDATE] [-ed ENDDATE] [-b BBOX] [-dc] [-dydoy] [-dymd] [-dy] [--offset OFFSET] [-m MINUTES] [-e EXTENSIONS] [--process PROCESS_CMD] [--version] [--verbose] [-p PROVIDER]
optional arguments:
-h, --help show this help message and exit
-c COLLECTION, --collection-shortname COLLECTION
The collection shortname for which you want to retrieve data.
-d OUTPUTDIRECTORY, --data-dir OUTPUTDIRECTORY
The directory where data products will be downloaded.
-f, --force Flag to force downloading files that are listed in CMR query, even if the file exists and checksum matches
-sd STARTDATE, --start-date STARTDATE
The ISO date time before which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z
-ed ENDDATE, --end-date ENDDATE
Expand All @@ -37,12 +38,11 @@ optional arguments:
Specify a provider for collection search. Default is POCLOUD.
```

##Run the Script
## Run the Script

Usage:
```
usage: podaac_data_subscriber.py [-h] -c COLLECTION -d OUTPUTDIRECTORY [-sd STARTDATE] [-ed ENDDATE] [-b BBOX] [-dc] [-dydoy] [-dymd] [-dy] [--offset OFFSET]
[-m MINUTES] [-e EXTENSIONS] [--version] [--verbose] [-p PROVIDER]
usage: podaac_data_subscriber.py [-h] -c COLLECTION -d OUTPUTDIRECTORY [-f] [-sd STARTDATE] [-ed ENDDATE] [-b BBOX] [-dc] [-dydoy] [-dymd] [-dy] [--offset OFFSET] [-m MINUTES] [-e EXTENSIONS] [--version] [--verbose] [-p PROVIDER]
```

To run the script, the following parameters are required:
Expand Down Expand Up @@ -112,6 +112,7 @@ machine urs.earthdata.nasa.gov

**If the script cannot find the netrc file, you will be prompted to enter the username and password and the script wont be able to generate the CMR token**


## Advanced Usage

### Request data from another DAAC...
Expand Down Expand Up @@ -141,6 +142,22 @@ The subscriber allows the placement of downloaded files into one of several dire
* -dydoy - optional, relative paths use the start time of a granule to layout data in a YEAR/DAY-OF-YEAR path
* -dymd - optional, relative paths use the start time of a granule to layout data in a YEAR/MONTH/DAY path

### Subscriber behavior when a file already exists

By default, when the subscriber is about to download a file, it first:
- Checks if the file already exists in the target location
- Creates a checksum for the file and sees if it matches the checksum for that file in CMR

If the file already exists AND the checksum matches, the subscriber will skip downloading that file.

This can drastically reduce the time for the subscriber to complete. Also, since the checksum is verified, files will still be re-downloaded if for some reason the file has changed (or the file already on disk is corrupted).

You can override this default behavior - forcing the subscriber to always download matching files, by using --force/-f.

```
podaac-data-subscriber -c SENTINEL-1A_SLC -d myData -f
```

### Running as a Cron job

To automatically run and update a local file system with data files from a collection, one can use a syntax like the following:
Expand Down
1 change: 1 addition & 0 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pytest==7.1.1
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,7 @@ requires = [
"wheel"
]
build-backend = "setuptools.build_meta"
[tool.pytest.ini_options]
markers = [
"regression: marks a test as a regression, requires netrc file (deselect with '-m \"not regresion\"')"
]
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ chardet==4.0.0
idna==2.10
requests==2.25.1
urllib3>=1.26.5
tenacity>=8.0.1
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
long_description = fh.read()

setup(name='podaac-data-subscriber',
version='1.8.0',
version='1.9.0',
description='PO.DAAC Data Susbcriber Command Line Tool',
url='https://github.com/podaac/data-subscriber',
long_description=long_description,
Expand All @@ -15,7 +15,7 @@
packages=['subscriber'],
entry_points='''
[console_scripts]
podaac-data-subscriber=subscriber.podaac_data_subscriber:run
podaac-data-downloader=subscriber.podaac_data_downloader:run
podaac-data-subscriber=subscriber.podaac_data_subscriber:main
podaac-data-downloader=subscriber.podaac_data_downloader:main
''',
zip_safe=False)
Loading

0 comments on commit bd21411

Please sign in to comment.