Skip to content

Commit

Permalink
Develop (#103)
Browse files Browse the repository at this point in the history
* Issues/91 (#92)

* added citation creation tests and functionality to subscriber and downloader

* added verbose option to create_citation_file command, previously hard coded

* updated changelog (whoops) and fixed regression test:
1. Issue where the citation file now downloaded affected the counts
2. Issue where the logic for determining if a file modified time was changing or not was picking up the new citation file which _always_ gets rewritten to update the 'last accessed' date.

* updated request to include exec_info in warning; fixed issue with params not being a dictionary caused errors

* changed a warning to debug for citation file. fixed test issues

* Enable debug logging during regression tests and set max parallel workflows to 2

* added output to pytest

* fixed test to only look for downlaoded data files not citation file due to 'random' cmr errors when creating a citation.

* added mock testing and retry on 503

* added 503 fixes

Co-authored-by: Frank Greguska <[email protected]>

* fixed issues where token was not proagated to CMR queries (#95)

* Misc fixes (#101)

* added ".tiff" to default extensions to address #100

* removed 'warning' message on not downloading all data to close #99

* updated help documentation for start/end times to close #79

* added version update, updates to CHANGELOG

* added token get,delete, refresh and list operations

* Revert "added token get,delete, refresh and list operations"

This reverts commit 15aba90.

* Update python-app.yml

Co-authored-by: Frank Greguska <[email protected]>
  • Loading branch information
mike-gangl and Frank Greguska committed Sep 2, 2022
1 parent 1a5f534 commit 665c9c5
Show file tree
Hide file tree
Showing 10 changed files with 224 additions and 30 deletions.
7 changes: 5 additions & 2 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@ jobs:
build:
strategy:
fail-fast: false
max-parallel: 2
matrix:
python-version: [ "3.7", "3.8", "3.9", "3.10" ]
poetry-version: [ "1.1" ]
poetry-version: [ "1.1.14" ]
os: [ ubuntu-18.04, macos-latest, windows-latest ]
runs-on: ${{ matrix.os }}
steps:
Expand Down Expand Up @@ -47,5 +48,7 @@ jobs:
username: ${{ secrets.EDL_OPS_USERNAME }}
password: ${{ secrets.EDL_OPS_PASSWORD }}
- name: Regression Test with pytest
env:
PODAAC_LOGLEVEL: "DEBUG"
run: |
poetry run pytest -m "regression"
poetry run pytest -o log_cli=true --log-cli-level=DEBUG -m "regression"
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,17 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)

## 1.11.0
### Fixed
- Fixed an issue where token-refresh was expecting a dictionary, not a list of tuples
- Fixed issues where token was not propagated to downloader CMR query [94](https://github.com/podaac/data-subscriber/issues/94)
- Fixed an issue with 503 errors on data download not being re-tried. [97](https://github.com/podaac/data-subscriber/issues/9797)
- added ".tiff" to default extensions to address #[100](https://github.com/podaac/data-subscriber/issues/100)
- removed erroneous 'warning' message on not downloading all data to close [99](https://github.com/podaac/data-subscriber/issues/99)
- updated help documentation for start/end times to close [79](https://github.com/podaac/data-subscriber/issues/79)
### Added
- Added citation file creation when data are downloaded [91](https://github.com/podaac/data-subscriber/issues/91). Required some updates to the regression testing.

## [1.10.2]
### Fixed
- Fixed an issue where using a default global bounding box prevented download of data that didn't use the horizontal spatial domain [87](https://github.com/podaac/data-subscriber/issues/87)
Expand Down
24 changes: 21 additions & 3 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "podaac-data-subscriber"
version = "1.10.2"
version = "1.11.0"
description = "PO.DAAC Data Subscriber Command Line Tool"
authors = ["PO.DAAC <[email protected]>"]
readme = "README.md"
Expand All @@ -19,6 +19,7 @@ tenacity = "^8.0.1"
[tool.poetry.dev-dependencies]
pytest = "^7.1.2"
flake8 = "^4.0.1"
pytest-mock = "^3.8.2"

[tool.poetry.scripts]
podaac-data-subscriber = 'subscriber.podaac_data_subscriber:main'
Expand Down
75 changes: 73 additions & 2 deletions subscriber/podaac_access.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,23 @@
from typing import Dict
from urllib import request
from urllib.error import HTTPError
from urllib.request import urlretrieve
import subprocess
from urllib.parse import urlencode
from urllib.request import Request, urlopen
import hashlib
from datetime import datetime
import time


import requests

import requests
import tenacity
from datetime import datetime

__version__ = "1.10.2"
extensions = [".nc", ".h5", ".zip", ".tar.gz"]
__version__ = "1.11.0"
extensions = [".nc", ".h5", ".zip", ".tar.gz", ".tiff"]
edl = "urs.earthdata.nasa.gov"
cmr = "cmr.earthdata.nasa.gov"
token_url = "https://" + cmr + "/legacy-services/rest/tokens"
Expand Down Expand Up @@ -286,6 +290,26 @@ def get_temporal_range(start, end, now):
raise ValueError("One of start-date or end-date must be specified.")


def download_file(remote_file, output_path, retries=3):
failed = False
for r in range(retries):
try:
urlretrieve(remote_file, output_path)
except HTTPError as e:
if e.code == 503:
logging.warning(f'Error downloading {remote_file}. Retrying download.')
# back off on sleep time each error...
time.sleep(r)
if r >= retries:
failed = True
else:
#downlaoded fie without 503
break

if failed:
raise Exception("Could not download file.")


# Retry using random exponential backoff if a 500 error is raised. Maximum 10 attempts.
@tenacity.retry(wait=tenacity.wait_random_exponential(multiplier=1, max=60),
stop=tenacity.stop_after_attempt(10),
Expand Down Expand Up @@ -436,3 +460,50 @@ def make_checksum(file_path, algorithm):
for chunk in iter(lambda: f.read(4096), b""):
hash_alg.update(chunk)
return hash_alg.hexdigest()

def get_cmr_collections(params, verbose=False):
query = urlencode(params)
url = "https://" + cmr + "/search/collections.umm_json?" + query
if verbose:
logging.info(url)

# Build the request, add the search after header to it if it's not None (e.g. after the first iteration)
req = Request(url)
response = urlopen(req)
result = json.loads(response.read().decode())
return result


def create_citation(collection_json, access_date):
citation_template = "{creator}. {year}. {title}. Ver. {version}. PO.DAAC, CA, USA. Dataset accessed {access_date} at {doi_authority}/{doi}"

# Better error handling here may be needed...
doi = collection_json['DOI']["DOI"]
doi_authority = collection_json['DOI']["Authority"]
citation = collection_json["CollectionCitations"][0]
creator = citation["Creator"]
release_date = citation["ReleaseDate"]
title = citation["Title"]
version = citation["Version"]
year = datetime.strptime(release_date, "%Y-%m-%dT%H:%M:%S.000Z").year
return citation_template.format(creator=creator, year=year, title=title, version=version, doi_authority=doi_authority, doi=doi, access_date=access_date)

def create_citation_file(short_name, provider, data_path, token=None, verbose=False):
# get collection umm-c METADATA
params = [
('provider', provider),
('ShortName', short_name)
]
if token is not None:
params.append(('token', token))

collection = get_cmr_collections(params, verbose)['items'][0]

access_date = datetime.now().strftime("%Y-%m-%d")

# create citation from umm-c metadata
citation = create_citation(collection['umm'], access_date)
# write file

with open(data_path + "/" + short_name + ".citation.txt", "w") as text_file:
text_file.write(citation)
27 changes: 19 additions & 8 deletions subscriber/podaac_data_downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,9 @@ def create_parser():
help="Cycle number for determining downloads. can be repeated for multiple cycles",
action='append', type=int)
parser.add_argument("-sd", "--start-date", required=False, dest="startDate",
help="The ISO date time before which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z") # noqa E501
help="The ISO date time after which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z") # noqa E501
parser.add_argument("-ed", "--end-date", required=False, dest="endDate",
help="The ISO date time after which data should be retrieved. For Example, --end-date 2021-01-14T00:00:00Z") # noqa E501
help="The ISO date time before which data should be retrieved. For Example, --end-date 2021-01-14T00:00:00Z") # noqa E501

# Adding optional arguments
parser.add_argument("-f", "--force", dest="force", action="store_true", help = "Flag to force downloading files that are listed in CMR query, even if the file exists and checksum matches") # noqa E501
Expand Down Expand Up @@ -178,6 +178,7 @@ def run(args=None):
('provider', provider),
('ShortName', short_name),
('temporal', temporal_range),
('token', token),
]
if args.verbose:
logging.info("Temporal Range: " + temporal_range)
Expand All @@ -193,7 +194,11 @@ def run(args=None):
except HTTPError as e:
if e.code == 401:
token = pa.refresh_token(token, 'podaac-subscriber')
params['token'] = token
# Updated: This is not always a dictionary...
# in fact, here it's always a list of tuples
for i, p in enumerate(params) :
if p[1] == "token":
params[i] = ("token", token)
results = pa.get_search_results(params, args.verbose)
else:
raise e
Expand Down Expand Up @@ -221,10 +226,6 @@ def run(args=None):

downloads = [item for sublist in downloads_all for item in sublist]

if len(downloads) >= page_size:
logging.warning("Only the most recent " + str(
page_size) + " granules will be downloaded; try adjusting your search criteria (suggestion: reduce time period or spatial region of search) to ensure you retrieve all granules.")

# filter list based on extension
if not extensions:
extensions = pa.extensions
Expand Down Expand Up @@ -268,7 +269,9 @@ def run(args=None):
skip_cnt += 1
continue

urlretrieve(f, output_path)
pa.download_file(f,output_path)
#urlretrieve(f, output_path)

pa.process_file(process_cmd, output_path, args)
logging.info(str(datetime.now()) + " SUCCESS: " + f)
success_cnt = success_cnt + 1
Expand All @@ -284,6 +287,14 @@ def run(args=None):
logging.info("Downloaded Files: " + str(success_cnt))
logging.info("Failed Files: " + str(failure_cnt))
logging.info("Skipped Files: " + str(skip_cnt))

#create citation file if success > 0
if success_cnt > 0:
try:
pa.create_citation_file(short_name, provider, data_path, token, args.verbose)
except:
logging.debug("Error generating citation",exc_info=True)

pa.delete_token(token_url, token)
logging.info("END\n\n")

Expand Down
26 changes: 18 additions & 8 deletions subscriber/podaac_data_subscriber.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,10 @@ def create_parser():

# spatiotemporal arguments
parser.add_argument("-sd", "--start-date", dest="startDate",
help="The ISO date time before which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z",
help="The ISO date time after which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z",
default=False) # noqa E501
parser.add_argument("-ed", "--end-date", dest="endDate",
help="The ISO date time after which data should be retrieved. For Example, --end-date 2021-01-14T00:00:00Z",
help="The ISO date time before which data should be retrieved. For Example, --end-date 2021-01-14T00:00:00Z",
default=False) # noqa E501
parser.add_argument("-b", "--bounds", dest="bbox",
help="The bounding rectangle to filter result in. Format is W Longitude,S Latitude,E Longitude,N Latitude without spaces. Due to an issue with parsing arguments, to use this command, please use the -b=\"-180,-90,180,90\" syntax when calling from the command line. Default: \"-180,-90,180,90\".",
Expand Down Expand Up @@ -218,7 +218,12 @@ def run(args=None):
except HTTPError as e:
if e.code == 401:
token = pa.refresh_token(token, 'podaac-subscriber')
params['token'] = token
# Updated: This is not always a dictionary...
# in fact, here it's always a list of tuples
for i, p in enumerate(params) :
if p[1] == "token":
params[i] = ("token", token)
#params['token'] = token
results = pa.get_search_results(params, args.verbose)
else:
raise e
Expand Down Expand Up @@ -249,10 +254,6 @@ def run(args=None):

downloads = [item for sublist in downloads_all for item in sublist]

if len(downloads) >= page_size:
logging.warning("Only the most recent " + str(
page_size) + " granules will be downloaded; try adjusting your search criteria (suggestion: reduce time period or spatial region of search) to ensure you retrieve all granules.")

# filter list based on extension
if not extensions:
extensions = pa.extensions
Expand Down Expand Up @@ -294,7 +295,9 @@ def run(args=None):
skip_cnt += 1
continue

urlretrieve(f, output_path)
#urlretrieve(f, output_path)
pa.download_file(f,output_path)

pa.process_file(process_cmd, output_path, args)
logging.info(str(datetime.now()) + " SUCCESS: " + f)
success_cnt = success_cnt + 1
Expand All @@ -314,6 +317,13 @@ def run(args=None):
logging.info("Downloaded Files: " + str(success_cnt))
logging.info("Failed Files: " + str(failure_cnt))
logging.info("Skipped Files: " + str(skip_cnt))

if success_cnt > 0:
try:
pa.create_citation_file(short_name, provider, data_path, token, args.verbose)
except:
logging.debug("Error generating citation", exc_info=True)

pa.delete_token(token_url, token)
logging.info("END\n\n")
#exit(0)
Expand Down
15 changes: 10 additions & 5 deletions tests/test_downloader_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,11 @@ def create_downloader_args(args):
@pytest.mark.regression
def test_downloader_limit_MUR():
shutil.rmtree('./MUR25-JPL-L4-GLOB-v04.2', ignore_errors=True)
args2 = create_downloader_args('-c MUR25-JPL-L4-GLOB-v04.2 -d ./MUR25-JPL-L4-GLOB-v04.2 -sd 2020-01-01T00:00:00Z -ed 2020-01-30T00:00:00Z --limit 1'.split())
args2 = create_downloader_args('-c MUR25-JPL-L4-GLOB-v04.2 -d ./MUR25-JPL-L4-GLOB-v04.2 -sd 2020-01-01T00:00:00Z -ed 2020-01-30T00:00:00Z --limit 1 --verbose'.split())
pdd.run(args2)
# count number of files downloaded...
assert len([name for name in os.listdir('./MUR25-JPL-L4-GLOB-v04.2') if os.path.isfile('./MUR25-JPL-L4-GLOB-v04.2/' + name)])==1
# So running the test in parallel, sometimes we get a 401 on the token...
# Let's ensure we're only looking for data files here
assert len([name for name in os.listdir('./MUR25-JPL-L4-GLOB-v04.2') if os.path.isfile('./MUR25-JPL-L4-GLOB-v04.2/' + name) and "citation.txt" not in name ])==1
shutil.rmtree('./MUR25-JPL-L4-GLOB-v04.2')

#Test the downlaoder on MUR25 data for start/stop/, yyyy/mmm/dd dir structure,
Expand All @@ -31,7 +32,7 @@ def test_downloader_limit_MUR():
@pytest.mark.regression
def test_downloader_MUR():
shutil.rmtree('./MUR25-JPL-L4-GLOB-v04.2', ignore_errors=True)
args2 = create_downloader_args('-c MUR25-JPL-L4-GLOB-v04.2 -d ./MUR25-JPL-L4-GLOB-v04.2 -sd 2020-01-01T00:00:00Z -ed 2020-01-02T00:00:00Z -dymd --offset 4'.split())
args2 = create_downloader_args('-c MUR25-JPL-L4-GLOB-v04.2 -d ./MUR25-JPL-L4-GLOB-v04.2 -sd 2020-01-01T00:00:00Z -ed 2020-01-02T00:00:00Z -dymd --offset 4 --verbose'.split())
pdd.run(args2)
assert exists('./MUR25-JPL-L4-GLOB-v04.2/2020/01/01/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR25-GLOB-v02.0-fv04.2.nc')
assert exists('./MUR25-JPL-L4-GLOB-v04.2/2020/01/02/20200102090000-JPL-L4_GHRSST-SSTfnd-MUR25-GLOB-v02.0-fv04.2.nc')
Expand All @@ -54,7 +55,7 @@ def test_downloader_MUR():
t1 = os.path.getmtime('./MUR25-JPL-L4-GLOB-v04.2/2020/01/01/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR25-GLOB-v02.0-fv04.2.nc')

# Set the args to --force to re-download those data
args2 = create_downloader_args('-c MUR25-JPL-L4-GLOB-v04.2 -d ./MUR25-JPL-L4-GLOB-v04.2 -sd 2020-01-01T00:00:00Z -ed 2020-01-02T00:00:00Z -dymd --offset 4 -f'.split())
args2 = create_downloader_args('-c MUR25-JPL-L4-GLOB-v04.2 -d ./MUR25-JPL-L4-GLOB-v04.2 -sd 2020-01-01T00:00:00Z -ed 2020-01-02T00:00:00Z -dymd --offset 4 -f --verbose'.split())
pdd.run(args2)
assert t1 != os.path.getmtime('./MUR25-JPL-L4-GLOB-v04.2/2020/01/01/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR25-GLOB-v02.0-fv04.2.nc')
assert t2 != os.path.getmtime('./MUR25-JPL-L4-GLOB-v04.2/2020/01/02/20200102090000-JPL-L4_GHRSST-SSTfnd-MUR25-GLOB-v02.0-fv04.2.nc')
Expand All @@ -73,6 +74,10 @@ def test_downloader_GRACE_with_SHA_512(tmpdir):
pdd.run(args)
assert len( os.listdir(directory_str) ) > 0
filename = directory_str + "/" + os.listdir(directory_str)[0]
#if the citation file was chosen above, get the next file since citation file is updated on successful run
if "citation.txt" in filename:
filename = directory_str + "/" + os.listdir(directory_str)[1]

modified_time_1 = os.path.getmtime(filename)
print( modified_time_1 )

Expand Down
Loading

0 comments on commit 665c9c5

Please sign in to comment.