Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Datetime other than those specified as 14-digits #283

Open
machawk1 opened this issue Nov 22, 2017 · 15 comments
Open

Support Datetime other than those specified as 14-digits #283

machawk1 opened this issue Nov 22, 2017 · 15 comments

Comments

@machawk1
Copy link
Member

machawk1 commented Nov 22, 2017

The WARC 1.1 spec allows for more precise datetimes. These should be supported in the replay system. Does any tool exist that will generate these yet? If not, some sample data can be fabricated.

@machawk1
Copy link
Member Author

The further precision does not be present in the link above. What's the BnF link?

Also see iipc/warc-specifications#21.

@machawk1
Copy link
Member Author

machawk1 commented Jul 1, 2018

First line WARC/1.1 causes an exception in the iterator we currently reuse from pywb to quickly invalidate the WARC and not proceed with processing.

@machawk1
Copy link
Member Author

machawk1 commented Jul 1, 2018

Per the WARC/1.1 spec and iipc/warc-specifications#21, date strings like 2014-01 are legal but currently breaks the indexer with:

Traceback (most recent call last):
  File "/usr/local/bin/ipwb", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 17, in main
    args = checkArgs(sys.argv)
  File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 151, in checkArgs
    results.func(results)
  File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 32, in checkArgs_index
    debug=args.debug)
  File "/usr/local/lib/python2.7/site-packages/ipwb/indexer.py", line 141, in indexFileAt
    warcFileFullPath, **encryptionAndCompressionSetting)
  File "/usr/local/lib/python2.7/site-packages/ipwb/indexer.py", line 179, in getCDXJLinesFromFile
    for i in iterForCounting(fhForCounting):
  File "/usr/local/lib/python2.7/site-packages/pywb/warc/archiveiterator.py", line 543, in __call__
    for entry in entry_iter:
  File "/usr/local/lib/python2.7/site-packages/pywb/warc/archiveiterator.py", line 379, in create_record_iter
    entry = self.parse_warc_record(record)
  File "/usr/local/lib/python2.7/site-packages/pywb/warc/archiveiterator.py", line 465, in parse_warc_record
    get_header('WARC-Date'))
  File "/usr/local/lib/python2.7/site-packages/pywb/utils/timeutils.py", line 122, in iso_date_to_timestamp
    return datetime_to_timestamp(iso_date_to_datetime(string))
  File "/usr/local/lib/python2.7/site-packages/pywb/utils/timeutils.py", line 40, in iso_date_to_datetime
    the_datetime = datetime.datetime(*map(int, nums))
TypeError: Required argument 'day' (pos 3) not found

...based on 6d219f5.

@machawk1 machawk1 changed the title Support Datetime larger than 14-digits Support Datetime other than those specified as 14-digits Jul 1, 2018
@machawk1
Copy link
Member Author

machawk1 commented Jul 1, 2018

Added a sample (variableSizedDates) WARC that I believe conforms to the 1.1 standard with variable length datetime strings.

@machawk1
Copy link
Member Author

machawk1 commented Jun 18, 2020

Encountered this again in testing, current master (73f136f):

% ipwb index samples/warcs/variableSizedDates.warc
Traceback (most recent call last):eSizedDates.warc: 1/5
  File "/usr/local/bin/ipwb", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/ipwb/__main__.py", line 19, in main
    args = checkArgs(sys.argv)
  File "/usr/local/lib/python3.7/site-packages/ipwb/__main__.py", line 167, in checkArgs
    results.func(results)
  File "/usr/local/lib/python3.7/site-packages/ipwb/__main__.py", line 34, in checkArgs_index
    debug=args.debug)
  File "/usr/local/lib/python3.7/site-packages/ipwb/indexer.py", line 174, in indexFileAt
    warcFileFullPath, **encryptionAndCompressionSetting)
  File "/usr/local/lib/python3.7/site-packages/ipwb/indexer.py", line 291, in getCDXJLinesFromFile
    record.rec_headers.get_header('WARC-Date'))
  File "/usr/local/lib/python3.7/site-packages/ipwb/util.py", line 165, in iso8601ToDigits14
    "%Y-%m-%dT%H:%M:%SZ")
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_strptime.py", line 577, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_strptime.py", line 359, in _strptime
    (data_string, format))

WARC 1.0 mandates 14-digit date for the WARC-Date field:

A 14-digit UTC timestamp formatted according to YYYY-MM-DDThh:mm:ssZ, described in the W3C profile of ISO8601 [W3CDTF]. The timestamp shall represent the instant that data capture for record creation began. Multiple records written as part of a single capture event (see section 5.7) shall use the same WARC-Date, even though the times of their writing will not be exactly synchronized.

WARC-Date = "WARC-Date" ":" w3c-iso8601
w3c-iso8601 = YYYY-MM-DDThh:mm:ssZ
All records shall have a WARC-Date field.

WARC 1.1 allows for other variants, e.g., 2014-01

The WARC-Date is a UTC timestamp as described in the W3C profile of ISO 8601:1988 [W3CDTF], for example YYYY-MM-DDThh:mm:ssZ. The timestamp shall represent the instant that data capture for record creation began. Multiple records written as part of a single capture event (see section 5.7) shall use the same WARC-Date, even though the times of their writing will not be exactly synchronized.

WARC-Date may be specified at any of the levels of granularity described in [W3CDTF]. If WARC-Date includes a decimal fraction of a second, the decimal fraction of a second shall have a minimum of 1 digit and a maximum of 9 digits. WARC-Date should be specified with as much precision as is accurately known. This document recommends no particular algorithm for access software to choose a record by date when an exact match is not available.

WARC-Date = "WARC-Date" ":" w3c-iso8601
w3c-iso8601 = <a UTC timestamp formatted according to [W3CDTF]>
All records shall have a WARC-Date field.

See Annex A for examples on usage of WARC-Date fields.

It seems more flexible to simply read and interpret the date instead of referring to which version of the spec to which the WARC should adhere. As of now, iso8601ToDigits14() in util.py assumes iso8601 compliance, hence WARC 1.0.

@machawk1
Copy link
Member Author

machawk1 commented Jun 18, 2020

Given the rationale for conversion is from ISO8601 to 14-digit datetime, some options:

  1. Assume undefined aspects of the datetime, e.g., 2014-01 to 20140101000000
  2. Adapt to allow for fuzziness.

The former seems more straightforward but instills perhaps unintended assumptions. Fuzziness is inherent in datetimes, as time is continuous, e.g., the millisecond discussion for WARCs. If we read a fuzzy datetime from a WARC and go with option 2, will it be compatible with storing this value in a CDXJ record with no assumptions of the datetime beyond what is specified.

@ibnesayeed, can you provide some insight/feedback/commentary for this?

@machawk1
Copy link
Member Author

machawk1 commented Jun 18, 2020

The key here is ISO8601 with "as much precision as is accurately known."

I cannot locate a module to accomplish this but a series of tests (e.g., regex) with the highest level of granularity (with 9 digits following the second) all the way down to simply year is an approach. This starting point might seem wasteful, given the more common ISO8601 length including up to seconds.

For Python:

%Y-%m-%dT%H:%M:%SZ
%Y-%m-%dT%H:%MZ
%Y-%m-%dT%HZ
%Y-%m-%d
%Y-%m
%Y
%Y-%m-%dT%H:%M:%S.[0-9]{1-9}Z

With the last version not quite correct (but you, future person, hopefully get the gist).

@machawk1
Copy link
Member Author

machawk1 commented Jun 18, 2020

9cd23ba addresses some of this but I have yet to match the fraction-of-a-second example in that WARC:

import datetime
datetime.datetime.strptime('%Y-%m-%dT%H:%M:%S.%fZ','2014-02-10T00:00:01.000000002Z')
ValueError: time data '%Y-%m-%dT%H:%M:%S.%fZ' does not match format '2014-02-10T00:00:01.000000002Z'

@ibnesayeed
Copy link
Member

There could be two possible approaches here:

  1. Identify all the potential datetime formats that are allowed and try to normalize them in one canonical form that is not lossy and is easier for lookup
  2. Gradually recognize and accommodate more formats that are in use, a canonical form for internal use will be helpful in this approach as well

We also need to figure out what is URI format we would want to support in the replay.

@machawk1
Copy link
Member Author

machawk1 commented Jun 19, 2020

The parameters above are backward, the format string should be second. This works:

datetime.datetime.strptime('2014-02-10T00:00:01.000000Z','%Y-%m-%dT%H:%M:%S.%fZ')
datetime.datetime(2014, 2, 10, 0, 0, 1)

Note, however, that %f read six 0-padded digits. The WARC/1.1 spec says:

the decimal fraction of a second shall have a minimum of 1 digit and a maximum of 9 digits.

This is problematic and conflicting with the sub-second %f portion of the format string.

W3CDTF says:

s = one or more digits representing a decimal fraction of a second.

%f might be insufficient, as it expects six digits and WARC-Dates can have 1-9 digits. Is there a format portion (akin to %f) that allows for this specification?

@machawk1
Copy link
Member Author

This level of precision is unlikely but allowable per WARC/1.1, so we need a special case for compliance. One option is to first check compliance with:

dt = '2014-02-10T00:00:01.123456789Z'
dt_f = f'{dt[:26]}dt[-1:]'
datetime.datetime.strptime(dt_f, '%Y-%m-%dT%H:%M:%S.%f')

...then parse out dt[27:-1], append it to dt[21:26], check it is all digits, and if so, assign it to the final value of the datetime object.

@machawk1
Copy link
Member Author

datetime.datetime.microsecond is not write-able, so the more precise value cannot simply be set after parsing.

@machawk1
Copy link
Member Author

b76135a adds support for generating more precise, solely digit-based date strings. These become present in the CDXJs generated, for example:

% ipwb index samples/warcs/variableSizedDates.warc 
!context ["http://tools.ietf.org/html/rfc7089"]
!meta {"generator": "InterPlanetary Wayback v.0.2020.06.18.1933", "created_at": "2020-06-19T14:37:49.991232"}
us,memento)/ 20140101000000 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/QmX4gE6SdJK8v67XikqQFJrac4xaqB5kwsgona2nH9hZwm", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}
us,memento)/ 20140210000001 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/QmXQB6e2aB7VRaA4CK5H33sTfVC6GxNd1JtSgCaWVuUbfj", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}
us,memento)/ 20140210000001 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/QmYWRfaHFcN7ygLUiiKEF6ELApMbdhv7K3zRtrz5rog83U", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}
us,memento)/ 20140210000001000000002 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/Qmb8q1BFPws4ZNhL9MczY9tb4mWEPdV41LNuXD6oMkvzcw", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}

More adjustments may need to be made to ensure replay can handle the potentially longer date strings (see last line above). This issue is complete but I would like to investigate the end other of using the CDXJ files with long date strings.

@machawk1
Copy link
Member Author

As suspected, when replaying the CDXJ above and accessing any memento, the digits14ToRFC1123() method is called and the parallel datetime.datetime.strptime(digits14, '%Y%m%d%H%M%S') within throws a ValueError.

@machawk1 machawk1 pinned this issue Jun 22, 2020
@machawk1
Copy link
Member Author

  • Check how replay handles CDXJ entries that are > (and <?) 14-digits, as generated from the indexer in 60de785.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants