GitHub

This is a workflow and set of python scripts that extract a series of URLs and their metadata from a JSON file then caches them at the Internet Archive based on their update frequency.

The JSON URL data comes from the Police Data Accessability Project.

TO-DO:

alter the builder script to pull data directly from a database instead of buliding from JSON
- alter workflow to automatically add URLs to the active cache list when new entries appear in the database
handle exceptions:
- reattempt a cache
- store broken URL info and remove from active caching list
- update broken URL info in database itself
use the Internet Archive API instead of savepagenow
- compare current versions of a page to previously cached pages to determine caching needs instead of using approximate timedeltas
- schedule caching using Archive API task scheduler instead of using sequential cache requests

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
ErrorLogs		ErrorLogs
README.md		README.md
cache_info.json		cache_info.json
cache_url.py		cache_url.py
requirements.txt		requirements.txt
sources.json		sources.json
test.json		test.json
url_cache_builder.py		url_cache_builder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

drowninginflowers/pdap_url_cache

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages