A program that creates archives of articles from specific journal sites (currently microPublication and Prompt) for sending to Portico and PMC.
Authors: Michael Hucka, Tom Morrell
Repository: https://github.com/caltechlibrary/pubarchiver
License: BSD/MIT derivative – see the LICENSE file for more information
- Introduction
- Installation
- Usage
- Getting help and support
- Contributing
- License
- Authors and history
- Acknowledgments
The Caltech Library is the publisher of a few academic journals and provides services for them. The services include archiving in a dark archive (specifically, Portico) as well as submitting articles to PMC. The archiving process involves pulling down articles from the journals and packaging them up in a format suitable for sending to the archives. PubArchiver is a program to help automate this process.
There are multiple ways of installing PubArchiver. Please choose the alternative that suits you.
You can use pipx to install PubArchiver. Pipx will install it into a separate Python environment that isolates the dependencies needed by PubArchiver from other Python programs on your system, and yet the resulting pubarchiver
command wil be executable from any shell – like any normal program on your computer. If you do not already have pipx
on your system, it can be installed in a variety of easy ways and it is best to consult Pipx's installation guide for instructions. Once you have pipx on your system, you can install PubArchiver with the following command:
pipx install pubarchiver
Pipx can also let you run PubArchiver directly using pipx run pubarchiver
, although in that case, you must always prefix every pubarchiver
command with pipx run
. Consult the documentation for pipx run
for more information.
The instructions below assume you have a Python 3 interpreter installed on your computer. Note that the default on macOS at least through 10.14 (Mojave) is Python 2 – please first install Python version 3 and familiarize yourself with running Python programs on your system before proceeding further.
On Linux, macOS, and Windows operating systems, you should be able to install pubarchiver
with pip
for Python 3. To install pubarchiver
from the Python package repository (PyPI), run the following command:
python3 -m pip install pubarchiver
As an alternative to getting it from PyPI, you can use pip
to install pubarchiver
directly from GitHub:
python3 -m pip install git+https://github.com/calitechlibrary/pubarchiver.git
If you already installed PubArchiver once before, and want to update to the latest version, add --upgrade
to the end of either command line above.
If you prefer to install PubArchiver directly from the source code, you can do that too. To get a copy of the files, you can clone the GitHub repository:
git clone https://github.com/caltechlibrary/pubarchiver
Alternatively, you can download the files as a ZIP archive using this link directly from your browser using this link: https://github.com/caltechlibrary/pubarchiver/archive/refs/heads/main.zip
Next, after getting a copy of the files, run setup.py
inside the code directory:
cd pubarchiver
python3 setup.py install
PubArchiver is a command-line program. The installation process should put a program named pubarchiver
in a location normally searched by your shell interpreter. For help with usage at any time, run pubarchiver
with the option --help
(or -h
for short).
pubarchiver -h
Options to pubarchiver
use a dash (-
) as the prefix character on macOS and Linux, and forward slash (/
) on Windows.
The journal whose articles are to be archived must be indicated using the required option --journal
(or -j
for short). To see a list of supported journals, you can use --journal list
like this:
pubarchiver --journal list
If not given any additional options besides a --journal
option to select the journal, pubarchiver
will proceed to contact the journal website as well as either DataCite or Crossref, and create an archive containing articles and their metadata for all articles published to date by the journal. The options below can be used to select articles and influence other pubarchiver
behaviors.
The option --list-dois
(or -l
for short) can be used to obtain a list of all DOIs for all articles published by the selected journal. When --list-dois
is used, pubarchiver
prints the list to the terminal and exits without doing further work. This can be useful if you intend to use the --doi-file
option discussed below.
If given the option --preview
(or -p
for short), pubarchiver
will only print a list of articles it will archive and stop short of creating the archive. This is useful to see what would be produced without actually doing it. Note, however, that because it does not attempt to download the articles and associated files, it cannot report errors that might occur when actually creating an archive. Consequently, do not use the output of --preview
as a prediction of eventual success or failure.
The value supplied after the option --dest
(or -d
for short) can be used to tell pubarchiver
the intended destination where the archive will be sent. The option changes the structure and content of the archive created by pubarchiver
. The possible alternatives are portico
and pmc
. Portico is assumed to be the default destination if no --dest
option is given.
By default, pubarchiver
will write its output to a new subdirectory it creates in the directory from which pubarchiver
is being run. The option --output-dir
(or /o
on Windows) can be used to select a different location. For example,
pubarchiver --journal micropublication --output-dir /tmp/micropub
The materials for each article will be written to an individual subdirectory named after the DOI of the article. The output for each article will consist of an XML metadata file describing the article, the article itself in PDF format, and (if the journal provides JATS) a subdirectory named jats
containing the article in JATS XML format along with any image that may appear in the article. The image is always converted to uncompressed TIFF format, because it is considered a good preservation format. The PMC structure follows the naming and delivery specifications defined at https://www.ncbi.nlm.nih.gov/pmc/pub/filespec-delivery/.
Unless the option --no-zip
(or -Z
for short) is given, the output will be archived in ZIP format. If the output structure (as determine by the --dest
option) is being generated for PMC, each article will be put into its own individual ZIP archive; else, the default action is to put the collected output of all articles into a single ZIP archive file. The option --no-zip
makes pubarchiver
leave the output unarchived in the directory determined by the --output-dir
option.
If the option --after-date
is given, pubarchiver
will download only articles whose publication dates are after the given date. Valid date descriptors are those accepted by the Python dateparser library. Make sure to enclose descriptions within single or double quotes. Examples:
pubarchiver --after-date "2014-08-29" ....
pubarchiver --after-date "12 Dec 2014" ....
pubarchiver --after-date "July 4, 2013" ....
pubarchiver --after-date "2 weeks ago" ....
The option --doi-file
(or -f
for short) can be used to tell pubarchiver
to read a file containing DOIs and only fetch those particular articles instead of asking the journal for all articles. The format of the file indicated after the --doi-file
option must be a simple text file containing one DOI per line.
The selection by date performed by the --after-date
option is performed after reading the list of articles using the --doi-file
option if present, and thus can be used to filter by date the articles whose DOIs are provided.
As it works, pubarchiver
writes information to the terminal about the articles it puts into the archive, including whether any problems are encountered. To save this information to a file, use the option --rep-file
(or -r
for short), which will make pubarchiver
write a report file. By default, the format of the report file is CSV; the option --rep-fmt
(or -s
for short) can be used to select csv
or html
(or both) as the format. The title of the report will be based on the current date, unless the option --rep-title
(or -t
for short) is used to supply a different title.
When pubarchiver
downloads the JATS XML version of articles from the journal site, it will by default validate the XML content against the JATS DTD. To skip the XML validation step, use the option --no-check
(or -X
for short).
pubarchiver
will print informational messages as it works. To reduce messages to only warnings and errors, use the option --quiet
(or -q
for short). Also, output is color-coded by default unless the --no-color
option (or -C
for short) is given; this option can be helpful if the color control sequences create problems for your terminal emulator.
If given the --debug
option (or -@
for short), this program will output a detailed real-time trace of what it is doing. The output will be written to the given destination, which can be a dash character (-
) to indicate console output, or a file path.
If given the --version
option (or -V
for short), this program will print version information and exit without doing anything else.
This program exits with a return code of 0
if no problems are encountered while fetching data from the server. It returns a nonzero value otherwise, following conventions for use in shells such as bash which only understand return code values of 0
to 255
. If no network is detected, it returns a value of 1; if it is interrupted (e.g., using ctrl-c
) it returns a value of 2
; if it encounters a fatal error, it returns a value of 3
. If it encounters any non-fatal problems (such as a missing PDF file or JATS validation error), it returns a nonzero value equal to 100 + the number of articles that had failures. Summarizing the possible return codes:
Return value | Meaning |
---|---|
0 |
No errors were encountered – success |
1 |
No network detected – cannot proceed |
2 |
The user interrupted program execution |
3 |
An exception or fatal error occurred |
100 + n |
Encountered non-fatal problems on a total of n articles |
The following table summarizes all the command line options available. (Note: on Windows computers, /
must be used as the prefix character instead of -
):
Short | Long form opt | Meaning | Default | |
---|---|---|---|---|
-a A |
--after-date A |
Only get articles published after date A | Get all articles | ⬥ |
-C |
--no-color |
Don't color-code info messages | Color-code terminal output | |
-d D |
--dest D |
Destination for archive: Portico or PMC | Portico | |
-f F |
--doi-file F |
Only get articles whose DOIs are in file F | Get all articles | |
-j J |
--journal J |
Work with journal J | ★ | |
-l |
--list-dois |
Print a list of all known DOIs & exit | Do other actions instead | |
-o O |
--output-dir O |
Write output in directory O | Write in current dir | |
-p |
--preview |
Preview what would be archived & exit | Obtain the articles | |
-q |
--quiet |
Only print important messages | Be chatty while working | |
-r R |
--rep-file R |
Write list of article & results in file R | Don't write a report | |
-s S |
--rep-fmt S |
With -r , write either html or csv |
csv |
|
-t T |
--rep-title T |
With -r , use T as the report title |
Use the current date | |
-V |
--version |
Print program version info & exit | Do other actions instead | |
-X |
--no-check |
Don't validate JATS XML files | Validate JATS XML | |
-Z |
--no-zip |
Don't put output into one ZIP archive | ZIP up the output | |
-@ OUT |
--debug OUT |
Debugging mode; write trace to OUT | Normal mode | ⚑ |
⬥ Enclose the date in quotes if it contains space characters; e.g., "12 Dec 2014"
.
★ Required argument.
⚑ To write to the console, use the character -
(a single dash) as the value of OUT; otherwise, OUT must be the name of a file where the output should be written.
If you find an issue, please submit it in the GitHub issue tracker for this repository.
We would be happy to receive your help and participation with enhancing pubarchiver
! Please visit the guidelines for contributing for some tips on getting started.
Copyright © 2019-2022, Caltech. This software is freely distributed under a BSD 3-clause license. Please see the LICENSE file for more information.
Tom Morrell developed the original algorithm for extracting metadata from DataCite and creating XML files for use with Portico submissions of microPublication.org articles. Starting with that original script, Mike Hucka created the much-expanded Microarchiver program (later renamed to PubArchiver).
The vector artwork used as a starting point for the logo for this repository was created by Cuby Design for the Noun Project. It is licensed under the Creative Commons Attribution 3.0 Unported license. The vector graphics was modified by Mike Hucka to change the color.
Nick Stiffler from the microPublication.org team helped figure out the requirements for PMC output (introduced in Microarchiver version 1.9), helped guide development of Microarchiver/PubArchiver, and engaged in many discussions about microPublication.org's needs.
PubArchiver makes use of numerous open-source packages, without which it would have been effectively impossible to develop PubArchiver with the resources we had. We want to acknowledge this debt. In alphabetical order, the packages are:
- Beautiful Soup – an HTML parsing library
- bun – a set of basic user interface classes and functions
- commonpy – a collection of commonly-useful Python functions
- crossrefapi – a python library that implements the Crossref API
- dateparser – parser for human-readable dates
- humanize – make numbers more easily readable by humans
- lxml – an XML parsing library for Python
- Pillow – a fork of the Python Imaging Library
- plac – a command line argument parser
- recordclass – a mutable version of Python named tuples
- setuptools – library for
setup.py
- sidetrack – simple debug logging/tracing package
- slack-cli – a command-line interface to Slack written in Bash
- urllib3 – a powerful HTTP library for Python
- xmltodict – a module to make working with XML feel like working with JSON
Finally, we are grateful for computing & institutional resources made available by the California Institute of Technology.