This repo contains sample data for use and testing of processing scripts for text/XML. The sample data comes from the Library of Congress Web Archives (LCWA), a program that has been selecting, harvesting, and preserving Web sites since 2000. The sample data is an unsystematically collected sample of metadata for 28 sites preserved in the Web archives at the Library of Congress. The metadata is gathered as sample data for practicing basic operations and manipulations of XML information; this metadata is formatted according to the structure and schema defined in the Metadata Object Description Schema (MODS), a format initially developed in 2002 for the communication of resource description information by libraries and archives. The data is described in more detail below. In addition to this readme, this git repository contains:
- Samples of library metadata in MODS XML format:
- Jupyter notebooks
- Python scripts
The Jupyter notebook contains information about downloading MODS records for use as sample data to practice parsing XML in python. These sample files were generated in August of 2018 from MODS metadata records for archived Web sites collected by the Library of Congress.
Those familiar with the Library may know about the LCCN, a general control number that provides unique identifiers for most items that are held by the Library of Congress. The Web archives described in these MODS records do not have LCCNs. Instead, an LCWA (Web Archives) offers a unique identifier for the metadata records.
These are newer MODS records that don't have LCSH:
lcwaN0010234,lcwaN0001999,lcwaN0003238,lcwaN0010144,lcwaN0010145,
lcwaN0012178,lcwaN0012179,lcwaN0012180,lcwaN0012184,lcwaN0012195,
lcwaN0010932,lcwaN0010933,lcwaN0010936,lcwaN0010937,lcwaN0010940,
These have LCSH in <subject>
:
lcwaN0010888,lcwaN0010226,lcwaN0009692,lcwaN0009700,lcwaN0010401
These are election sites that include <subject>
both lcsh
and "local" headings noted as "lcwat", which represent a taxonomy that was
developed for the quick categorization of sites during the nomination and
harvesting process:
lcwaE0008846,lcwaE0008263,lcwaE0008338,lcwaE0008918,lcwaE0008001
These are some previous generation records, which illustrate slight differences in format and naming convention.
lcwa00097019
Brazilian Presidential Election 2010 Web Archivedfd3979a7fb56bb3acc06b7b0129633c,00853935a711639f58b0f35bae8d7781
Example from 2002 Winter Olympics and NYPL (September 11, 2001 Web Archive)