Skip to content

Latest commit

 

History

History
24 lines (18 loc) · 1.1 KB

data.md

File metadata and controls

24 lines (18 loc) · 1.1 KB
layout title permalink
page
Data
/data/

books

View our collection of 1483 editions of Robinson Crusoe scraped from University of Florida Digital Collection, Hathitrust Digital Library, and the Internet Archive here.

metadata

Despite what may seem as a large collection of texts, there exists at least ten times that amount as we managed to track down metadata for over 15,000 editions published across the globe.

[Here][1] is an example of what our raw metadata looks like. Notice the 'publisher' column in which the publisher's name, city, and date are all combined in wildy different styles.

[Here][2] is the latest version that we have been able to compile and process within the scope of our project. This dataset powers our map and our Doc2Vec models. However, it is by no means a ground-truth dataset as our cleaning could only go so far in a 10-week timeframe.

challenges

  • ...
  • ...
  • ...

[1]:{{ site.url }}{{site.baseurl}}/data/crusoeData.csv [2]:{{ site.url }}{{site.baseurl}}/data/crusoeMaster3Scenes.csv