Skip to content

Getting started at Text Corpus Labs

License

Notifications You must be signed in to change notification settings

TextCorpusLabs/getting-started

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MIT license

The below is an explanation of what and how we work in Text Corpus Labs. It serves as a reminder of the team's global research goals.

Raison d'être

We are a collection of researchers focused on collecting different modes of human communication through text. We want to share our work and ways of working with the broader academic community.

To this end we:

  • Create guidance on how to standardize the format of a text corpus. All the members of our lab have come to an understanding as to how a text corpus should look prior to being analyzed.
  • Create processes to automate the collection of text corpora from existing resources. Scraping and parsing can be challenging at times. Our goals are to allow the reuse of a text corpus with the lowest barrier to entry for a new analysis.
  • Curate unique corpora. It has been well known for quite some time that humans have different modes of communication. Text corpora reflect this difference. When a new mode of communication is believed to exist, we try to capture a sample of that mode.
  • Provide a "Methods and Materials" boilerplate describing how the corpus was collected.
  • Provide a citable DOI for the process. For unique corpora, we provide the DOI to the article where the text corpus was introduced.

So that you can:

  • Get a text corpus on your local device with as little effort as possible.

Citations

It is always nice to see others build upon your efforts. If you use our work, please cite it using the provided DOI.

Getting the code to work

As of now, all members of our lab work on Windows PCs and program in Python. If that changes in the future, we will likely update this section to include other methods.

Prerequisites

The following packages need to be installed. You can use any method to install the prerequisites. On a Windows device, I recommend using Chocolatey. If you decide to use Chocolatey, open an admin PowerShell prompt, and run the code snippet below.

if('Unrestricted' -ne (Get-ExecutionPolicy)) { Set-ExecutionPolicy Bypass -Scope Process -Force }
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
refreshenv

choco install 7zip -y
choco install python3 -y

Python

Unless otherwise noted in the repository directly, all scripts have been tested on Python 3.9.x. In addtion to the steps below, each repository's README.md will contain a list of any special instructions. After running the steps here, run the special instructions.

  1. Clone this repository then open an Admin shell to the ~/code directory.
  2. Install the required modules.
    pip install -r requirements.txt
    

When writing any code that uses an external dependency, the version of that dependency needs to be declared. All the version information can be found in the repository’s ~/code/requirements.txt file. You may be able to run different versions, especially if it is just a minor revision, but if the exact version is not used, YMMV.

Steps

All the repositories contain a "Steps" section in the README.md. Please follow those guides to retrieve the text corpus.

You will likely want to perform additional text processing after retrieving the text corpus. Our goal is to provide you with a clean base to perform an analysis, not to be opinionated on what you do next. When completing your study, please remember to keep track of this difference. Doing so will better allow you to write your "Methods and Materials" section; using (and citing) our steps, then applying (and highlighting) your unique contribution.

About

Getting started at Text Corpus Labs

Resources

License

Stars

Watchers

Forks