This contains the materials, data, code, and write-ups associated with the Natural Stories Maze paper.
Analysis
- read_results.R takes the in Data/raw_data and produces Data/cleaned.rds
- nat_stories.Rmd has non-modelling analysis looking at accuracy, comprehension, participant feedback; takes in Data/cleaned.rds and produces Analysis/models/comp.rds
- models.Rmd has the modelling stuff
- models/ has saved summaries of models and other pre-processed data objects for inclusion in the paper (paper should build without needing to run any of the models oneself)
Data
- raw_data - what Ibex produces
- cleaned.rds (generated by Analysis/read_results.R)
- maze_pre_error.Rds is a cleaned up version of only pre-mistake data used for modelling, created by models.Rmd
- SPR/ contains raw data from Futrell et al; first.rds is a cleaned-up version of first stories only created by models.Rmd
Materials
- for_ns.js is the code to run the experiment (insert into Ibex maze framework)
- for_ibex.txt is natural stories text split up in sentences with distractors
- ibex_questions.txt is the natural stories questions
- natural_stories_sentences.tsv is the text split into sentences
- raw_questions.txt is the raw natural stories questions
- practice.txt is the text and questions of practice items
- practice_post_maze.txt is the practice items with distractors in Ibex maze format
Prep_code
- nat_stories_prep.Rmd - takes raw Natural Stories materials and processes it for labels, Maze and model surprisals; also takes in tokenizations and surprisal and makes a nice table of them. This generates some of the files in Materials/
- useful.py manages formatting for before and after running surprisals (Note: ngram, txl and grnn were run on a cluster with a precursor to lm-zoo. GPT was run with lm-zoo. For replicating/altering, I recommend using lm-zoo. TXL is not currently on lm-zoo)
- natural_stories_surprisals.rds is used in models.Rmd
- ns_pre_maze.txt is the natural stories sentences ready to get Maze distractors
- other files are inputs or intermediate outputs to reformatting the natural stories materials for the experiment
- predictors/ is all surprisal and frequency predictions and model tokenization patterns
Papers
- Papers/Paper has the actual manuscript
- Amlap_2020_talk contains abstract and slides for the presentation given at Amlap 2020
- UCI_2021 contains slides for a lab meeting presentation
- Images/ and many loose image files are just that
- lab_meeting_2020 (.tex and .pdf) is from a pre-Amlap lab meeting presentation