Skip to content

Latest commit

 

History

History
41 lines (32 loc) · 5.22 KB

README.md

File metadata and controls

41 lines (32 loc) · 5.22 KB

Analysis of Japanese Loanwords


General Info

Author: Lindsey Rojtas
Project Title: Analysis of Japanese Loanwords
My Background: I am an undergraduate computer science major and linguistics minor at the University of Pittsburgh. This is my repository for my term project in LING1340 (Data Science for Linguists), which evaluates the use of Japanese loan words borrowed from English. I can read a bit of Japanese, but I'm very limited in this reading. However, what I can understand is katakana characters, so I chose to work with loan words!


About the Project

Katakana is a syllabic Japanese script that is used for writing words that are onomatopoeia or loan words, usually of the Western variety. A way to compare this to English would be to think about how borrowed words are written in English, which is usually in italics (e.g. "getting a sense of deja vu"). Gairaigo is the word for "loan word" or "borrowed word" in Japanese. Click here to learn more about the differences between Katakana and the other two Japanese scripts, Hiragana and Kanji.

My project looks at two aspects of katakana use: the length of the katakana word versus its conversational frequency, and the ratio of katakana words to total length of a string of utterances by a certain speaker versus the age of that speaker. I wanted to see if length and usage were correlated in any way, as well as age and amount of loan words used.


Directory

General

Notebooks/Code (progress-notebooks folder)

Other

  • presentation: A folder for my presentation slides for this project in both .pdf (without audio) and .pptx (with audio) format
  • samples: A folder that holds samples of my found data under Fair Use.
  • images: A folder for the .png files for my plots that are included in final-report.md.
  • Guestbook: My peers' comments on my project

My Found Data

Two corpora were used in this project:

  • The Nagoya Conversation Corpus is a corpus of 129 unstructured conversations with several different participants of varying age groups. Information about the age groups can be found here. This corpus is licensed under Creative Commons BY-NC-ND 4.0. A sample of this dataset, a .txt file that has the first conversation in it, is available in samples.

  • The Balanced Corpus of Contemporary Written Japanese's Word List is a list of words and their frequencies, as well as some other arbitrary data that is not relevant for use in this project. The specific list of words that I'm using that only contain Katakana is a derivation of the list filtered by Reddit user Alphyn, who has given me permission to use the files that they filtered Katakana words in. I am able to use this corpus for academic purposes. There is no specification for republishing on BCCWJ's website for this data, so a sample set will be put up in samples under Fair Use that contains the first 100 entries of the .csv file that Alphyn provided.