Skip to content

Latest commit

 

History

History
291 lines (206 loc) · 37.2 KB

final_report.md

File metadata and controls

291 lines (206 loc) · 37.2 KB

Final Report:


Alejandro Ciuba, [email protected]


Summary

This final_report.md file contains in-detail descriptions of the process I went through for getting, cleaning, and exploring the data. As well as where I try to draw conclusions given my findings.

Table of Contents

  1. Introduction
  2. The Questions
  3. Data-Sourcing & Clean-Up
  4. Analysis
  5. Conclusion
  6. Bibliography

Introduction

Opening

Video game corpora are surprisingly hard to find. For an industry that had a global revenue of over 150 billion dollars in 2019, and with multiple billion dollar companies creating games which are played by millions of people worldwide (Dobrilova, 2022), the linguistic research conducted on the media is extremely sparse, and public linguistic data even more so. It is an artistic medium which is rife with linguistic research waiting to be done. For my research, I decided to take a look at some pragmatic aspects of video games, focusing on these specific questions:

  1. How are orders/requests realized in video game dialogues?
  2. What is the frequency and usage of the 2nd person pronoun, you?
  3. What are some common named entities in video games?

Note that the research ended up being more exploratory and that, while these questions were answered, they were done so on a per-game basis and a larger dataset would be necessary to draw conclusions about the medium as a whole. While the research completed in this repository may be small and only focused on a small set of video games, it is still a starting point for what could be more-developed linguistic research in the future—especially for pragmatics.

History of The Repository

The repository’s original research was slightly different. In the beginning, I had set out to answer various sociolinguistic questions related to video games; I was particularly interested in the fantasy-race sociolinguistics found in The Elder Scrolls—a game I had in my dataset. However, these questions ended up being too broad for me to answer and I felt like I lacked the proper time and knowledge needed to tackle them adequately. Ultimately, it was what I wanted to research given my data that caused the most headaches, with frequent changes and refinements made to the research questions. Questions were added, dropped, added again, dropped again, and altered during the beginning phases. It was not until I made research.ipynb that I got a true sense of what I should do and what questions I could/wanted to answer.

While most of the issues lied in what I should answer, there still were slight problems in other departments. For example, I had wanted to examine the use of deixis in video game quest dialogue, but creating a script which could accurately identify such a phenomena was out of my grasp. This question ultimately ended up being dropped in favor of my current question about orders and requests in video games. I thought this would be a better fit because it would be easier to identify false positives in the data. Now it was a question of how to answer it, which I quickly realized was a big issue. While I could turn common order and request phrases into regular expressions using Python’s re package, I quickly discovered, however, that I would need way more linguistic context for some (more on that in the analysis section), which then led me down a rabbit hole learning spaCy. Learning spaCy honestly took the biggest chunk of time and also contributed way more to the research than I expected. I cannot say this was a bad thing at all, as I absolutely loved learning it.


The Questions

Why I Selected What I Did

As previously mentioned, I originally wanted to answer various sociolinguistic questions regarding the video games I had at my disposal. However, these plans changed. With first question:

  1. How are orders/requests realized in video game dialogues?

I wanted to try to examine in what ways do video games drive the player to complete what they want/need them to. For example, if a player needs to go somewhere, how does the game (or the game’s characters) tell the player to head over there? Do they try to frame it in a way that gives the player more adjacency?

For the second question:

  1. What is the frequency and usage of the 2nd person pronoun, you?

I wanted to see how often the game refers to the player. I wanted to also see how often games use other pronouns and if the context of the text affects this.

Lastly, in the third question:

  1. What are some common named entities in video games?

I wanted to examine what type of named entities are common in the games I had. Do they mention people frequently, what about places? Can these entities be used in more than one context (i.e. toponyms)?

While I wasn't able to answer all my initial thoughts which provoked these questions, I was still able to go in-depth with each.

What Conclusions Can I Draw?

All of my analysis will be done on a per-game basis. This means, while I may compare the games I have between each other, I will not make assumptions on all games or even all games in any given genre. This is mainly because the dataset I am working with is extremely small and (particularly with anything which uses ML models) there are some false positives and false negatives which might slightly bias the data.


Data-Sourcing & Clean-Up

The Original Repository

I did not obtain most of this data from scratch. In fact, all of the data from The Elder Scrolls Series, Star Wars: Knights of The Old Republic, and Torchlight II were gotten by Judith van Stegeren for their paper, Fantastic Strings and Where to Find Them: The Quest for High-Quality Video Game Text Corpora, where they explore various ways of obtaining workable video game text corpora.

The data they had gotten from the games was in a really good state. All I had to do was look through each of the dataframes they had made and adapt what was in them, which was all done in initial_data_exploration.ipynb. This mostly involved just reorganizing the columns and dropping unnecessary data, as well as adding word counts to each dataframe. However, I needed to parse the book titles from the URLs in the original Elder Scrolls data and some columns in the Torchlight II and Star Wars: Knights of the Old Republic dataframes had to be standardized.

The Hollow Knight Data

After reading their paper, I was able to use a couple of their techniques to get the script data from Hollow Knight. I was lucky enough to have found an extremely well-made fan-written script containing all of the text from the game. I then used my HDialogueParser.py script on an HTML version of the file to parse the characters with dumps of their dialogues throughout the game and put them into their own dataframe. The section for Hollow Knight in initial_data_exploration.ipynb is mostly just me doing some quick renaming and adding word counts. All word counts were done using a mapping lambda function with nltk.word_tokenize().


Analysis

Before going through each question in-detail, it’s important to give a brief overview of each game in the dataset, discussing their genre, main themes, and what the data from the game pertains to and how it relates to the game as a whole. In doing this, it’ll be easier to contextualize the data and understand why the results given are the way they are.

First, the biggest dataset is that from The Elder Scrolls Series, a series of single-player role-playing games where the player takes control of the main character (usually some prophesied hero) to explore a region of the medieval-style fantasy world Tamriel. Created by Bethesda Softworks, the game series first entry, Elder Scrolls: Arena, was published in 1994 and the datasets most recent entry, The Elder Scrolls: Online, was released in 2014. Note, however, the latest entry is not single-player, but a massive multiplayer online game (MMO). The data from this game is completely made of in-game books which players can read when they come across them throughout each game's open-world. Because of this, I will regularly refer to this dataset as TES Books or something similar.

The second game in the dataset is Torchlight II, another single-player role-playing game where the player completes quests while exploring randomly generated dungeons. A dungeon is a series of rooms and corridors where the player is pitted against enemies, all in the hope to find items and complete quests. The game was made by Runic Games and released in 2012. This game’s dataset pertains to the quest dialogue given to the player as well as who (or what) said it. Note also that this game has multiple dialogue options at points, meaning that dialogue can change depending on what the player chooses.

The third game is Star Wars: Knights of the Old Republic (which, from now on, will always be abbreviated as KOTOR because I am tired of writing that), a 2003 single-player role-playing game released by Bioware and published through LucasArts(now LucasFilms Games). The player plays as Revan and traverses the Star Wars universe during the age of the Old Republic (set before the prequel trilogy). An interesting note about this game is that the said player character is voiced, whereas in other games (e.g. The Elder Scrolls Series), the player character does not have a voice, but rather simply displays what the player character is saying. This game’s dataset is all the dialogue from the game, including background voicelines. It also contains speaker and listener information as well as comments from the developers.

The last game in the list is Hollow Knight, a Metroidvania-style game made by Team Cherry and released early 2017. The player controls The Knight as they make their way through its dark fantasy world, the Kingdom of Hallownest. Along the way, the player encounters various creatures and NPCs which give text-based dialogue to them. The dataset for this game contains all the dialogue given by all the characters throughout the entire game. Note that this means dialogue is only sorted by who said it and not by when the player meets them. However, there is still a slight chronological ordering regarding which dialogue is said when. For an extremely detailed list of when certain dialogue is said (as well as the complete text of the game), please see the Google Doc.

For the first and the biggest of the questions, the way in which I tackled it can be broken down into a few steps:

  1. Figure out what forms to capture.
  2. Figure out how to capture those forms.
  3. Analyze said forms.

For the first step, I decided to focus on the forms I felt were the most common when giving both direct and indirect orders and requests. Furthermore, I wanted to make sure these captured forms resulted in low amounts of false positives, especially since I wouldn't have a complete context to analyze the utterances. Ultimately, I ended up capturing about 13 different forms: 7 for orders and 6 for requests, which are listed here. I do ignore some important forms sadly; for example, orders/requests which are "disguised" as assertions. This was because I felt like I wouldn't be able to adequately capture them without bad recall and precision rates. Lastly, note that almost every form captures multiple similar ways of saying it. For example, IO1 (indirect order 1), captures all of the following:

You (really) need to X.

You (really) have to X.

You (really) ought to X.

You (really) should X.

You (really) must X.

This means it captures 10 different ways of expressing an indirect order, but these forms were able to be easily and logically all grouped into one, more generalized form because they all follow a nearly identical structure. When referring to these forms, I will use D/I for direct/indirect, O/R for orders/requests, and then their given number (e.g. DO1 means "Mand. Form").

After figuring out what forms to capture, the next step was to actually devise a plan to extract them from the texts. For this, I realized I would need linguistic information for a few of these forms (e.g. DO3 must be present tense to capture only performative speech acts). To accomplish this, I used spaCy's English transformer model to tag text dumps for each game's for morphological information, here's an example with the first part of the Hollow Knight text dump. However, also note that the text dumps for KOTOR and The Elder Scrolls were only samples due to memory limitations, with about 500 lines taken from each (see this script's random_sample_texts function). This also meant that any orders and requests captured using spaCy's Matcher patterns would lose some metadata associated with them, such as speaker/author. These are a minority of the forms though, as eight were still able to be captured completely using standard regular expressions. To see all the utterances which were captured by using these patterns and regular expressions, see here.

Now, to analyze the data. I first took a look at some small samples in-use using my concordances function. The forms captured via regular expression still could be analyzed with their metadata in-tact. Overall, the data here starts to paint a picture as to which forms will be the most popular and which ones are infrequent (also note that there is a false positive for the last line in Hollow Knight). For example, DO2, IO1, IO4, and DR1 all appear in every single game, but IR1 and IR4 only show up once each.

The next step was to combine the utterances I had captured from both the regular expressions and matcher patterns. This was done via putting both techniques' results in a dataframe. Then, I proceeded to produce the following table and chart:

Speech Acts per Game

Speech Act Hollowknight KOTOR TES Books Torchlight II
DO1 182 40 243 211
DO2 39 264 477 2
DO3 0 0 2 0
DR1 11 177 146 9
DR2 0 0 2 1
IO1 36 637 558 43
IO2 0 0 1 1
IO3 1 0 3 1
IO4 11 187 95 1
IR1 0 0 3 0
IR2 11 291 170 11
IR3 0 1 0 0
IR4 0 1 0 0
png
Orders & Requests grouped by capture-type and stacked per video game.

As you can see, the most popular forms are IO1, DO1, and DO2 by a large margin. This makes sense given DO1 and DO2 are simply the positive and negative forms of commands. But IO1 is also interesting because it makes the listener (the player) the subject of the sentence while still telling them where to go. There could be further research done on if this is intentional to try to keep the player the focus of the game. Another interesting fact is that, while some forms only appear once in the entire data set, all forms captured do appear. Let's take a look at the IO2 occurence from TES Books:

"Your goal is to keep your feet planted and bend out of the way quickly."

This line is from Saving Your Hide by in-game author Lieutenant Anders Gemane, where he talks about how to fight using a variety of weapons. This sample is interesting in particular because it is from a (more-or-less) instructions manual, meaning he is still directly referring to the reader, so this means this indirect order is still technically being given to the player, who will be the reader in-context.

Lastly, I wanted to dive deeper regarding the most popular forms. In particular, I wanted to examine which verb used in IO1 was the most popular, but examining the most popular verbs in general would be better. This was simpler as I could just use regular expressions to capture the verbs and filter through the already made dataframe. Below are the results:

png
Verb Occurences in orders and requests, not grouped by game.

As expected, "must", "should", and "have" are the most popular. An interesting appearance is "worry" due to phrases such as "do not worry." I am also surprised by the appearance of "fail", which occurs frequently due to phrases such as "do not fail." Something interesting for future research might be looking at how verbs with positive and negative sentiment are used in positive and negative orders.

This question can be thought of as a more straightforward version of the previous in terms of steps:

  1. Capture the pronouns wanted
  2. Analyze their frequencies and usage

The original question centers around the 2nd person pronoun in particular because I wanted to see how that was used in-relation to the player, but the research ended up being more general, but still with a focus on "you". I was easily able to get what I wanted using nothing but pure regular expressions, which were quite easy to make, see here and here for them. I ignored forms like "y'all" or "ya" to try to focus more on the 2nd person singular, which was the most likely to be used with the player.

For the 2nd person pronoun, I also capture in-use examples directly from the data using my concordances function, which was specifically designed to capture instances of a given word or phrase and highlight them in random sentences from the text columns of my dataframes. Here are a couple samples from each game taken from the original samples in the Jupyter Notebook:

TES Books

Author Title Concordance
Anonymous silt-strider-station ...costs of an unsettling sea voyage when YOU can travel overland in safety and comfort...
Anonymous whats-yours-mine-little-larceny ...YOU turn your back and then...
Anonymous ruined-watchmasters-journal Concordance: ...our oath. See it upheld. Eight protect YOU and our Pale Watch...

Torchlight II

Speaker Concordance
Elder Josimon ...Have YOU found the hostages yet?...
NO SPEAKER ...YOU summoned Cacklespit - a witch who has...
Shady Character ...If I have any more work for YOU, I'll be sure to let you know...

Hollow Knight

Character Concordance
Distant Villagers ...Greetings. YOU are very tired. Sit and rest...
Zote the Mighty ...Just what do you think you're doing?! YOU dare to come between me and my...
Willoh ...Oh! Come in search of treats have YOU? I chanced upon a unique little fungus...

KOTOR

Speaker Listener Concordance
Bounty Hunter NO LISTENER ...Hey - YOU're the one who won that swoop race,...
Player NO LISTENER ...Haven't I told YOU guys to get lost already?...
Nemo NO LISTENER ...I think, perhaps, YOU place an undue importance on rank and...

As you can see, "you" in TES Books can be both in an instruction-manual (refering to reader) and character-dialogue (character talking to another) context, which is expected. Something interesting to observe with the KOTOR concordances is that they all say NO LISTENER, so I'm curious as to why they are using the 2nd person pronoun; my guess is that the listener just simply couldn't be extracted in those contexts when the dataframe was originally created. Torchlight II and Hollow Knight almost always refer to the player since all their texts are related to either characters talking to the player, or the quests giving the player instructions. Interestingly, with Torchlight II, NO SPEAKER means that its the quest system "directly" talking to the player, meaning text is displayed to the player, telling them where to go.

Lastly, I graphed the frequencies of both "you" as well as some other pronouns. I only focused on subject, object, and indirect-object pronouns since those are the contexts "you" can also appear in. Here are the graphs:

png
Occurence of you per total token count as percentage

Other Pronoun Percentages

Singular Plural
png png
png png

The percentages here, from what I can tell, follow some expected patterns given the type of data for each game. For example, since The Elder Scrolls dataset is comprised of books, it makes sense to see the third person forms be the most prevalent when compared to the other datasets. KOTOR containing the highest amount of first-person singular pronouns also makes sense considering the player character is voiced; 9485 out of the 29213 lines of dialogue come from him! The pronoun usage in Hollow Knight is interesting because the main character can't talk, meaning it all comes from other characters either referring to themselves or others; it seems they tend to refer to themselves more than others. Lastly, Torchlight II seems to use first-person pronouns way more than third-person, although I don't fully know why this is the case, I suspect it's because the characters tend to refer to themselves a lot, usually at the start of a quest when they ask the player character for help. However, more in-depth research would need to be done to confirm this.

Named Entities was a question I wasn't sure how to answer at first. Not only did I not know how to properly tag any named entities I saw, but I was also confused as to what I could investigate with them. Luckily, after learning spaCy, I was able to use its ner component to tag named entities with a high degree of precision and recall, being .9 each. After tagging, I created a dataframe to store the named entity, its type, and its game. Please remember from How Are Orders & Requests Realized in Video Games? that the data used from The Elder Scrolls and KOTOR are only samples due to memory limitations. I ended up exploring the following subquestions:

  1. What named entites are common in video games?
  2. Are there trends in named-entity types across games?
  3. Are named entites common in video games?
  4. A brief look at hapaxes in named entities.

First, the primary question was pretty straightforward, below are the graphs showing the raw counts of named entity types in all the video games between all datasets and then on a per-dataset level:

png
Named Entity Types raw counts between all datasets
png png
png png

As seen in the first bar chart (as well as all charts), the most popular named entity is PERSON, which is interesting. However, I am not too sure about these results as I had originally used spaCy's small English model, which produced ORG as the most popular tag, followed by person. This leads me to believe the results aren't set in stone and that, ultimately, human taggers will need to comb through these datasets to ensure the best accuracy. On the other hand, I don't think PERSON being the most popular is too far from the truth, given that the data from The Elder Scrolls is all books, and that the KOTOR characters seem to reference other people in-game quite frequently. Furthermore, a lot of Torchlight II's quests mention people specifically, either that the player should go to them and/or give them something.

Another issue I ran into was toponyms which are extremely frequent, particularly in The Elder Scrolls data, where entities could refer to groups of people, their territory, and their language. An example of this with the word Cyrodiilic can be viewed here.

Despite the problems, it was very interesting seeing some trends amongst all the datasets. Obviously, the PERSON tag dominates every dataset, but they all contain high amounts of NORP, except Hollow Knight, which has high amounts of PRODUCT, WORK_OF_ART, GPE, and CARDINAL instead. CARDINAL is most likely due to the fact the characters tend to talk about places, giving the player vague directions as to where to go. However, for all datasets, the low appearance of tags like LOC or ORG still lead me to believe that further research should be done to more accurately generate tags.

However, two areas which were not as susceptible to the specific tagging issues were the total named entities for all noun chunks for each game, and the hapaxes for Hollow Knight and Torchlight II (I ignored hapaxes for The Elder Scrolls and KOTOR since they were samples). Below are the findings for each:

png
Named Entities per Total Noun Chunks (according to nlp) as percentages
png
Hapaxes per noun chunks kept as proportions

As I expected, named entities make up a sizeable chunk of all noun chunks in the datasets, in general. For hapaxes, I ignored the type of tag, although further research could be done on which entity tags have the most hapaxes if a more accurate tagging scheme was devised. Overall, the amount of hapaxes in both games are extremely low, with most being spelled-out numbers, characters, some typos (e.g. EzrohirFinally), and some sentence fragments the tagger purposelly classified as single entities (numerical phrases are treated as one entity).


Conclusion

While there may be some inaccuracies in the data: false positives, false negatives, and incorrect tags. I still truly believe this is a good starting point for future research into the pragmatics of video games. In particular, I feel like the data captured relating to orders and requests as well as pronoun frequencies is extremely important as they highlight trends that can lead to further questions. If I were to continue this research after the term ends, I would definitely explore these two routes, but also try to improve the ML models I currently have for tagging named-entities, because of this, the conclusions gotten from that data are extremely broad, and must be looked at with a skeptic eye.

This was one of the most intense project I've ever done in my undergraduate career, lasting from February 14th, 2022 to May 1st, 2022. Over those months, I learned more than I could have ever imagined about NLP, computational linguistics, and Python packages like pandas, numpy, and (especially) spaCy. I am thrilled to say I am proud of the work I have done, and despite the highs and lows, I would gladly do this all over again.


Bibliography

Dataset Based On:

van Stegeren, J., & Theune, M. (2020). Fantastic Strings and Where to Find Them: The Quest for High-Quality Video Game Text Corpora. In Intelligent Narrative Technologies Workshop. essay, AAAI Press.

Data Collected From:

Bioware. (2003). Star Wars: Knights of the Old Republic (PC Version) [Video Game]. LucasArts.

Torchlight II (PC Version) [Video Game]. (2012). Runic Games.

The Elder Scrolls I-V and The Elder Scrolls Online (PC Versions) [Video Games]. (1994-2014). Bethesda Softworks.

Hollow Knight (PC Version) [Video Game]. (2017). Team Cherry.

Additional Resources

Dobrilova, T. (2022, April 26). How much is the gaming industry worth in 2022? [+25 powerful stats]. Techjury. Retrieved April 30, 2022, from https://techjury.net/blog/gaming-industry-worth/.

Wikipedia Links

These links were exclusively used in final_report.md to give readers context for various game terminology and genres, as well as certain games and characters in particular.

Role-playing game. (2022, February 13). In Wikipedia. https://en.wikipedia.org/wiki/Role-playing_game.

The Elder Scrolls: Arena. (2022, April 22). In Wikipedia. https://en.wikipedia.org/wiki/The_Elder_Scrolls:_Arena.

The Elder Scrolls: Online. (2022, April 22). In Wikipedia. https://en.wikipedia.org/wiki/The_Elder_Scrolls_Online.

Massively multiplayer online game. (2022, April 18). In Wikipedia. https://en.wikipedia.org/wiki/Massively_multiplayer_online_game.

Open world. (2022, April 26). In Wikipedia. https://en.wikipedia.org/wiki/Open_world.

Random dungeon. (2022, March 4). In Wikipedia. https://en.wikipedia.org/wiki/Random_dungeon.

Revan. (2022, April 29). In Wikipedia. https://en.wikipedia.org/wiki/Revan.

Dialogue tree. (2021, December 26). In Wikipedia. https://en.wikipedia.org/wiki/Dialogue_tree.

Metroidvania. (2022, April 19). In Wikipedia. https://en.wikipedia.org/wiki/Metroidvania.

Non-Academic Resources

Much like the Wikipedia links, these were used to give context to the games; however, they are non-academic sites. I still left them here to try to give proper credit.

https://elderscrolls.fandom.com/wiki/Last_Dragonborn.

https://elderscrolls.fandom.com/wiki/Tamriel.

https://elderscrolls.fandom.com/wiki/Books_(Oblivion).