Skip to content

This is Joe Patrick's project repo for Data Science (LING 2340). It is an analysis of dialect features among the Southwest Slavic Balkan languages on Wikipedia, namely the Croatian and Serbian Wikis.

License

Notifications You must be signed in to change notification settings

Data-Sci-2021/BCS-Wikipedia-Analysis

Repository files navigation

Multilingual Wikipedia domains as grounds for dialect variation in Serbian and Croatian

Joe Patrick

[email protected]

12/15/2021

The Project

This is the project repo for Joe Patrick's project in Data Science (LING 2340). The project is an analysis of dialect features occurring among Serbian and Croatian Wikipedia domains. This project is inspired by Ljubešić, Miličević-Petrović, and Samardžić (2018) which analyzed informal language data on Twitter. Ljubešić et al. collected geo-tagged data based on 16 features known to vary between the variants of the pluricentric language once referred to as 'Serbo-Croatian.' For the current study, those original 16 features are restricted to 10 in order to capture data more accessible via regular expressions. The chosen features represent a variety of linguistic levels, including lexical and morphosyntactic features.

The Data

To test out my hypothesis, I made corpora out of five articles imported from Wikipedia. The articles covered the topics of: Kosovo, Ustaše, Četnici, The Orthodox Church in Montenegro, and Nationalism and were matched for Serbian and Croatian versions taken from the respective Wikis. Controversial topics were chosen in an effort to determine whether emotionally or socially salient topics evoked more regional or dialectal forms. This is variation on Labov's (1972) 'Danger of Death' question posed during sociolinguistic interviews in which speakers are more likely to resort to the vernacular if they feel threatened or emotional. Both Ustaše and Četnici are the collective plural names for nationalist parties during World War II but they are experiencing a resurgence since the Yugoslav wars of the 1990s. The 'Ustaše' are a Croatian wing of nationalism (originating during Croatia's time as a puppet state for the Nazi regime). Similarly, 'Četnici' are the Serbian answer to Croatian (and other Balkan) nationalist groups. Both 'Kosovo' and the 'Orthodox Church in Montenegro' are controversial and contemporary topics selected to potentially invoke stronger Serbian ideological data. The final topic of 'Nationalism' is intended to evoke any marked data from either language domain.

Kosovo- Croatian
Četnici- Croatian
Ustaše- Croatian
Orthodox Church in Montenegro- Croatian
Nationalism- Croatian

Kosovo- Serbian
Četnici- Serbian
Ustaše- Serbian
Orthodox Church in Montenegro- Serbian
Nationalism- Serbian

Included Files and Folders

Final Report (knitted .md)
Final Report (.rmd)
License
Data Table
Ljubešić, Miličević-Petrović, and Samardžić (2018)

About

This is Joe Patrick's project repo for Data Science (LING 2340). It is an analysis of dialect features among the Southwest Slavic Balkan languages on Wikipedia, namely the Croatian and Serbian Wikis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages