Perform a comparison of methods to use vowel space density as a quantifiable and easily-calculated measure of linguistic similarity.
The ideal for a corpus of data from a low-resource language is to have high-quality audio recordings of speech paired with phonetic or phonemic transcription, and maybe a parallel translation in an analysis language. [citation] Maybe create a version that cleans up errors or hesitations that occur in spontaneous speech (though the spontaneous version has value, also). [citation] Creating the transcription can be a labor-intensive process, though the recent advent of ASR tools for low-resource languages can speed this up somewhat. [citation]
In many low-resource language communities, language and dialect boundaries may not yet be clearly established. The degree of interaction of speakers of the language(s) across a geographical area is often correlated with the degree of language and dialect variation across that area. Mass media, such as radio and television content, can allow interaction of speakers from a broad area, and may have a smoothing effect on regional variation, but for many understudied languages, mass media may not exist in the community's language. Documentation efforts and eventual development of the language (such as creation of a writing system if one does not exist, promotion of community literacy, and perhaps eventual multi-lingual education) benefit from an initial understanding of the range of language variation across an area. Members of the community who have had contact with speakers from across a geographical area may have a sense of the variation within that area, and there are methods to collect and record this kind of information. Where that has not yet been done, however, is there any other way to quickly get an idea of the range and degree of variation within an area?
Documentation of this kind of variation is often done by way of dialect intelligibility surveys. Tools that such surveys often use include the audio recording and phonemic transcription of wordlists, recorded text testing, sentence repetition testing, and sociolinguistic questionnaires. (explanation of each?) Recording and transcribing a wordlist is a labor-intensive process, as is the recording and transcription of the kind of personal narrative that is often used as the basis of a recorded text test. Making just an audio recording of a personal story, however, is relatively quick and low in required labor. What information can be obtained only from an audio recording in that speech variety, without these other research tools? If we can glean any information just from audio recordings, it may be helpful in planning the collection of more labor-intensive research tools in ways that better address the variation within that geographical area.
Enter the concept of vowel space density. From an unanalyzed audio recording, we can extract just those sequences for which formants can be calculated. We can record the path over which those formants vary, through a multi-dimensional space consisting of F1, F2, and even F3 values, and calculate the density of measured formants within this space, averaged over an entire personal narrative. This method has been used to calculate an area which meets a certain density threshhold, [citation], used for purposes of speech therapy. In the area of dialect variation, it is also possible to examine the contours of the frequency distribution, and compare the contours of one sample of speech with the contours generated from another sample of speech. [example] Will this provide enough information to characterize the variation between dialects across a region where no other variation information is known? Will it provide information that is actually representative of the language as a whole, rather than representing a particular speaker, or a particular story?
What are some of the factors that determine formant frequencies in a sample of speech, and the density of the formant trace within that space? Male and female speakers produce formants that vary through different sets of frequencies, and even speakers of the same gender have different frequency spaces based on the size of their vocal tract and its unique shape. Can we account for this by normalizing the frequencies based on the median and maximum frequencies? How comparable is this for different speakers of the same gender, and speakers of different genders?
Another factor is the set of lexical items used in a sample of natural speech. Ideally, a sufficiently long sample of natural speech should contain words that exemplify the whole range of possibilities within the formant space for that language, but for short samples of speech this can be gamed ("The rain in Spain stays mainly in the plains"), so we need to have a long-enough sample and make sure it doesn't show this property. Over a sufficiently long sample of speech, we would expect the frequency of phonemes to reflect the frequency of phonemes across the language's lexicon, but in relatively short samples of coherent discourse, certain lexical items are going to occur with higher than average frequencies: in a story about a fish, the phrase "the fish" is likely to show up more frequently than it would in a very large sample of unconnected discourse, which means that the sounds that are found in that phrase may skew as more frequent than they otherwise are in a larger sample that is more representative of all language use.
As a test case, SIL language surveys often collect personal narratives as well as word lists from multiple locations within a language area. This provides a convenient means to compare the vowel space density, as calculated from the narrative used in a recorded text test, with the phonetic and phonemic variation that is seen in the wordlists. In the case of one particular dataset, [citation] there are also multiple personal narratives from the same speaker, which can be used to address some of the questions necessary to evaluate this method.