Soobin Choi 3 November 2022
I was able to import both korean-learner-corpus (KLC) and PELIC data from github and finish sorting out what I need for my project as in the file ‘interim check’.
Also, I tokenized the korean-learner-corpus text, but the problem that I face here is, I am not really sure how I can count the tokens in one text. I am thinking about using count() and then sum the number of tokens with the same user id, but I haven’t tried yet. Maybe I will get a lot of errors or unwanted result. Also, I am not sure how I can count the morphemes in KLC. I believe I can just count the number of slashes(/) in the corpus since they marked morpheme boundaries with slash, but I haven’t figured this out yet.
Luckily, for PELIC, the text is already tokenized and even the number of tokens of each text is in the data. So, it will definitely facilitate my project. The problem I faced when mutating the PELIC data is that the information I need is quite scattered in different data sets. So I had to import almost all of the data set and mutate then using join() function.
After I figure out how to count the tokens in KLC, I believe (and hope) there would not be many issues.
Regarding Korean learner corpus, I am not very sure at this moment because the data owner does not specify the lisence in the repository. Since the repository is in public domain, it might be okay to use the data and share the result, but I think it would be a better approach to reach out to the owner and asks his permission.
sorting out the data based on the nationality counting tokens using
unnest_tokes()
merging datasets with the necessary columns only filtering data based on text length (larger than 10) sorting based on the native language (only korean)
I am struggling with collecting word types from the tokens. I am not
sure how I can use unique()
or duplicate()
function on the tokens.
Because of this, I could not move further. Also, in the case of PELIC, I
need to make another column that only contains lemma and POS, which I am
still figuring out. I need Dan’s help here.
tokenize and count the tokens for each essay. I’m sort of stuck here because I do not know how to count word type. How can I use unique() or distinct function to the values whose type is list?
Changing full_join()
to left_join()
to make the data smaller