Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using spacyr for language processing instead of the current UD treebank #87

Open
fishfree opened this issue Nov 1, 2024 · 3 comments
Open

Comments

@fishfree
Copy link

fishfree commented Nov 1, 2024

I found UD treebank models performed very weakly for some languages, esp. for CJK languages. spaCy supports so many languages and performs much better than UD treebank models.

@massimoaria
Copy link
Owner

Thank you for your comment.

We have already considered using spaCy and decided to continue with Udpipe.

This is because spaCy is not native in R but requires a Python installation, which often leads to numerous errors and requires a lot of work on the part of the user. It took me a whole day to get a properly functioning Python environment on my Mac to be able to use spacyr.

To improve Udpipe's performance, we plan to train updated models for the most commonly used languages. This will be done in the coming months.

@fishfree
Copy link
Author

fishfree commented Nov 18, 2024

@massimoaria Thank you!
Udpipe performs much faster than spaCy, for the former is written in C++. So the best option should be train UDpipe with more corpus for higher F1.
After exploring, I think, besides spacyr, we may need also the spacy-conll package, which can parse texts into CoNLL-U format.
However, CJK language models in spaCy does not output some fields in CoNLL-U, i.e. feats / lemma / misc . I doubt the lack of these fields probably cause the downstream analysis such as Clustering and etc. I also doubt that CJK languages without space as seperator will also cause some downstream analysis tasks.

@fishfree
Copy link
Author

fishfree commented Nov 18, 2024

I tried using spaCy to parse CJK languages. I can attach the files FYI.
global.zip
Pls change the extension to .R

And the modified lines in Server.R as below:

  posTagging <- eventReactive({
    input$tokPosRun
  },{
    values$language <- sub("-.*","",input$language_model)
	
    # Select processing model based on language
    if (input$language_model %in% c("chinese", "japanese", "korean")) {      				   
      # Initializing the spaCy model
      initialize_spacy_model(input$language_model, input$model_size)
      filtered_text <- values$txt %>% filter(doc_selected)
      doc_ids <- filtered_text$doc_id
      
      values$dfTag <- process_text_with_spacy(filtered_text$text)
      # Add `doc_ids` as a column in `values$dfTag`
      if (!"doc_id" %in% colnames(values$dfTag)) {
        values$dfTag <- cbind(doc_id = doc_ids, values$dfTag)
      }
    } else {
	  ## download and load model language
      udmodel_lang <- loadLanguageModel(language = input$language_model)

      ## set cores for parallel computing
      ncores <- max(1,parallel::detectCores()-1)

      ## set cores for windows machines
      if (Sys.info()[["sysname"]]=="Windows") {
        cl <- makeCluster(ncores)
        registerDoParallel(cl)
      }

      #Lemmatization and POS Tagging
      values$dfTag <- udpipe(object=udmodel_lang, x = values$txt %>%
                             filter(doc_selected),
                           parallel.cores=ncores)
	}
    # Merge metadata from the original txt object
    values$dfTag <- values$dfTag %>%
      left_join(values$txt %>% select(-text, -text_original), by = "doc_id") %>%
	  filter(!is.na(upos)) %>%
      posSel(., c("ADJ","NOUN","PROPN", "VERB"))
    values$dfTag <- highlight(values$dfTag)
    values$dfTag$docSelected <- TRUE
    values$menu <- 1
  }
  )
  ## Tokenization & PoS Tagging ----

  output$optionsTokenization <- renderUI({
    list(
	selectInput(
      inputId = 'language_model', label="Select language", choices = names(lang_map), selected = "english",
      multiple=FALSE,
      width = "100%"
    ),
    selectInput("model_size", "Select Model Size", choices = c("small" = "sm", "medium" = "md", "large" = "lg"), selected = "sm")
    ) 
  })

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants