Using spacyr for language processing instead of the current UD treebank #87

fishfree · 2024-11-01T07:18:04Z

I found UD treebank models performed very weakly for some languages, esp. for CJK languages. spaCy supports so many languages and performs much better than UD treebank models.

massimoaria · 2024-11-18T07:42:24Z

Thank you for your comment.

We have already considered using spaCy and decided to continue with Udpipe.

This is because spaCy is not native in R but requires a Python installation, which often leads to numerous errors and requires a lot of work on the part of the user. It took me a whole day to get a properly functioning Python environment on my Mac to be able to use spacyr.

To improve Udpipe's performance, we plan to train updated models for the most commonly used languages. This will be done in the coming months.

fishfree · 2024-11-18T07:54:18Z

@massimoaria Thank you!
Udpipe performs much faster than spaCy, for the former is written in C++. So the best option should be train UDpipe with more corpus for higher F1.
After exploring, I think, besides spacyr, we may need also the spacy-conll package, which can parse texts into CoNLL-U format.
However, CJK language models in spaCy does not output some fields in CoNLL-U, i.e. feats / lemma / misc . I doubt the lack of these fields probably cause the downstream analysis such as Clustering and etc. I also doubt that CJK languages without space as seperator will also cause some downstream analysis tasks.

fishfree · 2024-11-18T07:58:43Z

I tried using spaCy to parse CJK languages. I can attach the files FYI.
global.zip
Pls change the extension to .R

And the modified lines in Server.R as below:

  posTagging <- eventReactive({
    input$tokPosRun
  },{
    values$language <- sub("-.*","",input$language_model)
	
    # Select processing model based on language
    if (input$language_model %in% c("chinese", "japanese", "korean")) {      				   
      # Initializing the spaCy model
      initialize_spacy_model(input$language_model, input$model_size)
      filtered_text <- values$txt %>% filter(doc_selected)
      doc_ids <- filtered_text$doc_id
      
      values$dfTag <- process_text_with_spacy(filtered_text$text)
      # Add `doc_ids` as a column in `values$dfTag`
      if (!"doc_id" %in% colnames(values$dfTag)) {
        values$dfTag <- cbind(doc_id = doc_ids, values$dfTag)
      }
    } else {
	  ## download and load model language
      udmodel_lang <- loadLanguageModel(language = input$language_model)

      ## set cores for parallel computing
      ncores <- max(1,parallel::detectCores()-1)

      ## set cores for windows machines
      if (Sys.info()[["sysname"]]=="Windows") {
        cl <- makeCluster(ncores)
        registerDoParallel(cl)
      }

      #Lemmatization and POS Tagging
      values$dfTag <- udpipe(object=udmodel_lang, x = values$txt %>%
                             filter(doc_selected),
                           parallel.cores=ncores)
	}
    # Merge metadata from the original txt object
    values$dfTag <- values$dfTag %>%
      left_join(values$txt %>% select(-text, -text_original), by = "doc_id") %>%
	  filter(!is.na(upos)) %>%
      posSel(., c("ADJ","NOUN","PROPN", "VERB"))
    values$dfTag <- highlight(values$dfTag)
    values$dfTag$docSelected <- TRUE
    values$menu <- 1
  }
  )

  ## Tokenization & PoS Tagging ----

  output$optionsTokenization <- renderUI({
    list(
	selectInput(
      inputId = 'language_model', label="Select language", choices = names(lang_map), selected = "english",
      multiple=FALSE,
      width = "100%"
    ),
    selectInput("model_size", "Select Model Size", choices = c("small" = "sm", "medium" = "md", "large" = "lg"), selected = "sm")
    ) 
  })

This was referenced Nov 19, 2024

"Error in select: Can't select columns past the end" when running Correspondence Analysis #106

Open

K choice Topic Modeling show nothing for Chinese texts #108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using spacyr for language processing instead of the current UD treebank #87

Using spacyr for language processing instead of the current UD treebank #87

fishfree commented Nov 1, 2024

massimoaria commented Nov 18, 2024

fishfree commented Nov 18, 2024 •

edited

Loading

fishfree commented Nov 18, 2024 •

edited

Loading

Using spacyr for language processing instead of the current UD treebank #87

Using spacyr for language processing instead of the current UD treebank #87

Comments

fishfree commented Nov 1, 2024

massimoaria commented Nov 18, 2024

fishfree commented Nov 18, 2024 • edited Loading

fishfree commented Nov 18, 2024 • edited Loading

fishfree commented Nov 18, 2024 •

edited

Loading

fishfree commented Nov 18, 2024 •

edited

Loading