Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Statistics - Regex count in whole document to only token #1014

Merged
merged 1 commit into from
Feb 16, 2024

Conversation

PrimozGodec
Copy link
Collaborator

@PrimozGodec PrimozGodec commented Oct 24, 2023

Issue

Regex counter counts only appearances in tokens, which ignore multi-word appearances.

Description of changes

As discussed with @ajdapretnar, I added a dropdown beside each statistic so that the user can decide whether to do computation on tokens/ngrams or a full document. Currently, it includes two options:

  • Preprocessed tokens - Statistics are computed on either tokes or ngrams, depending on what is more suitable for the statistic.
  • Documents - statistic computed on full document text
Discussion
  • Is Preprocessed tokens a good term, or do we have any other idea? Changed to Tokens
  • Average word length is currently implemented only on documents since the name doesn't make sense on N-grams. Should we rename it to Average term length and apply it to documents and n-grams? So that it is word length on documents and ngram length on ngrams. Renamed to Average term length and enabled for ngrams.
Includes
  • Code changes
  • Tests
  • Documentation

@PrimozGodec PrimozGodec marked this pull request as draft October 24, 2023 09:32
@codecov-commenter
Copy link

codecov-commenter commented Oct 24, 2023

Codecov Report

Merging #1014 (6583069) into master (e7c360d) will increase coverage by 0.18%.
Report is 7 commits behind head on master.
The diff coverage is 96.03%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1014      +/-   ##
==========================================
+ Coverage   82.18%   82.37%   +0.18%     
==========================================
  Files          92       92              
  Lines       12283    12381      +98     
  Branches     1670     1690      +20     
==========================================
+ Hits        10095    10199     +104     
+ Misses       1880     1866      -14     
- Partials      308      316       +8     

@PrimozGodec
Copy link
Collaborator Author

/rebase

@PrimozGodec PrimozGodec force-pushed the annotator-epsilon branch 2 times, most recently from 0af2b21 to c42d3b8 Compare February 2, 2024 10:18
@PrimozGodec PrimozGodec marked this pull request as ready for review February 2, 2024 10:32
@ajdapretnar ajdapretnar merged commit a163629 into biolab:master Feb 16, 2024
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants