[FIX] Statistics - Regex count in whole document to only token #1014

PrimozGodec · 2023-10-24T09:25:22Z

Issue

Regex counter counts only appearances in tokens, which ignore multi-word appearances.

Description of changes

As discussed with @ajdapretnar, I added a dropdown beside each statistic so that the user can decide whether to do computation on tokens/ngrams or a full document. Currently, it includes two options:

Preprocessed tokens - Statistics are computed on either tokes or ngrams, depending on what is more suitable for the statistic.
Documents - statistic computed on full document text

Discussion

~~Is Preprocessed tokens a good term, or do we have any other idea?~~ Changed to Tokens
Average word length is currently implemented only on documents since the name doesn't make sense on N-grams. Should we rename it to Average term length and apply it to documents and n-grams? So that it is word length on documents and ngram length on ngrams. Renamed to Average term length and enabled for ngrams.

Includes

Code changes
Tests
Documentation

codecov-commenter · 2023-10-24T09:37:01Z

Codecov Report

Merging #1014 (6583069) into master (e7c360d) will increase coverage by 0.18%.
Report is 7 commits behind head on master.
The diff coverage is 96.03%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1014      +/-   ##
==========================================
+ Coverage   82.18%   82.37%   +0.18%     
==========================================
  Files          92       92              
  Lines       12283    12381      +98     
  Branches     1670     1690      +20     
==========================================
+ Hits        10095    10199     +104     
+ Misses       1880     1866      -14     
- Partials      308      316       +8

PrimozGodec · 2024-01-30T10:25:35Z

/rebase

PrimozGodec marked this pull request as draft October 24, 2023 09:32

biolab-helper force-pushed the annotator-epsilon branch from 6d4b9ea to c963d4c Compare January 30, 2024 10:51

PrimozGodec force-pushed the annotator-epsilon branch 2 times, most recently from 0af2b21 to c42d3b8 Compare February 2, 2024 10:18

PrimozGodec marked this pull request as ready for review February 2, 2024 10:32

PrimozGodec force-pushed the annotator-epsilon branch from c42d3b8 to 94d3a97 Compare February 2, 2024 10:43

Statistis - Select statistic computation source

6583069

PrimozGodec force-pushed the annotator-epsilon branch from 94d3a97 to 6583069 Compare February 2, 2024 12:37

ajdapretnar merged commit a163629 into biolab:master Feb 16, 2024
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Statistics - Regex count in whole document to only token #1014

[FIX] Statistics - Regex count in whole document to only token #1014

PrimozGodec commented Oct 24, 2023 •

edited

Loading

codecov-commenter commented Oct 24, 2023 •

edited

Loading

PrimozGodec commented Jan 30, 2024

[FIX] Statistics - Regex count in whole document to only token #1014

[FIX] Statistics - Regex count in whole document to only token #1014

Conversation

PrimozGodec commented Oct 24, 2023 • edited Loading

Issue

Description of changes

Discussion

Includes

codecov-commenter commented Oct 24, 2023 • edited Loading

Codecov Report

PrimozGodec commented Jan 30, 2024

PrimozGodec commented Oct 24, 2023 •

edited

Loading

codecov-commenter commented Oct 24, 2023 •

edited

Loading