-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocess text: first word in custom stopwords list is ignored #1028
Comments
This is an editor issue. When I use Sublime text, the file contains
See what you get. |
@markotoplak Is there a way we could sanitize this internally? |
@ajdapretnar, I guess you are saving text as rich text format (rtf), not plain text. @wvdvegte probably has a different problem. |
I thought the reason for not considering the first row for filtering is because in rtf, additional parameters get treated as text. So instead of a plain "orange" one would get "{fancyparam:15}orange" and thus the word would not be filtered. |
I was indeed referring to the use of plain text (TXT), not RTF. |
@wvdvegte Could you perhaps send the stopword list? I cannot replicate the issue, so perhaps there's something about the file that is the problem. Thanks! |
I didn't manage to dig up what I was working on when I reported on this in December 2023, but when I'm trying to reproduce the problem, I'm not getting any of the custom stopwords filtered out: |
Thank you! Now I've finally managed to reproduce the issue. |
Typical for Microsoft, perhaps? I created the text file using Word for Mac ... |
Describe the bug
In custom .txt (UTF-8) stopwords files, the first word is ignored as a stopword by Preprocess Text, i.e., it is not filtered out.
To Reproduce
Create a custom stopwords .txt file in UTF-8 encoding (in my case, I used MS Word), consisting of words separated by returns, and load it in Preprocess text. The first word will not be filtered out but the rest will. Leaving the first line empty solves the problem, but it's not the obvious thing to do.
Expected behavior
All custom stopwords should be filtered out.
Orange version:
3.36.2 (I don't know if it's the native Silicon version or the Intel version)
Text add-on version:
1.15.0
Operating system:
Mac OS 14.1.2 (23B92)
The text was updated successfully, but these errors were encountered: