-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I can't get "R&D" and "SMEs" in the same word cloud #769
Comments
Here is a fully functional example. The version where I calculate my own frequencies works fine, but it is clear that there is an issue with the version where I send the full text to Word Cloud. #!venv/bin/python
import collections
import matplotlib.pyplot as plt
import wordcloud
if __name__ == "__main__":
text = " ".join(["sme"] * 10 + ["r&d"] * 10)
frequencies = collections.Counter()
for word in text.split(" "):
frequencies[word] += 1
frequencies = dict(frequencies)
cloud = wordcloud.WordCloud().generate(text)
plt.imshow(cloud)
plt.tight_layout(pad=0)
plt.axis('off')
plt.title('text')
plt.show(bbox_inches='tight')
cloud = wordcloud.WordCloud().generate_from_frequencies(frequencies)
plt.imshow(cloud)
plt.tight_layout(pad=0)
plt.axis('off')
plt.title('frequencies')
plt.show(bbox_inches='tight') |
You can provide a custom #!venv/bin/python
import collections
import matplotlib.pyplot as plt
import wordcloud
if __name__ == "__main__":
text = " ".join(["sme"] * 10 + ["r&d"] * 10)
frequencies = collections.Counter()
for word in text.split(" "):
frequencies[word] += 1
frequencies = dict(frequencies)
cloud = wordcloud.WordCloud(regexp=r"\b[\w&][\w'&]+").generate(text) # Custom regexp includes "&" character
plt.imshow(cloud)
plt.tight_layout(pad=0)
plt.axis('off')
plt.title('text')
plt.show(bbox_inches='tight')
cloud = wordcloud.WordCloud().generate_from_frequencies(frequencies)
plt.imshow(cloud)
plt.tight_layout(pad=0)
plt.axis('off')
plt.title('frequencies')
plt.show(bbox_inches='tight') For the reference, here's the default regexp used by wordcloud: word_cloud/wordcloud/wordcloud.py Lines 582 to 583 in 1072b0e
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
I work in a field where "R&D" (research and development) and "SMEs" (small and medium enterprises) are important concepts. If I tokenize myself, then Word Cloud displays "r&d" correctly but does not display "smes" even though I have verified they are prominent in my frequency count. If I let Word Cloud tokenize, then it displays "smes" correctly but renders "R&D" as "r d" (i.e., with a space where there should be an ampersand). Is there anything I can do?
Steps/Code to Reproduce
Example:
Expected Results
Either way, I should be able to get both "smes" and "r&d" on the same Word Cloud.
Actual Results
As described above, in one case I get "smes" and "r d", and in the other case, I get "r&d" but no "smes".
Versions
Windows-11-10.0.22631-SP0
Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)]
NumPy 1.26.4
matplotlib 3.9.0
wordcoud 1.9.3
The text was updated successfully, but these errors were encountered: