The repo is dedicated to show how we may utilize 'WordCount' module in Python to represent most repeated words in Arabic datasets & text in general, overcoming errors or problems that may occur as module isn't prepared to directly deal with Arabic. In order to properly use dataset then several steps of preprocessing were used.
This dataset consists of 2386 reviews of products collected mainly in Arabic, with some reviews are written in English or Arabizi. Reviews are classified in 3 categories: Positive, Negative and Neutral.
- Needed modules were installed and dataset were imported
- Dataset were splitted properly as it should be into two columns.
- The sentences were tokenized into words and added to a list.
- To avoid intervention of English, Arabizi and special characters, they were removed as a partial cleaning of the dataset.
- We picked up the first 99 words, then created an instance of WordCount taking Arabic stopwords (imported from get_stop_words module) and Shorooq font as arguments.
- To be represented properly in the plot, we reshaped the words and reversed their letters adding them to a list and finally converting it to Pandas series.
- We generate the WordCount using Pandas series and plot the figure.