Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#174 Clean the data #14

Closed
wants to merge 9 commits into from
Closed

#174 Clean the data #14

wants to merge 9 commits into from

Conversation

SnehaHS65
Copy link
Contributor

@SnehaHS65 SnehaHS65 commented Aug 11, 2024

Here is what I found about the dataset

  1. Title is the title of the book
  2. Description is the small description of what the book contains
  3. Attribution_url is the link of the book online
  4. Chapters contains all the chapters of the book in json format.
    For eg: Title: Insects, Chapters: Grasshopper, ants, etc

So title, description and chapters are the only useful things to train the model. Hence I have preprocessed these columns in this PR.

Also is there no coderabbit for this?

@SnehaHS65 SnehaHS65 requested a review from a team as a code owner August 11, 2024 04:13
@SnehaHS65 SnehaHS65 self-assigned this Aug 11, 2024
@jo-elimu
Copy link
Member

Also is there no coderabbit for this?

@coderabbitai Where are you? 🐇🐇🐇

Copy link
Member

@jo-elimu jo-elimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SnehaHS65 I'm unable to run the code because of a ModuleNotFoundError:

pip install -r requirements.txt; python prepare_data.py; 
Traceback (most recent call last):
  File "prepare_data.py", line 4, in <module>
    import nltk
ModuleNotFoundError: No module named 'nltk'

Remember to add the new modules (nltk, sklearn, scipy) to requirements.txt so that your code can also run on other machines.

@SnehaHS65
Copy link
Contributor Author

@jo-elimu , added new modules to requirements.txt and created new python file to download nltk data. Steps to run:

pip install -r requirements.txt
python download_nltk_data.py
python prepare_data.py

Fixes build error with Python 3.9:

```python
ERROR: Ignored the following yanked versions: 1.11.0
ERROR: Ignored the following versions that require a different python version: 1.14.0 Requires-Python >=3.10; 1.14.0rc1 Requires-Python >=3.10; 1.14.0rc2 Requires-Python >=3.10; 2.1.0rc1 Requires-Python >=3.10
```
@jo-elimu
Copy link
Member

@SnehaHS65 I mistakenly added the original requirements.txt file as a binary file:

So I corrected that by storing it as a text file instead: e355e77

Copy link
Member

@jo-elimu jo-elimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added new modules to requirements.txt and created new python file to download nltk data. Steps to run:

pip install -r requirements.txt
python download_nltk_data.py
python prepare_data.py

Thank you, @SnehaHS65.

Instead of running two commands, can you trigger download_nltk_data.py from within the prepare_data.py file? Because that would make the code easier to run and maintain.

@SnehaHS65
Copy link
Contributor Author

@jo-elimu , can you please review and approve

storybooks_pd['combined_chapters_text'] = storybooks_pd['chapters'].apply(extract_chapters_text)
print(f"storybooks_pd_new: \n{storybooks_pd}")

#Removing stop words
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SnehaHS65 What are stop words, and why is it necessary to remove them from the storybooks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jo-elimu, these are the basic steps we follow for NLP to train the ML model

  1. Lowercase the text if your problem statement allows.
  2. Tokenize the text. This can be sentence level or word level.
  3. Lemmatize/Stemming the tokens. This reduces the tokens to their base level.
  4. Remove the following :- stops words, punctuation marks, hyperlinks, smileys, email ids etc. Basically anything that is not needed for classification.
  5. Vectorize the text using BOW or TFIDF approach.

Stopwords are common words in a language that are often filtered out during NLP tasks because they are considered to carry little meaning or contribute minimally to the overall understanding of a text. Examples include “the,” “is,” “and,” “in,” etc.

But since this dataset is of Hindi Language, I collected stop words for this and stored in a file and removed from the required columns.

prepare_data.py Outdated

# Function to run another Python script
def run_script(script_name):
subprocess.check_call(['python', script_name])
Copy link
Member

@jo-elimu jo-elimu Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SnehaHS65 The subprocess code is not working for me:

pip install -r requirements.txt; python prepare_data.py;
  File "prepare_data.py", line 12, in run_script
    subprocess.check_call(['python', script_name])
  File "subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['python', 'download_nltk_data.py']' returned non-zero exit status 1.

Can we use import download_nltk_data instead?

@@ -1 +1,7 @@
pandas==2.2.2
numpy==2.0.1
nltk==3.8.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jo-elimu
Copy link
Member

@SnehaHS65 Please note that there is now an alternative solution in the pmml folder.

That model currently scores 0.33 as the Mean Absolute Error (MAE). And with reading levels 1-4, I guess being off by 0.33 is not too bad.

Anyway, there is room for improvement. So if you think adding NLP to the mix can lower the Mean Absolute Error from 0.33, feel free to try that as one of the experiments.

@SnehaHS65
Copy link
Contributor Author

@SnehaHS65 Please note that there is now an alternative solution in the pmml folder.

That model currently scores 0.33 as the Mean Absolute Error (MAE). And with reading levels 1-4, I guess being off by 0.33 is not too bad.

Anyway, there is room for improvement. So if you think adding NLP to the mix can lower the Mean Absolute Error from 0.33, feel free to try that as one of the experiments.

@jo-elimu, is the model already done now? Should I close the PR?

@jo-elimu
Copy link
Member

jo-elimu commented Aug 20, 2024

is the model already done now? Should I close the PR?

Yes, we now have one model that is currently working in production: elimu-ai/webapp#1822

However, there are probably many ways its accuracy can be improved. So if you want to continue your initial strategy with using Natural Language Programming (NLP) for this, please feel free to try adding that as another feature column to the preprocessed CSV (as part of #4).

With each new feature, run the model training again, and see how it affects the prediction score.

@nya-elimu
Copy link
Member

@coderabbitai

@jo-elimu jo-elimu closed this Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants