-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#174 Clean the data #14
Conversation
@coderabbitai Where are you? 🐇🐇🐇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SnehaHS65 I'm unable to run the code because of a ModuleNotFoundError
:
pip install -r requirements.txt; python prepare_data.py;
Traceback (most recent call last):
File "prepare_data.py", line 4, in <module>
import nltk
ModuleNotFoundError: No module named 'nltk'
Remember to add the new modules (nltk
, sklearn
, scipy
) to requirements.txt
so that your code can also run on other machines.
@jo-elimu , added new modules to requirements.txt and created new python file to download nltk data. Steps to run:
|
Fixes build error with Python 3.9: ```python ERROR: Ignored the following yanked versions: 1.11.0 ERROR: Ignored the following versions that require a different python version: 1.14.0 Requires-Python >=3.10; 1.14.0rc1 Requires-Python >=3.10; 1.14.0rc2 Requires-Python >=3.10; 2.1.0rc1 Requires-Python >=3.10 ```
@SnehaHS65 I mistakenly added the original So I corrected that by storing it as a text file instead: e355e77 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added new modules to requirements.txt and created new python file to download nltk data. Steps to run:
pip install -r requirements.txt python download_nltk_data.py python prepare_data.py
Thank you, @SnehaHS65.
Instead of running two commands, can you trigger download_nltk_data.py
from within the prepare_data.py
file? Because that would make the code easier to run and maintain.
@jo-elimu , can you please review and approve |
storybooks_pd['combined_chapters_text'] = storybooks_pd['chapters'].apply(extract_chapters_text) | ||
print(f"storybooks_pd_new: \n{storybooks_pd}") | ||
|
||
#Removing stop words |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SnehaHS65 What are stop words, and why is it necessary to remove them from the storybooks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jo-elimu, these are the basic steps we follow for NLP to train the ML model
- Lowercase the text if your problem statement allows.
- Tokenize the text. This can be sentence level or word level.
- Lemmatize/Stemming the tokens. This reduces the tokens to their base level.
- Remove the following :- stops words, punctuation marks, hyperlinks, smileys, email ids etc. Basically anything that is not needed for classification.
- Vectorize the text using BOW or TFIDF approach.
Stopwords are common words in a language that are often filtered out during NLP tasks because they are considered to carry little meaning or contribute minimally to the overall understanding of a text. Examples include “the,” “is,” “and,” “in,” etc.
But since this dataset is of Hindi Language, I collected stop words for this and stored in a file and removed from the required columns.
prepare_data.py
Outdated
|
||
# Function to run another Python script | ||
def run_script(script_name): | ||
subprocess.check_call(['python', script_name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SnehaHS65 The subprocess
code is not working for me:
pip install -r requirements.txt; python prepare_data.py;
File "prepare_data.py", line 12, in run_script
subprocess.check_call(['python', script_name])
File "subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['python', 'download_nltk_data.py']' returned non-zero exit status 1.
Can we use import download_nltk_data
instead?
@@ -1 +1,7 @@ | |||
pandas==2.2.2 | |||
numpy==2.0.1 | |||
nltk==3.8.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SnehaHS65 Looks like there is a problem with the nltk
version you are using?: https://github.com/elimu-ai/ml-storybook-reading-level/actions/runs/10441098460/job/28911811221?pr=14
@SnehaHS65 Please note that there is now an alternative solution in the That model currently scores 0.33 as the Mean Absolute Error (MAE). And with reading levels 1-4, I guess being off by 0.33 is not too bad. Anyway, there is room for improvement. So if you think adding NLP to the mix can lower the Mean Absolute Error from 0.33, feel free to try that as one of the experiments. |
@jo-elimu, is the model already done now? Should I close the PR? |
Yes, we now have one model that is currently working in production: elimu-ai/webapp#1822 However, there are probably many ways its accuracy can be improved. So if you want to continue your initial strategy with using Natural Language Programming (NLP) for this, please feel free to try adding that as another feature column to the preprocessed CSV (as part of #4). With each new feature, run the model training again, and see how it affects the prediction score. |
Here is what I found about the dataset
For eg: Title: Insects, Chapters: Grasshopper, ants, etc
So title, description and chapters are the only useful things to train the model. Hence I have preprocessed these columns in this PR.
Also is there no coderabbit for this?