#174 Clean the data #14

SnehaHS65 · 2024-08-11T04:13:54Z

Here is what I found about the dataset

Title is the title of the book
Description is the small description of what the book contains
Attribution_url is the link of the book online
Chapters contains all the chapters of the book in json format.
For eg: Title: Insects, Chapters: Grasshopper, ants, etc

So title, description and chapters are the only useful things to train the model. Hence I have preprocessed these columns in this PR.

Also is there no coderabbit for this?

jo-elimu · 2024-08-11T04:51:28Z

Also is there no coderabbit for this?

@coderabbitai Where are you? 🐇🐇🐇

jo-elimu

@SnehaHS65 I'm unable to run the code because of a ModuleNotFoundError:

pip install -r requirements.txt; python prepare_data.py;

Traceback (most recent call last):
  File "prepare_data.py", line 4, in <module>
    import nltk
ModuleNotFoundError: No module named 'nltk'

Remember to add the new modules (nltk, sklearn, scipy) to requirements.txt so that your code can also run on other machines.

SnehaHS65 · 2024-08-11T16:53:20Z

@jo-elimu , added new modules to requirements.txt and created new python file to download nltk data. Steps to run:

pip install -r requirements.txt
python download_nltk_data.py
python prepare_data.py

refs elimu-ai#8

Fixes build error with Python 3.9: ```python ERROR: Ignored the following yanked versions: 1.11.0 ERROR: Ignored the following versions that require a different python version: 1.14.0 Requires-Python >=3.10; 1.14.0rc1 Requires-Python >=3.10; 1.14.0rc2 Requires-Python >=3.10; 2.1.0rc1 Requires-Python >=3.10 ```

jo-elimu · 2024-08-12T05:09:48Z

@SnehaHS65 I mistakenly added the original requirements.txt file as a binary file:

So I corrected that by storing it as a text file instead: e355e77

jo-elimu

added new modules to requirements.txt and created new python file to download nltk data. Steps to run:
pip install -r requirements.txt
python download_nltk_data.py
python prepare_data.py

Thank you, @SnehaHS65.

Instead of running two commands, can you trigger download_nltk_data.py from within the prepare_data.py file? Because that would make the code easier to run and maintain.

…k-reading-level into clean-data

SnehaHS65 · 2024-08-14T21:08:27Z

@jo-elimu , can you please review and approve

jo-elimu · 2024-08-13T11:48:30Z

prepare_data.py

+storybooks_pd['combined_chapters_text'] = storybooks_pd['chapters'].apply(extract_chapters_text)
+print(f"storybooks_pd_new: \n{storybooks_pd}")
+
+#Removing stop words


@SnehaHS65 What are stop words, and why is it necessary to remove them from the storybooks?

@jo-elimu, these are the basic steps we follow for NLP to train the ML model

Lowercase the text if your problem statement allows.

Tokenize the text. This can be sentence level or word level.

Lemmatize/Stemming the tokens. This reduces the tokens to their base level.

Remove the following :- stops words, punctuation marks, hyperlinks, smileys, email ids etc. Basically anything that is not needed for classification.

Vectorize the text using BOW or TFIDF approach.

Stopwords are common words in a language that are often filtered out during NLP tasks because they are considered to carry little meaning or contribute minimally to the overall understanding of a text. Examples include “the,” “is,” “and,” “in,” etc.

But since this dataset is of Hindi Language, I collected stop words for this and stored in a file and removed from the required columns.

jo-elimu · 2024-08-15T05:28:46Z

prepare_data.py

+
+# Function to run another Python script
+def run_script(script_name):
+    subprocess.check_call(['python', script_name])


@SnehaHS65 The subprocess code is not working for me:

pip install -r requirements.txt; python prepare_data.py;

File "prepare_data.py", line 12, in run_script subprocess.check_call(['python', script_name]) File "subprocess.py", line 413, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['python', 'download_nltk_data.py']' returned non-zero exit status 1.

Can we use import download_nltk_data instead?

jo-elimu · 2024-08-19T07:46:17Z

requirements.txt

@@ -1 +1,7 @@
 pandas==2.2.2
+numpy==2.0.1
+nltk==3.8.2


@SnehaHS65 Looks like there is a problem with the nltk version you are using?: https://github.com/elimu-ai/ml-storybook-reading-level/actions/runs/10441098460/job/28911811221?pr=14

jo-elimu · 2024-08-19T07:57:06Z

@SnehaHS65 Please note that there is now an alternative solution in the pmml folder.

That model currently scores 0.33 as the Mean Absolute Error (MAE). And with reading levels 1-4, I guess being off by 0.33 is not too bad.

Anyway, there is room for improvement. So if you think adding NLP to the mix can lower the Mean Absolute Error from 0.33, feel free to try that as one of the experiments.

SnehaHS65 · 2024-08-19T22:30:26Z

@SnehaHS65 Please note that there is now an alternative solution in the pmml folder.

That model currently scores 0.33 as the Mean Absolute Error (MAE). And with reading levels 1-4, I guess being off by 0.33 is not too bad.

Anyway, there is room for improvement. So if you think adding NLP to the mix can lower the Mean Absolute Error from 0.33, feel free to try that as one of the experiments.

@jo-elimu, is the model already done now? Should I close the PR?

jo-elimu · 2024-08-20T11:17:06Z

is the model already done now? Should I close the PR?

Yes, we now have one model that is currently working in production: elimu-ai/webapp#1822

However, there are probably many ways its accuracy can be improved. So if you want to continue your initial strategy with using Natural Language Programming (NLP) for this, please feel free to try adding that as another feature column to the preprocessed CSV (as part of #4).

With each new feature, run the model training again, and see how it affects the prediction score.

nya-elimu · 2024-08-20T17:40:22Z

@coderabbitai

#174 Clean the data

c05b54f

SnehaHS65 requested a review from a team as a code owner August 11, 2024 04:13

SnehaHS65 self-assigned this Aug 11, 2024

SnehaHS65 requested a review from jo-elimu August 11, 2024 04:14

jo-elimu requested changes Aug 11, 2024

View reviewed changes

Added req libraries to requirements.txt

e483bca

jo-elimu added 3 commits August 12, 2024 11:56

fix(requirements): delete binary file

06947f6

refs elimu-ai#8

fix(requirements): add text file

e355e77

refs elimu-ai#8

jo-elimu requested changes Aug 12, 2024

View reviewed changes

Sneha65 added 2 commits August 12, 2024 09:57

trigger download_nltk_data inside prepare_data

66a413d

Merge branch 'clean-data' of https://github.com/SnehaHS65/ml-storyboo…

e201bf0

…k-reading-level into clean-data

SnehaHS65 requested a review from jo-elimu August 12, 2024 14:01

jo-elimu reviewed Aug 15, 2024

View reviewed changes

jo-elimu requested changes Aug 15, 2024

View reviewed changes

import download_nltk_data

04dfb5e

SnehaHS65 requested a review from jo-elimu August 18, 2024 13:38

jo-elimu requested changes Aug 19, 2024

View reviewed changes

Merge branch 'main' into clean-data

554e482

jo-elimu closed this Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#174 Clean the data #14

#174 Clean the data #14

SnehaHS65 commented Aug 11, 2024 •

edited

Loading

jo-elimu commented Aug 11, 2024

jo-elimu left a comment

SnehaHS65 commented Aug 11, 2024

jo-elimu commented Aug 12, 2024

jo-elimu left a comment

SnehaHS65 commented Aug 14, 2024

jo-elimu Aug 13, 2024

SnehaHS65 Aug 15, 2024

jo-elimu Aug 15, 2024 •

edited

Loading

jo-elimu Aug 19, 2024

jo-elimu commented Aug 19, 2024

SnehaHS65 commented Aug 19, 2024

jo-elimu commented Aug 20, 2024 •

edited

Loading

nya-elimu commented Aug 20, 2024

#174 Clean the data #14

#174 Clean the data #14

Conversation

SnehaHS65 commented Aug 11, 2024 • edited Loading

jo-elimu commented Aug 11, 2024

jo-elimu left a comment

Choose a reason for hiding this comment

SnehaHS65 commented Aug 11, 2024

jo-elimu commented Aug 12, 2024

jo-elimu left a comment

Choose a reason for hiding this comment

SnehaHS65 commented Aug 14, 2024

jo-elimu Aug 13, 2024

Choose a reason for hiding this comment

SnehaHS65 Aug 15, 2024

Choose a reason for hiding this comment

jo-elimu Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

jo-elimu Aug 19, 2024

Choose a reason for hiding this comment

jo-elimu commented Aug 19, 2024

SnehaHS65 commented Aug 19, 2024

jo-elimu commented Aug 20, 2024 • edited Loading

nya-elimu commented Aug 20, 2024

SnehaHS65 commented Aug 11, 2024 •

edited

Loading

jo-elimu Aug 15, 2024 •

edited

Loading

jo-elimu commented Aug 20, 2024 •

edited

Loading