This tool can be used to create a word corpus from locally available documents.Word Corpus are required to build word embeddings for certain Natural Language Processing tasks.
This tool will convert the documents present in the documents
folder into a single clean txt file that can be then passed to a word vector generator such as GloVe created by Stanford.
Final
: Contains all the necessary files for the tool kitdocuments
: Put all the documents that you want to convert into this folder. Currently it can accept:pdf
,docx
docx2text.py
: Converts the passed docx file to textpdf2text.py
: Converts the passed pdf to textpipeline.py
: Picks the file from the documents folder and passes it to the correct converterrun.py
: This file is supposed to be executed to get the necessary text file output. This will clean the text generated from all the files and save the created text file in the correct locationoutput
: This folder will have the text file as the output.
Final output will be a single text file that will be a combination of all the files in the document folder.
Some sample documents are already saved in the documents
folder for you to quickly test.
- Python: 2.7 and above
- For pdf conversion
- pdfminer(python 2.7)
- pdfminer.six(python 3)
- For docx conversion
- python-docx