A powerful Jupyter Notebook-based toolkit for effortlessly managing, analyzing, and modifying CoNLL format annotations. Perfect for NLP researchers and practitioners working with named entity recognition (NER) datasets. Please check the CoNLL files in "samples" folder to learn about the compatible formats.
If your conll contains special characters, you may face errors because Python is having trouble reading the file due to encoding issues. The file likely contains special characters that the default encoding (cp1252 on Windows) can't handle. Use conll-toolkit_special_encoding.ipynb in this case, which explicitly handles UTF-8 encoding, which is commonly used for text files with special characters.
- Jupyter Notebook with CoNLL editing functionality
- Sample CoNLL file (yours.conll)
- 📊 View Annotations: Instantly visualize all annotations in the CoNLL file along with the total count
- 🏷️ Label Statistics: Analyze the distribution of labels in your dataset with detailed counts and appropriately sorted
- 🔍 Search Labels: Find entities with specific labels/tags and track their occurrences
- 🔍 Search Tokens: Find entities with specific tokens and track their occurrences
- ✂️ Remove Labels: Selectively remove labels from annotations
- 🔄 Merge Labels: Combine multiple labels into one
- ✏️ Rename Labels: Easily batch rename labels using a dictionary mapping
- ✂️ Delete Sentences: Selectively remove sentences containing particular labels
- ✂️ Delete Useless Sentences: Remove useless sentences containing no annotations or labels
- ✂️ Delete Duplicate Sentences: Remove duplicate sentences
- Python 3.6+
- Jupyter Notebook
- Clone the repository:
git clone https://github.com/SakibAhmedShuva/CoNLL-Toolkit.git
cd CoNLL-Toolkit
- Install required packages:
pip install jupyter
- Launch Jupyter Notebook:
jupyter notebook
- Open the
conll-toolkit.ipynb
notebook in your browser.
# Initialize the ConllEditor
editor = ConllEditor('yours.conll')
# View annotations
editor.view_annotations()
# Get label statistics
editor.label_stats()
# Search for a specific label
editor.search_by_label('B-PER')
# Search Annotations with a specific label
editor.search_by_token("Florida")
# Remove a label
editor.remove_label('O')
# Delete entire sentences containing specific labels
editor.delete_sentences_with_label("B-PER")
# Merge labels
editor.merge_labels(['B-PER', 'I-PER'], 'PER')
# Rename labels
editor.rename_labels({'B-ORG': 'B-COMPANY', 'I-ORG': 'I-COMPANY'})
# Save the modified file
editor.save('modified_yours.conll')
Contributions to this project are welcome. Please feel free to submit a Pull Request.
This project is open source and available under the MIT License - see the LICENSE file for details.
- Inspired by the needs of the NLP community
- Built with Python and Jupyter Notebook