Skip to content

A powerful Jupyter Notebook-based toolkit for effortlessly managing, analyzing, and modifying CoNLL format annotations. Perfect for NLP researchers and practitioners working with named entity recognition (NER) datasets.

License

Notifications You must be signed in to change notification settings

SakibAhmedShuva/CoNLL-Toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏷️ CoNLL-Toolkit

License: MIT Python 3.6+ Jupyter Notebook

A powerful Jupyter Notebook-based toolkit for effortlessly managing, analyzing, and modifying CoNLL format annotations. Perfect for NLP researchers and practitioners working with named entity recognition (NER) datasets. Please check the CoNLL files in "samples" folder to learn about the compatible formats.

Note:

If your conll contains special characters, you may face errors because Python is having trouble reading the file due to encoding issues. The file likely contains special characters that the default encoding (cp1252 on Windows) can't handle. Use conll-toolkit_special_encoding.ipynb in this case, which explicitly handles UTF-8 encoding, which is commonly used for text files with special characters.

📋 Repository Contents

  • Jupyter Notebook with CoNLL editing functionality
  • Sample CoNLL file (yours.conll)

✨ Features

  • 📊 View Annotations: Instantly visualize all annotations in the CoNLL file along with the total count
  • 🏷️ Label Statistics: Analyze the distribution of labels in your dataset with detailed counts and appropriately sorted
  • 🔍 Search Labels: Find entities with specific labels/tags and track their occurrences
  • 🔍 Search Tokens: Find entities with specific tokens and track their occurrences
  • ✂️ Remove Labels: Selectively remove labels from annotations
  • 🔄 Merge Labels: Combine multiple labels into one
  • ✏️ Rename Labels: Easily batch rename labels using a dictionary mapping
  • ✂️ Delete Sentences: Selectively remove sentences containing particular labels
  • ✂️ Delete Useless Sentences: Remove useless sentences containing no annotations or labels
  • ✂️ Delete Duplicate Sentences: Remove duplicate sentences

🚀 Getting Started

Prerequisites

  • Python 3.6+
  • Jupyter Notebook

Installation

  1. Clone the repository:
git clone https://github.com/SakibAhmedShuva/CoNLL-Toolkit.git
cd CoNLL-Toolkit
  1. Install required packages:
pip install jupyter
  1. Launch Jupyter Notebook:
jupyter notebook
  1. Open the conll-toolkit.ipynb notebook in your browser.

💻 Usage

Example Usage

# Initialize the ConllEditor
editor = ConllEditor('yours.conll')

# View annotations
editor.view_annotations()

# Get label statistics
editor.label_stats()

# Search for a specific label
editor.search_by_label('B-PER')

# Search Annotations with a specific label
editor.search_by_token("Florida")

# Remove a label
editor.remove_label('O')

# Delete entire sentences containing specific labels
editor.delete_sentences_with_label("B-PER")

# Merge labels
editor.merge_labels(['B-PER', 'I-PER'], 'PER')

# Rename labels
editor.rename_labels({'B-ORG': 'B-COMPANY', 'I-ORG': 'I-COMPANY'})

# Save the modified file
editor.save('modified_yours.conll')

🤝 Contributing

Contributions to this project are welcome. Please feel free to submit a Pull Request.

📄 License

This project is open source and available under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Inspired by the needs of the NLP community
  • Built with Python and Jupyter Notebook

🌐 Connect with Me

LinkedIn Kaggle LeetCode Email

About

A powerful Jupyter Notebook-based toolkit for effortlessly managing, analyzing, and modifying CoNLL format annotations. Perfect for NLP researchers and practitioners working with named entity recognition (NER) datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published