Skip to content

Releases: urduhack/resources

News datasets

10 Dec 08:06
Compare
Choose a tag to compare
News datasets Pre-release
Pre-release

Urdupoint.com

Word Tokenizer Model

30 Jun 06:15
Compare
Choose a tag to compare

Zip contains model file.

POS Tagger Model

01 Jul 08:54
Compare
Choose a tag to compare

Zip contains 3 files.

NER Model

02 Jul 05:49
Compare
Choose a tag to compare

Zip contains 3 files.

Lemmatizer Model

06 Jul 07:26
Compare
Choose a tag to compare
lemmatizer

[readme] docs updated.

UHaT: Urdu handwritten text dataset

11 May 20:18
75a2915
Compare
Choose a tag to compare

UHaT Dataset

UHaT: Urdu Handwritten Text Dataset

This dataset contains handwritten characters and digits of Urdu language. The samples are written by 900+ individuals.

Description and organization.
Size of images: All the images are stored in 28 by 28 resolution.

How many images: The training set per each character contains of 700 images on average. For example, there are 811 train set images for AYN and 697 train set images for ALIF. Similarly, the train set per each contains 700 images on average. For example, there are 678 train set images for digits one. The test set per each character contains 140 images on average. For example, there are 145 test set images for character ALIF. The test set per each digit contains 140 images on average. For example, there are 147 test set images for digit nine.

The dataset is organized into four sub-directories. Characters Training set, Characters Test set, Digits training set and digits test set. Each sub-director contains one sub-folder per one character. For example, all the train images for character ALIF are placed in sub-folder Alif.

The folder hierarchy is given as:

*Data > characterstrainset > alif

Data > characterstrainset > ayn*

And so on….

How to load directly.

You can also load it directly from the uhat_dataset.npz file. See the kernel load_dataset

Acknowledgements

Thanks to all volunteers who contributed by providing handwriting samples.

Inspiration

This is an MNIST style dataset. The machine learning community in general will find it useful for experimentation, demonstration purposes of machine learning models.
The dataset will also provide an opportunity to researchers to work on Urdu text recognition.

Homepage https://www.kaggle.com/hazrat/uhat-urdu-handwritten-text-dataset

IMDB Dataset of 50K Movie translated Urdu Reviews

07 May 11:41
1aea16a
Compare
Choose a tag to compare

Urdu Sentiment Analysis dataset

This is a dataset for binary sentiment classification containing substantially more data than previous
benchmark datasets. We provide a set of 40,000 highly polar movie reviews for training and 10,000 for testing.
To increase the availability of sentiment analysis dataset for a low recourse language like Urdu,
we opted to use the already available IMDB Dataset. we have translated this dataset using google translator.
This is a binary classification dataset having two classes as positive and negative.
The reason behind using this dataset is high polarity for each class.
It contains 50k samples equally divided in two classes.

Homepage https://www.kaggle.com/akkefa/imdb-dataset-of-50k-movie-translated-urdu-reviews