-
-
Notifications
You must be signed in to change notification settings - Fork 127
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
68e0b23
commit 0cc9e73
Showing
18 changed files
with
5,649 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,303 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Jawi-to-Rumi" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"<div class=\"alert alert-info\">\n", | ||
"\n", | ||
"This tutorial is available as an IPython notebook at [Malaya/example/jawi-rumi](https://github.com/huseinzol05/Malaya/tree/master/example/jawi-rumi).\n", | ||
" \n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"<div class=\"alert alert-info\">\n", | ||
"\n", | ||
"This module trained on both standard and local (included social media) language structures, so it is save to use for both.\n", | ||
" \n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Explanation\n", | ||
"\n", | ||
"Originally from https://www.ejawi.net/converterV2.php?go=rumi able to convert Rumi to Jawi using heuristic method. So Malaya convert from heuristic and map it using deep learning model by inverse the dataset.\n", | ||
"\n", | ||
"`چوميل` -> `comel`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"CPU times: user 5.95 s, sys: 1.15 s, total: 7.1 s\n", | ||
"Wall time: 9.05 s\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"%%time\n", | ||
"import malaya" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Use deep learning model\n", | ||
"\n", | ||
"Load LSTM + Bahdanau Attention Jawi to Rumi model.\n", | ||
"\n", | ||
"If you are using Tensorflow 2, make sure Tensorflow Addons already installed,\n", | ||
"\n", | ||
"```bash\n", | ||
"pip install tensorflow-addons U\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"```python\n", | ||
"def deep_model(quantized: bool = False, **kwargs):\n", | ||
" \"\"\"\n", | ||
" Load LSTM + Bahdanau Attention Rumi to Jawi model.\n", | ||
" Original size 11MB, quantized size 2.92MB .\n", | ||
" CER on test set: 0.09239719040982326\n", | ||
" WER on test set: 0.33811816744187656\n", | ||
"\n", | ||
" Parameters\n", | ||
" ----------\n", | ||
" quantized : bool, optional (default=False)\n", | ||
" if True, will load 8-bit quantized model.\n", | ||
" Quantized model not necessary faster, totally depends on the machine.\n", | ||
"\n", | ||
" Returns\n", | ||
" -------\n", | ||
" result: malaya.model.tf.Seq2SeqLSTM class\n", | ||
" \"\"\"\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"application/vnd.jupyter.widget-view+json": { | ||
"model_id": "530a47ea5c514ae9aa68c8a4e1e29d9c", | ||
"version_major": 2, | ||
"version_minor": 0 | ||
}, | ||
"text/plain": [ | ||
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=11034253.0, style=ProgressStyle(descrip…" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"model = malaya.jawi_rumi.deep_model()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Load Quantized model\n", | ||
"\n", | ||
"To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", | ||
"\n", | ||
"We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"Load quantized model will cause accuracy drop.\n" | ||
] | ||
}, | ||
{ | ||
"data": { | ||
"application/vnd.jupyter.widget-view+json": { | ||
"model_id": "6d1d22a65abd48a28f9a1eb62f2d0c4d", | ||
"version_major": 2, | ||
"version_minor": 0 | ||
}, | ||
"text/plain": [ | ||
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2926859.0, style=ProgressStyle(descript…" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"quantized_model = malaya.jawi_rumi.deep_model(quantized = True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Predict\n", | ||
"\n", | ||
"```python\n", | ||
"def predict(self, strings: List[str], beam_search: bool = False):\n", | ||
" \"\"\"\n", | ||
" Convert to target string.\n", | ||
"\n", | ||
" Parameters\n", | ||
" ----------\n", | ||
" strings : List[str]\n", | ||
" beam_search : bool, (optional=False)\n", | ||
" If True, use beam search decoder, else use greedy decoder.\n", | ||
"\n", | ||
" Returns\n", | ||
" -------\n", | ||
" result: List[str]\n", | ||
" \"\"\"\n", | ||
"```\n", | ||
"\n", | ||
"If want to speed up the inference, set `beam_search = False`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"['saya suka makan im',\n", | ||
" 'eak ack kotok',\n", | ||
" 'aisuk berthday saya, jegan lupa bawak hadiah']" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"model.predict(['ساي سوك ماكن ايم', 'اياق اچق كوتوق', 'ايسوق بيرثداي ساي، جڬن لوڤا باوق هديه'])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"['saya suka makan im',\n", | ||
" 'eak ack kotok',\n", | ||
" 'aisuk berthday saya, jegan lopa bawak hadiah']" | ||
] | ||
}, | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"quantized_model.predict(['ساي سوك ماكن ايم', 'اياق اچق كوتوق', 'ايسوق بيرثداي ساي، جڬن لوڤا باوق هديه'])" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.7" | ||
}, | ||
"varInspector": { | ||
"cols": { | ||
"lenName": 16, | ||
"lenType": 16, | ||
"lenVar": 40 | ||
}, | ||
"kernels_config": { | ||
"python": { | ||
"delete_cmd_postfix": "", | ||
"delete_cmd_prefix": "del ", | ||
"library": "var_list.py", | ||
"varRefreshCmd": "print(var_dic_list())" | ||
}, | ||
"r": { | ||
"delete_cmd_postfix": ") ", | ||
"delete_cmd_prefix": "rm(", | ||
"library": "var_list.r", | ||
"varRefreshCmd": "cat(var_dic_list()) " | ||
} | ||
}, | ||
"types_to_exclude": [ | ||
"module", | ||
"function", | ||
"builtin_function_or_method", | ||
"instance", | ||
"_Feature" | ||
], | ||
"window_display": false | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
Oops, something went wrong.