Skip to content

Commit

Permalink
added constituency parsing
Browse files Browse the repository at this point in the history
  • Loading branch information
huseinzol05 committed Aug 16, 2020
1 parent 38ffcb8 commit 7b18536
Show file tree
Hide file tree
Showing 130 changed files with 66,267 additions and 907 deletions.
3 changes: 3 additions & 0 deletions README-pypi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ Features
- **Augmentation**

Augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa.
- **Constituency Parsing**

Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa.
- **Dependency Parsing**

Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Expand Down
3 changes: 3 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ Features
- **Augmentation**

Augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa.
- **Constituency Parsing**

Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa.
- **Dependency Parsing**

Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Expand Down
Binary file added accuracy/constituency-accuracy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
31 changes: 31 additions & 0 deletions accuracy/constituency-template.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
option = {
xAxis: {
type: 'category',
axisLabel: {
interval: 0,
rotate: 30
},
data: ['bert-base (470MB)', 'tiny-bert (125MB)',
'albert-base (180MB)', 'tiny-albert (56.7MB)',
'xlnet-base (498MB)']
},
yAxis: {
type: 'value',
min: 69,
max: 82
},
grid: {
bottom: 100
},
backgroundColor: 'rgb(252,252,252)',
series: [{
data: [80.35, 74.89, 79.01, 70.84, 81.43],
type: 'bar',
label: {
normal: {
show: true,
position: 'top'
}
},
}]
};
87 changes: 35 additions & 52 deletions accuracy/models-accuracy.ipynb

Large diffs are not rendered by default.

40 changes: 20 additions & 20 deletions accuracy/models-accuracy.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
**All y-axis been distorted and this can cause misrepresents data and
incorrect conclusion.**
**All y-axis been distorted purposely and this can cause misrepresents
data and incorrect conclusion.**

Abstractive Summarization
-------------------------
Expand All @@ -16,12 +16,6 @@ dataset available inside the notebooks. All training sessions stored in
display(Image('abstractive-accuracy.png', width=500))
.. image:: models-accuracy_files/models-accuracy_2_0.png
:width: 500px


Full score at ``malaya.summarization.abstractive._t5_availability``.

.. code:: python
Expand Down Expand Up @@ -54,12 +48,6 @@ dataset available inside the notebooks. All training sessions stored in
display(Image('dependency-accuracy.png', width=500))
.. image:: models-accuracy_files/models-accuracy_5_0.png
:width: 500px


bert-base
^^^^^^^^^

Expand Down Expand Up @@ -375,12 +363,6 @@ dataset available inside the notebooks. All training sessions stored in
display(Image('emotion-accuracy.png', width=500))
.. image:: models-accuracy_files/models-accuracy_13_0.png
:width: 500px


multinomial
^^^^^^^^^^^

Expand Down Expand Up @@ -1759,3 +1741,21 @@ dataset available inside the notebooks. All training sessions stored in
:width: 500px


Constituency Parsing
--------------------

Trained on 80% of dataset, tested on 20% of dataset. Link to download
dataset available inside the notebooks. All training sessions stored in
`session/constituency <https://github.com/huseinzol05/Malaya/tree/master/session/constituency>`__

**Graph based on FScore**.

.. code:: ipython3
display(Image('constituency-accuracy.png', width=500))
.. image:: models-accuracy_files/models-accuracy_91_0.png
:width: 500px

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions docs/Api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@ malaya.cluster
.. automodule:: malaya.cluster
:members:

malaya.constituency
---------------------

.. automodule:: malaya.constituency
:members:

malaya.dependency
------------------

Expand Down Expand Up @@ -309,6 +315,9 @@ malaya.model.tf
.. autoclass:: malaya.model.tf.TRANSLATION()
:members:

.. autoclass:: malaya.model.tf.CONSTITUENCY()
:members:

malaya.model.ml
----------------------------------

Expand Down
9 changes: 9 additions & 0 deletions docs/Constituency.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Constituency Parsing
=====================

.. note::

This tutorial is available as an IPython notebook
`here <https://github.com/huseinzol05/Malaya/tree/master/example/constituency>`_.

.. include:: load-constituency.rst
2 changes: 2 additions & 0 deletions docs/Contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ Report bugs through `Github issue`_.
Please report relevant information and preferably code that exhibits the
problem.

Do not try to email us about the issues, we will not respond to the emails, submit a proper Github issue.

Fix Bugs
--------

Expand Down
3 changes: 3 additions & 0 deletions docs/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ Features
- **Augmentation**

Augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa.
- **Constituency Parsing**

Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa.
- **Dependency Parsing**

Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Expand Down
10 changes: 8 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Contents:
Deployment
Api
Windows
Contributing

.. toctree::
:maxdepth: 2
Expand Down Expand Up @@ -69,10 +70,16 @@ Contents:
:maxdepth: 2
:caption: Tagging Module

Dependency
Entities
Pos

.. toctree::
:maxdepth: 2
:caption: Parsing Module

Constituency
Dependency

.. toctree::
:maxdepth: 2
:caption: Summarization Module
Expand Down Expand Up @@ -105,7 +112,6 @@ Contents:
:maxdepth: 2
:caption: Misc

Contributing
Crawler
Donation
Translator-malaya
Expand Down
96 changes: 53 additions & 43 deletions docs/load-clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,49 +23,59 @@
Generate scatter plot for unsupervised clustering
-------------------------------------------------

\```python

def cluster_scatter( corpus, vectorizer, num_clusters = 5, titles =
None, colors = None, stemming = True, stop_words = None, cleaning =
simple_textcleaning, clustering = KMeans, decomposition = MDS, ngram =
(1, 3), figsize = (17, 9), batch_size = 20, ): """ plot scatter plot on
similar text clusters.

::

Parameters
----------

corpus: list
vectorizer: class
num_clusters: int, (default=5)
size of unsupervised clusters.
titles: list
list of titles, length must same with corpus.
colors: list
list of colors, length must same with num_clusters.
stemming: bool, (default=True)
If True, sastrawi_stemmer will apply.
stop_words: list, (default=None)
list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
ngram: tuple, (default=(1,3))
n-grams size to train a corpus.
cleaning: function, (default=simple_textcleaning)
function to clean the corpus.
batch_size: int, (default=10)
size of strings for each vectorization and attention. Only useful if use transformer vectorizer.

Returns
-------
dictionary: {
'X': X,
'Y': Y,
'labels': clusters,
'vector': transformed_text_clean,
'titles': titles,
}
"""
```
.. code:: python
def cluster_scatter(
corpus,
vectorizer,
num_clusters = 5,
titles = None,
colors = None,
stemming = True,
stop_words = None,
cleaning = simple_textcleaning,
clustering = KMeans,
decomposition = MDS,
ngram = (1, 3),
figsize = (17, 9),
batch_size = 20,
):
"""
plot scatter plot on similar text clusters.
Parameters
----------
corpus: list
vectorizer: class
num_clusters: int, (default=5)
size of unsupervised clusters.
titles: list
list of titles, length must same with corpus.
colors: list
list of colors, length must same with num_clusters.
stemming: bool, (default=True)
If True, sastrawi_stemmer will apply.
stop_words: list, (default=None)
list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
ngram: tuple, (default=(1,3))
n-grams size to train a corpus.
cleaning: function, (default=simple_textcleaning)
function to clean the corpus.
batch_size: int, (default=10)
size of strings for each vectorization and attention. Only useful if use transformer vectorizer.
Returns
-------
dictionary: {
'X': X,
'Y': Y,
'labels': clusters,
'vector': transformed_text_clean,
'titles': titles,
}
"""
.. code:: python
Expand Down
Loading

0 comments on commit 7b18536

Please sign in to comment.