added constituency parsing

mesolitica · Aug 16, 2020 · 7b18536 · 7b18536
1 parent 38ffcb8
commit 7b18536
Show file tree

Hide file tree

Showing 130 changed files with 66,267 additions and 907 deletions.
diff --git a/README-pypi.rst b/README-pypi.rst
@@ -26,6 +26,9 @@ Features
 -  **Augmentation**
 
    Augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa.
+-  **Constituency Parsing**
+
+   Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa.  
 -  **Dependency Parsing**
 
    Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.

diff --git a/README.rst b/README.rst
@@ -46,6 +46,9 @@ Features
 -  **Augmentation**
 
    Augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa.
+-  **Constituency Parsing**
+
+   Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa.  
 -  **Dependency Parsing**
 
    Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.

diff --git a/accuracy/constituency-accuracy.png b/accuracy/constituency-accuracy.png
diff --git a/accuracy/constituency-template.js b/accuracy/constituency-template.js
@@ -0,0 +1,31 @@
+option = {
+    xAxis: {
+        type: 'category',
+        axisLabel: {
+            interval: 0,
+            rotate: 30
+        },
+        data: ['bert-base (470MB)', 'tiny-bert (125MB)',
+            'albert-base (180MB)', 'tiny-albert (56.7MB)',
+            'xlnet-base (498MB)']
+    },
+    yAxis: {
+        type: 'value',
+        min: 69,
+        max: 82
+    },
+    grid: {
+        bottom: 100
+    },
+    backgroundColor: 'rgb(252,252,252)',
+    series: [{
+        data: [80.35, 74.89, 79.01, 70.84, 81.43],
+        type: 'bar',
+        label: {
+            normal: {
+                show: true,
+                position: 'top'
+            }
+        },
+    }]
+};
diff --git a/accuracy/models-accuracy.ipynb b/accuracy/models-accuracy.ipynb
diff --git a/accuracy/models-accuracy.rst b/accuracy/models-accuracy.rst
@@ -1,5 +1,5 @@
-**All y-axis been distorted and this can cause misrepresents data and
-incorrect conclusion.**
+**All y-axis been distorted purposely and this can cause misrepresents
+data and incorrect conclusion.**
 
 Abstractive Summarization
 -------------------------
@@ -16,12 +16,6 @@ dataset available inside the notebooks. All training sessions stored in
     
     display(Image('abstractive-accuracy.png', width=500))
 
-
-
-.. image:: models-accuracy_files/models-accuracy_2_0.png
-   :width: 500px
-
-
 Full score at ``malaya.summarization.abstractive._t5_availability``.
 
 .. code:: python
@@ -54,12 +48,6 @@ dataset available inside the notebooks. All training sessions stored in
 
     display(Image('dependency-accuracy.png', width=500))
 
-
-
-.. image:: models-accuracy_files/models-accuracy_5_0.png
-   :width: 500px
-
-
 bert-base
 ^^^^^^^^^
 
@@ -375,12 +363,6 @@ dataset available inside the notebooks. All training sessions stored in
     
     display(Image('emotion-accuracy.png', width=500))
 
-
-
-.. image:: models-accuracy_files/models-accuracy_13_0.png
-   :width: 500px
-
-
 multinomial
 ^^^^^^^^^^^
 
@@ -1759,3 +1741,21 @@ dataset available inside the notebooks. All training sessions stored in
    :width: 500px
 
 
+Constituency Parsing
+--------------------
+
+Trained on 80% of dataset, tested on 20% of dataset. Link to download
+dataset available inside the notebooks. All training sessions stored in
+`session/constituency <https://github.com/huseinzol05/Malaya/tree/master/session/constituency>`__
+
+**Graph based on FScore**.
+
+.. code:: ipython3
+
+    display(Image('constituency-accuracy.png', width=500))
+
+
+
+.. image:: models-accuracy_files/models-accuracy_91_0.png
+   :width: 500px
+
diff --git a/accuracy/models-accuracy_files/models-accuracy_91_0.png b/accuracy/models-accuracy_files/models-accuracy_91_0.png
diff --git a/docs/Api.rst b/docs/Api.rst
@@ -21,6 +21,12 @@ malaya.cluster
 .. automodule:: malaya.cluster
     :members:
 
+malaya.constituency
+---------------------
+
+.. automodule:: malaya.constituency
+    :members:
+
 malaya.dependency
 ------------------
 
@@ -309,6 +315,9 @@ malaya.model.tf
 .. autoclass:: malaya.model.tf.TRANSLATION()
     :members:
 
+.. autoclass:: malaya.model.tf.CONSTITUENCY()
+    :members:
+
 malaya.model.ml
 ----------------------------------
 

diff --git a/docs/Constituency.rst b/docs/Constituency.rst
@@ -0,0 +1,9 @@
+Constituency Parsing
+=====================
+
+.. note::
+
+    This tutorial is available as an IPython notebook
+    `here <https://github.com/huseinzol05/Malaya/tree/master/example/constituency>`_.
+
+.. include:: load-constituency.rst
diff --git a/docs/Contributing.rst b/docs/Contributing.rst
@@ -12,6 +12,8 @@ Report bugs through `Github issue`_.
 Please report relevant information and preferably code that exhibits the
 problem.
 
+Do not try to email us about the issues, we will not respond to the emails, submit a proper Github issue.
+
 Fix Bugs
 --------
 

diff --git a/docs/README.rst b/docs/README.rst
@@ -46,6 +46,9 @@ Features
 -  **Augmentation**
 
    Augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa.
+-  **Constituency Parsing**
+
+   Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa.  
 -  **Dependency Parsing**
 
    Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.

diff --git a/docs/index.rst b/docs/index.rst
@@ -24,6 +24,7 @@ Contents:
    Deployment
    Api
    Windows
+   Contributing
 
 .. toctree::
    :maxdepth: 2
@@ -69,10 +70,16 @@ Contents:
    :maxdepth: 2
    :caption: Tagging Module
 
-   Dependency
    Entities
    Pos
 
+.. toctree::
+   :maxdepth: 2
+   :caption: Parsing Module
+
+   Constituency
+   Dependency
+
 .. toctree::
    :maxdepth: 2
    :caption: Summarization Module
@@ -105,7 +112,6 @@ Contents:
    :maxdepth: 2
    :caption: Misc
 
-   Contributing
    Crawler
    Donation
    Translator-malaya

diff --git a/docs/load-clustering.rst b/docs/load-clustering.rst
@@ -23,49 +23,59 @@
 Generate scatter plot for unsupervised clustering
 -------------------------------------------------
 
-\```python
-
-def cluster_scatter( corpus, vectorizer, num_clusters = 5, titles =
-None, colors = None, stemming = True, stop_words = None, cleaning =
-simple_textcleaning, clustering = KMeans, decomposition = MDS, ngram =
-(1, 3), figsize = (17, 9), batch_size = 20, ): """ plot scatter plot on
-similar text clusters.
-
-::
-
-   Parameters
-   ----------
-
-   corpus: list
-   vectorizer: class
-   num_clusters: int, (default=5)
-       size of unsupervised clusters.
-   titles: list
-       list of titles, length must same with corpus.
-   colors: list
-       list of colors, length must same with num_clusters.
-   stemming: bool, (default=True)
-       If True, sastrawi_stemmer will apply.
-   stop_words: list, (default=None)
-       list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
-   ngram: tuple, (default=(1,3))
-       n-grams size to train a corpus.
-   cleaning: function, (default=simple_textcleaning)
-       function to clean the corpus.
-   batch_size: int, (default=10)
-       size of strings for each vectorization and attention. Only useful if use transformer vectorizer.
-
-   Returns
-   -------
-   dictionary: {
-       'X': X,
-       'Y': Y,
-       'labels': clusters,
-       'vector': transformed_text_clean,
-       'titles': titles,
-   }
-   """
-   ```
+.. code:: python
+
+
+   def cluster_scatter(
+       corpus,
+       vectorizer,
+       num_clusters = 5,
+       titles = None,
+       colors = None,
+       stemming = True,
+       stop_words = None,
+       cleaning = simple_textcleaning,
+       clustering = KMeans,
+       decomposition = MDS,
+       ngram = (1, 3),
+       figsize = (17, 9),
+       batch_size = 20,
+   ):
+       """
+       plot scatter plot on similar text clusters.
+
+       Parameters
+       ----------
+
+       corpus: list
+       vectorizer: class
+       num_clusters: int, (default=5)
+           size of unsupervised clusters.
+       titles: list
+           list of titles, length must same with corpus.
+       colors: list
+           list of colors, length must same with num_clusters.
+       stemming: bool, (default=True)
+           If True, sastrawi_stemmer will apply.
+       stop_words: list, (default=None)
+           list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
+       ngram: tuple, (default=(1,3))
+           n-grams size to train a corpus.
+       cleaning: function, (default=simple_textcleaning)
+           function to clean the corpus.
+       batch_size: int, (default=10)
+           size of strings for each vectorization and attention. Only useful if use transformer vectorizer.
+
+       Returns
+       -------
+       dictionary: {
+           'X': X,
+           'Y': Y,
+           'labels': clusters,
+           'vector': transformed_text_clean,
+           'titles': titles,
+       }
+       """
 
 .. code:: python