Write technical documentation. WIP on #38.

cceh · Nov 27, 2019 · 4fa0988 · 4fa0988
1 parent 7a2fbb9
commit 4fa0988
Show file tree

Hide file tree

Showing 101 changed files with 17,837 additions and 37,852 deletions.
diff --git a/doc_src/Makefile b/doc_src/Makefile
@@ -50,7 +50,7 @@ clean:
 
 .PHONY: html
 html:
-	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)
+	$(SPHINXBUILD) -a -E -b html $(ALLSPHINXOPTS) $(BUILDDIR)
 	cp _config.yml $(BUILDDIR)/
 	@echo
 	@echo "Build finished. The HTML pages are in $(BUILDDIR)."

diff --git a/doc_src/collation_tool.rst b/doc_src/collation_tool.rst
@@ -1,52 +1,113 @@
-================
- Collation Tool
-================
+.. _collation-tool:
 
-The collation tool has a frontend and a backend component.
-The collatables are stored in TEI files.
 
+Collation Tool
+==============
+
+Description of the collation tool and the processing of the TEI files.
+
+
+Pre-Processing of the TEI files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We extract every chapter of every capitular from all manuscripts and store them
+separately in the Postgres database on the application Server.  The text stored
+in the database is already normalized.
+
+If a manuscript contains more than one copy of a chapter, all copies are
+extracted.  If a corrector hand was active in the chapter, an original and a
+corrected version are both extracted.  The collation tool knows about all these
+versions and offers them to the user.
 
 .. uml::
    :align: center
-   :caption: Collation Tool
+   :caption: Data flow during pre-processing
 
    skinparam backgroundColor transparent
    skinparam DefaultTextAlignment center
+   skinparam componentStyle uml2
+
+   database  "Manuscript files\n(XML+TEI)"      as tei
+   note left of tei: AFS:publ/mss/*.xml
+
+   cloud "VM" {
+     component "Cron"                          as cron
+     component "Makefile"                      as make
+     component "mss-extract-chapters.xsl"      as saxon
+     database  "Chapter files\n(plain text)"   as chapters
+     note left of chapters: AFS:publ/cache/extracted/*/*.txt
+     component "import.py"                     as import
+     database  "Database\n(Postgres)"          as db
+   }
+
+   tei      --> saxon
+   saxon    --> chapters
+   chapters --> import
+   import   --> db
+
+   cron .> make
+   make .> saxon
+   make .> import
+
+The Makefile is run by cron on the Capitularia VM at regular intervals.
+
+The manuscript files are in the AFS.  The AFS is mounted onto the VM.
 
-   component "Frontend\n\n(Javascript)" as client
+The Makefile knows all the dependencies between the files and runs the
+appropriate tools to keep the database up-to-date with the manuscript files.
+
+All intermediate files can be found in the cache/extracted directory.  One
+directory per manuscript, and one file per chapter, copy, and hand.  The
+intermediate files are normalized, eg. have V replaced by U.
+
+The import.py script imports the intermediate text files into the database.
+
+
+Collation
+~~~~~~~~~
+
+The collation tool is divided in two parts, one frontend written in JavaScript
+and the Vue.js library, and one backend application server written in Python.
+The backend retrieves the chapters to collate from the database and calls the
+CollateX executable to do the actual collation. The results are sent to the
+frontend that does the formatting for display.
+
+.. uml::
+   :align: center
+   :caption: Data flow during collation
+
+   skinparam backgroundColor transparent
+   skinparam DefaultTextAlignment center
+   skinparam componentStyle uml2
 
-   component "Backend\n\n(Wordpress Plugin)" as api
-   component "CollateX\n\n(Java)"  as cx
-   component "Transformation\n\n(XSLT)"       as xslt
+   cloud "VM" {
+     database  "Database\n(Postgres)"   as db
+     component "API Server\n(Python)"   as api
+     component "CollateX\n(Java)"       as cx
+   }
+   component "Frontend\n(Javascript)" as client
 
-   database "TEI Files"   as tei
+   db     --> api
+   api    --> client
+   api    <- cx
+   api    -> cx
 
-   client <-> api
-   api <-> cx
-   api <-- xslt
-   xslt <-- tei
 
-The frontend is written in Javascript using the VueJS library.  It communicates
-with the backend using AJAX calls.  The frontend displays the data to the user
-and lets the user manipulate it, while the backend does the actual collation.
+The collation unit is the chapter, so that only short texts need to be collated,
+saving much processing time.
 
-The collatables are stored in TEI files.  The backend has to be preprocessed
-them to obtain streams of words with all markup stripped.  The streams of words
-are then sent to CollateX to do the collation and finally to the frontend, who
-formats and displays them.
+We aim to rewrite all the functionality we need of CollateX in Python or
+Javascript and then drop the dependency on CollateX.
 
-The collatables are subdivided into capitularies and sections, so that only
-short texts need to be collated, saving much processing time.  The backend also
-extracts the wanted sections from the TEI files.
+The Wordpress cap-collation-user plugin delivers the Javascript client to the
+user.  After that, all communication happens directly between the client and the
+application server.
 
-Currently and for historical reasons the backend is implemented as Wordpress
-plugin in PHP.  We aim to rewrite it ASAP using a Python application framework
-and at the same time we'll rewrite all the functionality we need of CollateX in
-Python and drop the dependency on CollateX and Java.
 
+.. _custom-collatex:
 
 Custom Version of CollateX
-==========================
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Our custom version of CollateX uses a custom word comparison function.
 

diff --git a/doc_src/collections.rst b/doc_src/collections.rst
@@ -1,6 +1,5 @@
-=============
- Collections
-=============
+Collections
+===========
 
 We want to find *collections* of capitularies, currently very vaguely defined as
 capitularies that are often copied together.
@@ -12,7 +11,7 @@ potential collections of capitularies.
 
 
 Algorithm
-=========
+~~~~~~~~~
 
 Description of the algorithm used by the :code:`cluster.py` script.
 

diff --git a/doc_src/conf.py b/doc_src/conf.py
@@ -20,6 +20,7 @@
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 
+sys.path.insert (0, os.path.abspath ('../server'))
 sys.path.insert (0, os.path.abspath ('..'))
 
 # -- General configuration ------------------------------------------------
@@ -38,14 +39,15 @@
     'sphinx.ext.imgconverter',
     # 'sphinx_js',
     'sphinxcontrib.plantuml',
-    #'sauml.sauml',
+    'sphinxcontrib.httpdomain',
+    'sauml.sauml',
 ]
 
 #js_source_path = '../server/es6'
 #jsdoc_config_path = '../jsdoc.json'
 
-#sauml_arguments = ['mysql://capitularia@mysql2.uni-koeln.de/capitularia']
-#sauml_dot_table = 'bgcolor=#e7f2fa&color=#2980B9'
+sauml_arguments = ['postgresql+psycopg2://capitularia@localhost:5432/capitularia']
+sauml_dot_table = 'bgcolor=#e7f2fa&color=#2980B9'
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['_templates']
@@ -63,7 +65,7 @@
 
 # General information about the project.
 project = 'Capitularia'
-copyright = '2018 CCeH - Licensed under the GNU GPL v3 or later'
+copyright = '2018-19 CCeH - Licensed under the GNU GPL v3 or later'
 author = 'Marcello Perathoner'
 
 # The version info for the project you're documenting, acts as replacement for

diff --git a/doc_src/index.rst b/doc_src/index.rst
@@ -19,7 +19,11 @@ Developer Manual
 .. toctree::
    :maxdepth: 2
 
+   intro
+   webprojekt/webprojekt
+   vm/vm
    collation_tool
+   meta_search
    collections
 
 

diff --git a/doc_src/intro.rst b/doc_src/intro.rst
@@ -0,0 +1,77 @@
+==============
+ Introduction
+==============
+
+Introduction to the Capitularia Website  Project.
+
+
+Platforms
+=========
+
+The project uses three main platforms:
+
+- the RRZK WebProject (https://capitularia.uni-koeln.de),
+- the Capitularia VM  (https://api.capitularia.uni-koeln.de),
+- the AFS Filesystem  (/afs/rrz.uni-koeln.de/vol/www/projekt/capitularia/)
+
+.. uml::
+   :align: center
+   :caption: Main components of the project
+
+   skinparam backgroundColor transparent
+   skinparam DefaultTextAlignment center
+   skinparam componentStyle uml2
+
+   cloud "RRZK WebProject" {
+     rectangle "Apache" as apache {
+       component "Wordpress" as wp
+     }
+     database  "Database\n(mysql)"   as mysql
+   }
+
+   cloud "Capitularia VM" {
+     component "App Server\n(Python+Flask)"   as api
+     database  "Database\n(Postgres)"   as db
+   }
+
+   cloud "AFS Filesystem" {
+     database "Files" as afs
+   }
+
+   wp    <->  api
+   wp    <--> afs
+   api   <--> afs
+
+   api   <-> db
+
+   mysql <-> wp
+
+
+The Apache web server runs the Wordpress app and serves static files.  We wrote
+a Wordpress theme and many :ref:`Wordpress plugins <plugins>` to add the
+functionality we needed for our project.  As it got harder to implement all that
+as plugins we moved part of that functionality onto an application server on
+a VM.
+
+The Capitularia VM is a root VM on which we installed recent software.  It runs
+the Postgres database and the :ref:`Python application server <app-server>`.
+Next to that it hosts a recent OpenJDK, Saxon and a
+:ref:`customized version of CollateX <custom-collatex>`.
+
+The application server does :ref:`collations <collation-tool>` and
+:ref:`metadata and fulltext search <meta-search>` in the capitulars.  The
+database holds manuscript metadata and the pre-processed text of every chapter
+in every manuscript.
+
+The AFS Filesytem holds the manuscript files (and other project files.)  It is
+accessible from the VM and the Apache web server.  Also the editors have direct
+access to it through ssh.
+
+
+Components
+==========
+
+- Website
+- Meta Search
+- Collation Tool
+- Page Generator