Skip to content

Commit

Permalink
Write technical documentation. WIP on #38.
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcelloPerathoner committed Nov 27, 2019
1 parent 7a2fbb9 commit 4fa0988
Show file tree
Hide file tree
Showing 101 changed files with 17,837 additions and 37,852 deletions.
2 changes: 1 addition & 1 deletion doc_src/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ clean:

.PHONY: html
html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)
$(SPHINXBUILD) -a -E -b html $(ALLSPHINXOPTS) $(BUILDDIR)
cp _config.yml $(BUILDDIR)/
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)."
Expand Down
121 changes: 91 additions & 30 deletions doc_src/collation_tool.rst
Original file line number Diff line number Diff line change
@@ -1,52 +1,113 @@
================
Collation Tool
================
.. _collation-tool:

The collation tool has a frontend and a backend component.
The collatables are stored in TEI files.

Collation Tool
==============

Description of the collation tool and the processing of the TEI files.


Pre-Processing of the TEI files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We extract every chapter of every capitular from all manuscripts and store them
separately in the Postgres database on the application Server. The text stored
in the database is already normalized.

If a manuscript contains more than one copy of a chapter, all copies are
extracted. If a corrector hand was active in the chapter, an original and a
corrected version are both extracted. The collation tool knows about all these
versions and offers them to the user.

.. uml::
:align: center
:caption: Collation Tool
:caption: Data flow during pre-processing

skinparam backgroundColor transparent
skinparam DefaultTextAlignment center
skinparam componentStyle uml2

database "Manuscript files\n(XML+TEI)" as tei
note left of tei: AFS:publ/mss/*.xml
cloud "VM" {
component "Cron" as cron
component "Makefile" as make
component "mss-extract-chapters.xsl" as saxon
database "Chapter files\n(plain text)" as chapters
note left of chapters: AFS:publ/cache/extracted/*/*.txt
component "import.py" as import
database "Database\n(Postgres)" as db
}

tei --> saxon
saxon --> chapters
chapters --> import
import --> db

cron .> make
make .> saxon
make .> import

The Makefile is run by cron on the Capitularia VM at regular intervals.

The manuscript files are in the AFS. The AFS is mounted onto the VM.

component "Frontend\n\n(Javascript)" as client
The Makefile knows all the dependencies between the files and runs the
appropriate tools to keep the database up-to-date with the manuscript files.

All intermediate files can be found in the cache/extracted directory. One
directory per manuscript, and one file per chapter, copy, and hand. The
intermediate files are normalized, eg. have V replaced by U.

The import.py script imports the intermediate text files into the database.


Collation
~~~~~~~~~

The collation tool is divided in two parts, one frontend written in JavaScript
and the Vue.js library, and one backend application server written in Python.
The backend retrieves the chapters to collate from the database and calls the
CollateX executable to do the actual collation. The results are sent to the
frontend that does the formatting for display.

.. uml::
:align: center
:caption: Data flow during collation

skinparam backgroundColor transparent
skinparam DefaultTextAlignment center
skinparam componentStyle uml2

component "Backend\n\n(Wordpress Plugin)" as api
component "CollateX\n\n(Java)" as cx
component "Transformation\n\n(XSLT)" as xslt
cloud "VM" {
database "Database\n(Postgres)" as db
component "API Server\n(Python)" as api
component "CollateX\n(Java)" as cx
}
component "Frontend\n(Javascript)" as client

database "TEI Files" as tei
db --> api
api --> client
api <- cx
api -> cx

client <-> api
api <-> cx
api <-- xslt
xslt <-- tei

The frontend is written in Javascript using the VueJS library. It communicates
with the backend using AJAX calls. The frontend displays the data to the user
and lets the user manipulate it, while the backend does the actual collation.
The collation unit is the chapter, so that only short texts need to be collated,
saving much processing time.

The collatables are stored in TEI files. The backend has to be preprocessed
them to obtain streams of words with all markup stripped. The streams of words
are then sent to CollateX to do the collation and finally to the frontend, who
formats and displays them.
We aim to rewrite all the functionality we need of CollateX in Python or
Javascript and then drop the dependency on CollateX.

The collatables are subdivided into capitularies and sections, so that only
short texts need to be collated, saving much processing time. The backend also
extracts the wanted sections from the TEI files.
The Wordpress cap-collation-user plugin delivers the Javascript client to the
user. After that, all communication happens directly between the client and the
application server.

Currently and for historical reasons the backend is implemented as Wordpress
plugin in PHP. We aim to rewrite it ASAP using a Python application framework
and at the same time we'll rewrite all the functionality we need of CollateX in
Python and drop the dependency on CollateX and Java.

.. _custom-collatex:

Custom Version of CollateX
==========================
~~~~~~~~~~~~~~~~~~~~~~~~~~

Our custom version of CollateX uses a custom word comparison function.

Expand Down
7 changes: 3 additions & 4 deletions doc_src/collections.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
=============
Collections
=============
Collections
===========

We want to find *collections* of capitularies, currently very vaguely defined as
capitularies that are often copied together.
Expand All @@ -12,7 +11,7 @@ potential collections of capitularies.


Algorithm
=========
~~~~~~~~~

Description of the algorithm used by the :code:`cluster.py` script.

Expand Down
10 changes: 6 additions & 4 deletions doc_src/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.

sys.path.insert (0, os.path.abspath ('../server'))
sys.path.insert (0, os.path.abspath ('..'))

# -- General configuration ------------------------------------------------
Expand All @@ -38,14 +39,15 @@
'sphinx.ext.imgconverter',
# 'sphinx_js',
'sphinxcontrib.plantuml',
#'sauml.sauml',
'sphinxcontrib.httpdomain',
'sauml.sauml',
]

#js_source_path = '../server/es6'
#jsdoc_config_path = '../jsdoc.json'

#sauml_arguments = ['mysql://capitularia@mysql2.uni-koeln.de/capitularia']
#sauml_dot_table = 'bgcolor=#e7f2fa&color=#2980B9'
sauml_arguments = ['postgresql+psycopg2://capitularia@localhost:5432/capitularia']
sauml_dot_table = 'bgcolor=#e7f2fa&color=#2980B9'

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
Expand All @@ -63,7 +65,7 @@

# General information about the project.
project = 'Capitularia'
copyright = '2018 CCeH - Licensed under the GNU GPL v3 or later'
copyright = '2018-19 CCeH - Licensed under the GNU GPL v3 or later'
author = 'Marcello Perathoner'

# The version info for the project you're documenting, acts as replacement for
Expand Down
4 changes: 4 additions & 0 deletions doc_src/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,11 @@ Developer Manual
.. toctree::
:maxdepth: 2

intro
webprojekt/webprojekt
vm/vm
collation_tool
meta_search
collections


Expand Down
77 changes: 77 additions & 0 deletions doc_src/intro.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
==============
Introduction
==============

Introduction to the Capitularia Website Project.


Platforms
=========

The project uses three main platforms:

- the RRZK WebProject (https://capitularia.uni-koeln.de),
- the Capitularia VM (https://api.capitularia.uni-koeln.de),
- the AFS Filesystem (/afs/rrz.uni-koeln.de/vol/www/projekt/capitularia/)

.. uml::
:align: center
:caption: Main components of the project

skinparam backgroundColor transparent
skinparam DefaultTextAlignment center
skinparam componentStyle uml2

cloud "RRZK WebProject" {
rectangle "Apache" as apache {
component "Wordpress" as wp
}
database "Database\n(mysql)" as mysql
}

cloud "Capitularia VM" {
component "App Server\n(Python+Flask)" as api
database "Database\n(Postgres)" as db
}

cloud "AFS Filesystem" {
database "Files" as afs
}

wp <-> api
wp <--> afs
api <--> afs

api <-> db

mysql <-> wp


The Apache web server runs the Wordpress app and serves static files. We wrote
a Wordpress theme and many :ref:`Wordpress plugins <plugins>` to add the
functionality we needed for our project. As it got harder to implement all that
as plugins we moved part of that functionality onto an application server on
a VM.

The Capitularia VM is a root VM on which we installed recent software. It runs
the Postgres database and the :ref:`Python application server <app-server>`.
Next to that it hosts a recent OpenJDK, Saxon and a
:ref:`customized version of CollateX <custom-collatex>`.

The application server does :ref:`collations <collation-tool>` and
:ref:`metadata and fulltext search <meta-search>` in the capitulars. The
database holds manuscript metadata and the pre-processed text of every chapter
in every manuscript.

The AFS Filesytem holds the manuscript files (and other project files.) It is
accessible from the VM and the Apache web server. Also the editors have direct
access to it through ssh.


Components
==========

- Website
- Meta Search
- Collation Tool
- Page Generator
Loading

0 comments on commit 4fa0988

Please sign in to comment.