Code for Text Engineering courses, University of Cologne
Course plan and material (in German)
Course plan and material (in German)
Functional | Technical | Uses | Literature | |
tm1 | Corpus and data access | OOD und TDD basics; object DB und native queries | DB4O; Crawler (ir6) | Gamma et al. (1994), Kap. 1; Bloch (2008), Item 16 |
tm2 | Data enrichment with standoff annotation | Generics; XML binding for export und import; Schema generation as a form of MDD (code-first) | Index (ir2); TF-IDF (ir5); JAXB (or Java 6) | Thompson & McKelvie (1997); Bloch (2008), Ch. 5; Naftalin & Wadler (2006) Part 1 |
tm3 | Text classification with naive bayes | Delegation and strategy for modular classification | Crawler (ir6) | Gamma et al. (1994), S. 315; Bloch (2008), Item 21 |
tm4 | Comparative text classification and evaluation | Using the Weka-API, adapter for integration | Weka (developer version) | Gamma et al. (1994), S. 139; Witten & Frank (2005) |
tm5 | Flat k-means clustering and purity evaluation | Java Concurrency API (CopyOnWriteArrayList, ExecutorService), visualization with Graphviz DOT | TF-IDF vectors and cosine similarity (ir5) | Bloch (2008), Item 68 |
tm6 | Release engineering | CRISP builds with Ant | All previous code | Clark (2006), Kap. 2 |
- Files runnable as Java application and JUnit test for each session can be found in package de.uni_koeln.phil_fak.iv.tm.pX.PraxisX.java (X for the session number)
- To run all tests: run All.java as JUnit test (needs corpora in data/, run All.java as Java application to generate)
- The Ant script can compile and deploy the code as an executable Jar (ant deploy), generate Javadoc (ant doc) and run tests (ant test), which are summarized in an HTML report (ant report)
- Bloch, Joshua (2008), Effective Java, Second Edition, Addison-Wesley.
- Clark, Mike (2006), Projekt-Automatisierung, Hanser.
- Gamma, Erich, Helm, Richard, Johnson, Ralph and John Vlissides (1995), Design Patterns. Elements of Reusable Object-Oriented Software, Addison-Wesley.
- Naftalin, Maurice and Philip Wadler (2006), Java Generics and Collections, O’Reilly.
- Thompson, H. S. and McKelvie, D. (1997), Hyperlink semantics for standoff markup of read-only documents. In Proceedings of SGML Europe ’97: The next decade – Pushing the Envelope, page 227–229.
- Ian H. Witten & Eibe Frank (2005), Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Morgan Kaufmann.