forked from kvorg/CUWI
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
223 lines (203 loc) · 9.3 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
-*- outline -*-
* CWB::Model BUGS:
** Log errors from Model instantiation (handle exceptions)
** Paging: displays 1 match more than pagesize
** Check query mangling in detail
** CQP heurisitcs:
- Bug in combined query syntax: a* [word="t.*"], add test!
- enable quoting of CQP escapes for "[rule]" and '[rule]'
** In paging with Result objects, the end page calculation is wrong
** Default overrides new for ingorecase (fixed)
** Better tests/fails for failing/missing registry specifications
** Clean up newlines at the end of description
** Cleanup CQP query link situation with advanced query form options
** _token split bug with / in token (could limit to no of atts and hope for the best)
* CWB::Model
** Add Model::Result use classes:
Model::Result::Hits as a collection with an additional cpos
interface Model::Result::Hit and Model::Result::Frequency as
instances -> api: @words = $result->hits->each($_->match('word'));
-> clean up CUWI code
** Implement maxquerytime properly
** Multiple sorting and structural / alignement queries
** Add ->next method to result for API, evolve result
** document query and result
** Structure handling in Query:
structure-based tree view (treebanks)
** Multicorpora
spook:
- fix paging calculation, should be optional, wrapped up in a function
** spook: sub-searches (based on existing search): subquery
aka Support for chained query support -> store previous queries in link/params
** spook: any needed support for form-based alignement and structural queries
** Fix all file work to use IO::File and File::Spec
** Make multicorpora use parallel fork or webapp subquery to other daemons
to implement parallel queries on commandline and in web interface, respectively
** Improve test coverage
** Colocation support (using cwb-scan-corpus and scan grid intefrace?)
** Info files: translateable names/titles
** Add a handle for literal matching (no regexps) with %l
** Test cwb beta release compatibiliy - UTF-8
** Implement cached queries:
- CWB::Model::Corpus keeps a pool of ageing Query objects, matched on query
- queries are reused with different show options, paging etc.
- possibly used to support chained queries
- also:
*** cache cqp connections in queries (only hit if query in instance)
*** cache small but complex results and subquery parents in cpos files
*** possibly use fork in Model, but use HTTP subquery in CUWI
** Add corpus positions dump and cposlist as an export option
** Support for list-oriented queries and macros, i.e.
/codist["whose", pos];
/codist[lemma, "go", word];
** Native CQP and cwb-corpus-scan interfaces with non-blocking queries using
Mojo::IOLoop->timer(0, sub { poll query or render-cb ; reset timer; }
or suitable newer solutions
The query should have a flag when still running and refuse new requests
from the query cache API
* Web interface BUGS
** fix page meta titles, localize
** update boundled cluetip version
** bigger blurb fonts, not centered
** show corpus names in lowercase, boldened when possible
** missing si translations: query, search etc.
** add test to check for errors with result exports
** fix error with a full query but nonexisting corpus
** fix encodings for frequency downloads
** fix missing display mode before logging on line 65
** escapes for ?, *, |, handling for single ", whaterver ~ is for
** format info tags, links
** samples (reduce) with frequency lists?
** fix warnings (with tests)
** algnement and links on info tags
** missing "not" for structural constraint
** wordlist -> ignoremeta in links disables ignorecase
** wordlist: quote metacharacters in string links: check/test
** classes: sanitize with the virtual corpus before sending
done, missing: tests
** fix controller params / %params ambiguity in controller
(probably unify names, fix values with redirect and pass to model
through a filter - setting params is proven to work ok
use flash to pass messages about fixing params)
** fix wierd behavior with last page on \. search on imp-goo
** move export modules in and eval encapsualtion and handle exceptions
** make export modules work asynchroniously
** consider moving export modules to Model, caveat previous
** disable context links for biggest display mode (sentence or paragraph)
** possibly closures for exception handling are buggy
- should rethrow the exception and catch it in webapp?
- or just check for return valu?e
- make a nice exception page or part of page for fatals
- find a way to test all of this
* Web interface
** document info file format
** make constraint args disappear after a full CQP query, with a discreet note
** frequencies ->
when set to 1, make all and support no_frequencies
** config option: simple search attribute candidates per corpus
** simple search form -> always show alignements (optional with config)
** change wordlist exporting to divide matches into multiple fields
** add options to ignore structural tags (urls, strings, regexes)
** more logging: paging, sorting, shows, tx->query
** config upgrade: most general options also per corpus, corpus overrides
** config reporting via API
** add API for multiplexed multicorpus queries via interal http
** structural attribute filters from tooltips (possibly use subqueries or such)
** add progress and redirect for result exports
** when more than 7 attributes are present, switch display to 4 drop-down lists
(possibly list selection performs a direct query redirect)
** use chunked transfer encoding for results?
** konrad, imp, spook: multiple sort and multiple structure constraint
** finish frequency support with cwb-scan-corpus
and disk-related long-polled results, see CWB::Model TODO
** split display modes into simple search, detail, frequencies, collocations
** spook: missing templates for query result exports
oocalc, tab-delmited, corpus position dump, tei/xml, latex, yaml
** spook: password protected groups: better implementation
** spook: sub-searches (based on existing search), possibly with caching
** add config file support for ID for structural attribute in result
** do not use url-like attributes and *_id attributes in search forms*
** additional grid query compositing interface
as per http://nl.ijs.si/jos/cqp/uni and http://nl2.ijs.si/fpj.html
** add sample implementation with redirect to a fixed sample seeed
[check status and add tests]
** add generate config helper to cuwi command line, and help
** add on-line form-based config with auth
** handle metadata with structural attributes via config /re/s
* Tests
** Finish CWB::Model tests
*** virtual config errors
*** t/01_registry.t-23-# test for non-existing registry
*** 01 tests for multiple registry directories
*** 01 tests for ->registry accessor
*** 01 registry update and reloading
*** 01 broken registry files, missing info files
*** 01 exception handling
*** 01 introduce exceptions in tests
*** 02 infofile: peer corpora/alignement
*** 03 query: faulty options
*** 05 result: display tests (show, all/sample)
*** 05 query/result encoding roundtrip (in queries and all display modes)
*** 05 result: one too many hit in result pages
*** 05 result: check number of hits against 2 or more pages for 0-based page offsets
*** 06 result: alignement, alignement encoding
** CUWI Webapp tests
*** config file parsing
*** config file errors: consistency
*** perl as config format
*** export module detection (use_ok vs. loader?)
*** export options working
*** all layouts (to enable template hacking)
** Online tests for HTML and CSS syntax checking
* Links:
** http://nlp.fi.muni.cz/trac/noske
** http://corpus.byu.edu/coca/
** http://parasol.unibe.ch/UsagePARASOL.html
** http://corpus.leeds.ac.uk/paraquery.html
** http://www.comp.leeds.ac.uk/ssharoff/
** http://the.sketchengine.co.uk/
** http://corpus.leeds.ac.uk./natasha mana serge cqp search
** SPOOK concord: http://nl.ijs.si/et/project/SPOOK/konkor/
** Para corpora:
- http://nl2.ijs.si/index-bi.html
where ALIGNED name is the aligned corpus, aligned by <seg>
- http://nl2.ijs.si/dsi.html
* Release-critical
** cuwi growing and cleanup
** spook stuff
** default info file processing
** decent test coverage
** logo
** l18n
** colocations MI3 LL
http://nlp.stanford.edu/fsnlp/promo/colloc.pdf
http://nlp.stanford.edu/fsnlp/
** missing tabular download formats
** documentation
especially:
*** config
*** registries and info files
*** CWB::Model API
* Future:
** corpus groups
- groups need special description files -> default corpus, links to members
- grouped corpora have corpus selectors for corpora within group
- alternatively, use SISTER_CORPORA in info files for the drop-down menu
** user registration, storable queries, diff between storable queries etc.
From the list:
Message:
From our experience working interactively in CQP, it's even more useful to
be able to run subqueries, i.e. filter query results either by collocates
(tokens with a certain property within a specified range, e.g. a finite
verb within 3 words) or by another CQP query. This could easily be
implemented using "set keyword" and subqueries in CQP, but the results
would have to be stored as saved queries (because they can't easily be
reproduced when they're dropped from the cache).
** CEQL parser
** CQI interface
** treebank view
** admin section:
*** corpus building
*** ad-hoc corpora
*** corpus building toolchain: ToTaLe plugin etc.
*** wordnet API, slownet plugin