-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
181 lines (129 loc) · 6.31 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
Stanford GHKM Rule Extractor - August 2012
--------------------------------------------
(c) 2009-2012 The Board of Trustees of The Leland Stanford Junior University.
All Rights Reserved.
Rule extractor written by Michel Galley.
Support code written by Stanford JavaNLP members.
The system requires Java 6 (JDK1.6).
The GHKM rule extractor included in this distribution replicates almost
exactly the one described in (Galley et al.; 2004, 2006). Sentence pairs for
which the two extractors generate different outputs are rare (one such
sentence is shown in samples/costs.*). The new implementation generates rule
extracts but not derivation forests. The original implementation could
generate derivation forests, though (DeNeefe et al., 2007) indirectly suggests
that such derivations (needed to run EM) are not needed to get
state-of-the-art performance.
USAGE
Unix:
> extract.sh <root> <memory> <joshua_format>
<root>: <root>.{f,ptb,a} should be files containing respectively
source-language sentences, target-language parsed sentences (PTB format),
and word alignments (see ./samples/ for sample data files).
<memory>: heap size for the JVM
<joshua_format>: whether to print rules in Joshua format (true|false).
If false, print rules in xrs format.
Sample usage 1:
> extract.sh samples/astronauts 1g false
The above command prints rules to stdout in the following format:
xrs rule LHS -> xrs rule RHS ||| features
e.g.,
VP(VBG("coming") PP(IN("from") x0:NP)) -> "来自" x0 ||| 0.33333334 0.5
The two features printed by default are relative frequencies for:
p(rule | root)
p(rule | LHS)
In the example given above, this corresponds to:
p(rule | VP) = 0.33
p(rule | VP(VBG("coming") PP(IN("from") x0:NP))) = 0.6
Sample usage 2:
> extract.sh samples/astronauts 1g true
In this case, the output is printed in a format that looks like Joshua's:
root ||| source-language yield ||| target-language yield ||| features
e.g.,
[VP] ||| 来自 [NP,1] ||| coming from [NP,1] ||| 0.33333334 0.5
Printing rules in Joshua format discards internal tree annotation.
To get a behavior strictly equivalent to GHKM as implemented in (Galley et
al., 2006), one should create auxiliary non-terminals (not implemented),
though it is not clear whether this extra step would really be useful.
OPTIONS
The class edu.stanford.nlp.mt.syntax.RuleExtractor supports the following arguments:
-joshuaFormat (true|false) [default: false]
Determines whether to print rules in Joshua format.
-maxLHS (N) [default: 15]
Maximum number of nodes in LHS tree.
-maxRHS (N) [default: 10]
Maximum number of elements in RHS sequence.
-maxCompositions (N) [default: 0]
Maximum number of compositions of minimal rules (see (Galley et al., 2006) for
an explanation of the difference between minimal and composed rules).
-startAtLine (N) [default: 0]
Start extraction at the given line.
-endAtLine (N) [default: -1]
End extraction at the given line.
-reversedAlignment (true|false) [default: false]
If RuleExtractor complains about your word alignment, try this.
-fFilterCorpus (file) [default: nil]
Option to filter rules on the fly against a specific dev or test file. If
the RHS (i.e., source language) side of the rule contains any word that does
appear in the dev/test file, it is automatically discarded. This way of
filtering rules is admittedly very crude, and a suffix-array implementation
would be ideal to extract very large rule sets (one may want to reuse
Joshua's implementation).
- extractors <feature_extractor_1>:...:<feature_extractor_N>
See "Feature API" section.
Parameters to maxCompositions, maxLHS, and maxRHS can have a dramatic impact on the
amount of memory needed for extraction.
TESTING
to make sure you are getting the same output as we do, please cd to samples and run:
make sample
astronauts.grammar.joshua and astronauts.grammar.xrs should then be identical to
astronauts.grammar.joshua.orig and astronauts.grammar.xrs.orig.
FEATURE API
The rule extractor currently only generates two features for each rule, though
it is relatively easy to add new ones. The first step is to create a new class
that implements edu.stanford.nlp.mt.syntax.train.FeatureExtractor. The second
step is to enable its use by adding it to RuleExtractor's command line, e.g.,
if you just created NewExtractor:
> java edu.stanford.nlp.mt.syntax.train.RuleExtractor -extractors edu.stanford.nlp.mt.syntax.train.RelativeFrequencyFeatureExtractor:edu.stanford.nlp.mt.syntax.train.NewExtractor
TODO
This release lacks a decoder and a suffix array implementation.
Components that could be added in future releases:
- interfacing with Joshua to exploit its support suffix arrays;
- multi-threading;
- additional features.
LICENSE
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
For more information, bug reports, fixes, contact:
Michel Galley
Dept of Computer Science, Gates 2A
Stanford CA 94305-9020
USA
CHANGES
2012-08-09
Bug fix contributed by Karl Moritz Hermann:
a very small subset of rules were missing due to bad hashing.
2010-03-08
Added support for multi-threading.
Reduced memory usage.
2010-02-21
Initial release.
MORE INFORMATION
The details of this software can be found in these papers:
Michel Galley, Mark Hopkins, Kevin Knight, Daniel Marcu: What's in a
translation rule? HLT-NAACL 2004: 273-280.
Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve
DeNeefe, Wei Wang, Ignacio Thayer: Scalable Inference and Training of
Context-Rich Syntactic Translation Models. ACL 2006.
For more information, look in the included Javadoc, starting with the
edu.stanford.nlp.mt.syntax.train.RuleExtractor class documentation.
Please send any questions or feedback to [email protected].