RecordDAWG only stores a very small subset of data when reading from generator function #9

dan-blanchard · 2013-04-30T19:16:44Z

Are RecordDAWGs compatible with using generator functions to fill them?

I have a very large TSV file that contains: words, the contexts they appear in, and then some statistics about the words and contexts. The data is calculated based on Gigaword so the file is 29 GB with 507 million lines, and when I tried to build a dictionary that mapped from word-context pairs to their rank ratios it took up over 80 GB of RAM (and the machine I'm using only has 64), so I tried using RecordDAWGs.

Unfortunately, when I use code like the following Python churns away for a long time seemingly loading the DAWG (and using about 45GB of RAM, which is more reasonable), but then the DAWG it saves only has 6 elements in it and takes up 1.6 KB. Is this because the RecordDAWG constructor cannot read from a generator function?

def ngram_rank_ratio_iter(rank_ratio_file):
    for line in rank_ratio_file:
        line = line.strip('\r\n')
        split_line = line.split('\t')
        word_context = '{}\t{}'.format(split_line[0], split_line[1])
        ratio = float(split_line[7])
        yield (word_context, (ratio,))

rr_dawg = dawg.RecordDAWG('d', ngram_rank_ratio_iter(open('huge_file.tsv')))
rr_dawg.save('huge_file.dawg')

dan-blanchard · 2013-04-30T19:51:25Z

It actually looks like this works fine if I limit the number of lines read to be 1 million, so I guess it is not the generator function after all. I think the problem is because of the 536,870,912 (2^29) DAWG unit limit. I did not think this was a problem because there are 507,369,749 elements I would like to store in the DAWG, but I just now noticed this is based on the number of transitions, not elements.

On a related note, it looks like RecordDAWG.__getitem__ seem to be return a one element list containing a one element tuple in my case, which seems strange. Shouldn't it just return the tuple? Sorry this is not directly related to my initial issue.

>>> import dawg
>>> d = dawg.RecordDAWG('d')
>>> d.load('sorted_ngrams.txt.rankratios.dawg')
<dawg.RecordDAWG object at 0x2b1e3a53eb00>
>>> d['zurich\t#csum ,']
[(1.06259921167,)]

kmike · 2013-04-30T20:06:30Z

Hi Dan,

I think there could be 2 issues.

By default DAWG constructor "unrolls" generator and sorts it because the library requires data to be sorted before the insertion. If you know that your data is sorted, pass "input_is_sorted=True" to constructor - this way DAWG building will use way less memory (because Python list won't be constructed).
As you said, there is a limit on total transitions (see e.g. "Can't build dictionary" #5); I'd recommend marisa-trie if DAWG doesn't work for your data - it has slower access times and sometimes require more memory to build, but it can handle larger datasets and often more memory-efficient.

RecordDAWG.__getitem__ returns list because it is possible to store multiple values for the same key (this is a feature of BytesDAWG and RecordDAWG).

If you need only a single integer value then give IntCompletionDAWG a try: it will be several times faster and could use less memory (depending on data). It could also use less transitions, by the way. Or use marisa-trie.Trie and store values in an array (marisa-trie provides a perfect hash function that maps key to a number 0..len(trie)-1 and vice versa).

(1) input_is_sorted is slightly complicated with RecordDAWG because you need to make sure the values are also sorted properly if there are several values. For integer values this means storing them in big-endian format and sorting by value. But it seems that this doesn't affect you because you're only interested in a single value.

kmike · 2013-04-30T20:10:26Z

Once I was trying to store a huge number of bigrams, each bigram was associated with a float number. I was not interested in high precision, so the first approach was to store them in RecordDAWG and use 1-byte floats (floats converted to ints 0<=x<256 using some maths). But in practice marisa-trie (with an external array for values) worked better.

kmike · 2013-04-30T20:12:55Z

Anyway, I don't know why DAWG saving didn't raise an exception; this looks like a bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RecordDAWG only stores a very small subset of data when reading from generator function #9

RecordDAWG only stores a very small subset of data when reading from generator function #9

dan-blanchard commented Apr 30, 2013

dan-blanchard commented Apr 30, 2013

kmike commented Apr 30, 2013

kmike commented Apr 30, 2013

kmike commented Apr 30, 2013

RecordDAWG only stores a very small subset of data when reading from generator function #9

RecordDAWG only stores a very small subset of data when reading from generator function #9

Comments

dan-blanchard commented Apr 30, 2013

dan-blanchard commented Apr 30, 2013

kmike commented Apr 30, 2013

kmike commented Apr 30, 2013

kmike commented Apr 30, 2013