Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordDAWG only stores a very small subset of data when reading from generator function #9

Open
dan-blanchard opened this issue Apr 30, 2013 · 4 comments

Comments

@dan-blanchard
Copy link
Contributor

Are RecordDAWGs compatible with using generator functions to fill them?

I have a very large TSV file that contains: words, the contexts they appear in, and then some statistics about the words and contexts. The data is calculated based on Gigaword so the file is 29 GB with 507 million lines, and when I tried to build a dictionary that mapped from word-context pairs to their rank ratios it took up over 80 GB of RAM (and the machine I'm using only has 64), so I tried using RecordDAWGs.

Unfortunately, when I use code like the following Python churns away for a long time seemingly loading the DAWG (and using about 45GB of RAM, which is more reasonable), but then the DAWG it saves only has 6 elements in it and takes up 1.6 KB. Is this because the RecordDAWG constructor cannot read from a generator function?

def ngram_rank_ratio_iter(rank_ratio_file):
    for line in rank_ratio_file:
        line = line.strip('\r\n')
        split_line = line.split('\t')
        word_context = '{}\t{}'.format(split_line[0], split_line[1])
        ratio = float(split_line[7])
        yield (word_context, (ratio,))

rr_dawg = dawg.RecordDAWG('d', ngram_rank_ratio_iter(open('huge_file.tsv')))
rr_dawg.save('huge_file.dawg')
@dan-blanchard
Copy link
Contributor Author

It actually looks like this works fine if I limit the number of lines read to be 1 million, so I guess it is not the generator function after all. I think the problem is because of the 536,870,912 (2^29) DAWG unit limit. I did not think this was a problem because there are 507,369,749 elements I would like to store in the DAWG, but I just now noticed this is based on the number of transitions, not elements.

On a related note, it looks like RecordDAWG.__getitem__ seem to be return a one element list containing a one element tuple in my case, which seems strange. Shouldn't it just return the tuple? Sorry this is not directly related to my initial issue.

>>> import dawg
>>> d = dawg.RecordDAWG('d')
>>> d.load('sorted_ngrams.txt.rankratios.dawg')
<dawg.RecordDAWG object at 0x2b1e3a53eb00>
>>> d['zurich\t#csum ,']
[(1.06259921167,)]

@kmike
Copy link
Member

kmike commented Apr 30, 2013

Hi Dan,

I think there could be 2 issues.

  1. By default DAWG constructor "unrolls" generator and sorts it because the library requires data to be sorted before the insertion. If you know that your data is sorted, pass "input_is_sorted=True" to constructor - this way DAWG building will use way less memory (because Python list won't be constructed).
  2. As you said, there is a limit on total transitions (see e.g. "Can't build dictionary" #5); I'd recommend marisa-trie if DAWG doesn't work for your data - it has slower access times and sometimes require more memory to build, but it can handle larger datasets and often more memory-efficient.

RecordDAWG.__getitem__ returns list because it is possible to store multiple values for the same key (this is a feature of BytesDAWG and RecordDAWG).

If you need only a single integer value then give IntCompletionDAWG a try: it will be several times faster and could use less memory (depending on data). It could also use less transitions, by the way. Or use marisa-trie.Trie and store values in an array (marisa-trie provides a perfect hash function that maps key to a number 0..len(trie)-1 and vice versa).

(1) input_is_sorted is slightly complicated with RecordDAWG because you need to make sure the values are also sorted properly if there are several values. For integer values this means storing them in big-endian format and sorting by value. But it seems that this doesn't affect you because you're only interested in a single value.

@kmike
Copy link
Member

kmike commented Apr 30, 2013

Once I was trying to store a huge number of bigrams, each bigram was associated with a float number. I was not interested in high precision, so the first approach was to store them in RecordDAWG and use 1-byte floats (floats converted to ints 0<=x<256 using some maths). But in practice marisa-trie (with an external array for values) worked better.

@kmike
Copy link
Member

kmike commented Apr 30, 2013

Anyway, I don't know why DAWG saving didn't raise an exception; this looks like a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants