Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill and write each array before creating the next one, to save memory. #177

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

qwertystop
Copy link

@qwertystop qwertystop commented Apr 2, 2017

The previous version said "we'll have to do something more clever for huge datasets."

I tried this with a huge dataset (the entire five-CD Okami soundtrack, WAV-formatted with headers stripped out, tracks separated by five seconds of silence, repeated three times and shuffled... about 10G, but then the preprocessor maps each byte to a unit32 so it gets magnified a bit).

I wouldn't call it "something clever," but it seems to work. All I did was re-organize things so that instead of making all the numpy arrays at once, filling them, and writing them all to h5 files, it makes one, fills it, writes it, garbage-collects it, then goes on to the next.

@qwertystop
Copy link
Author

Realized another improvement: Preprocessor can now use numpy arrays of uint16 (more efficient use of space for files with e.g. all single-byte values) and uint64 (probably won't be necessary, but as long as I'm adding types I may as well be thorough).

@@ -45,33 +45,44 @@
# Choose the datatype based on the vocabulary size
dtype = np.uint8
if len(token_to_idx) > 255:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment #1: use a single level of branching:

if len(..) > 4294967295:
     ....
elif len(..) > 65535:
    ...
else:
   ...


# Write data to HDF5 file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment #2: Indentation doesn't match the rest of the file.

@ChrisCummins
Copy link
Collaborator

LGTM and is a useful improvement, though I haven't tested it. Two minor comments left inline.

@qwertystop
Copy link
Author

qwertystop commented Apr 2, 2017

Fixed commented issues. I have tested it/am currently testing it – the preprocessor ran cleanly (1.8 GB input.txt, 256 "characters" in that it's non-text-encoded bytes), and the network is currently calculating validation loss for iteration 4000 (running at default settings). Certainly doesn't seem like it's going to break. Perhaps when this is done I'll work out how to make it read two bytes per "character" and test it on Red Book audio (16-bit stereo) to bump up the data type. Then four bytes, treating both channels together as a single sample, to bump it up again.

(if someone else wants to do that, it'd be done faster – I've used up my spare-time budget getting this far)

@binary-person
Copy link

Anybody want to merge this? This looks perfect for a 2g dataset solution with 4g ram space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants