-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fill and write each array before creating the next one, to save memory. #177
base: master
Are you sure you want to change the base?
Conversation
Realized another improvement: Preprocessor can now use numpy arrays of |
scripts/preprocess.py
Outdated
@@ -45,33 +45,44 @@ | |||
# Choose the datatype based on the vocabulary size | |||
dtype = np.uint8 | |||
if len(token_to_idx) > 255: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment #1: use a single level of branching:
if len(..) > 4294967295:
....
elif len(..) > 65535:
...
else:
...
scripts/preprocess.py
Outdated
|
||
# Write data to HDF5 file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment #2: Indentation doesn't match the rest of the file.
LGTM and is a useful improvement, though I haven't tested it. Two minor comments left inline. |
Fixed commented issues. I have tested it/am currently testing it – the preprocessor ran cleanly (1.8 GB input.txt, 256 "characters" in that it's non-text-encoded bytes), and the network is currently calculating validation loss for iteration 4000 (running at default settings). Certainly doesn't seem like it's going to break. Perhaps when this is done I'll work out how to make it read two bytes per "character" and test it on Red Book audio (16-bit stereo) to bump up the data type. Then four bytes, treating both channels together as a single sample, to bump it up again. (if someone else wants to do that, it'd be done faster – I've used up my spare-time budget getting this far) |
Anybody want to merge this? This looks perfect for a 2g dataset solution with 4g ram space |
The previous version said "we'll have to do something more clever for huge datasets."
I tried this with a huge dataset (the entire five-CD Okami soundtrack, WAV-formatted with headers stripped out, tracks separated by five seconds of silence, repeated three times and shuffled... about 10G, but then the preprocessor maps each byte to a unit32 so it gets magnified a bit).
I wouldn't call it "something clever," but it seems to work. All I did was re-organize things so that instead of making all the numpy arrays at once, filling them, and writing them all to h5 files, it makes one, fills it, writes it, garbage-collects it, then goes on to the next.