Fill and write each array before creating the next one, to save memory. #177

qwertystop · 2017-04-02T15:40:59Z

The previous version said "we'll have to do something more clever for huge datasets."

I tried this with a huge dataset (the entire five-CD Okami soundtrack, WAV-formatted with headers stripped out, tracks separated by five seconds of silence, repeated three times and shuffled... about 10G, but then the preprocessor maps each byte to a unit32 so it gets magnified a bit).

I wouldn't call it "something clever," but it seems to work. All I did was re-organize things so that instead of making all the numpy arrays at once, filling them, and writing them all to h5 files, it makes one, fills it, writes it, garbage-collects it, then goes on to the next.

qwertystop · 2017-04-02T15:53:11Z

Realized another improvement: Preprocessor can now use numpy arrays of uint16 (more efficient use of space for files with e.g. all single-byte values) and uint64 (probably won't be necessary, but as long as I'm adding types I may as well be thorough).

ChrisCummins · 2017-04-02T18:30:09Z

scripts/preprocess.py

@@ -45,33 +45,44 @@
  # Choose the datatype based on the vocabulary size
  dtype = np.uint8
  if len(token_to_idx) > 255:


Minor comment #1: use a single level of branching:

if len(..) > 4294967295: .... elif len(..) > 65535: ... else: ...

ChrisCummins · 2017-04-02T18:30:38Z

scripts/preprocess.py


-  # Write data to HDF5 file


Minor comment #2: Indentation doesn't match the rest of the file.

ChrisCummins · 2017-04-02T18:31:20Z

LGTM and is a useful improvement, though I haven't tested it. Two minor comments left inline.

qwertystop · 2017-04-02T23:08:26Z

Fixed commented issues. I have tested it/am currently testing it – the preprocessor ran cleanly (1.8 GB input.txt, 256 "characters" in that it's non-text-encoded bytes), and the network is currently calculating validation loss for iteration 4000 (running at default settings). Certainly doesn't seem like it's going to break. Perhaps when this is done I'll work out how to make it read two bytes per "character" and test it on Red Book audio (16-bit stereo) to bump up the data type. Then four bytes, treating both channels together as a single sample, to bump it up again.

(if someone else wants to do that, it'd be done faster – I've used up my spare-time budget getting this far)

binary-person · 2019-03-08T22:27:09Z

Anybody want to merge this? This looks perfect for a 2g dataset solution with 4g ram space

qwertystop added 2 commits April 2, 2017 11:07

Fill and write each array before creating the next one, to save memory.

f63c351

allow use of dtype uint64 and uint16 in preprocessing

8287802

ChrisCummins reviewed Apr 2, 2017

View reviewed changes

scripts/preprocess.py Outdated

# Write data to HDF5 file

Copy link

Collaborator

ChrisCummins Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment #2: Indentation doesn't match the rest of the file.

spacing and branching cleanup as requested

7de2c74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fill and write each array before creating the next one, to save memory. #177

Fill and write each array before creating the next one, to save memory. #177

qwertystop commented Apr 2, 2017 •

edited

Loading

qwertystop commented Apr 2, 2017

ChrisCummins Apr 2, 2017

ChrisCummins Apr 2, 2017

ChrisCummins commented Apr 2, 2017

qwertystop commented Apr 2, 2017 •

edited

Loading

binary-person commented Mar 8, 2019

Fill and write each array before creating the next one, to save memory. #177

Are you sure you want to change the base?

Fill and write each array before creating the next one, to save memory. #177

Conversation

qwertystop commented Apr 2, 2017 • edited Loading

qwertystop commented Apr 2, 2017

ChrisCummins Apr 2, 2017

Choose a reason for hiding this comment

ChrisCummins Apr 2, 2017

Choose a reason for hiding this comment

ChrisCummins commented Apr 2, 2017

qwertystop commented Apr 2, 2017 • edited Loading

binary-person commented Mar 8, 2019

qwertystop commented Apr 2, 2017 •

edited

Loading

qwertystop commented Apr 2, 2017 •

edited

Loading