vocab not being written correctly #27

ajtritt · 2021-02-22T23:53:14Z

In a DeepIndexFile that is currently sitting on Cori at $CSCRATCH/exabiome/deep-index/input/gtdb/r95/ar122_r95.input.h5, the vocabulary attribute on seq_table/sequence was [A T C G N], when it should be [A C Y W S K D V N T G R W S M H B N].

This attribute should get set during conversion here:

deep-taxon/src/exabiome/gtdb/prepare_data.py

Line 306 in d4ddbf3

vocab = np.array(list(vocab_it.characters()))

According to this line, the correct vocabulary should be returned:

deep-taxon/src/exabiome/sequence/convert.py

Lines 354 to 355 in d7f54dd

    
           chars = ('ACYWSKDVN' 
        
                    'TGRWSMHBN')

The text was updated successfully, but these errors were encountered:

* add function to extract profile data * add assert to catch bad vocab. See #27 * add command to give dataset info * Clean up some things - clean up argparse arguments - use CSV logger - always use DDP. Remove SLURM/LSF support (until PL stabilizes) * clean up job sumission * update base model to work with PL * improve chunking efficiency * add command to exec * remove name from environment * remove read of downsample

* add function to extract profile data * add assert to catch bad vocab. See #27 * add command to give dataset info * Clean up some things - clean up argparse arguments - use CSV logger - always use DDP. Remove SLURM/LSF support (until PL stabilizes) * clean up job sumission * update base model to work with PL * improve chunking efficiency * add command to exec * remove name from environment * remove read of downsample * add classifier for ResNet feature models * add options for using classifier * add sensible statement for recurring error * work out kinks in ResNet classifier * add options for job submission script * set model subdirectory if starting classification from features * Resume from checkpoint of ResNetClassifier - Use LightningDataModule so we can determine number of model outputs before initializing model * clean up to make checkpointing and adding classifer work * Clean up for Cori (#28) * add function to extract profile data * add assert to catch bad vocab. See #27 * add command to give dataset info * Clean up some things - clean up argparse arguments - use CSV logger - always use DDP. Remove SLURM/LSF support (until PL stabilizes) * clean up job sumission * update base model to work with PL * improve chunking efficiency * add command to exec * remove name from environment * remove read of downsample * add arguments to job runner * add classifier for ResNet feature models * add options for using classifier * add sensible statement for recurring error * work out kinks in ResNet classifier * add options for job submission script * set model subdirectory if starting classification from features * Resume from checkpoint of ResNetClassifier - Use LightningDataModule so we can determine number of model outputs before initializing model * clean up to make checkpointing and adding classifer work * add arguments to job runner * add field that was missing after restarting from checkpoint

ajtritt self-assigned this Feb 22, 2021

ajtritt added a commit that referenced this issue Mar 2, 2021

add assert to catch bad vocab. See #27

343f419

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vocab not being written correctly #27

vocab not being written correctly #27

ajtritt commented Feb 22, 2021

vocab not being written correctly #27

vocab not being written correctly #27

Comments

ajtritt commented Feb 22, 2021