TimesFormer is the new Facebook's transformer based video model. https://github.com/facebookresearch/TimeSformer
This repository attempts to create TensorRT engine for TimeSformer. It shows when converting in fp16, there are large discrepancies.
The network structure is fairly straightforward:
- for video of size 8x224x224, generate 8x14x14 patches of embeddings
- run 12 blocks of T-S self attention blocks
- use the final classification token as the feature to compute classification result
Hence the following results are validated against known inputs-outputs:
- Final probability
- Patches features after each of the 12 T-S blocks,
So far we have following findings:
- The FP32 engine produces correct results
- The FP16 engine produces incorrect results without making any modifications, the error on patch features jump from 0.01 to 0.4 on 6th block
- If we expose the difference tensor from layer norm as outputs, FP16 engine produces correct results
the environt is build based on the tensorrt docker in NGC
docker pull nvcr.io/nvidia/tensorrt:21.06-py3
TensorRT: 7.2.3.4
CUDA: 11.3
# clone repo
git clone https://github.com/zmy1116/timesfromer_trtfp16_error
cd timesfromer_trtfp16_error
# download pretrained weights from TimeSformer official repo
wget https://www.dropbox.com/s/g5t24we9gl5yk88/TimeSformer_divST_8x32_224_K400.pyth
# Build and test engine in FP16 with/without expose layernorm diffs as outputs
python conversions.py TimeSformer_divST_8x32_224_K400.pyth test_data.p fp16engine_expose_ln result_fp16_expose_ln.p --expose_ln_entries --fp16
python conversions.py TimeSformer_divST_8x32_224_K400.pyth test_data.p fp16engine result_fp16 --fp16
# Build and test engine in FP32 with/without expose layernrom diffs as outputs
python conversions.py TimeSformer_divST_8x32_224_K400.pyth test_data.p fp32engine_expose_ln result_fp32_expose_ln.p --expose_ln_entries
python conversions.py TimeSformer_divST_8x32_224_K400.pyth test_data.p fp32engine result_fp32
I have included printed traces/outputs for 4 cases in the traces
folder.
Results on FP16 without exposing layernorm diffs looks like following, notice sudden error jump at 6th block
Run evaluation on test data
Final probabilities difference: 1.0182867
Patch embedding diff after 0 S-T blocks: 0.004299811087548733
Patch embedding diff after 1 S-T blocks: 0.00536126596853137
Patch embedding diff after 2 S-T blocks: 0.00613724160939455
Patch embedding diff after 3 S-T blocks: 0.00694211944937706
Patch embedding diff after 4 S-T blocks: 0.009320137090981007
Patch embedding diff after 5 S-T blocks: 0.013058185577392578
Patch embedding diff after 6 S-T blocks: 0.4091333746910095
Patch embedding diff after 7 S-T blocks: 0.430844247341156
Patch embedding diff after 8 S-T blocks: 0.44067439436912537
Patch embedding diff after 9 S-T blocks: 0.46074366569519043
Patch embedding diff after 10 S-T blocks: 0.45822423696517944
Patch embedding diff after 11 S-T blocks: 0.5575547814369202
Results on FP16 with exposing layernorm diffs looks like following
Final probabilities difference: 0.0033378536
Patch embedding diff after 0 S-T blocks: 0.0042906636372208595
Patch embedding diff after 1 S-T blocks: 0.005335847847163677
Patch embedding diff after 2 S-T blocks: 0.0060969325713813305
Patch embedding diff after 3 S-T blocks: 0.006786249577999115
Patch embedding diff after 4 S-T blocks: 0.00906979851424694
Patch embedding diff after 5 S-T blocks: 0.014810287393629551
Patch embedding diff after 6 S-T blocks: 0.015742896124720573
Patch embedding diff after 7 S-T blocks: 0.015805872157216072
Patch embedding diff after 8 S-T blocks: 0.015810402110219002
Patch embedding diff after 9 S-T blocks: 0.01605822518467903
Patch embedding diff after 10 S-T blocks: 0.016058534383773804
Patch embedding diff after 11 S-T blocks: 0.01618446595966816