Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda error 77, please suggest how to debug #68

Open
sergei-mironov opened this issue Aug 24, 2020 · 5 comments
Open

Cuda error 77, please suggest how to debug #68

sergei-mironov opened this issue Aug 24, 2020 · 5 comments

Comments

@sergei-mironov
Copy link

Hi. I applied TASO (commit dce8c4d) to the several models of onnx-models. One frequent error I got is:

Cuda failure: 77
/workspace/taso/src/cudnn/element_kernel.cu:242
Aborting...

Affected models are: inception-v2-9, mnist-8.log, resnet101-v2-7, resnet18-v2-7, roberta-base-11, shufflenet-9, vgg19-7, yolov4.
Since trivial mnist is in the list, I suspect that the problem was caused by some environment bug, such as package version mismatch or alike.

The error message is not very verbose, and named CUDA line doesn't look suspicious. I would be glad to provide more debugging information but unfortunately I'm not a expert in low-level CUDA. Could you please suggest what can I do to collect more information?

@Alex-Sol
Copy link

Alex-Sol commented Oct 20, 2020

You could modify ele->use_kernel() to true to use elementwise kernel in cudnn. This method can avoid this bug.

bool Element::use_kernel(void) const {
    switch (type) {
        case OP_EW_ADD:
            return true;
        case OP_EW_MUL:
        case OP_EW_MAX:
        case OP_EW_MIN:
            break;
        default:
            return false;
    }
......

@jiahuiyang
Copy link

jiahuiyang commented Nov 18, 2020

Hi, @Alex-Sol ,
I met the same problem as @grwlf . After changed Element::use_kernel function, I faced following problem in resnet50.

/home/TASO/src/cudnn/cuda_helper.cu:83: void helperSetBroadcastableTensorDescriptor(const taso::Tensor&, const taso::Tensor&, cudnnTensorDescriptor_t): Assertion `input.default_layout()' failed.
Aborted (core dumped)

Could you help me to solve this problem?

@jiahuiyang
Copy link

I found the problem is related to gemm opearter. If I comment some code in init.py like following, I don't have layout problem. But still I need to know how to solve it perfectly. @Alex-Sol

def _gemm(op, graph, tensors, initializer):
inputs = _get_inputs(op, graph, tensors, initializer)
attrs = _parse_attribute(op.attribute)
if "transA" in attrs and attrs["transA"] == 1:
inputs[0] = graph.transpose(inputs[0], (1,0), shuffle=True)
if "transB" in attrs and attrs["transB"] == 1:
inputs[1] = graph.transpose(inputs[1], (1,0), shuffle=True)
outputs = graph.matmul(inputs[0], inputs[1])
# if len(inputs) > 2:
# outputs = graph.add(outputs, inputs[2])
return outputs

@Alex-Sol
Copy link

@jiahuiyang This may be a bug about mismatch of dims of bias and matmul.
I have fixed this bug like this in python/taso/__init__.py:

if len(inputs) > 2:
        dim = inputs[2].dim(0)
        reshape_bias = graph.reshape(inputs[2], (1,dim))
        outputs = graph.add(outputs, reshape_bias)
return outputs

@jiahuiyang
Copy link

@jiahuiyang This may be a bug about mismatch of dims of bias and matmul.
I have fixed this bug like this in python/taso/__init__.py:

if len(inputs) > 2:
        dim = inputs[2].dim(0)
        reshape_bias = graph.reshape(inputs[2], (1,dim))
        outputs = graph.add(outputs, reshape_bias)
return outputs

great. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants