Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harbor is throwing 503s #21250

Open
veerendra2 opened this issue Nov 26, 2024 · 10 comments
Open

Harbor is throwing 503s #21250

veerendra2 opened this issue Nov 26, 2024 · 10 comments
Labels
help wanted The issues that is valid but needs help from community

Comments

@veerendra2
Copy link

veerendra2 commented Nov 26, 2024

We get 503 errors once in a while for /v2/* endpoint and we have to re-run the github pipeline to make it work.

$ k get pods -n harbor
NAME                                 READY   STATUS    RESTARTS      AGE
harbor-core-5d79cd78d5-f8lrs         2/2     Running   0             107m
harbor-core-5d79cd78d5-gxcdq         2/2     Running   0             107m
harbor-core-5d79cd78d5-nf2z2         2/2     Running   0             107m
harbor-exporter-7946548cbd-vnp7m     2/2     Running   0             107m
harbor-jobservice-66bf496776-6qv4h   2/2     Running   0             107m
harbor-jobservice-66bf496776-kmp6p   2/2     Running   0             107m
harbor-jobservice-66bf496776-pksgz   2/2     Running   0             106m
harbor-portal-5f47b6c7fb-478nv       2/2     Running   0             107m
harbor-portal-5f47b6c7fb-5kcq8       2/2     Running   0             107m
harbor-portal-5f47b6c7fb-kfnkg       2/2     Running   0             107m
harbor-postgres-db-0                 4/4     Running   1 (88m ago)   88m
harbor-postgres-db-1                 4/4     Running   2 (86m ago)   86m
harbor-postgres-db-2                 4/4     Running   2 (87m ago)   87m
harbor-registry-557df7f8c5-jqwxz     3/3     Running   0             107m
harbor-registry-557df7f8c5-tdr8w     3/3     Running   0             107m
harbor-registry-557df7f8c5-wlzf8     3/3     Running   0             107m
harbor-trivy-0                       2/2     Running   0             106m
harbor-trivy-1                       2/2     Running   0             106m
harbor-trivy-2                       2/2     Running   0             107m
redis-node-0                         4/4     Running   0             27h
redis-node-1                         4/4     Running   0             27h
redis-node-2                         4/4     Running   0             27h
  • We are able to access harbor via portal without any problem

Attaching screenshots

  • Harbor metrics grafana dashboard
    image

  • istio-gateway logs for harbor which HTTP response 503
    image

Steps to reproduce the problem:

  • Deploy harbor via helm chart with redis and postgres
  • Set nginx proxy replicas to 0
  • Deploy VirtualService to access harbor

Versions:
Please specify the versions of following systems.

  • harbor helm chart version: 1.16.0
  • harbor version: [v2.12.0](https://github.com/goharbor/harbor/releases/tag/v2.12.0)
  • AKS kubernetes version: v1.30.3

Additional context:

@veerendra2 veerendra2 changed the title Error: Error response from daemon: login attempt to /v2/ failed with status: 503 Service Unavailable Harbor is throwing 503s Nov 26, 2024
@stonezdj
Copy link
Contributor

maybe something wrong with the redis and postgres. postgres DB restart 88 minute ago, what is the deployment type of three postgresql instance?
for redis:

2024-11-26 15:37:20;redis: 2024/11/26 14:37:20 pubsub.go:159: redis: discarding bad PubSub connection: read tcp 10.244.9.47:51360->10.244.0.149:26379: read: connection reset by peer
2024-11-26 15:37:20;redis: 2024/11/26 14:37:20 sentinel.go:587: sentinel: GetMasterAddrByName name="mymaster" failed: EOF
2024-11-26 15:37:19;redis: 2024/11/26 14:37:19 sentinel.go:587: sentinel: GetMasterAddrByName name="mymaster" failed: read tcp 10.244.20.157:60812->10.244.0.149:26379: read: connection reset by peer
2024-11-26 15:37:19;redis: 2024/11/26 14:37:19 pubsub.go:159: redis: discarding bad PubSub connection: read tcp 10.244.2.30:56824->10.244.0.149:26379: read: connection reset by peer
2024-11-26 15:37:19;redis: 2024/11/26 14:37:19 pubsub.go:159: redis: discarding bad PubSub connection: read tcp 10.244.2.30:56838->10.244.0.149:26379: read: connection reset by peer
2024-11-26 15:37:19;redis: 2024/11/26 14:37:19 pubsub.go:159: redis: discarding bad PubSub connection: read tcp 10.244.2.30:56854->10.244.0.149:26379: read: connection reset by peer
2024-11-26 15:37:19;redis: 2024/11/26 14:37:19 pubsub.go:159: redis: discarding bad PubSub connection: read tcp 10.244.20.157:60700->10.244.0.149:26379: read: connection reset by peer
2024-11-26 15:37:19;redis: 2024/11/26 14:37:19 pubsub.go:159: redis: discarding bad PubSub connection: read tcp 10.244.20.157:60708->10.244.0.149:26379: read: connection reset by peer
2024-11-26 15:37:19;redis: 2024/11/26 14:37:19 pubsub.go:159: redis: discarding bad PubSub connection: read tcp 10.244.9.47:51270->10.244.0.149:26379: read: connection reset by peer
2024-11-26 15:37:19;redis: 2024/11/26 14:37:19 pubsub.go:159: redis: discarding bad PubSub connection: EOF

@veerendra2
Copy link
Author

veerendra2 commented Nov 27, 2024

@stonezdj We use https://github.com/zalando/postgres-operator to manage postgresql

$ k get postgresql
NAME                 TEAM     VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
harbor-postgres-db   harbor   15        3      2Gi      100m          1024Mi           130d   Running

$ k get sts
NAME                 READY   AGE
harbor-postgres-db   3/3     19h
harbor-trivy         3/3     130d
redis-node           3/3     130d

$ k get svc
NAME                        TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
harbor                      ClusterIP   10.0.134.156   <none>        80/TCP                       130d
harbor-core                 ClusterIP   10.0.198.244   <none>        80/TCP,8001/TCP              130d
harbor-exporter             ClusterIP   10.0.255.72    <none>        8001/TCP                     49d
harbor-jobservice           ClusterIP   10.0.233.25    <none>        80/TCP,8001/TCP              130d
harbor-portal               ClusterIP   10.0.129.86    <none>        80/TCP                       130d
harbor-postgres-db          ClusterIP   10.0.218.184   <none>        5432/TCP                     130d
harbor-postgres-db-config   ClusterIP   None           <none>        <none>                       130d
harbor-postgres-db-repl     ClusterIP   10.0.5.126     <none>        5432/TCP                     130d
harbor-registry             ClusterIP   10.0.234.173   <none>        5000/TCP,8080/TCP,8001/TCP   130d
harbor-trivy                ClusterIP   10.0.134.5     <none>        8080/TCP                     130d
patroni-metrics             ClusterIP   10.0.31.14     <none>        9547/TCP                     30d
postgres-exporter           ClusterIP   10.0.197.179   <none>        9187/TCP                     30d
redis                       ClusterIP   10.0.186.250   <none>        6379/TCP,26379/TCP           130d
redis-headless              ClusterIP   None           <none>        6379/TCP,26379/TCP           130d
redis-metrics               ClusterIP   10.0.250.114   <none>        9121/TCP                     130d

We updated postgresql sidecar container before, that's why there were some restarts.

Please let me know any further details are needed

EDIT

Attaching jager traces on an endpoint

image

@veerendra2
Copy link
Author

veerendra2 commented Nov 27, 2024

harbor-core throwing 404 errors. I can see in istio-proxy containers logs below

image

I increased the harbor-core replicas from 3 to 5 and see any improvements

EDIT

@veerendra2
Copy link
Author

Update

I even check a sha256 layer path really exists in my storage account, indeed the sha256 layer for the image is exists in storage account

image

Searched same sha256 layer in Azure Storage Account Explorer

Screenshot 2024-11-27 at 15 47 16

And also there are lot of ClientErrors/Failed transactions in Azure storage account insights(Anyways these ClientErrors are exists for long time)

image

@veerendra2
Copy link
Author

It seems mainly the upstream (harbor-core) is resetting the connection. By the way increased number replicas from 3 to 5

$ k get deploy
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
harbor-core         5/5     5            5           131d
harbor-exporter     1/1     1            1           50d
harbor-jobservice   5/5     5            5           131d
harbor-nginx        0/0     0            0           131d
harbor-portal       3/3     3            3           131d
harbor-registry     5/5     5            5           131d

Still, same, getting 503
image

@stonezdj
Copy link
Contributor

stonezdj commented Dec 2, 2024

What is the output in the harbor-core log? Harbor core doesn't throw 503 error in the program, this error is usually thrown by front end components.

@reasonerjt reasonerjt added the help wanted The issues that is valid but needs help from community label Dec 2, 2024
@veerendra2
Copy link
Author

@stonezdj

What is the output in the harbor-core log?

I already attached harbor-core debug logs here

Upstream (harbor-core) resetting the connection, that's why istio-proxy sidecar throwing 503.

We had to add retry for virtualservice like below to fix login attempt to https://[REDACTED]/v2/ failed with status: 503 Service Unavailable

    match:
    - uri:
        prefix: /v2/
    retries:
      attempts: 3
      retryOn: "503"

@hajnalmt
Copy link
Contributor

hajnalmt commented Dec 3, 2024

Hello @veerendra2
I can see some proxy-cache errors in the core logs.

Can you add some additional logs from the istio-proxy (for all the core pods), actual container logs instead of the klogs output?

I am curious if some of your traffic is routed towards the blackhole cluster. Additinally are you using REGISTRY_ONLY outbound traffic policy instead of ALLOW_ANY? If yes, are you properly configuring the service-entry, the destination rules and the gateway?

Most of the 503 I investigated on Istio were solely because the traffic was routed to the blackhole cluster due to a misconfigured virtualservice, gateway or destinationrule.

@veerendra2
Copy link
Author

veerendra2 commented Dec 4, 2024

@hajnalmt

Can you add some additional logs from the istio-proxy (for all the core pods), actual container logs instead of the klogs output?

Please find the attachment for logs of istio-proxy (harbor-core pods) ->
kobs-export-logs.log

I am curious if some of your traffic is routed towards the blackhole cluster.

If this the case, this should happen all the times, but in my case it is happening once in a while. After adding retries, there is lot better client experience and almost no 503s from clients(i.e docker login in Github actions and docker pulls)

Additinally are you using REGISTRY_ONLY outbound traffic policy instead of ALLOW_ANY? If yes, are you properly configuring the service-entry, the destination rules and the gateway?

It should be ALLOW_ANY(We didn't set any out bound traffic policy), and there are service in our cluster able to access stuff outside of cluster/mesh(For example, Azure blob, etc).
Yesterday, I also added SericeEntry and DestinationRule to see any improvements(but still same, I dont see any)
-> https://gist.github.com/veerendra2/5f946d073aff391aff894407bc646281

EDIT
Forgot mention before, below DestinationRule already exists before

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  labels:
    app: harbor-core
    kustomize.toolkit.fluxcd.io/name: harbor
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: harbor-core
  namespace: harbor
spec:
  host: harbor-core.harbor.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        idleTimeout: 50s
    loadBalancer:
      simple: LEAST_REQUEST

And similar for harbor-portal

@veerendra2
Copy link
Author

veerendra2 commented Dec 4, 2024

And also there are still lot of client errors shown in storage account insights
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted The issues that is valid but needs help from community
Projects
None yet
Development

No branches or pull requests

4 participants