Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Firestore AggregationQuery stuck when running locally in Docker #316

Open
ArmandBriere opened this issue Mar 26, 2024 · 2 comments
Open
Assignees

Comments

@ArmandBriere
Copy link

Firestore AggregationQuery getting stuck

We are getting Client.Timeout error when running AggregationQuery to count the number of documents in a query inside a Docker container.

How to reproduce

Use the code provided below with the following folder structure:

.
├── credentials.json
├── Dockerfile
├── main.py
└── requirements.txt
  • credentails.json is used to authenticate to Google Cloud and have access to Firestore. For this example, we assume that Firestore is set up for the project and can be access by this service account key.
  • Dockerfile, main.py and requirements.txt are provided below.
docker build -t bug .
docker run -d -p 8888:8080 --name bug bug:latest
  • Running the following curl command output the expected result:
$ curl http://localhost:8888
Count: 0.0

At this point the code works well, the issue appears when we restart the docker container, and we send multiple concurrent request to the endpoint using the hey HTTP load generator to simulate real traffic on our application:

docker restart bug
hey -c 10 -n 100 -m GET http://localhost:8888/

From that point we are not getting any response back from application, and we receive Get "http://localhost:8888/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This error setup has been reproduced on Linux and Mac.

The error only seems to appear when we restart the container. It is also fix when we restart the container again. It is going in a loop of stuck, unstuck, stuck, unstuck.... We didn't manage to reproduce this bug by running this code outside of Docker.

Is there any undocumented caching or network protocol used by that tool that we should know of and that required some Docker config?

What we tested

  • Running the same code without the data = aggregate_query.count().get() line solve the timeout issue. We are no longer getting the data we need since we are not running it. By doing so, we isolated the issue to that line.
  • Adding the timeout parameter to the aggregate_query.count().get(timeout=2) does not do anything for us. This parameter doesn't seem to be working at all.
  • We tested this code on different network to exclude firewall rules that could block network calls.

Source code

  • main.py
"""BUGGED module."""

from datetime import datetime, timedelta
from typing import Tuple

import flask
import functions_framework
from flask import Response
from google.cloud.firestore_v1 import Query
from google.cloud.firestore_v1.aggregation import AggregationQuery
from google.cloud.firestore_v1.base_query import FieldFilter
from google.cloud.firestore_v1.client import Client as FirestoreClient

FIRESTORE_CLIENT = FirestoreClient()


def count_data_in_query_bugged(query: Query) -> int:
    """Count data in query."""
    print("Start counting data in query")
    # Transform to aggregation query to count
    aggregate_query: AggregationQuery = AggregationQuery(query)
    data = aggregate_query.count().get()
    count = data[0][0].value
    print("end counting data in query")
    return count


@functions_framework.http
def entry_point(request: flask.Request) -> Tuple[Response | str, int]:
    print("Request received")
    start = datetime.now() - timedelta(days=1)
    end = datetime.now()

    query = (
        FIRESTORE_CLIENT.collection("statistics")
        .where(filter=FieldFilter("status", "==", "acceptable"))
        .where(filter=FieldFilter("timestamp", ">=", start))
        .where(filter=FieldFilter("timestamp", "<", end))
    )

    count = count_data_in_query_bugged(query)
    print(count)
    return f"Count: {count}", 200
  • Dockerfile
FROM python:3.11

WORKDIR /app

COPY . .

# Install requirements
RUN pip install -r requirements.txt

ENV FUNCTION_TARGET="entry_point"
ENV GOOGLE_APPLICATION_CREDENTIALS="/app/credentials.json"

# Run cloud function locally
CMD functions-framework --target=$FUNCTION_TARGET --debug
  • requirements.txt
blinker==1.7.0
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
cloudevents==1.10.1
deprecation==2.1.0
Flask==3.0.2
functions-framework==3.5.0
google-api-core==2.18.0
google-auth==2.29.0
google-cloud-core==2.4.1
google-cloud-firestore==2.15.0
googleapis-common-protos==1.63.0
grpcio==1.62.1
grpcio-status==1.62.1
gunicorn==21.2.0
idna==3.6
itsdangerous==2.1.2
Jinja2==3.1.3
MarkupSafe==2.1.5
packaging==24.0
proto-plus==1.23.0
protobuf==4.25.3
pyasn1==0.5.1
pyasn1-modules==0.3.0
requests==2.31.0
rsa==4.9
urllib3==2.2.1
watchdog==4.0.0
Werkzeug==3.0.1
@zackarydev
Copy link

Do you need to close the client or gracefully shutdown? What about a try-catch or increasing the client timeout secs?

@ArmandBriere
Copy link
Author

Do you need to close the client or gracefully shutdown? What about a try-catch or increasing the client timeout secs?

I've tried to instantiate a new Firestore client at every function call to check if the globally available client was the issue. It didn't change the results of the experiment, we are still getting stuck

I've also recreated that experiment in Go using the function-framework-go library and I am not getting any errors/timeout so far. I can only affirm that the docker restart process to break the python code doesn't seem to work on the Go code. Switching to another programming language to overcome a critical issue of the package isn't a solution for us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants