Bug: Firestore AggregationQuery stuck when running locally in Docker #316

ArmandBriere · 2024-03-26T15:27:16Z

Firestore AggregationQuery getting stuck

We are getting Client.Timeout error when running AggregationQuery to count the number of documents in a query inside a Docker container.

How to reproduce

Use the code provided below with the following folder structure:

.
├── credentials.json
├── Dockerfile
├── main.py
└── requirements.txt

credentails.json is used to authenticate to Google Cloud and have access to Firestore. For this example, we assume that Firestore is set up for the project and can be access by this service account key.
Dockerfile, main.py and requirements.txt are provided below.

docker build -t bug .
docker run -d -p 8888:8080 --name bug bug:latest

Running the following curl command output the expected result:

$ curl http://localhost:8888
Count: 0.0

At this point the code works well, the issue appears when we restart the docker container, and we send multiple concurrent request to the endpoint using the hey HTTP load generator to simulate real traffic on our application:

docker restart bug
hey -c 10 -n 100 -m GET http://localhost:8888/

From that point we are not getting any response back from application, and we receive Get "http://localhost:8888/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This error setup has been reproduced on Linux and Mac.

The error only seems to appear when we restart the container. It is also fix when we restart the container again. It is going in a loop of stuck, unstuck, stuck, unstuck.... We didn't manage to reproduce this bug by running this code outside of Docker.

Is there any undocumented caching or network protocol used by that tool that we should know of and that required some Docker config?

What we tested

Running the same code without the data = aggregate_query.count().get() line solve the timeout issue. We are no longer getting the data we need since we are not running it. By doing so, we isolated the issue to that line.
Adding the timeout parameter to the aggregate_query.count().get(timeout=2) does not do anything for us. This parameter doesn't seem to be working at all.
We tested this code on different network to exclude firewall rules that could block network calls.

Source code

main.py

"""BUGGED module."""

from datetime import datetime, timedelta
from typing import Tuple

import flask
import functions_framework
from flask import Response
from google.cloud.firestore_v1 import Query
from google.cloud.firestore_v1.aggregation import AggregationQuery
from google.cloud.firestore_v1.base_query import FieldFilter
from google.cloud.firestore_v1.client import Client as FirestoreClient

FIRESTORE_CLIENT = FirestoreClient()


def count_data_in_query_bugged(query: Query) -> int:
    """Count data in query."""
    print("Start counting data in query")
    # Transform to aggregation query to count
    aggregate_query: AggregationQuery = AggregationQuery(query)
    data = aggregate_query.count().get()
    count = data[0][0].value
    print("end counting data in query")
    return count


@functions_framework.http
def entry_point(request: flask.Request) -> Tuple[Response | str, int]:
    print("Request received")
    start = datetime.now() - timedelta(days=1)
    end = datetime.now()

    query = (
        FIRESTORE_CLIENT.collection("statistics")
        .where(filter=FieldFilter("status", "==", "acceptable"))
        .where(filter=FieldFilter("timestamp", ">=", start))
        .where(filter=FieldFilter("timestamp", "<", end))
    )

    count = count_data_in_query_bugged(query)
    print(count)
    return f"Count: {count}", 200

Dockerfile

FROM python:3.11

WORKDIR /app

COPY . .

# Install requirements
RUN pip install -r requirements.txt

ENV FUNCTION_TARGET="entry_point"
ENV GOOGLE_APPLICATION_CREDENTIALS="/app/credentials.json"

# Run cloud function locally
CMD functions-framework --target=$FUNCTION_TARGET --debug

requirements.txt

blinker==1.7.0
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
cloudevents==1.10.1
deprecation==2.1.0
Flask==3.0.2
functions-framework==3.5.0
google-api-core==2.18.0
google-auth==2.29.0
google-cloud-core==2.4.1
google-cloud-firestore==2.15.0
googleapis-common-protos==1.63.0
grpcio==1.62.1
grpcio-status==1.62.1
gunicorn==21.2.0
idna==3.6
itsdangerous==2.1.2
Jinja2==3.1.3
MarkupSafe==2.1.5
packaging==24.0
proto-plus==1.23.0
protobuf==4.25.3
pyasn1==0.5.1
pyasn1-modules==0.3.0
requests==2.31.0
rsa==4.9
urllib3==2.2.1
watchdog==4.0.0
Werkzeug==3.0.1

The text was updated successfully, but these errors were encountered:

zackarydev · 2024-03-26T15:47:16Z

Do you need to close the client or gracefully shutdown? What about a try-catch or increasing the client timeout secs?

ArmandBriere · 2024-03-27T14:26:31Z

Do you need to close the client or gracefully shutdown? What about a try-catch or increasing the client timeout secs?

I've tried to instantiate a new Firestore client at every function call to check if the globally available client was the issue. It didn't change the results of the experiment, we are still getting stuck

I've also recreated that experiment in Go using the function-framework-go library and I am not getting any errors/timeout so far. I can only affirm that the docker restart process to break the python code doesn't seem to work on the Go code. Switching to another programming language to overcome a critical issue of the package isn't a solution for us

blunderbuss-gcf bot assigned janell-chen Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Firestore AggregationQuery stuck when running locally in Docker #316

Bug: Firestore AggregationQuery stuck when running locally in Docker #316

ArmandBriere commented Mar 26, 2024

zackarydev commented Mar 26, 2024

ArmandBriere commented Mar 27, 2024

Bug: Firestore AggregationQuery stuck when running locally in Docker #316

Bug: Firestore AggregationQuery stuck when running locally in Docker #316

Comments

ArmandBriere commented Mar 26, 2024

Firestore AggregationQuery getting stuck

How to reproduce

What we tested

Source code

zackarydev commented Mar 26, 2024

ArmandBriere commented Mar 27, 2024