Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance on string contains expression. #3188

Closed
kingnguyen93 opened this issue May 30, 2024 · 23 comments
Closed

Slow performance on string contains expression. #3188

kingnguyen93 opened this issue May 30, 2024 · 23 comments

Comments

@kingnguyen93
Copy link

kingnguyen93 commented May 30, 2024

See details below. I have table with milion of records;

image

Here is generated SQL. It's using = ANY (ARRAY['string']).

image

It tooks 0.3s to execute. But I change to this. It's using = ANY ('{ string }').

image

It tooks only 0.02s to execute.

I'm using PostgreSQL 12 on Windows Server 2019.

@kingnguyen93
Copy link
Author

@roji
Copy link
Member

roji commented May 30, 2024

Can you please post an actual minimal example showing the SQL above getting generated, and say which version you're using? EF doesn't generate that SQL for the LINQ you provided; given the minimal code sample below, I'm getting the following

SELECT b."Id", b."Name"
FROM "Blogs" AS b
WHERE b."Name" = ANY (@__names_0)

You can play around with the code sample just below to make it produce the problematic SQL, and then post that.

Minimal code sample
await using var context = new BlogContext();
await context.Database.EnsureDeletedAsync();
await context.Database.EnsureCreatedAsync();

string[] names = ["Blog1", "Blog2", "Blog3"];
_ = await context.Blogs.Where(b => names.Contains(b.Name)).ToListAsync();

public class BlogContext : DbContext
{
    public DbSet<Blog> Blogs { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
        => optionsBuilder
            .UseNpgsql("Host=localhost;Username=test;Password=test")
            .LogTo(Console.WriteLine, LogLevel.Information)
            .EnableSensitiveDataLogging();
}

public class Blog
{
    public int Id { get; set; }
    public string Name { get; set; }
}

@kingnguyen93
Copy link
Author

@roji Yeah. Your sample is quite enough for the isssue. Can you check the value of @__names_0 parameter?
It's ARRAY["Blog1", "Blog2", "Blog3"], right? But I think it's good to be '{ Blog1, Blog2, Blog3}'.
In my case, WHERE b."Name" = ANY ('{ Blog1, Blog2, Blog3 }') is faster than WHERE b."Name" = ANY (ARRAY['Blog1', 'Blog2', 'Blog3']) with Name is indexed with B-Tree.

@roji
Copy link
Member

roji commented May 30, 2024

@kingnguyen93 there seems to be a bit of confusion here. The parameter cannot be '{ Blog1, Blog2, Blog3}' - that's a string, and ANY does not accept a string.

What I think you're trying to say, is that when there's no parameter at all - i.e. EF generates SQL with the array contents as constants in the SQL (= ANY('{ Blog1, Blog2 }')), that performs faster. That may be the guess; on the other hand, it means that every query generates a different SQL (since array contents differ, and the array isn't parameterized but rather integrated into the SQL), and this causes each query to require planning (as well as evicting unrelated, already-planned queries from Npgsql's internal auto-prepared statement cache).

All this has been discussed very extensively for SQL Server, please see #32394. It's true that I haven't focused on how PostgreSQL behaves here, but it's possible that it may be better to integrates constants in some cases. I'll make a note to investigate this.

@kingnguyen93
Copy link
Author

kingnguyen93 commented May 30, 2024

@roji Yep, you can take a look at 2 images above and line number 5. I just change condition from Id = ANY(ARRAY['Id1', 'Id2', 'Id3'] to Id = ANY('{ Id1, Id2, Id3 }') and it run 10x faster. You can check elapsed time at right bottom. I have about 1 milion record in that table right now.

@roji
Copy link
Member

roji commented May 30, 2024

@kingnguyen93 can you please help provide a minimal repro for this? For a trivial query it does not seem to repro - please see the below console program which creates and seeds a database, and then prints the query plans for both query types. In both cases I'm getting:

Index Only Scan using ix_x on numbers  (cost=0.42..12.88 rows=2 width=4)

Note that I'm not saying there isn't a problem - it just doesn't repro for this trivial cases (making the query more complex could trigger it - that's what I need help with).

await using var dataSource = NpgsqlDataSource.Create("Host=localhost;Username=test;Password=test");
await using var conn = await dataSource.OpenConnectionAsync();

{
    await using var command = new NpgsqlCommand(
        """
        DROP TABLE IF EXISTS numbers;

        CREATE TABLE numbers AS
        SELECT x FROM generate_series(1,1000000) x;

        CREATE INDEX IX_x on numbers(x);

        ANALYZE numbers;
        """, conn);
    await command.ExecuteNonQueryAsync();
}

// Constants in ANY
{
    await using var command = new NpgsqlCommand("EXPLAIN SELECT * FROM numbers WHERE x = ANY('{1,3}')", conn);
    var plan = (string)(await command.ExecuteScalarAsync())!;
    Console.WriteLine(plan);
}

// Parameter in ANY
{
    await using var command = new NpgsqlCommand("EXPLAIN SELECT * FROM numbers WHERE x = ANY($1)", conn);
    command.Parameters.Add(new() { Value = new[] { 1, 3 } });
    var plan = (string)(await command.ExecuteScalarAsync())!;
    Console.WriteLine(plan);
}

@kingnguyen93
Copy link
Author

kingnguyen93 commented May 31, 2024

@roji Here's my table.

CREATE TABLE "public"."crm_calllogs" (
  "stt_rec" "public"."ud_stt_rec" COLLATE "pg_catalog"."default" NOT NULL,
  "stt_rec_doitac" "public"."ud_stt_rec" COLLATE "pg_catalog"."default" NOT NULL,
  "call_id" "public"."ud_char50" COLLATE "pg_catalog"."default",
  "dien_giai" "public"."ud_memo" COLLATE "pg_catalog"."default",
  "ma_nvbh" "public"."ud_ma" COLLATE "pg_catalog"."default" NOT NULL,
  "ma_nhom" "public"."ud_ma" COLLATE "pg_catalog"."default" NOT NULL,
  "ma_dvcs" "public"."ud_ma" COLLATE "pg_catalog"."default",
  "date0" "public"."ud_ngay" NOT NULL,
  "time0" "public"."ud_time" NOT NULL,
  "user_id0" "public"."ud_smallint" NOT NULL,
  "status" "public"."ud_status" COLLATE "pg_catalog"."default" NOT NULL,
  "date2" "public"."ud_ngay",
  "time2" "public"."ud_time",
  "user_id2" "public"."ud_smallint",
  "ketnoi_yn" "public"."ud_smallint" NOT NULL
);

ALTER TABLE "public"."crm_calllogs" ADD CONSTRAINT "crm_calllogs_pkey" PRIMARY KEY ("stt_rec");

CREATE INDEX "crm_calllogs_date0_time0_idx" ON "public"."crm_calllogs" USING btree (
  "date0" "pg_catalog"."date_ops" DESC NULLS FIRST,
  "time0" "pg_catalog"."time_ops" DESC NULLS FIRST
);

CREATE INDEX "crm_calllogs_stt_rec_doitac_idx" ON "public"."crm_calllogs" USING btree (
  "stt_rec_doitac" COLLATE "pg_catalog"."default" "pg_catalog"."bpchar_ops" ASC NULLS LAST
);

image

image

I noiticed that something relate to index scan and data type when using ARRAY. ud_stt_rec is ''::bpchar

@kingnguyen93
Copy link
Author

@roji I think the problem is that SQL generated from Entity Framework is using ANY (ARRAY[]). And when it's executed at server, it auto translate to ANY ('{}') but with wrong data type. I'm using bpchar[] and the translated SQL using text[]. So we will lose index scan.

@roji
Copy link
Member

roji commented May 31, 2024

@kingnguyen93 if that's true, then you should be able to force bpchar[] manually in SQL and see the better performance: ANY(ARRAY['A1...']::bpchar[]). Can you please test that and report how that works? If that's the problem, then EF may be sending the wrong parameter time.

@kingnguyen93
Copy link
Author

@roji I already tested that. I think the root issue is at PG engine when convert ARRAY[] to native string array ‘{ }’. It added wrong cast type. I’m using bpchar[] but it using text[] that lead to lose index scan on query.
I don’t whether we should change to use string parameter at EF or wait for PG engine fix that.

@roji
Copy link
Member

roji commented May 31, 2024

I think the root issue is at PG engine when convert ARRAY[] to native string array ‘{ }’. It added wrong cast type.

But that's why I suggested adding ::bpchar[], to force PG to produce a bpchar[] rather than a text[]. At that point you have ANY() over a bpchar[] array type. Can you please post the exact SQL you tested here and the plan you see coming out?

I don’t whether we should change to use string parameter at EF or wait for PG engine fix that

Note that there's no string parameter anywhere here (at least for now). What I'm trying to figure out here - with your help - is what is the exact source of the plan difference. If this is a text[] vs. bpchar[] difference (which the above question is supposed to clarify), then it's fine to parameterize the array, and we just need to make sure the EF provider sends the correct type.

If it really is a question of ANY with constants inside vs. ANY with an array inside, then this would be the same as dotnet/efcore#32394 which I linked to above (I suspect that's the case).

@kingnguyen93
Copy link
Author

kingnguyen93 commented May 31, 2024

@roji Here's my test results:

  1. stt_rec_doitac = ANY (ARRAY['A1000000002201362CT1']) Execution Time: 323.798 ms
    image

  2. stt_rec_doitac = ANY ('{ "A1000000002201362CT1" }'::text[]) Execution Time: 336.404 ms
    image

  3. stt_rec_doitac = ANY ('{ "A1000000002201362CT1" }'::bpchar[]) Execution Time: 0.041 ms
    image

  4. stt_rec_doitac = ANY ('{ "A1000000002201362CT1" }') Execution Time: 0.048 ms
    image

@NinoFloris
Copy link
Member

Can you please run stt_rec_doitac = ANY (ARRAY['A1000000002201362CT1']::bpchar[])

@kingnguyen93
Copy link
Author

@NinoFloris Here. Execution Time: 0.044 ms

image

@roji
Copy link
Member

roji commented May 31, 2024

OK, so this is just a case of the array type (text[]) not matching the type of the column (bpchar), and so the index doesn't get used (in other words, this looks unrelated to dotnet/efcore#32394, which is about array parameterization for Contains in general.

We now need to understand why EFCore.PG sends an array with the wrong type. To know this, I need to see how you're configuring your EF model - can you start by confirming that you're actually configuring stt_rec_doitac as bpchar in e.g. OnModelCreating?

@kingnguyen93
Copy link
Author

kingnguyen93 commented May 31, 2024

@roji I don't config it. Because I develop on existed database. And it took time to cofig that. :)))))
But I don't understand why PG engine added cast type to optimized SQL. It should leave it like this stt_rec_doitac = ANY ('{ A1000000002201362CT1 }'), why did it add cast to stt_rec_doitac = ANY ('{ "A1000000002201362CT1" }'::text[]).
I think it's not related to EF. Cause EF only send ARRAY['A1000000002201362CT1'] to parameter. PG engine shoud detect target column or just leave it empty cast.

image

@NinoFloris
Copy link
Member

At the protocol level we have to send a type for a parameter, it's likely that we end up sending text[] there

@roji
Copy link
Member

roji commented May 31, 2024

I don't config it.

Then that's very likely the source of your problem. Even if you use EF against an existing database, if you don't tell EF what the column types are, it can't generate the right SQL and/or parameters there. As @NinoFloris wrote, when a parameter is sent to database, it's strongly-typed, and it's important for that type to be correct. Since EF isn't properly configured here, it simply assumes that the type is a PG text (that's the default mapping for .NET string), and so the array parameter type that gets inferred is text[]`.

Try configuring the property as bpchar with EF, it's very likely that the query will start working fast (if not I'd want to know).

@kingnguyen93
Copy link
Author

@roji I try to config data type. But I see that parameter sent to command is not strong type.

image

image

@roji
Copy link
Member

roji commented Jun 1, 2024

But I see that parameter sent to command is not strong type.

That's just EF's logging which is unaware of the Npgsql-specific typing; that's set on NpgsqlParameter.NpgsqlDbType, which EF's logging mechanism doesn't know. Try running the code and seeing how it performs.

To more reliably know what's getting sent, you can enable Npgsql's logging at the lower ADO.NET level, or use Wireshark to look at the packets.

@kingnguyen93
Copy link
Author

kingnguyen93 commented Jun 1, 2024

@roji I don't think so. I still got slow even if I config column type. I also try [Column(TypeName = "ud_stt_rec")]. And it's still the same.
I use Scaffold-DbContext and I got this. It don't even config type.

image

@kingnguyen93
Copy link
Author

kingnguyen93 commented Jun 1, 2024

@roji I found the root cause. It's not related to data type. I have to config HasMaxLength(). I think if we don't set it. It will use text as default for string.
Btw, I think most of our type is fixed type. So I think there's must is the way to config defaut.

@roji
Copy link
Member

roji commented Jun 1, 2024

It sounds like you're asking how to set the default max length for all string properties - see these docs.

Am going to close this issue as the problem has been found and fixed, but feel free to continue posting if you have further questions.

@roji roji closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants