Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Copying of private data sources can become very slow #5473

Open
lutter opened this issue Jun 6, 2024 · 0 comments
Open

[Bug] Copying of private data sources can become very slow #5473

lutter opened this issue Jun 6, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@lutter
Copy link
Collaborator

lutter commented Jun 6, 2024

Bug report

The code currently copies private data sources row by row. Since we are seeing subgraphs with hundreds of thousands of private data sources, mostly file data sources, the current approach will scale poorly. For correctness, we currently have to do the copying in one transaction

There are two possible approaches to speed up copying of private data sources: (1) do the copy in one SQL query, or (2) break copying into batches like we do for normal subgraph data tables.

Option (1) is a bit easier to implement, though it will at some point also be too slow and lead to a long-running transaction. Option (2) would completely avoid that, but is more difficult to implement.

The main difficulty is that when private data sources are copied, we need to translate the manifest_idx from the source to the destination as the numbering of data sources can have changed between source and destination.

For either approach, we should build a map source_idx -> destination_idx in the database and use that for the translation rather than doing the translation in Rust code. That map could take a variety of forms: it could be as simple as an array we add to the query, or it could be a full-blown temporary table. The right choice will depend on how big that map will be, though it's my understanding that that map should be fairly small. With an array, the query to copy would look something like

insert into dst.data_sources$(block_range, manifest_idx, param, context, causality_region, done_at)
select 
   case
     when upper(sds.block_range) <= $target_block then sds.block_range
        else int4range(lower(sds.block_range), null)
   end as block_range,
   xlat.dst,
   sds.param, sds.context, sds.causality_region, sds.done_at
 from src.data_sources$ sds, 
      unnest($manifest_idxs) with ordinality as xlat(dst, src)
where sds.manifest_idx = xlat.src

The variable $manifest_idxs is an int array where for data source number i in the source, $manifest_idx[i] contains the number of that data source in the destination.

To get to option (2), the above query would need to be paged by vid, in the same way as we do for 'normal' data tables. The existing copy_table_state table could be used to track progress there.

@lutter lutter added the bug Something isn't working label Jun 6, 2024
@lutter lutter changed the title [Bug] Copying of private data sources is very slow [Bug] Copying of private data sources can become very slow Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant