[Bug] Copying of private data sources can become very slow #5473

lutter · 2024-06-06T16:39:27Z

Bug report

The code currently copies private data sources row by row. Since we are seeing subgraphs with hundreds of thousands of private data sources, mostly file data sources, the current approach will scale poorly. For correctness, we currently have to do the copying in one transaction

There are two possible approaches to speed up copying of private data sources: (1) do the copy in one SQL query, or (2) break copying into batches like we do for normal subgraph data tables.

Option (1) is a bit easier to implement, though it will at some point also be too slow and lead to a long-running transaction. Option (2) would completely avoid that, but is more difficult to implement.

The main difficulty is that when private data sources are copied, we need to translate the manifest_idx from the source to the destination as the numbering of data sources can have changed between source and destination.

For either approach, we should build a map source_idx -> destination_idx in the database and use that for the translation rather than doing the translation in Rust code. That map could take a variety of forms: it could be as simple as an array we add to the query, or it could be a full-blown temporary table. The right choice will depend on how big that map will be, though it's my understanding that that map should be fairly small. With an array, the query to copy would look something like

insert into dst.data_sources$(block_range, manifest_idx, param, context, causality_region, done_at)
select 
   case
     when upper(sds.block_range) <= $target_block then sds.block_range
        else int4range(lower(sds.block_range), null)
   end as block_range,
   xlat.dst,
   sds.param, sds.context, sds.causality_region, sds.done_at
 from src.data_sources$ sds, 
      unnest($manifest_idxs) with ordinality as xlat(dst, src)
where sds.manifest_idx = xlat.src

The variable $manifest_idxs is an int array where for data source number i in the source, $manifest_idx[i] contains the number of that data source in the destination.

To get to option (2), the above query would need to be paged by vid, in the same way as we do for 'normal' data tables. The existing copy_table_state table could be used to track progress there.

The text was updated successfully, but these errors were encountered:

lutter added the bug Something isn't working label Jun 6, 2024

lutter changed the title ~~[Bug] Copying of private data sources is very slow~~ [Bug] Copying of private data sources can become very slow Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Copying of private data sources can become very slow #5473

[Bug] Copying of private data sources can become very slow #5473

lutter commented Jun 6, 2024 •

edited

Loading

[Bug] Copying of private data sources can become very slow #5473

[Bug] Copying of private data sources can become very slow #5473

Comments

lutter commented Jun 6, 2024 • edited Loading

Bug report

lutter commented Jun 6, 2024 •

edited

Loading