You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code currently copies private data sources row by row. Since we are seeing subgraphs with hundreds of thousands of private data sources, mostly file data sources, the current approach will scale poorly. For correctness, we currently have to do the copying in one transaction
There are two possible approaches to speed up copying of private data sources: (1) do the copy in one SQL query, or (2) break copying into batches like we do for normal subgraph data tables.
Option (1) is a bit easier to implement, though it will at some point also be too slow and lead to a long-running transaction. Option (2) would completely avoid that, but is more difficult to implement.
The main difficulty is that when private data sources are copied, we need to translate the manifest_idx from the source to the destination as the numbering of data sources can have changed between source and destination.
For either approach, we should build a map source_idx -> destination_idx in the database and use that for the translation rather than doing the translation in Rust code. That map could take a variety of forms: it could be as simple as an array we add to the query, or it could be a full-blown temporary table. The right choice will depend on how big that map will be, though it's my understanding that that map should be fairly small. With an array, the query to copy would look something like
insert intodst.data_sources$(block_range, manifest_idx, param, context, causality_region, done_at)
select
case
when upper(sds.block_range) <= $target_block then sds.block_range
else int4range(lower(sds.block_range), null)
end as block_range,
xlat.dst,
sds.param, sds.context, sds.causality_region, sds.done_atfromsrc.data_sources$ sds,
unnest($manifest_idxs) with ordinality as xlat(dst, src)
wheresds.manifest_idx=xlat.src
The variable $manifest_idxs is an int array where for data source number i in the source, $manifest_idx[i] contains the number of that data source in the destination.
To get to option (2), the above query would need to be paged by vid, in the same way as we do for 'normal' data tables. The existing copy_table_state table could be used to track progress there.
The text was updated successfully, but these errors were encountered:
Bug report
The code currently copies private data sources row by row. Since we are seeing subgraphs with hundreds of thousands of private data sources, mostly file data sources, the current approach will scale poorly. For correctness, we currently have to do the copying in one transaction
There are two possible approaches to speed up copying of private data sources: (1) do the copy in one SQL query, or (2) break copying into batches like we do for normal subgraph data tables.
Option (1) is a bit easier to implement, though it will at some point also be too slow and lead to a long-running transaction. Option (2) would completely avoid that, but is more difficult to implement.
The main difficulty is that when private data sources are copied, we need to translate the
manifest_idx
from the source to the destination as the numbering of data sources can have changed between source and destination.For either approach, we should build a map
source_idx -> destination_idx
in the database and use that for the translation rather than doing the translation in Rust code. That map could take a variety of forms: it could be as simple as an array we add to the query, or it could be a full-blown temporary table. The right choice will depend on how big that map will be, though it's my understanding that that map should be fairly small. With an array, the query to copy would look something likeThe variable
$manifest_idxs
is an int array where for data source numberi
in the source,$manifest_idx[i]
contains the number of that data source in the destination.To get to option (2), the above query would need to be paged by
vid
, in the same way as we do for 'normal' data tables. The existingcopy_table_state
table could be used to track progress there.The text was updated successfully, but these errors were encountered: