Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve performance of functions used when bulk loading dataset #557

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ajacombs
Copy link

Improves performance of the buildings_bulk_load.bulk_load_outlines_insert_supplied function, by optimising three other functions in buildings_reference which this calls: suburb_locality_intersect_polygon, town_city_intersect_polygon, and territorial_authority_grid_intersect_polygon.

These three functions perform equivalent tasks: finding the suburb, town, and TA which each building outline overlaps the most. It is possible for an outline to overlap none, one, or multiple of each of these. If an outline overlaps more than one, we want to choose the one with the most overlap. This requires calculating the area of the intersection, which takes a non-trivial amount of processing time. Most outlines however will only intersect with a single suburb, town, or TA.

By first finding which and how many areas an outline intersects, we can then avoid having to take the time to calculate the area of the intersection in the case where there is only a single candidate. This avoids significant amounts of computation.

A test bulk load into a database running in docker saw the time taken to execute buildings_bulk_load.bulk_load_outlines_insert_supplied when loading the Ecopia Hawkes Bay dataset with 134,209 outlines drop from 1h 44m 23s to 17m 59s.

Notes on Sqitch

I am unsure if the way I have added the changes through Sqitch is correct, and have pushed this branch up so others can check what I have done, though I am not confident it is fully correct.

This pull request should be seen more as a "here is what I did, is this the right way to do it?" than fully finished changes ready to be merged.

Steps I followed using Sqitch were:

  • Stashed the changes I had made in git so the db/sql directory had no pending changes.
  • Ran sqitch rework <file-identifier> -n '<Note text>'.
  • Popped the stash with edits to the deploy files of suburb_locality.sql, territorial_authority.sql and town_city.sql.
  • Did not change what Sqitch had generated in revert.
  • Edited the revert files of suburb_locality.sql, territorial_authority.sql and town_city.sql to add a block to check for text added in the new version of each function.

Things I'm not sure about with Sqitch:

  • In deploy: the autogenerated @4.0.0-dev1.sql files for town_city and territorial_authority seem to have the entirity of what the non-suffixed file was before these changes, but suburb_locality only has the previous version of the one function that was changed. That doesn't seem quite right, given the changes were equivalent across the three files.
  • In revert: I think I just generally don't understand what's going on here: in what order each file would be applied, what changes ought to be made in each version, and whether what is here is correct. Hence, I just left what Sqitch auto-generated, but I'm unsure whether this is valid.
  • In verify: I'm unsure why suburb_locality only had a single SELECT has_function_privilege generated, when the others had one generated for each function. I didn't add the others, but it seems inconsistent.

@ajacombs ajacombs self-assigned this Oct 16, 2022
@ghost ghost added pr: no changelog pr-esto adds when CHANGELOG not updated. pr: review requested pr-esto adds when a review is requested. pr: not closing issue pr-esto adds when no linked issue to close. labels Oct 16, 2022
@dwsilk
Copy link
Member

dwsilk commented Oct 16, 2022

This pull request should be seen more as a "here is what I did, is this the right way to do it?" than fully finished changes ready to be merged.

You can mark a PR as draft and then request a review, which is I guess the built-in GitHub functionality to indicate this.

@ajacombs ajacombs marked this pull request as draft October 16, 2022 23:33
@ajacombs
Copy link
Author

This pull request should be seen more as a "here is what I did, is this the right way to do it?" than fully finished changes ready to be merged.

You can mark a PR as draft and then request a review, which is I guess the built-in GitHub functionality to indicate this.

Aha, didn't know you could do that, thank you. Changed to draft now.

@Douglas-K
Copy link
Contributor

This is looking great overall, as far as I can see. For the Sqitch things you aren't sure about:

  • deploy suburb_locality looks fine. Master only has one function in it. So it is correct that this branch has one function for both @4.0.0-dev1.sql and your updated version.
  • revert looks fine. The oldest version has drop functions. Any subsequent versions match the corresponding deploy. I think you just need to change the first word in the header from "deploy" to "revert" (not that it has always been done in the past!)
  • verify. As above, suburb_locality only has one function in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr: no changelog pr-esto adds when CHANGELOG not updated. pr: not closing issue pr-esto adds when no linked issue to close. pr: review requested pr-esto adds when a review is requested.
Development

Successfully merging this pull request may close these issues.

3 participants