Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Example nextjs site that imports repositories from GitHub to Splitgraph, then exports them to Seafowl, then renders a chart of stargazers #21

Open
wants to merge 36 commits into
base: generated-import-plugins
Choose a base branch
from

Conversation

milesrichardson
Copy link
Contributor

No description provided.

@milesrichardson milesrichardson force-pushed the example-nextjs-import-airbyte-github-export-seafowl branch from 6369655 to 301cb59 Compare June 9, 2023 23:11
@milesrichardson milesrichardson force-pushed the example-nextjs-import-airbyte-github-export-seafowl branch from f171fa4 to 140f865 Compare June 23, 2023 03:21
Implement the import panel and stub out the export panel,
using a single Stepper component and a react context with
a reducer for managing the state. Implement the fetch requests
to start the import, and also to await the import.

Co-Authored by GPT-4 ;)
…ort-to-seafowl-task`

The `start-export-to-seafowl` route takes a list of source
tables from Splitgraph (list of `{namespace,repository,table}`),
and starts a task to export them to Seafowl. It returns a list
of objects `{taskId: string; tableName: string;}`, where each
item represents the currently exporting table (and `tableName` is
the source table name).

The `await-export-to-seafowl-task` route takes a single `taskId`
parameter and returns its status, i.e. `{completed: boolean; ...otherInfo}`
The ExportPanel first renders a "Start Export" button. Then, while
the export is running, it renders an `ExportTableLoadingBar` for
each table that is being exported. Each of thee individual
components sends its own polling request with its `taskId`
to the `await-export-to-seafowl-task` endpoint, and upon
completion of each task, sends an action to the reducer, which
handles it by updating the set of loading tasks. When the set of
loading tasks is complete, it changes the `stepperState` to
`export_complete`. If any of the tasks has an error, then the
`stepperState` changes to `export_error` which should cause all
loading bars to unmount - i.e., any error will short-circuit all
of the table loading, even if some were to complete. At that point
the user can click "start export" again.

This completes the logic necessary for import and export, and now
it's just a matter of styling the components, linking to the
Splitgraph Console, adding explanatory text, and finally rendering
a chart with the data. We'll also want to create a meta-repository
in Splitgraph for tracking which GitHub repos we've imported so far,
analogously to how we track Socrata metadata for each Socrata repo.
The `airbyte-github` plugin by default imports 163 tables into
Splitgraph, but we only need a few of them for the analytics queries
we want to make in the demo app. So, hardcode the list of those,
but also hardcode the list of all 163 tables for reference, and also
the 43 tables that are imported given the relevant tables (because either
they depend on them via a foreign key relationship, or they're an
airbyte meta table).

For the 43 tables, see this recent import of `splitgraph/seafowl`:

* https://www.splitgraph.com/miles/github-import-splitgraph-seafowl/20230526-224723/-/tables

This took 3 minutes and 40 seconds to import into Splitgraph.
…umed across page loads

Keep track of the current stepper state (e.g. taskId, import completion, etc.)
in the URL. Update the URL when the state changes, and initialize the state
from the URL on page load. Note that we need to default to an "uninitialized"
state, and then update the state from the URL via an `initialize_from_url` action, because
the `useRouter` hook is ansynchronous, and we don't look at query parameters on
the server side with `getInitialProps` or similar. Thus we can show a loading bar
before showing the import form (or whatever we're showing based on the current state).

This makes development easier, since after a long import we can refresh the page
with the URL containing the task ID and start from there, rather than re-importing
every time. And it also makes it easier for users who can refresh the page without
losing progress if an import has already started (it will just poll the taskId from
the URL).
Export queries to tables `monthly_user_stats` and `monthly_issue_stats`
in the same schema/namespace as the tables. We also export the tables,
or at least the few that we explicitly asked to import.
After an import/export has completed, insert a row into the meta table,
which we will also use to fetch the previously imported repositories
from the client side when rendering the sidebar. We don't have transactional
guarantees on the DDN, so we can't do `INSERT ON CONFLICT`, so instead we
avoid duplicate rows by first selecting the existing row, and returning `204`
if it's already been inserted into the `completed_repositories` table.

However, I did notice that when I inserted the same row twice, it only
showed up once when I made a selection in the Console. I don't know if
this was due to a race condition, a bug, or because it's using the entire
row as a compound primary key and for some reason requiring that it be unique.
…he charts

The sidebar queries the DDN from the client-side with `useSql` from `@madatdata/react`,
using the default anonymous (thus read-only) credential to query the "metadata table"
that includes the list of repositories that have had a succesful import, and it links
to a page for each one, which is currently a stub but where we will show the chart(s)
with Observable Plot.
…istles, etc.

Render each export(able|ed) query/table in a Splitgraph embed, with
a tabbed conainer for switching to a Seafowl embed when it's ready,
i.e. show each table/query individually, inline with its loading state.
…tead of each component

Previously, the API always returned a unique taskId for each table being exported,
but a recent change optimized it to return one taskId for the set of tables being
exported, but still one taskId for each query being exported. Also previously, this
demo code rendered a loading component for each table, and each component had its
own hook for polling the taskId of that table. But now that multiple tables can share
a taskId, it doesn't make sense for each component to poll for its own taskId.

Now, we track the set of taskIds separately from the set of completed tables, and
we only poll for unique taskIds, which we do in a hook instead of in each component. And
each table preview checks the set of completed tables to know whether it's been completed.
For each query to export, optionally provide a fallback `CREATE TABLE`
query which will run if the export job for the query fails. Implement
this by calling an API route `/api/create-fallback-table-after-failed-export`
after an export for a query fails for any reason.

This works around the bug where queries with an empty result fail
to export to Seafowl, see: splitgraph/seafowl#423
…tinationTable

The point of exporting a query from Splitgraph to Seafowl is that once
the result is in Seafowl, we can just select from the destinationTable
and forget about the original query (which might not even be compatible
with Seafowl). So make sure that when we're embedding an exported query,
we only render the query in the embedded Splitgraph query editor, and
for the embedded Seafowl Console, we render a query that simply selects
from the destinationTable.
@milesrichardson milesrichardson force-pushed the example-nextjs-import-airbyte-github-export-seafowl branch from 43a9de7 to afa4c07 Compare June 29, 2023 02:27
milesrichardson added a commit to splitgraph/demo-github-analytics that referenced this pull request Aug 2, 2023
Migrate pull request from: splitgraph/madatdata#21

into its own repo, using `git-filter-repo` to include only commits from
subdirectory `examples/nextjs-import-airbyte-github-export-seafowl/`

ref: https://github.com/newren/git-filter-repo

This commit adds the Yarn files necessary for running the example in
an isolated repo (as opposed to as one of multiple examples in a shared
multi-workspace `examples`), points the dependencies to `canary` versions
(which reflect versions in splitgraph/madatdata#20),
and also updates the readme with information for running in local development.
@milesrichardson
Copy link
Contributor Author

This PR has been filtered into its own repo: https://github.com/splitgraph/demo-github-analytics

The demo is deployed to: https://demo-github-analytics.vercel.app/

(It's still a bit fragile - don't try to import a big repository with lots of issues/commits, since it will trigger a multi-hour ingestion job...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant