This repository has been archived by the owner on Feb 8, 2024. It is now read-only.
forked from GoogleCloudPlatform/DataflowTemplates
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Change readme to just contain needed build instructions
- Loading branch information
1 parent
c43aab8
commit f0df6b1
Showing
1 changed file
with
21 additions
and
186 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,200 +1,35 @@ | ||
# Google Cloud Dataflow Template Pipelines | ||
|
||
These Dataflow templates are an effort to solve simple, but large, in-Cloud data | ||
tasks, including data import/export/backup/restore and bulk API operations, | ||
without a development environment. The technology under the hood which makes | ||
these operations possible is the | ||
[Google Cloud Dataflow](https://cloud.google.com/dataflow/) service combined | ||
with a set of [Apache Beam](https://beam.apache.org/) SDK templated pipelines. | ||
This fork is done to "fix" some state of the Google's original repo, since they don't maintain any tags or versions there. | ||
The only part is changed so far is this README file. | ||
|
||
Google is providing this collection of pre-implemented Dataflow templates as a | ||
reference and to provide easy customization for developers wanting to extend | ||
their functionality. | ||
It will just hold the instruction on how to build Google's templates in a way, Google does not do so far. | ||
|
||
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git) | ||
|
||
## Note on Default Branch | ||
|
||
As of November 18, 2021, our default branch is now named "main". This does not | ||
affect forks. If you would like your fork and its local clone to reflect these | ||
changes you can follow [GitHub's branch renaming guide](https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-branches-in-your-repository/renaming-a-branch). | ||
|
||
## Template Pipelines | ||
|
||
* [BigQuery to Bigtable](v2/bigquery-to-bigtable/src/main/java/com/google/cloud/teleport/v2/templates/BigQueryToBigtable.java) | ||
* [BigQuery to Datastore](src/main/java/com/google/cloud/teleport/templates/BigQueryToDatastore.java) | ||
* [BigQuery to TFRecords](src/main/java/com/google/cloud/teleport/templates/BigQueryToTFRecord.java) | ||
* [Bigtable to GCS Avro](src/main/java/com/google/cloud/teleport/bigtable/BigtableToAvro.java) | ||
* [Bulk Compressor](src/main/java/com/google/cloud/teleport/templates/BulkCompressor.java) | ||
* [Bulk Decompressor](src/main/java/com/google/cloud/teleport/templates/BulkDecompressor.java) | ||
* [Datastore Bulk Delete](src/main/java/com/google/cloud/teleport/templates/DatastoreToDatastoreDelete.java) * | ||
* [Datastore to BigQuery](src/main/java/com/google/cloud/teleport/templates/DatastoreToBigQuery.java) | ||
* [Datastore to GCS Text](src/main/java/com/google/cloud/teleport/templates/DatastoreToText.java) * | ||
* [Datastore to Pub/Sub](src/main/java/com/google/cloud/teleport/templates/DatastoreToPubsub.java) * | ||
* [Datastore Unique Schema Count](src/main/java/com/google/cloud/teleport/templates/DatastoreSchemasCountToText.java) | ||
* [DLP Text to BigQuery (Streaming)](src/main/java/com/google/cloud/teleport/templates/DLPTextToBigQueryStreaming.java) | ||
* [GCS Avro to Bigtable](src/main/java/com/google/cloud/teleport/bigtable/AvroToBigtable.java) | ||
* [GCS Avro to Spanner](src/main/java/com/google/cloud/teleport/spanner/ImportPipeline.java) | ||
* [GCS Text to Spanner](src/main/java/com/google/cloud/teleport/spanner/TextImportPipeline.java) | ||
* [GCS Text to BigQuery](src/main/java/com/google/cloud/teleport/templates/TextIOToBigQuery.java) * | ||
* [GCS Text to Datastore](src/main/java/com/google/cloud/teleport/templates/TextToDatastore.java) | ||
* [GCS Text to Pub/Sub (Batch)](src/main/java/com/google/cloud/teleport/templates/TextToPubsub.java) | ||
* [GCS Text to Pub/Sub (Streaming)](src/main/java/com/google/cloud/teleport/templates/TextToPubsubStream.java) | ||
* [Jdbc to BigQuery](src/main/java/com/google/cloud/teleport/templates/JdbcToBigQuery.java) | ||
* [Pub/Sub to BigQuery](src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java) * | ||
* [Pub/Sub to Datastore](src/main/java/com/google/cloud/teleport/templates/PubsubToDatastore.java) * | ||
* [Pub/Sub to GCS Avro](src/main/java/com/google/cloud/teleport/templates/PubsubToAvro.java) | ||
* [Pub/Sub to GCS Text](src/main/java/com/google/cloud/teleport/templates/PubsubToText.java) | ||
* [Pub/Sub to Pub/Sub](src/main/java/com/google/cloud/teleport/templates/PubsubToPubsub.java) | ||
* [Pub/Sub to Splunk](src/main/java/com/google/cloud/teleport/templates/PubSubToSplunk.java) * | ||
* [Spanner to GCS Avro](src/main/java/com/google/cloud/teleport/spanner/ExportPipeline.java) | ||
* [Spanner to GCS Text](src/main/java/com/google/cloud/teleport/templates/SpannerToText.java) | ||
* [Word Count](src/main/java/com/google/cloud/teleport/templates/WordCount.java) | ||
|
||
|
||
\* Supports user-defined functions (UDFs). | ||
|
||
For documentation on each template's usage and parameters, please see | ||
the official [docs](https://cloud.google.com/dataflow/docs/templates/provided-templates). | ||
|
||
## Getting Started | ||
|
||
### Requirements | ||
|
||
* Java 8 | ||
## Prerequisites | ||
On your local machine you need to have: | ||
* Java 8. With some newer versions build fails, but with Java 8 it works just fine. And make sure that `JAVA_HOME` is pointing to this version. | ||
* Maven 3 | ||
|
||
### Building the Project | ||
I propose to use a kind of the same version convention in a bucket with built jos as Google does, so when you want to make a new build - place it in `gs://dataflow-staging-europe-west1-105942741667/templates/google/YYYY-MM-DD/Job_Name` | ||
|
||
Build the entire project using the maven compile command. | ||
## PubsubToAvro in subscritpion mode | ||
PubsubToAvro can be build in 2 ways: | ||
* where you can define input topic, and job will create subscription automatically every time it's started - this one is being built by Google, but we don't want it. | ||
* where you can define input subscription - this one we need. And since Google for some reason don't build it itself in their public bucket, the instruction below will guide how to do that. | ||
|
||
```sh | ||
mvn clean compile | ||
In order to build new version and upload it to GCS simply run the following command (put the actual date inside!): | ||
``` | ||
|
||
### Building/Testing from IntelliJ | ||
|
||
IntelliJ, by default, will often skip necessary Maven goals, leading to | ||
build failures. You can fix these in the Maven view by going to | ||
**Module_Name > Plugins > Plugin_Name** where Module_Name and Plugin_Name are | ||
the names of the respective module and plugin with the rule. From there, | ||
right-click the rule and select "Execute Before Build". | ||
|
||
The list of known rules that require this are: | ||
|
||
* common > Plugins > protobuf > protobuf:compile | ||
* common > Plugins > protobuf > protobuf:test-compile | ||
|
||
### Formatting Code | ||
|
||
From either the root directory or v2/ directory, run: | ||
|
||
```sh | ||
mvn spotless:apply | ||
``` | ||
|
||
This will format the code and add a license header. To verify that the code is | ||
formatted correctly, run: | ||
|
||
```sh | ||
mvn spotless:check | ||
``` | ||
|
||
The directory to run the commands from is based on whether the changes are under | ||
v2/ or not. | ||
|
||
### Creating a Template File | ||
|
||
Dataflow templates can be [created](https://cloud.google.com/dataflow/docs/templates/creating-templates#creating-and-staging-templates) | ||
using a maven command which builds the project and stages the template | ||
file on Google Cloud Storage. Any parameters passed at template build | ||
time will not be able to be overwritten at execution time. | ||
|
||
```sh | ||
mvn compile exec:java \ | ||
-Dexec.mainClass=com.google.cloud.teleport.templates.<template-class> \ | ||
-Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToAvro \ | ||
-Dexec.cleanupDaemonThreads=false \ | ||
-Dexec.args=" \ | ||
--project=<project-id> \ | ||
--stagingLocation=gs://<bucket-name>/staging \ | ||
--tempLocation=gs://<bucket-name>/temp \ | ||
--templateLocation=gs://<bucket-name>/templates/<template-name>.json \ | ||
--runner=DataflowRunner" | ||
``` | ||
|
||
|
||
### Executing a Template File | ||
|
||
Once the template is staged on Google Cloud Storage, it can then be | ||
executed using the | ||
[gcloud CLI](https://cloud.google.com/sdk/gcloud/reference/dataflow/jobs/run) | ||
tool. The runtime parameters required by the template can be passed in the | ||
parameters field via comma-separated list of `paramName=Value`. | ||
|
||
```sh | ||
gcloud dataflow jobs run <job-name> \ | ||
--gcs-location=<template-location> \ | ||
--zone=<zone> \ | ||
--parameters <parameters> | ||
--project=epidemic-data-infra \ | ||
--stagingLocation=gs://dataflow-staging-europe-west1-105942741667/templates/google/staging \ | ||
--tempLocation=gs://dataflow-staging-europe-west1-105942741667/templates/google/temp \ | ||
--templateLocation=gs://dataflow-staging-europe-west1-105942741667/templates/google/[DATE in YYYY-MM-DD]/PubsubToAvro \ | ||
--runner=DataflowRunner \ | ||
--region=europe-west1 \ | ||
--useSubscription=true" | ||
``` | ||
|
||
|
||
## Using UDFs | ||
|
||
User-defined functions (UDFs) allow you to customize a template's | ||
functionality by providing a short JavaScript function without having to | ||
maintain the entire codebase. This is useful in situations which you'd | ||
like to rename fields, filter values, or even transform data formats | ||
before output to the destination. All UDFs are executed by providing the | ||
payload of the element as a string to the JavaScript function. You can | ||
then use JavaScript's in-built JSON parser or other system functions to | ||
transform the data prior to the pipeline's output. The return statement | ||
of a UDF specifies the payload to pass forward in the pipeline. This | ||
should always return a string value. If no value is returned or the | ||
function returns undefined, the incoming record will be filtered from | ||
the output. | ||
|
||
### UDF Function Specification | ||
| Template | UDF Input Type | Input Description | UDF Output Type | Output Description | | ||
|-----------------------|----------------|-------------------------------------------------|-----------------|-------------------------------------------------------------------------------| | ||
| Datastore Bulk Delete | String | A JSON string of the entity | String | A JSON string of the entity to delete; filter entities by returning undefined | | ||
| Datastore to Pub/Sub | String | A JSON string of the entity | String | The payload to publish to Pub/Sub | | ||
| Datastore to GCS Text | String | A JSON string of the entity | String | A single-line within the output file | | ||
| GCS Text to BigQuery | String | A single-line within the input file | String | A JSON string which matches the destination table's schema | | ||
| Pub/Sub to BigQuery | String | A string representation of the incoming payload | String | A JSON string which matches the destination table's schema | | ||
| Pub/Sub to Datastore | String | A string representation of the incoming payload | String | A JSON string of the entity to write to Datastore | | ||
| Pub/Sub to Splunk | String | A string representation of the incoming payload | String | The event data to be sent to Splunk HEC events endpoint. Must be a string or a stringified JSON object | | ||
|
||
|
||
### UDF Examples | ||
|
||
#### Adding fields | ||
```js | ||
/** | ||
* A transform which adds a field to the incoming data. | ||
* @param {string} inJson | ||
* @return {string} outJson | ||
*/ | ||
function transform(inJson) { | ||
var obj = JSON.parse(inJson); | ||
obj.dataFeed = "Real-time Transactions"; | ||
obj.dataSource = "POS"; | ||
return JSON.stringify(obj); | ||
} | ||
``` | ||
|
||
#### Filtering records | ||
```js | ||
/** | ||
* A transform function which only accepts 42 as the answer to life. | ||
* @param {string} inJson | ||
* @return {string} outJson | ||
*/ | ||
function transform(inJson) { | ||
var obj = JSON.parse(inJson); | ||
// only output objects which have an answer to life of 42. | ||
if (obj.hasOwnProperty('answerToLife') && obj.answerToLife === 42) { | ||
return JSON.stringify(obj); | ||
} | ||
} | ||
``` | ||
And then just the the folder. |