Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection to an external Spark cluster #1

Open
wetneb opened this issue Aug 4, 2021 · 7 comments
Open

Connection to an external Spark cluster #1

wetneb opened this issue Aug 4, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@wetneb
Copy link
Member

wetneb commented Aug 4, 2021

When running OpenRefine with the Spark runner, we always spin a local Spark instance for it. We should instead make it possible to configure OpenRefine to connect to an existing Spark cluster for this.

Proposed solution

Introduce the relevant configuration variables for the Spark runner to make this possible.

@wetneb wetneb added the enhancement New feature or request label Aug 4, 2021
@wetneb
Copy link
Member Author

wetneb commented Dec 23, 2021

This is now configurable. As things stand it is not easy to use though: we also need to publish a .jar file which contains all the application code of OpenRefine (and its dependencies) so that it can be loaded in the remote Spark cluster.

@fcomte
Copy link

fcomte commented Jan 3, 2022

I saw that we can start a spark driver connected with a mesos engine.
Is it to possible that openrefine start a spark driver connected with a kubernetes cluster in stead of mesos?

I a very interrested in connecting openrefine and spark in a cloud native way.
I distribute those tools in a kubernetes plateform. One instance of this project is here : onyxia
User can start openrefine

@wetneb
Copy link
Member Author

wetneb commented Jan 3, 2022

Hi @fcomte,

I have not looked into this, but it looks like it should be feasible! Just to check, is your goal to run the OpenRefine web server in Kubernetes itself, or just to have OpenRefine connect to a Spark cluster running in Kubernetes?

For the latter, it seems doable, although there might be some minor things to tweak on OpenRefine's side. The existing parameter refine.runner.sparkMasterURI can already be used to supply a k8s:// address (as documented here: https://spark.apache.org/docs/latest/running-on-kubernetes.html#cluster-mode), we could add some other parameters to supply other parameters to Spark (spark.executor.instances and spark.kubernetes.container.image).

That being said I should say that running OpenRefine with Spark should mostly be useful when deploying OpenRefine workflows in some production environment, rather than using it interactively through OpenRefine's UI. The latter is possible of course, but it is likely to be less responsive than with the default runner (especially when connecting to a remote cluster).

I am really keen to understand your needs better in any case.

@fcomte
Copy link

fcomte commented Jan 3, 2022

I am running openrefine inside a container (kubernetes). It's working.
But this new feature that enable openrefine to run job inside a spark cluster have a great potential for my users.

Of course in my context, the spark cluster should use a kubernetes master.
It would be perfect if i can start a runner that enable a spark cluster mode ( with k8s )

I have no real use case, I manage a data platform and I think this is great to empowered users with this feature and I am very interested in testing that myself.

How the data would be handle in that case ?
Is that possible to have data in an external s3 bucket ?

@wetneb
Copy link
Member Author

wetneb commented Jan 5, 2022

At the moment, it is possible to import a file from an external source like S3, but it will create a copy in the workspace (which is stored locally). That is not so useful as things stand! Especially in the context of working with an external Spark cluster, because it means that data will be streamed from OpenRefine to the worker nodes and back, at every operation.
In the future I would like to make it possible to disable this initial copying and let people work off the original copy directly. If that copy is well accessible for the workers then I would expect much more interesting performance.

@fcomte
Copy link

fcomte commented Jan 5, 2022

yes, it would be important to not copy the content of the whole file on the workspace if we use spark.

@wetneb wetneb transferred this issue from OpenRefine/OpenRefine May 17, 2023
@wetneb
Copy link
Member Author

wetneb commented May 17, 2023

One other approach which looks also promising to me is to make it possible to have the workspace itself accessed via HDFS.

Having the possibility of executing a workflow directly from a file is appealing on paper, but it is likely to be quite slow as soon as the original file is not splittable. Most import formats that OpenRefine supports are not splittable (even CSV with the default settings) so you'd be limited to running the workflow on a single node, which defeats the purpose of Spark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants