-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection to an external Spark cluster #1
Comments
This is now configurable. As things stand it is not easy to use though: we also need to publish a |
I saw that we can start a spark driver connected with a mesos engine. I a very interrested in connecting openrefine and spark in a cloud native way. |
Hi @fcomte, I have not looked into this, but it looks like it should be feasible! Just to check, is your goal to run the OpenRefine web server in Kubernetes itself, or just to have OpenRefine connect to a Spark cluster running in Kubernetes? For the latter, it seems doable, although there might be some minor things to tweak on OpenRefine's side. The existing parameter That being said I should say that running OpenRefine with Spark should mostly be useful when deploying OpenRefine workflows in some production environment, rather than using it interactively through OpenRefine's UI. The latter is possible of course, but it is likely to be less responsive than with the default runner (especially when connecting to a remote cluster). I am really keen to understand your needs better in any case. |
I am running openrefine inside a container (kubernetes). It's working. Of course in my context, the spark cluster should use a kubernetes master. I have no real use case, I manage a data platform and I think this is great to empowered users with this feature and I am very interested in testing that myself. How the data would be handle in that case ? |
At the moment, it is possible to import a file from an external source like S3, but it will create a copy in the workspace (which is stored locally). That is not so useful as things stand! Especially in the context of working with an external Spark cluster, because it means that data will be streamed from OpenRefine to the worker nodes and back, at every operation. |
yes, it would be important to not copy the content of the whole file on the workspace if we use spark. |
One other approach which looks also promising to me is to make it possible to have the workspace itself accessed via HDFS. Having the possibility of executing a workflow directly from a file is appealing on paper, but it is likely to be quite slow as soon as the original file is not splittable. Most import formats that OpenRefine supports are not splittable (even CSV with the default settings) so you'd be limited to running the workflow on a single node, which defeats the purpose of Spark. |
When running OpenRefine with the Spark runner, we always spin a local Spark instance for it. We should instead make it possible to configure OpenRefine to connect to an existing Spark cluster for this.
Proposed solution
Introduce the relevant configuration variables for the Spark runner to make this possible.
The text was updated successfully, but these errors were encountered: