-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot insert spark dataframe in batches into a collection #14
Comments
The error I get is saying "Py4JJavaError: An error occurred while calling o100353.save. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: milvus." however the jar file is just there where I specified it. SparkSession doesn't return any error on clickhouse driver I also specify and utilizes it just fine. |
@timtimich35 Obviously the spark-milvus jar is not loaded correctly. I 'm not quite familiar to pyspark. Have you ever tried .config("spark.driver.extraClassPath", '/data/notebook_files/clickhouse-native-jdbc-shaded-2.6.5.jar,/data/notebook_files/spark-milvus-1.0.0-SNAPSHOT.jar') \ .config("spark.executor.extraClassPath", '/data/notebook_files/clickhouse-native-jdbc-shaded-2.6.5.jar,/data/notebook_files/spark-milvus-1.0.0-SNAPSHOT.jar') |
@wayblink Nope. Haven't yet tried this approach. Will do. Will text you back. |
Dependencies:
python 3.8.12
pyspark 3.5.0
pymilvus 2.4.1
grpcio-tools 1.60.0
protobuf 4.25.3
milvus cluster was deployed in k8s using milvus operator 0.9.13
SparkSession setup:
Milvus setup:
Given:
I work in DataLore IDE deployed in k8s along with Milvus and Spark.
I have a spark dataframe of 2.5 million rows and 2 columns, id and 3000 elements long vector of floats.
I try to load it in batches 100,000 records each so it should be 25 iterations in total.
None of the batches gets inserted.
Insert operation:
Can you please help me understand what do I do wrong?
The text was updated successfully, but these errors were encountered: