Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoostj4-spark train failed on the CPU hosts #10926

Open
NvTimLiu opened this issue Oct 24, 2024 · 2 comments
Open

XGBoostj4-spark train failed on the CPU hosts #10926

NvTimLiu opened this issue Oct 24, 2024 · 2 comments

Comments

@NvTimLiu
Copy link

XGBoostj4-spark train failed on the CPU hosts,

ENVS:

1, OS: ubuntu22.04/NGC

2, Spark ver: 3.5.1

3, XGBoost4j-spark: xgboost4j-spark-gpu_2.12-2.2.0-SNAPSHOT.jar

4, rapids-4-spark: 24.12.0-SNAPSHOT

5, failed test agaricus train


 + ngc batch exec --commandline bash -c 'cat /raid/tmp/driver-agaricus-Main-CPU.log' 7117740
  WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  INFO SparkContext: Running Spark version 3.5.0
  INFO SparkContext: OS info Linux, 5.4.0-107-generic, amd64
  INFO SparkContext: Java version 1.8.0_402
  WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/and LOCAL_DIRS in YARN).
  INFO ResourceUtils: ==============================================================
  INFO ResourceUtils: No custom resources configured for spark.driver.
  INFO ResourceUtils: ==============================================================
  INFO SparkContext: Submitted application: Agaricus-Mai-csv
  INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 8, script: , vendor: , memory t: 32768, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
  INFO ResourceProfile: Limiting resource is cpus at 8 tasks per executor
  INFO ResourceProfileManager: Added ResourceProfile id: 0
  INFO SecurityManager: Changing view acls to: root
  INFO SecurityManager: Changing modify acls to: root
  INFO SecurityManager: Changing view acls groups to: 
  INFO SecurityManager: Changing modify acls groups to: 
  INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: root; groups with view ers with modify permissions: root; groups with modify permissions: EMPTY
  INFO Utils: Successfully started service 'sparkDriver' on port 39803.
  INFO SparkEnv: Registering MapOutputTracker
  INFO SparkEnv: Registering BlockManagerMaster
  INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
  INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
  INFO SparkEnv: Registering BlockManagerMasterHeartbeat
  INFO DiskBlockManager: Created local directory at /raid/tmp/blockmgr-0034f8a7-578b-4364-bce3-68225f9bf27b
  INFO MemoryStore: MemoryStore started with capacity 8.4 GiB
  INFO SparkEnv: Registering OutputCommitCoordinator
  INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
  INFO Utils: Successfully started service 'SparkUI' on port 4040.
  INFO SparkContext: Added JAR file:///test/xgboost4j-spark.jar at spark://127.0.0.1:39803/jars/xgboost4j-spark.jar with timestamp 
  INFO SparkContext: Added JAR file:/test/xgb-apps.jar at spark://127.0.0.1:39803/jars/xgb-apps.jar with timestamp 1729610887859
  INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:7077...
  INFO TransportClientFactory: Successfully created connection to /127.0.0.1:7077 after 41 ms (0 ms spent in bootstraps)
  INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20241022152809-0001
  INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/0 on worker-20241022145613-127.0.0.1-35209  8 core(s)
  INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/0 on hostPort 127.0.0.1:35209 with 8 core(s), 32.0 GiB RAM
  INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/1 on worker-20241022145613-127.0.0.1-35209  8 core(s)
  INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/1 on hostPort 127.0.0.1:35209 with 8 core(s), 32.0 GiB RAM
  INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/2 on worker-20241022145611-127.0.0.1-42465  8 core(s)
  INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/2 on hostPort 127.0.0.1:42465 with 8 core(s), 32.0 GiB RAM
  INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/3 on worker-20241022145611-127.0.0.1-42465  8 core(s)
  INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/3 on hostPort 127.0.0.1:42465 with 8 core(s), 32.0 GiB RAM
  INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40511.
  INFO NettyBlockTransferService: Server created on 127.0.0.1:40511
  INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
  INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 127.0.0.1, 40511, None)
  INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:40511 with 8.4 GiB RAM, BlockManagerId(driver, 127.0.0.1, 40511, 
  INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 127.0.0.1, 40511, None)
  INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 127.0.0.1, 40511, None)
  INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/3 is now RUNNING
  INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/2 is now RUNNING
  INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/1 is now RUNNING
  INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/0 is now RUNNING
  INFO SingleEventLogFileWriter: Logging events to file:/tmp/spark-events/app-20241022152809-0001.inprogress
  INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
  INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
  INFO SharedState: Warehouse path is 'file:/spark-warehouse'.
  WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
  INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
  INFO MetricsSystemImpl: s3a-file-system metrics system started
  INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor)  ID 2,  ResourceProfileId 0
  INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:40013 with 16.9 GiB RAM, BlockManagerId(2, 127.0.0.1, 40013, None)
  INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor)  ID 0,  ResourceProfileId 0
  INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:35771 with 16.9 GiB RAM, BlockManagerId(0, 127.0.0.1, 35771, None)
  INFO InMemoryFileIndex: It took 83 ms to list leaf files for 1 paths.
  INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor)  ID 1,  ResourceProfileId 0
  INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor)  ID 3,  ResourceProfileId 0
  INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:38295 with 16.9 GiB RAM, BlockManagerId(1, 127.0.0.1, 38295, None)
  INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:45991 with 16.9 GiB RAM, BlockManagerId(3, 127.0.0.1, 45991, None)
  INFO InMemoryFileIndex: It took 26 ms to list leaf files for 1 paths.
 
 ------ Training ------
 Exception in thread "main"  WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can  'spark.sql.debug.maxToStringFields'.
 org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `features` cannot be resolved. Did you ing? [`feature_0`, `feature_1`, `feature_2`, `feature_3`, `feature_4`].;
 'Project [cast(label#509 as float) AS label#639, 'features]
 +- Project [cast(label#0 as double) AS label#509, feature_0#1, feature_1#2, feature_2#3, feature_3#4, feature_4#5, feature_5#6, feature_6#7, feature_7#8, #10, feature_10#11, feature_11#12, feature_12#13, feature_13#14, feature_14#15, feature_15#16, feature_16#17, feature_17#18, feature_18#19, feature_19#20, _21#22, feature_22#23, ... 103 more fields]
    +- Relation eature_1#2,feature_2#3,feature_3#4,feature_4#5,feature_5#6,feature_6#7,feature_7#8,feature_8#9,feature_9#10,feature_10#11,feature_11#12,feature_12#13,featureeature_15#16,feature_16#17,feature_17#18,feature_18#19,feature_19#20,feature_20#21,feature_21#22,feature_22#23,... 103 more fields] csv
 
 	at org.apache.spark.sql.errors.QueryCompilationErrors$.unresolvedAttributeError(QueryCompilationErrors.scala:307)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$te(CheckAnalysis.scala:147)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:266)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:264)
 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:264)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:264)
 	at scala.collection.immutable.Stream.foreach(Stream.scala:533)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:264)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
 	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
 	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
 	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
 	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
 	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
 	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
 	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
 	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
 	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
 	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
 	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
 	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
 	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
 	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
 	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
 	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:91)
 	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
 	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
 	at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:4363)
 	at org.apache.spark.sql.Dataset.select(Dataset.scala:1541)
 	at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.preprocess(XGBoostEstimator.scala:210)
 	at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.preprocess$(XGBoostEstimator.scala:188)
 	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.preprocess(XGBoostClassifier.scala:33)
 	at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:415)
 	at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train$(XGBoostEstimator.scala:409)
 	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:33)
 	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:33)
 	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
 	at com.nvidia.spark.examples.agaricus.Main$.$anonfun$main$8(Main.scala:77)
 	at com.nvidia.spark.examples.utility.Benchmark.time(Benchmark.scala:29)
 	at com.nvidia.spark.examples.agaricus.Main$.main(Main.scala:77)
 	at com.nvidia.spark.examples.agaricus.Main.main(Main.scala)
 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 	at java.lang.reflect.Method.invoke(Method.java:498)
 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
 	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
 	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
 	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
 	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
 	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
  INFO SparkContext: Invoking stop() from shutdown hook
  INFO SparkContext: SparkContext is stopping with exitCode 0.
  INFO SparkUI: Stopped Spark web UI at http://127.0.0.1:4040
  INFO StandaloneSchedulerBackend: Shutting down all executors
  INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Asking each executor to shut down
  INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
  INFO MemoryStore: MemoryStore cleared
  INFO BlockManager: BlockManager stopped
  INFO BlockManagerMaster: BlockManagerMaster stopped
  INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
  ERROR TransportRequestHandler: Error sending result StreamResponse[streamId=/jars/xgboost4j-8696354,body=FileSegmentManagedBuffer[file=/test/xgboost4j-spark.jar,offset=0,length=338696354]] to /127.0.0.1:33046; closing connection
 io.netty.channel.StacklessClosedChannelException
 	at io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown Source)
  INFO SparkContext: Successfully stopped SparkContext
  INFO ShutdownHookManager: Shutdown hook called
  INFO ShutdownHookManager: Deleting directory /tmp/spark-1d29e677-7338-4fc8-bec8-e57284298ca1
  INFO ShutdownHookManager: Deleting directory /raid/tmp/spark-0dcd6655-62da-49f8-ba12-59f4e9c5739c
  INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
  INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
  INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
 
 real	0m15.488s
 user	0m26.418s
 sys	0m3.454s
 
 0
 
@NvTimLiu
Copy link
Author

@wbo4958

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 24, 2024

Hi @NvTimLiu, Thx, I will fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants