{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":678861833,"defaultBranch":"master","name":"spark","ownerLogin":"riven-blade","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2023-08-15T14:55:12.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/98205714?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1692124281.0","currentOid":""},"activityList":{"items":[{"before":null,"after":"0a04721973c34a3324c41ac68b4f9c203ecedf40","ref":"refs/heads/branch-1.5","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-17721][MLLIB][BACKPORT] Fix for multiplying transposed SparseMatrix with SparseVector\n\nBackport PR of changes relevant to mllib only, but otherwise identical to #15296\n\njkbradley\n\nAuthor: Bjarne Fruergaard \n\nCloses #15311 from bwahlgreen/bugfix-spark-17721-1.6.\n\n(cherry picked from commit 376545e4d38cd41b4a3233819d63bb81f5c83283)\nSigned-off-by: Joseph K. Bradley ","shortMessageHtmlLink":"[SPARK-17721][MLLIB][BACKPORT] Fix for multiplying transposed SparseM…"}},{"before":null,"after":"117843f85e8e69a43ffa531716a37c8c05dbabb0","ref":"refs/heads/branch-1.0","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-9633] [BUILD] SBT download locations outdated; need an update\n\nRemove 2 defunct SBT download URLs and replace with the 1 known download URL. Also, use https.\nFollow up on https://github.com/apache/spark/pull/7792\n\nAuthor: Sean Owen \n\nCloses #7956 from srowen/SPARK-9633 and squashes the following commits:\n\ncaa40bd [Sean Owen] Remove 2 defunct SBT download URLs and replace with the 1 known download URL. Also, use https.\n\nConflicts:\n\tsbt/sbt-launch-lib.bash","shortMessageHtmlLink":"[SPARK-9633] [BUILD] SBT download locations outdated; need an update"}},{"before":null,"after":"11ee9d191e26a41a44ff0ca8730a129934942ee7","ref":"refs/heads/branch-1.1","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec\n\njira: https://issues.apache.org/jira/browse/SPARK-11813\n\nI found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.\n1. Performance improvement for less serialization.\n2. Increase the capacity of Word2Vec a lot.\nCurrently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.\nthe main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab\n2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.\n\nTheir sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.\n\nActually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.\n\nAuthor: Yuhao Yang \n\nCloses #9803 from hhbyyh/w2vVocab.\n\n(cherry picked from commit e391abdf2cb6098a35347bd123b815ee9ac5b689)\nSigned-off-by: Xiangrui Meng ","shortMessageHtmlLink":"[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec"}},{"before":null,"after":"2f3e4e36017d16d67086fd4ecaf39636a2fb4b7c","ref":"refs/heads/branch-3.0","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-39932][SQL] WindowExec should clear the final partition buffer\n\n### What changes were proposed in this pull request?\n\nExplicitly clear final partition buffer if can not find next in `WindowExec`. The same fix in `WindowInPandasExec`\n\n### Why are the changes needed?\n\nWe do a repartition after a window, then we need do a local sort after window due to RoundRobinPartitioning shuffle.\n\nThe error stack:\n```java\nExternalAppendOnlyUnsafeRowArray INFO - Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter\n\norg.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0\n\tat org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)\n\tat org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97)\n\tat org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:352)\n\tat org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:435)\n\tat org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455)\n\tat org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)\n\tat org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226)\n\tat org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:355)\n```\n\n`WindowExec` only clear buffer in `fetchNextPartition` so the final partition buffer miss to clear.\n\nIt is not a big problem since we have task completion listener.\n```scala\ntaskContext.addTaskCompletionListener(context -> {\n cleanupResources();\n});\n```\n\nThis bug only affects if the window is not the last operator for this task and the follow operator like sort.\n\n### Does this PR introduce _any_ user-facing change?\n\nyes, bug fix\n\n### How was this patch tested?\n\nN/A\n\nCloses #37358 from ulysses-you/window.\n\nAuthored-by: ulysses-you \nSigned-off-by: Hyukjin Kwon \n(cherry picked from commit 1fac870126c289a7ec75f45b6b61c93b9a4965d4)\nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-39932][SQL] WindowExec should clear the final partition buffer"}},{"before":null,"after":"307f27e24e17afd92030194a3e6fec312fc19f4f","ref":"refs/heads/branch-1.2","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec\n\njira: https://issues.apache.org/jira/browse/SPARK-11813\n\nI found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.\n1. Performance improvement for less serialization.\n2. Increase the capacity of Word2Vec a lot.\nCurrently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.\nthe main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab\n2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.\n\nTheir sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.\n\nActually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.\n\nAuthor: Yuhao Yang \n\nCloses #9803 from hhbyyh/w2vVocab.\n\n(cherry picked from commit e391abdf2cb6098a35347bd123b815ee9ac5b689)\nSigned-off-by: Xiangrui Meng ","shortMessageHtmlLink":"[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec"}},{"before":null,"after":"330f7a789fb9d3b1e501f180f477bf0ad578d10e","ref":"refs/heads/branch-3.3","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-44581][YARN] Fix the bug that ShutdownHookManager gets wrong UGI from SecurityManager of ApplicationMaster\n\n### What changes were proposed in this pull request?\n\nI make the SecurityManager instance a lazy value\n\n### Why are the changes needed?\n\nfix the bug in issue [SPARK-44581](https://issues.apache.org/jira/browse/SPARK-44581)\n\n**Bug:**\nIn spark3.2 it throws the org.apache.hadoop.security.AccessControlException, but in spark2.4 this hook does not throw exception.\n\nI rebuild the hadoop-client-api.jar, and add some debug log before the hadoop shutdown hook is created, and rebuild the spark-yarn.jar to add some debug log when creating the spark shutdown hook manager, here is the screenshot of the log:\n![image](https://github.com/apache/spark/assets/62563545/ea338db3-646c-432c-bf16-1f445adc2ad9)\n\nWe can see from the screenshot, the ShutdownHookManager is initialized before the ApplicationManager create a new ugi.\n\n**Reason**\n\nThe main cause is that ShutdownHook thread is created before we create the ugi in ApplicationMaster.\n\nWhen we set the config key \"hadoop.security.credential.provider.path\", the ApplicationMaster will try to get a filesystem when generating SSLOptions, and when initialize the filesystem during which it will generate a new thread whose ugi is inherited from the current process (yarn).\nAfter this, it will generate a new ugi (SPARK_USER) in ApplicationMaster and execute the doAs() function.\n\nHere is the chain of the call:\nApplicationMaster.(ApplicationMaster.scala:83) -> org.apache.spark.SecurityManager.(SecurityManager.scala:98) -> org.apache.spark.SSLOptions$.parse(SSLOptions.scala:188) -> org.apache.hadoop.conf.Configuration.getPassword(Configuration.java:2353) -> org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2434) -> org.apache.hadoop.security.alias.CredentialProviderFactory.getProviders(CredentialProviderFactory.java:82)\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nI didn't add new UnitTest for this, but I rebuild the package, and runs a program in my cluster, and turns out that the user when I delete the staging file turns to be the same with the SPARK_USER.\n\nCloses #42405 from liangyu-1/SPARK-44581.\n\nAuthored-by: 余良 \nSigned-off-by: Kent Yao \n(cherry picked from commit e584ed4ad96a0f0573455511d7be0e9b2afbeb96)\nSigned-off-by: Kent Yao ","shortMessageHtmlLink":"[SPARK-44581][YARN] Fix the bug that ShutdownHookManager gets wrong U…"}},{"before":null,"after":"379567fcdb0d85ca0c5598db70c5a8b9efca552a","ref":"refs/heads/branch-0.7","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"fix block manager UI display issue when enable spark.cleaner.ttl\n\nConflicts:\n\n\tcore/src/main/scala/spark/storage/StorageUtils.scala","shortMessageHtmlLink":"fix block manager UI display issue when enable spark.cleaner.ttl"}},{"before":null,"after":"4be566062defa249435c4d72eb106fe7b933e023","ref":"refs/heads/branch-2.4","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"Update Spark key negotiation protocol","shortMessageHtmlLink":"Update Spark key negotiation protocol"}},{"before":null,"after":"4d2d3d47e00e78893b1ecd5a9a9070adc5243ac9","ref":"refs/heads/branch-2.1","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL][BACKPORT-2.1] Shuffle+Repartition on a DataFrame could lead to incorrect answers\n\n## What changes were proposed in this pull request?\n\n Back port of #20393 and #22079.\n\n Currently shuffle repartition uses RoundRobinPartitioning, the generated result is nondeterministic since the sequence of input rows are not determined.\n\n The bug can be triggered when there is a repartition call following a shuffle (which would lead to non-deterministic row ordering), as the pattern shows below:\n upstream stage -> repartition stage -> result stage\n (-> indicate a shuffle)\n When one of the executors process goes down, some tasks on the repartition stage will be retried and generate inconsistent ordering, and some tasks of the result stage will be retried generating different data.\n\n The following code returns 931532, instead of 1000000:\n ```\n import scala.sys.process._\n\n import org.apache.spark.TaskContext\n val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>\n x\n }.repartition(200).map { x =>\n if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {\n throw new Exception(\"pkill -f java\".!!)\n }\n x\n }\n res.distinct().count()\n ```\n\n In this PR, we propose a most straight-forward way to fix this problem by performing a local sort before partitioning, after we make the input row ordering deterministic, the function from rows to partitions is fully deterministic too.\n\n The downside of the approach is that with extra local sort inserted, the performance of repartition() will go down, so we add a new config named `spark.sql.execution.sortBeforeRepartition` to control whether this patch is applied. The patch is default enabled to be safe-by-default, but user may choose to manually turn it off to avoid performance regression.\n\n This patch also changes the output rows ordering of repartition(), that leads to a bunch of test cases failure because they are comparing the results directly.\n\n Add unit test in ExchangeSuite.\n\n With this patch(and `spark.sql.execution.sortBeforeRepartition` set to true), the following query returns 1000000:\n ```\n import scala.sys.process._\n\n import org.apache.spark.TaskContext\n\n spark.conf.set(\"spark.sql.execution.sortBeforeRepartition\", \"true\")\n\n val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>\n x\n }.repartition(200).map { x =>\n if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {\n throw new Exception(\"pkill -f java\".!!)\n }\n x\n }\n res.distinct().count()\n\n res7: Long = 1000000\n ```\n\n Author: Xingbo Jiang \n\nAuthor: Xingbo Jiang \nAuthor: Henry Robinson \n\nCloses #22211 from henryr/spark-23207-branch-2.1.","shortMessageHtmlLink":"[SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL][BACKPORT-2.…"}},{"before":null,"after":"5b021ce0990ec675afc6939cc2c06f041c973d17","ref":"refs/heads/branch-0.5","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"Change version to 0.5.3-SNAPSHOT","shortMessageHtmlLink":"Change version to 0.5.3-SNAPSHOT"}},{"before":null,"after":"5ed89ceaf367590f79401abbf9ff7fc66507fe4e","ref":"refs/heads/branch-2.0","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-25089][R] removing lintr checks for 2.0\n\n## What changes were proposed in this pull request?\n\nsince 2.0 will be EOLed some time in the not too distant future, and we'll be moving the builds from centos to ubuntu, i think it's fine to disable R linting rather than going down the rabbit hole of trying to fix this stuff.\n\n## How was this patch tested?\n\nthe build system will test this\n\nCloses #22074 from shaneknapp/removing-lintr-2.0.\n\nAuthored-by: shane knapp \nSigned-off-by: Sean Owen ","shortMessageHtmlLink":"[SPARK-25089][R] removing lintr checks for 2.0"}},{"before":null,"after":"61e034807dc555d7ceadc534fcb0c82d50fc8719","ref":"refs/heads/branch-3.1","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-41541][SQL] Fix call to wrong child method in SQLShuffleWriteMetricsReporter.decRecordsWritten()\n\n### What changes were proposed in this pull request?\n\nThis PR fixes a bug in `SQLShuffleWriteMetricsReporter.decRecordsWritten()`: this method is supposed to call the delegate `metricsReporter`'s `decRecordsWritten` method but due to a typo it calls the `decBytesWritten` method instead.\n\n### Why are the changes needed?\n\nOne of the situations where `decRecordsWritten(v)` is called while reverting shuffle writes from failed/canceled tasks. Due to the mixup in these calls, the _recordsWritten_ metric ends up being _v_ records too high (since it wasn't decremented) and the _bytesWritten_ metric ends up _v_ records too low, causing some failed tasks' write metrics to look like\n\n> {\"Shuffle Bytes Written\":-2109,\"Shuffle Write Time\":2923270,\"Shuffle Records Written\":2109}\n\ninstead of\n\n> {\"Shuffle Bytes Written\":0,\"Shuffle Write Time\":2923270,\"Shuffle Records Written\":0}\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests / manual code review only. The existing SQLMetricsSuite contains end-to-end tests which exercise this class but they don't exercise the decrement path because they don't exercise the shuffle write failure paths. In theory I could add new unit tests but I don't think the ROI is worth it given that this class is intended to be a simple wrapper and it ~never changes (this PR is the first change to the file in 5 years).\n\nCloses #39086 from JoshRosen/SPARK-41541.\n\nAuthored-by: Josh Rosen \nSigned-off-by: Hyukjin Kwon \n(cherry picked from commit ed27121607cf526e69420a1faff01383759c9134)\nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-41541][SQL] Fix call to wrong child method in SQLShuffleWriteM…"}},{"before":null,"after":"62b3158b80c14668b16476818d72f1296b06da66","ref":"refs/heads/branch-0.8","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"Merge pull request #583 from colorant/zookeeper.\n\nMinor fix for ZooKeeperPersistenceEngine to use configured working dir\n\nAuthor: Raymond Liu \n\nCloses #583 and squashes the following commits:\n\n91b0609 [Raymond Liu] Minor fix for ZooKeeperPersistenceEngine to use configured working dir\n\n(cherry picked from commit 68b2c0d02dbdca246ca686b871c06af53845d5b5)\nSigned-off-by: Aaron Davidson \n\nConflicts:\n\tcore/src/main/scala/org/apache/spark/deploy/master/ZooKeeperPersistenceEngine.scala","shortMessageHtmlLink":"Merge pull request #583 from colorant/zookeeper."}},{"before":null,"after":"65cc451c89fd03daed8c315a91d93067ccdc3a5c","ref":"refs/heads/branch-1.3","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-12363] [MLLIB] [BACKPORT-1.3] Remove setRun and fix PowerIterationClustering failed test\n\n## What changes were proposed in this pull request?\n\nBackport JIRA-SPARK-12363 to branch-1.3.\n\n## How was the this patch tested?\n\nUnit test.\n\ncc mengxr\n\nAuthor: Liang-Chi Hsieh \nAuthor: Xiangrui Meng \n\nCloses #11265 from viirya/backport-12363-1.3 and squashes the following commits:\n\nec076dd [Liang-Chi Hsieh] Fix scala style.\n7a3ef5f [Xiangrui Meng] use Graph instead of GraphImpl and update tests and example based on PIC paper\nb86018d [Liang-Chi Hsieh] Remove setRun and fix PowerIterationClustering failed test.","shortMessageHtmlLink":"[SPARK-12363] [MLLIB] [BACKPORT-1.3] Remove setRun and fix PowerItera…"}},{"before":null,"after":"75cc3b2da9ee0b51ecf0f13169f2b634e36a60c4","ref":"refs/heads/branch-2.3","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-28891][BUILD][2.3] backport do-release-docker.sh to branch-2.3\n\n### What changes were proposed in this pull request?\nThis PR re-enables `do-release-docker.sh` for branch-2.3.\n\nAccording to the release manager of Spark 2.3.3 maropu, `do-release-docker.sh` in the master branch. After applying #23098, the script does not work for branch-2.3.\n\n### Why are the changes needed?\nThis PR simplifies the release process in branch-2.3 simple.\n\nWhile Spark 2.3.x will not be released further, as dongjoon-hyun [suggested](https://github.com/apache/spark/pull/23098#issuecomment-524682234), it would be good to put this change for\n\n1. to reproduce this release by others\n2. to make the future urgent release simple\n\n### Does this PR introduce any user-facing change?\nNo\n\n### How was this patch tested?\nNo test is added.\nThis PR is used to create Spark 2.3.4-rc1\n\nCloses #25607 from kiszk/SPARK-28891.\n\nAuthored-by: Kazuaki Ishizaki \nSigned-off-by: Dongjoon Hyun ","shortMessageHtmlLink":"[SPARK-28891][BUILD][2.3] backport do-release-docker.sh to branch-2.3"}},{"before":null,"after":"7c7d7f6a878b02ece881266ee538f3e1443aa8c1","ref":"refs/heads/branch-2.2","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-26806][SS] EventTimeStats.merge should handle zeros correctly\n\n## What changes were proposed in this pull request?\n\nRight now, EventTimeStats.merge doesn't handle `zero.merge(zero)` correctly. This will make `avg` become `NaN`. And whatever gets merged with the result of `zero.merge(zero)`, `avg` will still be `NaN`. Then finally, we call `NaN.toLong` and get `0`, and the user will see the following incorrect report:\n\n```\n\"eventTime\" : {\n \"avg\" : \"1970-01-01T00:00:00.000Z\",\n \"max\" : \"2019-01-31T12:57:00.000Z\",\n \"min\" : \"2019-01-30T18:44:04.000Z\",\n \"watermark\" : \"1970-01-01T00:00:00.000Z\"\n }\n```\n\nThis issue was reported by liancheng .\n\nThis PR fixes the above issue.\n\n## How was this patch tested?\n\nThe new unit tests.\n\nCloses #23718 from zsxwing/merge-zero.\n\nAuthored-by: Shixiong Zhu \nSigned-off-by: Shixiong Zhu \n(cherry picked from commit 03a928cbecaf38bbbab3e6b957fcbb542771cfbd)\nSigned-off-by: Shixiong Zhu ","shortMessageHtmlLink":"[SPARK-26806][SS] EventTimeStats.merge should handle zeros correctly"}},{"before":null,"after":"9caf3a9659821a3b4fd4394c9f4134adff9caf88","ref":"refs/heads/branch-1.0-jdbc","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-2696] Reduce default value of spark.serializer.objectStreamReset\n\nThe current default value of spark.serializer.objectStreamReset is 10,000.\nWhen trying to re-partition (e.g., to 64 partitions) a large file (e.g., 500MB), containing 1MB records, the serializer will cache 10000 x 1MB x 64 ~= 640 GB which will cause out of memory errors.\n\nThis patch sets the default value to a more reasonable default value (100).\n\nAuthor: Hossein \n\nCloses #1595 from falaki/objectStreamReset and squashes the following commits:\n\n650a935 [Hossein] Updated documentation\n1aa0df8 [Hossein] Reduce default value of spark.serializer.objectStreamReset\n\n(cherry picked from commit 66f26a4610aede57322cb7e193a50aecb6c57d22)\nSigned-off-by: Matei Zaharia ","shortMessageHtmlLink":"[SPARK-2696] Reduce default value of spark.serializer.objectStreamReset"}},{"before":null,"after":"a233fac0b8bf8229d938a24f2ede2d9d8861c284","ref":"refs/heads/branch-1.6","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-19688][STREAMING] Not to read `spark.yarn.credentials.file` from checkpoint.\n\n## What changes were proposed in this pull request?\n\nReload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.\n\n## How was this patch tested?\n\nManual tested with 1.6.3 and 2.1.1.\nI didn't test this with master because of some compile problems, but I think it will be the same result.\n\n## Notice\n\nThis should be merged into maintenance branches too.\n\njira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)\n\nAuthor: saturday_s \n\nCloses #18230 from saturday-shi/SPARK-21008.\n\n(cherry picked from commit e92ffe6f1771e3fe9ea2e62ba552c1b5cf255368)\nSigned-off-by: Marcelo Vanzin ","shortMessageHtmlLink":"[SPARK-19688][STREAMING] Not to read spark.yarn.credentials.file fr…"}},{"before":null,"after":"a51932a5f34c33d65e5d0fdc5552868b1b1cebc4","ref":"refs/heads/branch-3.5","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-44653][SQL][FOLLOWUP] ResolveUnion should not combine Unions\n\n### What changes were proposed in this pull request?\n\nThis is a followup of https://github.com/apache/spark/pull/42315 , to fix by-name Union as well. Ideally, an analyzer rule should not invoke an optimizer rule, and it's cleaner to keep the df union hack in the df code.\n\n### Why are the changes needed?\n\nfix the regression for by-name Union as well.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nnew tests\n\nCloses #42483 from cloud-fan/union.\n\nAuthored-by: Wenchen Fan \nSigned-off-by: Kent Yao \n(cherry picked from commit a53a9e93bd75a3c08ffb63abe21db0944e6601e9)\nSigned-off-by: Kent Yao ","shortMessageHtmlLink":"[SPARK-44653][SQL][FOLLOWUP] ResolveUnion should not combine Unions"}},{"before":null,"after":"a846a225cedf61187ece611143b45806837bf0bc","ref":"refs/heads/branch-3.4","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-44745][DOCS][K8S] Document shuffle data recovery from the remounted K8s PVCs\n\n### What changes were proposed in this pull request?\n\nThis PR aims to document an example of shuffle data recovery configuration from the remounted K8s PVCs.\n\n### Why are the changes needed?\n\nThis will help the users use this feature more easily.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nManual review because this is a doc-only change.\n\n![Screenshot 2023-08-09 at 1 39 48 PM](https://github.com/apache/spark/assets/9700541/8cc7240b-570d-4c2e-b90a-54795c18df0a)\n\n```\n$ kubectl logs -f xxx-exec-16 | grep Kube\n...\n23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Try to recover shuffle data.\n23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Found 192 files\n23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Try to recover /data/spark-x/executor-x/blockmgr-41a810ea-9503-447b-afc7-1cb104cd03cf/11/shuffle_0_11160_0.data\n23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Try to recover /data/spark-x/executor-x/blockmgr-41a810ea-9503-447b-afc7-1cb104cd03cf/0e/shuffle_0_10063_0.data\n23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Try to recover /data/spark-x/executor-x/blockmgr-41a810ea-9503-447b-afc7-1cb104cd03cf/0e/shuffle_0_10283_0.data\n23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Ignore a non-shuffle block file.\n```\n\nCloses #42417 from dongjoon-hyun/SPARK-44745.\n\nAuthored-by: Dongjoon Hyun \nSigned-off-by: Dongjoon Hyun \n(cherry picked from commit 4db378fae30733cbd2be41e95a3cd8ad2184e06f)\nSigned-off-by: Dongjoon Hyun ","shortMessageHtmlLink":"[SPARK-44745][DOCS][K8S] Document shuffle data recovery from the remo…"}},{"before":null,"after":"afcccb42c96cc7785a85c9463523f34d7a900a1d","ref":"refs/heads/master","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-44718][SQL] Match ColumnVector memory-mode config default to OffHeapMemoryMode config value\n\n### What changes were proposed in this pull request?\n\nSet the column vector default memory mode to depend on the off-heap memory mode flag. This is to prevent a user from using Vectorized-Reader with an on-heap column-vector by default when the off-heap memory mode is enabled on the cluster.\n\n### Why are the changes needed?\nAvoid the unintentional usage of on-heap memory in vectorized-reader when off-heap memory mode is enabled by the user.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nManual & existing tests.\n\nCloses #42394 from majdyz/offheap-colvec-mode-default-value.\n\nLead-authored-by: Zamil Majdy \nCo-authored-by: Zamil Majdy \nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-44718][SQL] Match ColumnVector memory-mode config default to O…"}},{"before":null,"after":"b2680ae70f18a39b549ece2e6ca9fc8331148c2f","ref":"refs/heads/branch-1.4","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-14468] Always enable OutputCommitCoordinator\n\n## What changes were proposed in this pull request?\n\n`OutputCommitCoordinator` was introduced to deal with concurrent task attempts racing to write output, leading to data loss or corruption. For more detail, read the [JIRA description](https://issues.apache.org/jira/browse/SPARK-14468).\n\nBefore: `OutputCommitCoordinator` is enabled only if speculation is enabled.\nAfter: `OutputCommitCoordinator` is always enabled.\n\nUsers may still disable this through `spark.hadoop.outputCommitCoordination.enabled`, but they really shouldn't...\n\n## How was this patch tested?\n\n`OutputCommitCoordinator*Suite`\n\nAuthor: Andrew Or \n\nCloses #12244 from andrewor14/always-occ.\n\n(cherry picked from commit 3e29e372ff518827bae9dcd26087946fde476843)\nSigned-off-by: Andrew Or ","shortMessageHtmlLink":"[SPARK-14468] Always enable OutputCommitCoordinator"}},{"before":null,"after":"d46c54c5be5dc62237a0cdf584787d7fe16eab31","ref":"refs/heads/branch-0.6","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"Merge pull request #485 from andyk/branch-0.6\n\nFixes link to issue tracker in documentation page \"Contributing to Spark\"","shortMessageHtmlLink":"Merge pull request #485 from andyk/branch-0.6"}},{"before":null,"after":"e63783a23f49fafc5d9f464ecfd107d19bd87787","ref":"refs/heads/branch-0.9","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[MAINTENANCE] Closes #2854\n\nThis commit exists to close a pull request on github.","shortMessageHtmlLink":"[MAINTENANCE] Closes #2854"}},{"before":null,"after":"f67e168acd9c13d27c55807736b7fd7cdf870e29","ref":"refs/heads/branch-3.2","pushedAt":"2023-08-15T18:31:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"riven-blade","name":"blade","path":"/riven-blade","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/98205714?s=80&v=4"},"commit":{"message":"[SPARK-44581][YARN] Fix the bug that ShutdownHookManager gets wrong UGI from SecurityManager of ApplicationMaster\n\n### What changes were proposed in this pull request?\n\nI make the SecurityManager instance a lazy value\n\n### Why are the changes needed?\n\nfix the bug in issue [SPARK-44581](https://issues.apache.org/jira/browse/SPARK-44581)\n\n**Bug:**\nIn spark3.2 it throws the org.apache.hadoop.security.AccessControlException, but in spark2.4 this hook does not throw exception.\n\nI rebuild the hadoop-client-api.jar, and add some debug log before the hadoop shutdown hook is created, and rebuild the spark-yarn.jar to add some debug log when creating the spark shutdown hook manager, here is the screenshot of the log:\n![image](https://github.com/apache/spark/assets/62563545/ea338db3-646c-432c-bf16-1f445adc2ad9)\n\nWe can see from the screenshot, the ShutdownHookManager is initialized before the ApplicationManager create a new ugi.\n\n**Reason**\n\nThe main cause is that ShutdownHook thread is created before we create the ugi in ApplicationMaster.\n\nWhen we set the config key \"hadoop.security.credential.provider.path\", the ApplicationMaster will try to get a filesystem when generating SSLOptions, and when initialize the filesystem during which it will generate a new thread whose ugi is inherited from the current process (yarn).\nAfter this, it will generate a new ugi (SPARK_USER) in ApplicationMaster and execute the doAs() function.\n\nHere is the chain of the call:\nApplicationMaster.(ApplicationMaster.scala:83) -> org.apache.spark.SecurityManager.(SecurityManager.scala:98) -> org.apache.spark.SSLOptions$.parse(SSLOptions.scala:188) -> org.apache.hadoop.conf.Configuration.getPassword(Configuration.java:2353) -> org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2434) -> org.apache.hadoop.security.alias.CredentialProviderFactory.getProviders(CredentialProviderFactory.java:82)\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nI didn't add new UnitTest for this, but I rebuild the package, and runs a program in my cluster, and turns out that the user when I delete the staging file turns to be the same with the SPARK_USER.\n\nCloses #42405 from liangyu-1/SPARK-44581.\n\nAuthored-by: 余良 \nSigned-off-by: Kent Yao \n(cherry picked from commit e584ed4ad96a0f0573455511d7be0e9b2afbeb96)\nSigned-off-by: Kent Yao ","shortMessageHtmlLink":"[SPARK-44581][YARN] Fix the bug that ShutdownHookManager gets wrong U…"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"Y3Vyc29yOnYyOpK7MjAyMy0wOC0xNVQxODozMToyMS4wMDAwMDBazwAAAANrS8Rg","startCursor":"Y3Vyc29yOnYyOpK7MjAyMy0wOC0xNVQxODozMToyMS4wMDAwMDBazwAAAANrS8Rg","endCursor":"Y3Vyc29yOnYyOpK7MjAyMy0wOC0xNVQxODozMToyMS4wMDAwMDBazwAAAANrSdKp"}},"title":"Activity · riven-blade/spark"}