update readMe

FlagOpen · wuchengwei0122 · Jan 2, 2024 · Dec 6, 2023 · Dec 8, 2023 · Dec 10, 2023
commit 1d042ea5a4ec093feffa70501cd48c193b667c81
diff --git a/README.md b/README.md
@@ -192,6 +192,8 @@ When using UDF, you should consider performance and optimization. Some functions
 For complex logic or functions that require a lot of memory, further optimization and consideration may be required. UDF is designed for simple logic and data processing, and for more complex calculations, you may need to use the
 Spark's native operator for processing.
 
+The deduplication module provides a common Python function (to determine whether it is a substring of other strings) to use spark udf rewriting, which makes it easy to use spark distributed capabilities. For more information, please see `stringMatching.py` and `stringMatching.py`.
+
 If the user simply changes the python function to a spark task, it will not work without a spark cluster. Here a detailed document of building a cluster is written in detail, which is convenient for novice users.
 
 See [Spark cluster building](flagdata/deduplication/README.md) for an example.

diff --git a/README_zh.md b/README_zh.md
@@ -192,6 +192,8 @@ spark单一能力的集成：
 对于复杂的逻辑或需要大量内存的函数，可能需要进一步的优化和考虑。UDF 是为了简单的逻辑和数据处理而设计的，对于更复杂的计算，可能需要使用
 Spark 的原生算子来进行处理。
 
+deduplication模块下提供了普通Python函数（判断是否是其他字符串的子字符串）使用spark udf的改写，可以方便的使用spark分布式能力，详细请见`stringMatching.py`和`udf_spark_stringMatching.py`的对比
+
 如果用户只是单单将python函数改成spark任务，如果没有spark集群是不行的。这里详细的写了傻瓜式搭建集群的文档，方便小白用户使用。具体示例见[spark集群搭建](flagdata/deduplication/README_zh.md)
 
 ### 2.4、数据分析阶段