Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update readMe #5

Merged
merged 39 commits into from
Jan 2, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
update readMe
wuchengwei committed Dec 19, 2023
commit 1d042ea5a4ec093feffa70501cd48c193b667c81
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -192,6 +192,8 @@ When using UDF, you should consider performance and optimization. Some functions
For complex logic or functions that require a lot of memory, further optimization and consideration may be required. UDF is designed for simple logic and data processing, and for more complex calculations, you may need to use the
Spark's native operator for processing.

The deduplication module provides a common Python function (to determine whether it is a substring of other strings) to use spark udf rewriting, which makes it easy to use spark distributed capabilities. For more information, please see `stringMatching.py` and `stringMatching.py`.

If the user simply changes the python function to a spark task, it will not work without a spark cluster. Here a detailed document of building a cluster is written in detail, which is convenient for novice users.

See [Spark cluster building](flagdata/deduplication/README.md) for an example.
2 changes: 2 additions & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
@@ -192,6 +192,8 @@ spark单一能力的集成:
对于复杂的逻辑或需要大量内存的函数,可能需要进一步的优化和考虑。UDF 是为了简单的逻辑和数据处理而设计的,对于更复杂的计算,可能需要使用
Spark 的原生算子来进行处理。

deduplication模块下提供了普通Python函数(判断是否是其他字符串的子字符串)使用spark udf的改写,可以方便的使用spark分布式能力,详细请见`stringMatching.py``udf_spark_stringMatching.py`的对比

如果用户只是单单将python函数改成spark任务,如果没有spark集群是不行的。这里详细的写了傻瓜式搭建集群的文档,方便小白用户使用。具体示例见[spark集群搭建](flagdata/deduplication/README_zh.md)

### 2.4、数据分析阶段