-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility Issue with Chinese Text in Document Parsing #3267
base: main
Are you sure you want to change the base?
Conversation
…apply to text type classification - Added a `languages` attribute to the Document base class. This attribute is essential to express the current language nature of a document, as language issues are encountered in various methods across the document. Having a common language array as a default value is necessary, and this attribute also partially meets the requirements of domain-driven design. - Added `languages` option to `DocxPartitionerOptions` to specify a list of languages to use for text type classification. - Modified `_DocxPartitioner.detect_text_type()` to use the specified languages or automatically detect the languages if "auto" is specified. - This allows the partitioner to more accurately classify text elements based on the language, improving the overall partitioning quality. - For HTML and MD (MD utilizes the HTML partition method), the `languages` field is passed through the entire construction chain until it is finally used in the `is_possible_narrative_text` and `is_possible_title` functions. Previously, although these two functions supported different judgments for different languages, the `languages` parameter was not correctly passed, which led to this capability not being enabled. This update enables this capability. - **BREAKING CHANGE**: The `DocxPartitionerOptions` constructor and some other partition functions now require a new `languages` parameter. This is a breaking change for any existing code. However, since most parameters have default values, it is not entirely a breaking change. This is merely a warning. In fact, docx and md test cases have been retested and passed, and simple test cases for the new feature have been submitted to ensure the functionality works correctly. --- ### feat(unstructured/partition/docx.py): 添加语言检测并应用于文本类型分类 - 在 Document 基础类中添加了 `languages` 属性。文档应该具有一个类似的属性来表达文档当前的语言性质,因为在文档的各个方法中都会遇到语言问题。在这些场景中,有一个公共的语言数组作为默认值是必要的,而且这个属性在某种程度上也满足了领域驱动设计的要求。 - 在 `DocxPartitionerOptions` 中添加了 `languages` 选项,用于指定用于文本类型分类的语言列表。 - 修改了 `_DocxPartitioner.detect_text_type()`,以使用指定的语言或在指定为 "auto" 时自动检测语言。 - 这使得分区器能够更准确地基于语言对文本元素进行分类,从而提高整体分区质量。 - 对于 HTML 和 MD(MD 利用了 HTML 的分区方法),`languages` 字段在整个构造链中一路传递,直到在 `is_possible_narrative_text` 和 `is_possible_title` 函数中最终使用。此前,虽然这两个函数支持针对不同语言进行不同的判断,但 `languages` 参数没有正确传递,这导致这一能力一直未被启用。本次更新启用了这一能力。 - **破坏性更改**: `DocxPartitionerOptions` 构造函数和其他一些分区函数现在需要一个新的 `languages` 参数。这对于现有的代码是一个破坏性更改。然而,由于大多数参数都有默认值,所以并不完全算是破坏性更新,这仅是一个警告。实际上,docx 和 md 的测试用例已经重新测试并通过,同时针对新的功能也提交了简单的测试用例以确保功能正常运行。
进行了全量测试,并基本保持了与main分支一致的通过率。
… "DocxPartitionerOptions" to collapse into keyword arguments (kwargs). 2. Change "capitalizable_languages" to "non_capitalizable_languages" in the function "is_possible_narrative_text".
…ure/zh_adaptation # Conflicts: # CHANGELOG.md
This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.
…ure/zh_adaptation
# Conflicts: # test_unstructured/documents/test_html.py # unstructured/documents/base.py # unstructured/documents/html.py # unstructured/documents/xml.py # unstructured/partition/epub.py # unstructured/partition/html.py
…ure/zh_adaptation
This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.
This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.
Sorry for the delayed response. I've recently had some changes at work and have just formed a team dedicated to LLM-related development. This has caused some delays, and I couldn't address the previous issues in time. I just finished testing on my Mac, and the test results are consistent with the main branch (possibly due to some OCR packages or other issues, even the main branch can't fully pass the tests on my Mac). So, please review it again. @MthwRobinson FYI |
…e code formatting - Update CHANGELOG.md to include compatibility issue fix for Chinese text in document parsing. - Reformat import statements in test_odt.py for better readability. - Adjust import order in html.py to adhere to PEP8 guidelines. - Add `languages` parameter to text processing functions in pdf.py and text.py for improved language handling. - Reformat long lines to improve code readability and maintain consistency. Co-authored-by: Your Name <[email protected]>
@JIAQIA - could you resolve the merge conflicts on the HTML docs? Once those are resolved we can get the tests running in CI. |
# Conflicts: # unstructured/documents/html.py # unstructured/partition/html.py
@MthwRobinson Mth OK, merge main into this branch now. |
CI is running on #3339 |
@@ -257,6 +257,7 @@ def element_from_text( | |||
text: str, | |||
coordinates: Optional[tuple[tuple[float, float], ...]] = None, | |||
coordinate_system: Optional[CoordinateSystem] = None, | |||
languages: Optional[list[str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kwarg was in some partitioning functions but not all of them before, correct? @scanny would you rather go with **kwargs
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for the time being. This argument is omissible, like coordinates
and coordinate_system
are, so if the caller doesn't have it they can just leave it out of the call.
But I think all the partitioners have this parameter explicitly defined because it's used in the apply_lang_metadata()
call which I believe is global to all partitioners.
Looks good overall, left a couple of inline comments and otherwise just waiting on CI. |
Looks like a couple of linting issues (see this job). You can run |
…cOS, gsed (installed via brew) replaces the default sed. The script includes platform checks to use gsed on MacOS and sed on Linux. Additionally, awk is used for version extraction. Preliminary tests indicate the script works correctly on both Linux and MacOS.
OK,I run make check and make tidy. (unstructured) ➜ unstructured git:(feature/zh_adaptation) make tidy
ruff . --fix-only || true
warning: `ruff <path>` is deprecated. Use `ruff check <path>` instead.
autoflake --in-place .
black --line-length=100 .
All done! ✨ 🍰 ✨
640 files left unchanged.
(unstructured) ➜ unstructured git:(feature/zh_adaptation) make check
ruff check .
All checks passed!
black . --check --line-length=100
All done! ✨ 🍰 ✨
640 files would be left unchanged.
flake8 .
scripts/version-sync.sh -c \
-f "unstructured/__version__.py" semver
From github.com:JIAQIA/unstructured
* branch main -> FETCH_HEAD
Checking sed version...
Detected sed version: 4.9
version sync would make no changes to unstructured/__version__.py.
flake8 --per-file-ignores "" ./unstructured/ingest Note that I improved the make check script to support lint checks on MacOS. On MacOS, gsed (installed via brew) replaces the default sed. The script includes platform checks to use gsed on MacOS and sed on Linux. Additionally, awk is used for version extraction. Preliminary tests indicate the script works correctly on both Linux and MacOS. |
Added logic in the `test_weaviate_schema_is_valid` test function to check the existing Weaviate schema. If the class to be created already exists, the creation step is skipped and a corresponding message is printed to avoid creating a duplicate class.
@MthwRobinson May I ask if there are any other issues that I need to address? I noticed some errors in the CI/CD process, such as the chipper model not being found. However, I'm not familiar with that part of the code. If there are any errors in the CI/CD process that I need to resolve, we can discuss them here. Additionally, if there are other requirements that are not yet met for merging, please bring them up. This feature is very important to our team. Thank you. |
# Conflicts: # CHANGELOG.md # examples/pgvector/pgvector.ipynb # examples/training/0-Core Concepts.ipynb # examples/training/1-Intro to Bricks.ipynb # examples/training/2-File Exploration.ipynb # examples/weaviate/weaviate.ipynb # test_unstructured/partition/test_auto.py # unstructured/documents/html.py
@JIAQIA, You'll need to update the CHANGELOG and |
@Coniferish OK, I have bump to dev10, and add a ChangeLog description |
…zh_adaptation # Conflicts: # CHANGELOG.md # unstructured/__version__.py
@JIAQIA, it looks like you need to merge with main and bump this again since |
# Conflicts: # CHANGELOG.md
@Coniferish I've updated the main branch to version 0.15.4 and bumped it to 0.15.5-dev0. However, I encountered an error in the CI/CD pipeline related to test_chipper. It seems like the test is trying to find a model, but I'm not sure how to fix it. Any ideas? |
Hey @JIAQIA, it seems we don't have a good process in place to resolve this for contributors, so I will create a duplicate PR with your changes as a branch from our repo so it has the appropriate keys. |
@Coniferish Con .Thank you very much. Also, it seems that I couldn't find a guide for contributing code. During my recent code contribution, I encountered many issues that could have been clarified by documentation, such as upgrading the version number, or the previous problem about the use of I'm not entirely sure if such a guide exists and I just missed it, but if possible, we could maintain a "Contribution Guide" document to help everyone improve their code contribution efficiency. |
@JIAQIA Thanks for the feedback. We have this contributing doc in our community repo, but I'm just seeing that repo has been archived. Adding that information to this repo would be useful and I'll create a ticket for that. Please let me know if you can think of any other things that would be useful to add and I'll make sure they get added. I've opened an issue here if you want to add any additional suggestions there. |
if languages is None or languages == [""]: | ||
languages = ["auto"] | ||
if isinstance(languages, list) and "auto" in languages and text: | ||
languages = detect_languages(text) or [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JIAQIA, this section needs to change. detect_languages
should only be called when detect_language_per_element=True
and with this change it is being called for any potential NarrativeText element.
apply_lang_metadata
is the entry point all partitioners use for language detection and it determines if language detection is done at all. How we handle language detection admittedly needs to be refactored, but detecting the language of every element would slow down the processing of all documents, which we can't do.
If we just need to modify the condition for calling exceeds_cap_ratio
to take into account languages that don't contain capitalization, we can modify this to just check the language
parameter and you can set that when partitioning documents in other languages. Would work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Coniferish Con The issue of detect_languages being called repeatedly is indeed a problem, and I’ve also noticed that it’s called again during apply_lang_metadata. However, as you mentioned, language detection is an area that needs deep refactoring. Given the current situation, undertaking such a deep refactor might be beyond what I can handle within a reasonable timeframe. Therefore, I chose the following solution, which is based on the following considerations:
-
If the user explicitly specifies the language as "eng", the repeated call to detect_language will be skipped. detect_language is only triggered when the user sets the language to auto or "" (an empty string). It’s worth noting that using "" to represent a specific meaning is hard to understand and process, and this should also be refactored.
-
To ensure that existing users are not affected, detect_language will not be called repeatedly if set language to "eng". If we want to enable this capability, I will specify language="auto" when using Unstructured. It’s important to note that if language="auto" is specified, then repeated calls to detect_language will indeed occur.
-
As I mentioned when I first proposed this solution, it will result in repeated calls to detect_languages. I believe the way to address this issue is to pass through detect_language_per_element during the partition process. Additionally, we need to consider the different behaviors when language="auto" and detect_language_per_element=False are configured, which should be illustrated using a Cartesian product to explain the program’s behavior under different combinations. If language detection has already been completed for an element in the first half of the partition process, this information should be stored in the element’s metadata. Furthermore, we need to optimize the behavior of apply_lang_metadata so that if the element’s metadata contains language information, the detection is skipped.
However, I am currently very limited on time, so I am unable to complete step 3. Step 3 would require extensive testing and coverage, and I believe that the extent of these changes would significantly increase the difficulty of merging.
Therefore, with my changes, the capabilities will only be activated when the user specifies that the document language is a non-English language like Chinese or "auto". If the user specifies "eng" or does not specify a language, my changes will not take effect.
Additionally, I hope the Unstructured team understands that metadata at the Element level is extremely important to us. If this metadata is incorrect, it is almost unusable in most scenarios. Therefore, we hope that this initial small step can be merged first, allowing non-English-speaking countries to use it before we discuss deep optimization.
To reiterate, this change ensures that when the user sets the language to "eng", the number of detect_language calls remains the same. But the number of detect_language calls will only change when the language belongs to the "languages without capitalization" list or is set to "auto" (but sadlly, default value is auto), where there may be repeated calls (depending on the detect_language_per_element setting).
P.S. Regarding your suggestion to modify the conditions in the exceeds_cap_ratio function to account for languages that do not use capitalization, wouldn’t this also require calling detect_language? My opinion is that the root of this issue lies in the fact that the language parameter is not correctly passed during the partition process, and that the func call timing and logic of storing the results of detect_language are not ideal. So, based on my current understanding, modifying exceeds_cap_ratio might also introduce potential issues down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JIAQIA, I'm not proposing you to refactor the language behavior. That is something we would take care of ourselves, so please don't worry about that.
To your other points:
- If the user explicitly specifies the language as "eng", the repeated call to detect_language will be skipped. detect_language is only triggered when the user sets the language to auto or "" (an empty string). It’s worth noting that using "" to represent a specific meaning is hard to understand and process, and this should also be refactored.
I agree that using an empty string is bad practice and something that needs to be changed. To clarify, though, it is being used to skip language detection by partitioners that depend on other partitioners (see this comment)
- To ensure that existing users are not affected, detect_language will not be called repeatedly if set language to "eng". If we want to enable this capability, I will specify language="auto" when using Unstructured. It’s important to note that if language="auto" is specified, then repeated calls to detect_language will indeed occur.
The default parameter for languages
for all the document specific partitioners is ["auto"]
. Your changes would require users' behavior to change by having them passing languages=["eng"]
if they do not want to detect the language per element.
When language="auto"
, partitioning should not automatically detect language per element. That is the main point I am raising. There is a separate parameter, detect_language_per_element
for controlling if language is detected per element and it should be the only parameter that has that responsibility. The languages
parameter is provided for users to declare what language the document is in, mostly for the purposes of OCR and metadata, but it could also be used to detect the language before the element type is determined. I thought that the main issue you were trying to fix with this PR was the element type that was used for languages that do not use capitalization (reducing the number of Title
and UncategorizedText
elements that were detected and correctly categorizing them as NarrativeText
). Please let me know if I'm mistaken about that.
I'm proposing that we keep your changes that make languages
and detect_language_per_element
accessible through the partition process. We could then use those parameters to more accurately determine if exceeds_cap_ratio
needs to be called (which seemed to be the main change in determining the element type in this PR). Instead of relying on the default languages=['auto']
, I'm saying that users partitioning non-English language documents could define the languages param (ex. languages=['zho']
) and that could be used in is_possible_narrative_text
. Or, users could leave the languages
parameter alone and set detect_language_per_element=True
and that could trigger detect_languages
from within is_possible_narrative_text
.
- As I mentioned when I first proposed this solution, it will result in repeated calls to detect_languages. I believe the way to address this issue is to pass through detect_language_per_element during the partition process. Additionally, we need to consider the different behaviors when language="auto" and detect_language_per_element=False are configured, which should be illustrated using a Cartesian product to explain the program’s behavior under different combinations. If language detection has already been completed for an element in the first half of the partition process, this information should be stored in the element’s metadata. Furthermore, we need to optimize the behavior of apply_lang_metadata so that if the element’s metadata contains language information, the detection is skipped.
Here is a table describing the current behavior. It sounds like we are on the same page about language detection only happening once and skipping detection if language data is already in the element's metadata :)
detect_language_per_element |
languages |
behavior |
---|---|---|
True |
["auto"] (default) | Language is detected per element |
False (default) |
["auto"] (default) | Language is detected based on the text of the whole document |
True |
["zho"] | Language is detected per element (ignore languages )** |
False (default) |
["zho"] | Language is not detected and "zho" is used |
**This behavior should probably change so that ["zho"] would be used and language detection is skipped.
For your other comments.
P.S. Regarding your suggestion to modify the conditions in the exceeds_cap_ratio function to account for languages that do not use capitalization, wouldn’t this also require calling detect_language?
Yes, I'm proposing that we call it here, but only if detect_language_per_element=True
.
My opinion is that the root of this issue lies in the fact that the language parameter is not correctly passed during the partition process, and that the func call timing and logic of storing the results of detect_language are not ideal. So, based on my current understanding, modifying exceeds_cap_ratio might also introduce potential issues down the line.
I agree to with this and think your changes making languages
(and detect_language_per_element
) available throughout the partition process is valuable. Your PR currently modifies the logic with how exceeds_cap_ratio
is used to determine if a text segment is NarrativeText
: if non_capitalizable_languages.isdisjoint(set(languages)) and exceeds_cap_ratio(...)
. I'm not saying we should change exceeds_cap_ratio
(for now at least) or this condition. Your changes make it so NarrativeText
is returned more often (which is good) for non-English languages. I'm saying we should keep that logic, but should pass detect_language_per_element
down to determine if detect_language
should be called, which would help with recognizing if the document language is in non_capitalizable_languages
. Additionally, languages
should be used here (for when users define languages=["zho"]
for example).
Finally, I also want to note that we really appreciate your input on this. We are a team of primarily native English speakers, so hearing how we can improve to better support other languages really helps other non-English users as well. Thank you!! I hope we can continue to get your feedback so we can create a better tool for you and others.
You have already contributed a lot. I'm more than happy to pick up with this PR and make these changes if what I am proposing sounds good to you.
FYI: #3096
Thanks for your review.