From 120fac3186d72a1714334e70617930a51e07d623 Mon Sep 17 00:00:00 2001 From: JQQ Date: Thu, 23 May 2024 19:46:13 +0800 Subject: [PATCH 01/20] ### feat(unstructured/partition/docx.py): Add language detection and apply to text type classification MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Added a `languages` attribute to the Document base class. This attribute is essential to express the current language nature of a document, as language issues are encountered in various methods across the document. Having a common language array as a default value is necessary, and this attribute also partially meets the requirements of domain-driven design. - Added `languages` option to `DocxPartitionerOptions` to specify a list of languages to use for text type classification. - Modified `_DocxPartitioner.detect_text_type()` to use the specified languages or automatically detect the languages if "auto" is specified. - This allows the partitioner to more accurately classify text elements based on the language, improving the overall partitioning quality. - For HTML and MD (MD utilizes the HTML partition method), the `languages` field is passed through the entire construction chain until it is finally used in the `is_possible_narrative_text` and `is_possible_title` functions. Previously, although these two functions supported different judgments for different languages, the `languages` parameter was not correctly passed, which led to this capability not being enabled. This update enables this capability. - **BREAKING CHANGE**: The `DocxPartitionerOptions` constructor and some other partition functions now require a new `languages` parameter. This is a breaking change for any existing code. However, since most parameters have default values, it is not entirely a breaking change. This is merely a warning. In fact, docx and md test cases have been retested and passed, and simple test cases for the new feature have been submitted to ensure the functionality works correctly. --- ### feat(unstructured/partition/docx.py): 添加语言检测并应用于文本类型分类 - 在 Document 基础类中添加了 `languages` 属性。文档应该具有一个类似的属性来表达文档当前的语言性质,因为在文档的各个方法中都会遇到语言问题。在这些场景中,有一个公共的语言数组作为默认值是必要的,而且这个属性在某种程度上也满足了领域驱动设计的要求。 - 在 `DocxPartitionerOptions` 中添加了 `languages` 选项,用于指定用于文本类型分类的语言列表。 - 修改了 `_DocxPartitioner.detect_text_type()`,以使用指定的语言或在指定为 "auto" 时自动检测语言。 - 这使得分区器能够更准确地基于语言对文本元素进行分类,从而提高整体分区质量。 - 对于 HTML 和 MD(MD 利用了 HTML 的分区方法),`languages` 字段在整个构造链中一路传递,直到在 `is_possible_narrative_text` 和 `is_possible_title` 函数中最终使用。此前,虽然这两个函数支持针对不同语言进行不同的判断,但 `languages` 参数没有正确传递,这导致这一能力一直未被启用。本次更新启用了这一能力。 - **破坏性更改**: `DocxPartitionerOptions` 构造函数和其他一些分区函数现在需要一个新的 `languages` 参数。这对于现有的代码是一个破坏性更改。然而,由于大多数参数都有默认值,所以并不完全算是破坏性更新,这仅是一个警告。实际上,docx 和 md 的测试用例已经重新测试并通过,同时针对新的功能也提交了简单的测试用例以确保功能正常运行。 --- example-docs/zho_md_partition.md | 25 + examples/training/0-Core Concepts.ipynb | 1411 +--------------------- test_unstructured/partition/test_docx.py | 255 ++-- test_unstructured/partition/test_md.py | 49 + unstructured/documents/base.py | 3 +- unstructured/documents/html.py | 31 +- unstructured/documents/xml.py | 10 +- unstructured/partition/docx.py | 26 +- unstructured/partition/html.py | 6 +- unstructured/partition/lang.py | 7 +- unstructured/partition/text_type.py | 10 +- 11 files changed, 335 insertions(+), 1498 deletions(-) create mode 100644 example-docs/zho_md_partition.md diff --git a/example-docs/zho_md_partition.md b/example-docs/zho_md_partition.md new file mode 100644 index 0000000000..800ba56956 --- /dev/null +++ b/example-docs/zho_md_partition.md @@ -0,0 +1,25 @@ +## 春节放假通知 + +## Spring Festival Holiday Notice + +庆祝春节假期。 + +春节放假从大年 30 开始 + +Celebrate the Spring Festival holiday. Holiday time: 2021年2月6日至2021年3月8日,共计放假一个月。比法定假期长三周。 + +## 标题 2 + +### 标题 3 + +## Another Title 2 + +正文开始。 + +- 一组1 + +- 一组2 + +- 一组3 + +正文结束。 diff --git a/examples/training/0-Core Concepts.ipynb b/examples/training/0-Core Concepts.ipynb index b65e43da63..188706bd10 100644 --- a/examples/training/0-Core Concepts.ipynb +++ b/examples/training/0-Core Concepts.ipynb @@ -19,14 +19,14 @@ "execution_count": 1, "id": "a326d600", "metadata": {}, - "outputs": [], "source": [ "import os\n", "import pathlib\n", "\n", "DIRECTORY = os.path.abspath(\"\")\n", "EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, \"..\", \"..\", \"example-docs\")" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -43,56 +43,23 @@ "execution_count": 2, "id": "015a9385", "metadata": {}, - "outputs": [], "source": [ "from unstructured.partition.auto import partition\n", "\n", "filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"layout-parser-paper-fast.pdf\")\n", "elements = partition(filename=filename)" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 3, "id": "a4e7a5bc", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "elements" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -107,54 +74,21 @@ "execution_count": 4, "id": "dd54b5b0", "metadata": {}, - "outputs": [], "source": [ "with open(filename, \"rb\") as f:\n", " elements = partition(file=f)" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 5, "id": "97a7274b", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ,\n", - " ]" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "elements" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -195,29 +129,6 @@ "execution_count": 6, "id": "76a2e17a", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "UNITED STATES\n", - "\n", - "\n", - "SECURITIES AND EXCHANGE COMMISSION\n", - "\n", - "\n", - "Washington, D.C. 20549\n", - "\n", - "\n", - "FORM 10-K\n", - "\n", - "\n", - "ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", - "\n", - "\n" - ] - } - ], "source": [ "filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"example-10k.html\")\n", "elements = partition(filename=filename)\n", @@ -225,7 +136,8 @@ "for element in elements[:5]:\n", " print(element)\n", " print(\"\\n\")" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -240,23 +152,6 @@ "execution_count": 7, "id": "96c11b32", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report.  ☐\n", - "\n", - "\n", - "This report contains statements that do not relate to historical or current facts but are “forward-looking” statements. These statements relate to analyses and other information based on forecasts of future results and estimates of amounts not yet determinable. These statements may also relate to future events or trends, our future prospects and proposed new products, services, developments or business strategies, among other things. These statements can generally (although not always) be identified by their use of terms and phrases such as anticipate, appear, believe, could, would, estimate, expect, indicate, intent, may, plan, predict, project, pursue, will continue and other similar terms and phrases, as well as the use of the future tense.\n", - "\n", - "\n", - "Actual results could differ materially from those expressed or implied in our forward-looking statements. Our future financial condition and results of operations, as well as any forward-looking statements, are subject to change and to inherent known and unknown risks and uncertainties. You should not assume at any point in the future that the forward-looking statements in this report are still valid. We do not intend, and undertake no obligation, to update our forward-looking statements to reflect future events or circumstances.\n", - "\n", - "\n" - ] - } - ], "source": [ "from unstructured.documents.elements import NarrativeText\n", "from unstructured.partition.text_type import sentence_count\n", @@ -265,7 +160,8 @@ " if isinstance(element, NarrativeText) and sentence_count(element.text) > 2:\n", " print(element)\n", " print(\"\\n\")" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -282,1291 +178,20 @@ "execution_count": 8, "id": "45d5b5f4", "metadata": {}, - "outputs": [], "source": [ "from unstructured.staging.base import convert_to_dict" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 9, "id": "5d13fc38", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[{'text': 'UNITED STATES', 'type': 'Title'},\n", - " {'text': 'SECURITIES AND EXCHANGE COMMISSION', 'type': 'Title'},\n", - " {'text': 'Washington, D.C. 20549', 'type': 'Title'},\n", - " {'text': 'FORM 10-K', 'type': 'Title'},\n", - " {'text': 'ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934',\n", - " 'type': 'Uncategorized'},\n", - " {'text': 'For the fiscal year ended\\xa0December\\xa031, 2021',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934',\n", - " 'type': 'Uncategorized'},\n", - " {'text': 'For the transition period from\\xa0\\xa0\\xa0\\xa0\\xa0\\xa0\\xa0to',\n", - " 'type': 'Title'},\n", - " {'text': 'Commission file number:\\xa0000-30653', 'type': 'Title'},\n", - " {'text': 'Galaxy Gaming, Inc.', 'type': 'Title'},\n", - " {'text': '(Exact name of small business issuer as specified in its charter)',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Nevada', 'type': 'Title'},\n", - " {'text': '20-8143439', 'type': 'Uncategorized'},\n", - " {'text': '(State or other jurisdiction of incorporation or organization)',\n", - " 'type': 'Title'},\n", - " {'text': '(IRS Employer Identification No.)', 'type': 'Title'},\n", - " {'text': '6480 Cameron Street Ste. 305 – Las Vegas, NV 89118',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '(Address of principal executive offices)', 'type': 'Title'},\n", - " {'text': '(702) 939-3254', 'type': 'Uncategorized'},\n", - " {'text': '(Registrant’s telephone number)', 'type': 'Title'},\n", - " {'text': 'Securities registered under Section\\xa012(b) of the Act:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Title of each class', 'type': 'Title'},\n", - " {'text': 'Trading symbol', 'type': 'Title'},\n", - " {'text': 'Name of exchange on which registered', 'type': 'NarrativeText'},\n", - " {'text': 'Common stock', 'type': 'Title'},\n", - " {'text': 'GLXZ', 'type': 'Uncategorized'},\n", - " {'text': 'OTCQB marketplace', 'type': 'Title'},\n", - " {'text': 'Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.\\xa0\\xa0\\xa0\\xa0Yes\\xa0\\xa0☐\\xa0\\xa0\\xa0\\xa0No\\xa0\\xa0☑',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Indicate by check mark if the registrant is not required to file reports pursuant to Section\\xa013 or Section\\xa015(d) of the Act. \\xa0\\xa0\\xa0\\xa0Yes\\xa0\\xa0☐\\xa0\\xa0\\xa0\\xa0No\\xa0\\xa0☑',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Indicate by checkmark whether the registrant (1)\\xa0has filed all reports required to be filed by Section\\xa013 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2)\\xa0has been subject to such filing requirements for the past 90 days.\\xa0\\xa0\\xa0\\xa0Yes\\xa0\\xa0☑\\xa0\\xa0\\xa0\\xa0No\\xa0\\xa0☐',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Indicate by check mark whether the issuer has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). \\xa0\\xa0\\xa0\\xa0Yes\\xa0\\xa0☑\\xa0\\xa0\\xa0\\xa0No\\xa0\\xa0☐',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '☐ Accelerated filer', 'type': 'Title'},\n", - " {'text': '☐ Smaller reporting company', 'type': 'Title'},\n", - " {'text': 'Emerging growth Company', 'type': 'Title'},\n", - " {'text': 'If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standard provided pursuant to Section 13(a) of the Exchange Act.\\xa0\\xa0☐',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report.\\xa0\\xa0☐',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act).\\xa0\\xa0\\xa0\\xa0Yes\\xa0\\xa0☐\\xa0\\xa0\\xa0\\xa0No\\xa0\\xa0☑',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'State the aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold, or the average bid and asked price of such common equity, as of the last business day of the registrant’s second fiscal quarter. $70,923,698.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Indicate the number of shares outstanding of each of the registrant’s classes of common stock, as of the latest practicable date: 23,718,968 common shares as of March 28, 2022.',\n", - " 'type': 'Uncategorized'},\n", - " {'text': 'GALAXY GAMING, INC.', 'type': 'Title'},\n", - " {'text': 'ANNUAL REPORT ON FORM 10-K FOR THE YEAR ENDED DECEMBER 31, 2021',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'TABLE OF CONTENTS', 'type': 'Title'},\n", - " {'text': 'PART I', 'type': 'Title'},\n", - " {'text': 'Item 1.', 'type': 'Title'},\n", - " {'text': 'Business', 'type': 'Title'},\n", - " {'text': 'Item 1A.', 'type': 'Title'},\n", - " {'text': 'Risk Factors', 'type': 'Title'},\n", - " {'text': 'Item 1B.', 'type': 'Title'},\n", - " {'text': 'Unresolved Staff Comments', 'type': 'Title'},\n", - " {'text': 'Item 2.', 'type': 'Title'},\n", - " {'text': 'Properties', 'type': 'Title'},\n", - " {'text': 'Item 3.', 'type': 'Title'},\n", - " {'text': 'Legal Proceedings', 'type': 'Title'},\n", - " {'text': 'Item 4.', 'type': 'Title'},\n", - " {'text': 'Mine Safety Disclosures', 'type': 'Title'},\n", - " {'text': 'PART II', 'type': 'Title'},\n", - " {'text': 'Item 5.', 'type': 'Title'},\n", - " {'text': 'Market for Registrant’s Common Equity and Related Stockholder Matters',\n", - " 'type': 'Title'},\n", - " {'text': '10', 'type': 'Uncategorized'},\n", - " {'text': 'Item 7.', 'type': 'Title'},\n", - " {'text': 'Management’s Discussion and Analysis of Financial Condition and Results of Operations',\n", - " 'type': 'Title'},\n", - " {'text': '12', 'type': 'Uncategorized'},\n", - " {'text': 'Item 7A.', 'type': 'Title'},\n", - " {'text': 'Quantitative and Qualitative Disclosures about Market Risk',\n", - " 'type': 'Title'},\n", - " {'text': '14', 'type': 'Uncategorized'},\n", - " {'text': 'Item 8.', 'type': 'Title'},\n", - " {'text': 'Financial Statements and Supplementary Financial Information',\n", - " 'type': 'Title'},\n", - " {'text': '15', 'type': 'Uncategorized'},\n", - " {'text': 'Item 9.', 'type': 'Title'},\n", - " {'text': 'Changes in and Disagreements with Accountants on Accounting and Financial Disclosure',\n", - " 'type': 'Title'},\n", - " {'text': '35', 'type': 'Uncategorized'},\n", - " {'text': 'Item\\xa09A.', 'type': 'Title'},\n", - " {'text': 'Controls and Procedures', 'type': 'Title'},\n", - " {'text': '35', 'type': 'Uncategorized'},\n", - " {'text': 'Item 9B.', 'type': 'Title'},\n", - " {'text': 'Other Information', 'type': 'Title'},\n", - " {'text': '35', 'type': 'Uncategorized'},\n", - " {'text': 'PART III', 'type': 'Title'},\n", - " {'text': 'Item 10.', 'type': 'Title'},\n", - " {'text': 'Directors, Executive Officers and Corporate Governance',\n", - " 'type': 'Title'},\n", - " {'text': '36', 'type': 'Uncategorized'},\n", - " {'text': 'Item 11.', 'type': 'Title'},\n", - " {'text': 'Executive Compensation', 'type': 'Title'},\n", - " {'text': '39', 'type': 'Uncategorized'},\n", - " {'text': 'Item 12.', 'type': 'Title'},\n", - " {'text': 'Security Ownership of Certain Beneficial Owners and Management, and Related Stockholder Matters',\n", - " 'type': 'Title'},\n", - " {'text': '41', 'type': 'Uncategorized'},\n", - " {'text': 'Item 13.', 'type': 'Title'},\n", - " {'text': 'Certain Relationships and Related Transactions, and Director Independence',\n", - " 'type': 'Title'},\n", - " {'text': '41', 'type': 'Uncategorized'},\n", - " {'text': 'Item 14.', 'type': 'Title'},\n", - " {'text': 'Principal Accounting Fees and Services', 'type': 'Title'},\n", - " {'text': '41', 'type': 'Uncategorized'},\n", - " {'text': 'PART IV', 'type': 'Title'},\n", - " {'text': 'Item 15.', 'type': 'Title'},\n", - " {'text': 'Exhibits and Financial Statement Schedules', 'type': 'Title'},\n", - " {'text': '42', 'type': 'Uncategorized'},\n", - " {'text': 'SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'This report contains statements that do not relate to historical or current facts but are “forward-looking” statements. These statements relate to analyses and other information based on forecasts of future results and estimates of amounts not yet determinable. These statements may also relate to future events or trends, our future prospects and proposed new products, services, developments or business strategies, among other things. These statements can generally (although not always) be identified by their use of terms and phrases such as anticipate, appear, believe, could, would, estimate, expect, indicate, intent, may, plan, predict, project, pursue, will continue and other similar terms and phrases, as well as the use of the future tense.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Actual results could differ materially from those expressed or implied in our forward-looking statements. Our future financial condition and results of operations, as well as any forward-looking statements, are subject to change and to inherent known and unknown risks and uncertainties. You should not assume at any point in the future that the forward-looking statements in this report are still valid. We do not intend, and undertake no obligation, to update our forward-looking statements to reflect future events or circumstances.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'PART I', 'type': 'Title'},\n", - " {'text': 'ITEM\\xa01.\\xa0BUSINESS', 'type': 'Title'},\n", - " {'text': 'BUSINESS', 'type': 'Title'},\n", - " {'text': 'Unless the context indicates otherwise, references to “Galaxy Gaming, Inc.,” “we,” “us,” “our,” or the “Company,” refer to Galaxy Gaming, Inc., a Nevada corporation (“Galaxy Gaming”).',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'We are an established global gaming company specializing in the design, development, acquisition, assembly, marketing and licensing of proprietary casino table games and associated technology, platforms and systems for the casino gaming industry. Casinos use our proprietary products and services to enhance their gaming operations and improve their profitability, productivity and security, as well as to offer popular cutting-edge gaming entertainment content and technology to their players. We market our products and services to online casinos worldwide and to land-based casino gaming companies in North America, the Caribbean, Central America, the United Kingdom, Europe and Africa and to cruise ship companies. We license our products and services for use solely in legalized gaming markets. We also license our content and distribute content from other companies to iGaming operators throughout the world.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Products and Services', 'type': 'Title'},\n", - " {'text': 'Proprietary Table Games. Casinos use Proprietary Table Games together with or in lieu of other games in the public domain (e.g. Blackjack, Craps, Roulette, etc.) because of their popularity with players and to increase profitability. Typically, Proprietary Table Games are grouped into two product types referred to as “Side Bets” and “Premium Games.” Side Bets are proprietary features and wagering options typically added to public domain games such as baccarat, pai gow poker, craps and blackjack table games. Examples of our Side Bets include 21+3®, Lucky Ladies® and Bonus Craps™. Premium Games are unique stand-alone games with their own set of rules and strategies. Examples of our Premium Games include Heads Up Hold ’em®, High Card Flush®, Cajun Stud® and Three Card Poker®. Generally, Premium Games generate higher revenue per table placement than the Side Bet games.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Enhanced Table Systems. Enhanced Table Systems are electronic enhancements used on casino table games to add to player appeal and to enhance game security. An example in this category is our Bonus Jackpot System (“BJS”), an advanced electronic system installed on gaming tables designed to collect data by detecting player wagers and other game activities. This information is processed and used to improve casino operations by evaluating game play, to improve dealer efficiency and to reward players through the offering of jackpots and other bonusing mechanisms. Typically, the BJS system includes an electronic video display, known as TableVision, which shows game information designed to generate player interest and to promote various aspects of the game. The BJS system can also be used to network numerous gaming tables together into a common system either within a casino or through the interconnection of multiple casinos, which we refer to as our Inter-Casino Link System. In 2022, we plan to introduce a new table system called Triton™. Triton is designed to be a platform on which we can build a suite of enhanced table game features and services, the first of which will be the progressive jackpot wagers currently provided by BJS. Triton is built using off-the-shelf electronic components and software in order to minimize field service issues.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'iGaming. On August 21, 2020, we completed the acquisition of 100% of the member interests in Progressive Games Partners, LLC (“PGP”). PGP holds the exclusive worldwide rights to a number of games titles (including ours) for relicensing to operators of online gaming systems principally in Europe, the United Kingdom, and, more recently, the United States. Prior to the acquisition, PGP had been the exclusive distributor of our games to the online gaming sector; by making the acquisition of PGP, we effectively eliminated the distributor fee that PGP charged us, and we now also receive the revenue PGP earns on the content of other licensors (to whom we pay a royalty fee). In many cases, these online operators provide “white label” gaming infrastructure for many separate online casino brands with the result that the content that PGP licenses can appear on hundreds of online gaming sites. PGP’s contracts with online operators prohibit those operators from deploying the content in markets where it is not legal to do so.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Product Strategy. In the physical casino market, we have a “three-dimensional” growth strategy. First, we seek to increase the number of casinos we serve with our games. Second, within a casino, we seek to increase the number of tables on which we have placements. Our current product placements are concentrated around blackjack, and we have developed side bets and other game content to address other table game categories such as baccarat, roulette and craps. Finally, by adding our enhanced systems to tables that already have our content, we can increase the billable units per table. For example, on a blackjack table that has one of our side bets we can add a second side bet and a progressive jackpot for each side bet thereby increasing the billable units for that table from one to four. As of December 31, 2021, we served 515 casinos worldwide, had content on 4,500 tables in those casinos and had a total of 6,709 billable units in those casinos.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Our strategy in iGaming is similar in that it seeks to have our content on as many online tables as possible. However, the structure of the iGaming business is different in that many of our customers are iGaming platform providers that offer a turnkey online gaming solution to online operators who deploy those online offerings directly to the gaming player. To a lesser extent, we license our content to online operators who have their own platform and serve gaming customers directly. The online analog to a casino is called a “skin”',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'where a skin is a separately branded and marketed URL. Online operators often offer multiple skins targeting different markets and using different themes. Our strategy is 1) to have our content on as many skins as possible and 2) to have as many of our games as possible on each skin. As of December 31, 2021, we had content on',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'over', 'type': 'Title'},\n", - " {'text': '1,000', 'type': 'Uncategorized'},\n", - " {'text': 'skins worldwide and', 'type': 'NarrativeText'},\n", - " {'text': 'approximately four to six', 'type': 'Title'},\n", - " {'text': 'game placements on', 'type': 'Title'},\n", - " {'text': 'each of', 'type': 'Title'},\n", - " {'text': 'those skins.', 'type': 'Title'},\n", - " {'text': 'Finally, we expect that additional states in the U',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'will legalize online gaming, allowing our online clients to offer games to a significantly bigger audience.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Recurring Revenue and Gross Profit', 'type': 'Title'},\n", - " {'text': 'A majority of our clients contract with us to use our products and services on a month-to-month basis with typically a 30–45 day termination notice requirement. We invoice our clients monthly, either in advance for unlimited use or in arrears for actual use, depending on the product or contract terms. Such recurring revenues accounted for substantially all of our total revenues in 2021 and 2020. Our license revenues have few direct costs thereby generating high gross profit margins. We do not report “gross profit” in our statements of operations included in this report. Instead, gross profit would be comparable to “revenues” minus “cost of ancillary products and assembled components,” both of which are presented in our statements of operations.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'For more information about our revenues, operating income and assets, see “Item 7. Management’s Discussion and Analysis of Financial Condition and Results of Operations” and “Item 8. Financial Statements and Supplementary Financial Information” included in this report.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'STRATEGY', 'type': 'Title'},\n", - " {'text': 'Our long-term business strategy focuses on increasing our value to casino clients by offering them enhanced services and support, and by producing innovative products and game play methodologies that their players enjoy. We believe that by increasing the value of our products and services to clients, we can continue to build our recurring revenues in both existing and new markets. To achieve this objective, we employ the following strategies:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '•\\n\\nIncrease our per unit revenues by leveraging our Enhanced Table Systems;',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Expand our portfolio of services, products and technologies;',\n", - " 'type': 'ListItem'},\n", - " {'text': '•\\n\\nExpand the number of markets we serve;', 'type': 'ListItem'},\n", - " {'text': 'Increase our per unit revenues by leveraging our Enhanced Table Systems;',\n", - " 'type': 'ListItem'},\n", - " {'text': '•\\n\\nGrow our iGaming content and partner base; and',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Expand the number of markets we serve;', 'type': 'ListItem'},\n", - " {'text': '•\\n\\nPromote the use of our game content in adjacent gaming markets.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Grow our iGaming content and partner base; and',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Expand our portfolio of services, products and technologies. Our strategy is to be an important vendor to casino operators by offering a complete and comprehensive portfolio of services, games, products, systems, technologies and methodologies for casino table games. We continuously develop and/or seek to acquire new proprietary table games to complement our existing offerings and to extend our penetration of proprietary table games on the casino floor. We believe we have a significant opportunity to replicate the success we have had with blackjack side bets by developing content for the other significant public domain casino games of baccarat, roulette and craps.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Promote the use of our game content in adjacent gaming markets.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Expand our portfolio of services, products and technologies. Our strategy is to be an important vendor to casino operators by offering a complete and comprehensive portfolio of services, games, products, systems, technologies and methodologies for casino table games. We continuously develop and/or seek to acquire new proprietary table games to complement our existing offerings and to extend our penetration of proprietary table games on the casino floor. We believe we have a significant opportunity to replicate the success we have had with blackjack side bets by developing content for the other significant public domain casino games of baccarat, roulette and craps.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Increase our revenue per unit by leveraging our Enhanced Table Systems. Our Enhanced Table Systems are placed on tables where we already have our side bet or premium game content deployed. By adding our Enhanced Table Systems, we significantly increase the revenue we earn from that table. Gaming operators deploy the Enhanced Table Systems because they generally increase the win for the casino by an amount that significantly exceeds the cost to license the system from us.\\xa0Our product strategy includes making Electronic Table Systems that support a multitude of side bets and premium games across several casino game segments (e.g., blackjack, craps, roulette, baccarat, etc.).',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Expand the number of markets we serve. In the past, there were table games markets in North America that we could not serve or in which we could not offer our full suite of products and services. In general, this was because we were not licensed to serve casinos in that market or the license we have limits the products and services we can provide. We believe that the redemption transaction we undertook in 2019 (discussed below in the “Significant Business Developments” section) has helped us with our licensing activities in existing and new markets, and will continue to help us, including table games markets outside of the United States. Since the redemption transaction, we have received new or expanded licenses in 21 jurisdictions in North America.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Grow our iGaming content and partner base. We have licensed our content to the iGaming segment for several years through our distributor, PGP.\\xa0In 2020, we acquired PGP in order to improve our financial results from the iGaming segment by eliminating the distribution fee to PGP and by adding the revenue that PGP earns from licensing the content owned by itself and others.\\xa0The COVID pandemic has resulted in a significant increase in jurisdictions considering legalizing iGaming, in many cases in concert with legalizing sports wagering. We intend to increase our revenues from iGaming in several ways.\\xa0First, we expect that our existing licensees will see growth in their current markets while adding new markets in the U.S. and elsewhere.\\xa0Second, we intend to add new licensees in the iGaming segment.\\xa0And finally, we intend to add to the number of games that we license to both existing and new licensees.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Promote the use of our game content in adjacent gaming markets. We have game content that is well-known and popular in physical casinos and online casinos. One example is the Electronic Table Games (“ETG”) market, which offers table game content on touch-',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'screen video devices. As casinos face rising labor costs, table games can become unprofitable at low bet minimums',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'and', 'type': 'Title'},\n", - " {'text': 'we believe', 'type': 'NarrativeText'},\n", - " {'text': 'casinos', 'type': 'Title'},\n", - " {'text': 'may', 'type': 'Title'},\n", - " {'text': 'seek to expand the use of ETGs to address this shortfall. Another example',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'is', 'type': 'NarrativeText'},\n", - " {'text': 'lotteries (both ticket lotteries and', 'type': 'Title'},\n", - " {'text': 'iLotteries', 'type': 'Uncategorized'},\n", - " {'text': '), where our well-known game content may attract patrons to lotteries as another way to enjoy it. There may be regulatory restrictions on the use of casino gaming content in certain lottery markets, but the addressable market is large even excluding these markets.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'COMPETITION', 'type': 'Title'},\n", - " {'text': 'We compete with several companies that develop and provide proprietary table games, electronic gaming platforms, game enhancements and related services. We believe that the principal competitive factors in our market include products and services that appeal to casinos and players, jurisdictional approvals and a well-developed sales and distribution network.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'We believe that our success will depend upon our ability to remain competitive in our field.\\xa0Competition can be based on price, brand recognition, player appeal and the strength of underlying intellectual property and superior customer service. Larger competitors may have longer operating histories, greater brand recognition, more firmly established supply relationships, superior capital resources, distribution and product inventory than we do. Smaller competitors may be more able to participate in developing and marketing table games, compared to other gaming products, because of the lower cost and complexity associated with the development of these products and a generally less stringent regulatory environment. We compete with others in efforts to obtain or create innovative products, obtain financing, acquire other gaming companies, and license and distribute products. We compete on these bases, as well as on the strength of our sales, service and distribution channels.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Our competitors include, but are not limited to, Scientific Games Corporation; Play AGS, Inc.; TCS/John Huxley; and Masque Publishing. Most of these competitors are larger than we are, have more financial resources than we do, and have more business segments than we do. In addition, we expect additional competitors to emerge in the future. There can be no assurances that we will be able to compete effectively in the future and failure to compete successfully in the market could have a material adverse effect on our business.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'SUPPLIERS', 'type': 'Title'},\n", - " {'text': 'We own outright the content for most of our Side Bets and Premium Games and therefore do not depend on suppliers for the majority of our revenues from these games. However, there are some games that we have licensed from others and to whom we pay royalty fees when we license those games to others (including in the online gaming sector). We generally have multi-year licensing agreements for this content. With respect to our Enhanced Table Systems, we obtain most of the parts for our products from third-party suppliers, including both off-the-shelf items as well as components manufactured to our specifications. We also assemble a small number of parts in-house that are used both for product assembly and for servicing existing products. We generally perform warehousing, quality control, final assembly and shipping functions from our facilities in Las Vegas, Nevada, although small inventories are maintained, and repairs are performed by our field service employees. We believe that our sources of supply for components and raw materials are adequate and that alternative sources of materials are available.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'In our iGaming business, we license some of our game content from other providers for re-licensing to online operators along with the content we own outright. We pay royalties to the owners of the content that we license from them.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'RESEARCH AND DEVELOPMENT', 'type': 'Title'},\n", - " {'text': 'We seek to develop and maintain a robust pipeline of new products and services to bring to market. We employ a staff of hardware and software engineers, graphic artists and game developers at our corporate offices to support, improve and upgrade our products and to develop and explore other potential table game products, technologies, methodologies and services. We also will use outside services for research and development from time to time.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'INTELLECTUAL PROPERTY', 'type': 'Title'},\n", - " {'text': 'Our products and the intellectual property associated with them are typically protected by patents, trademarks, copyrights and non-compete agreements. However, there can be no assurance that the steps we have taken to protect our intellectual property will be sufficient. Further, in the United States certain court rulings may make it difficult to enforce patents around the math relating to casino games, which makes us more dependent on copyrights and trademarks for protection. In addition, the laws of some foreign countries do not protect intellectual property to the same extent as the laws of the United States, which could increase the likelihood of infringement. Furthermore, other companies could develop similar or superior products without violating our intellectual property rights. If we resort to legal proceedings to enforce our intellectual property rights, the proceedings could be burdensome, disruptive and expensive, and distract the attention of management, and there can be no assurance that we would prevail.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'We have been subject to litigation claiming that we have infringed the rights of others and/or that certain of our patents and other intellectual property are invalid or unenforceable. We have also brought actions against others to protect our rights.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'GOVERNMENT REGULATION', 'type': 'Title'},\n", - " {'text': 'We are subject to regulation by governmental authorities in most jurisdictions in which we offer our products. The development and distribution of casino games, gaming equipment, systems technology and related services, as well as the operation of casinos, are all subject to regulation by a variety of federal, state, international, tribal, and local agencies with the majority of oversight provided by individual state gaming control boards. While the regulatory requirements vary by jurisdiction, most require:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '•\\n\\nDocumentation of qualification, including evidence of financial stability;',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Findings of suitability for the Company, individual officers, directors, key employees and major shareholders;',\n", - " 'type': 'ListItem'},\n", - " {'text': '•\\n\\nSpecific product approvals for games and gaming equipment; and',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Documentation of qualification, including evidence of financial stability;',\n", - " 'type': 'ListItem'},\n", - " {'text': '•\\n\\nLicenses, registrations and/or permits.', 'type': 'ListItem'},\n", - " {'text': 'Specific product approvals for games and gaming equipment; and',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Gaming regulatory requirements vary from jurisdiction to jurisdiction, and obtaining licenses, registrations, findings of suitability for our officers, directors, and principal stockholders and other required approvals with respect to us, our personnel and our products are time consuming and expensive. Generally, gaming regulatory authorities have broad discretionary powers and may deny applications for or revoke approvals on any basis they deem reasonable. We have approvals that enable us to conduct our business in numerous jurisdictions, subject in each case to the conditions of the particular approvals. These conditions may include limitations as to the type of game or product we may sell or lease, as well as limitations on the type of facility, such as riverboats, and the territory within which we may operate, such as tribal nations. Gaming laws and regulations serve to protect the public interest and ensure gambling related activity is conducted honestly, competitively and free of corruption. Regulatory oversight additionally ensures that the local authorities receive the appropriate amount of gaming tax revenues. As such, our financial systems and reporting functions must demonstrate high levels of detail and integrity.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Licenses, registrations and/or permits.', 'type': 'ListItem'},\n", - " {'text': 'Gaming regulatory requirements vary from jurisdiction to jurisdiction, and obtaining licenses, registrations, findings of suitability for our officers, directors, and principal stockholders and other required approvals with respect to us, our personnel and our products are time consuming and expensive. Generally, gaming regulatory authorities have broad discretionary powers and may deny applications for or revoke approvals on any basis they deem reasonable. We have approvals that enable us to conduct our business in numerous jurisdictions, subject in each case to the conditions of the particular approvals. These conditions may include limitations as to the type of game or product we may sell or lease, as well as limitations on the type of facility, such as riverboats, and the territory within which we may operate, such as tribal nations. Gaming laws and regulations serve to protect the public interest and ensure gambling related activity is conducted honestly, competitively and free of corruption. Regulatory oversight additionally ensures that the local authorities receive the appropriate amount of gaming tax revenues. As such, our financial systems and reporting functions must demonstrate high levels of detail and integrity.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'We also have authorizations with certain Native American tribes throughout the United States that have compacts with the states in which their tribal dominions are located or operate or propose to operate casinos. These tribes generally require suppliers of gaming and gaming-related equipment to obtain authorizations. Gaming on Native American lands within the United States is governed by the Federal Indian Gaming Regulatory Act of 1988 (“IGRA”) and specific tribal ordinances and regulations. Class\\xa0III gaming (table games and slot machines, for example), as defined under IGRA, also requires a Tribal-State Compact, which is a written agreement between a specific tribe and the respective state. This compact authorizes the type of Class\\xa0III gaming activity and the standards, procedures and controls under which the Class\\xa0III gaming activity must be conducted.\\xa0The National Indian Gaming Commission (“NIGC”) has oversight authority over gaming on Native American lands and generally monitors tribal gaming, including the establishment and enforcement of required minimum internal control standards.\\xa0Each tribe is sovereign and must have a tribal gaming commission or office established to regulate tribal gaming activity to ensure compliance with IGRA, NIGC, and its Tribal-State Compact.\\xa0We have complied with each of the numerous vendor licensing, specific product approvals and shipping notification requirements imposed by Tribal-State Compacts and enforced by tribal and/or state gaming agencies under IGRA in the Native American lands in which we do business.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'The nature of the industry and our worldwide operations make the license application process very time consuming and require extensive resources. We engage legal resources familiar with local customs in certain jurisdictions to assist in keeping us compliant with applicable regulations worldwide. Through this process, we seek to assure both regulators and investors that all our operations maintain the highest levels of integrity and avoid any appearance of impropriety.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'We have obtained or applied for all required government licenses, permits, registrations, findings of suitability and approvals necessary to develop and distribute gaming products in all jurisdictions where we directly operate. Although many regulations at each level are similar or overlapping, we must satisfy all conditions individually for each jurisdiction. Additionally, in certain jurisdictions we license our products through distributors authorized to do business in those jurisdictions.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'In addition to what may be required of our officers, board members, key employees and substantial interest holders, any of our stakeholders, including but not limited to investors, may be subject to regulatory requests and suitability findings. Failure to comply with regulatory requirements or obtaining a finding of unsuitability by a regulatory body could result in a substantial or total loss of investment.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'In the future, we intend to seek the necessary registrations, licenses, approvals, and findings of suitability for us, our products, and our personnel in other jurisdictions throughout the world. However, we may be unable to obtain such necessary items, or if such items are obtained, may be revoked, suspended, or conditioned. In addition, we may be unable to obtain on a timely basis, or to obtain at all, the necessary approvals of our future products as they are developed, even in those jurisdictions in which we already have existing products licensed or approved. If the necessary registrations are not sought after or the required approvals not received, we may be prohibited from selling our products in that jurisdiction or may be required to sell our products through other licensed entities at a reduced profit.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'EMPLOYEES', 'type': 'Title'},\n", - " {'text': 'We have 36 full-time employees, including executive officers, management personnel, accounting personnel, office staff, sales staff, service technicians and research and development personnel. As needed, we also employ part-time and temporary employees and pay for the services of independent contractors.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Significant 2020 and 2021 Business Developments', 'type': 'Title'},\n", - " {'text': \"Share Redemption. On May 6, 2019, we redeemed all 23,271,667 shares of our common stock held by Triangulum Partners, LLC (“Triangulum”), an entity controlled by Robert B. Saucier (“Saucier”), Galaxy Gaming's founder, and, prior to the redemption, the holder of a majority of our outstanding common stock. Our Articles of Incorporation (the “Articles”) provide that if certain events occur in relation to a stockholder that is required to undergo a gaming suitability review or similar investigative process, we have the option to purchase all or any part of such stockholder’s shares at a price per share that is equal to the average closing share price over the thirty calendar days preceding the purchase. The average closing share price over the thirty calendar days preceding the redemption was $1.68 per share.\",\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'The consideration owed to Triangulum for the redemption is $39,096,401 (the “Redemption Consideration Obligation”). The litigation between the Company and Triangulum related to the redemption and other matters was settled pursuant to a settlement agreement by a payment from the Company of $39,507,717 to Triangulum on November 15, 2021. See Note 10 and Note 11 to our audited consolidated financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Credit Agreement Amendments. See Note 10 to our audited consolidated financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details of amendments made to the Company’s credit agreement.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Fortress Credit Agreement. See Note 10 to our audited consolidated financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details of the entry into the Fortress Credit Agreement.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Membership Interest Purchase Agreement. On February 25, 2020, Galaxy Gaming entered into a Membership Interest Purchase Agreement, dated February 25, 2020 (the “Purchase Agreement”), between the Company and the membership interest holders of PGP.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'On August 21, 2020, the Company entered into a First Amendment to the Purchase Agreement between the Company and the membership interest holders of PGP. The First Amendment, among other things, fixed the cash portion of the purchase price at $6.425 million and established that the stock portion would be satisfied through the issuance of 3,141,361 shares of the Company’s common stock with a value of $1.27 per share on the date of the acquisition.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'On August 21, 2020, the Company completed the acquisition of 100% of the member interests in PGP. The entirety of the purchase price ($10,414,528) has been allocated to customer relationships and is included in Other intangible assets, net, on the Company’s balance sheet. See Note 7 to our audited financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details. The Company also acquired certain receivables and payables in the net amount of $581,885, which was to be remitted to the sellers of PGP as the receivables and payables were settled. The remaining balance owed to the sellers at December 31, 2020 was paid on May 7, 2021.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'COVID-19. On March 11, 2020, the World Health Organization declared a pandemic related to the COVID-19 outbreak, which led to a global health emergency.\\xa0The public-health impact of the outbreak continues to remain largely unknown and still evolving. The related health crisis could continue to adversely affect the global economy, resulting in continued economic downturn that could impact demand for our products.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'On March 17, 2020, the Company announced that it suspended billing to customers who had closed their doors due to the COVID-19 outbreak. As a result, we did not earn revenue for the use of our games by our physical casino customers during the time that they were closed. In general, the online gaming customers who license our games through our distributor remained and continue to remain in operation in spite of the COVID-19 crisis. We earned revenue from them during the crisis and expect to continue to do so, but potentially at levels that may be lower than we previously received.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'As of the date of this filing, virtually all land-based casinos have re-opened, although operations have not returned to pre-COVID-19 levels.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'We also rely on third-party suppliers and manufacturers in China, many of whom were shut down or severely cut back production during the initial COVID-19 shutdown. Although this did not have a material effect on our supply chain, any future disruption of our suppliers and their contract manufacturers may impact our sales and operating results going forward.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Because of the uncertainties of COVID-19, the Company drew on its Revolving Loan in the amount of $1,000,000 on March 12, 2020. Also, on April 17, 2020, the Company obtained an unsecured loan of $835,300 through Zions Bancorporation, N.A. dba Nevada State Bank under the Paycheck Protection Program (the “PPP Loan”) pursuant to the Coronavirus Aid, Relief, and Economic Security Act (the “CARES Act”) and the Paycheck Protection Program Flexibility Act (the “Flexibility Act”). On July 16, 2020, the Company filed an application and supporting documentation for forgiveness in full of the PPP Loan. On November 21, 2020, the Company received notification the PPP Loan had been forgiven in full. Pursuant to the CARES Act, the Federal Reserve created the Main Street Priority Loan Program (“MSPLP”) to provide financing for small and medium-sized businesses. On October 26, 2020, the Company borrowed $4 million from Zions Bancorporation N.A., dba Nevada State Bank under this program. See Note 10 to our audited consolidated financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'The COVID-19 crisis may change the behavior of gaming patrons. Most of our clients operate places of public accommodation, and their patrons may reduce visitation and play as a precaution. Further, governmental authorities may continue to impose reduced hours of operation or limit the capacity of such places of public accommodation. A long-term reduction in play could have a material adverse impact on our results of operations. Depending on the length and severity of any such adverse impact, we may fail to comply with our obligations, including covenants in our credit agreement, and we may need to reassess the carrying value of our assets.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'ITEM\\xa01A. RISK FACTORS', 'type': 'Title'},\n", - " {'text': 'A smaller reporting company is not required to provide the information required by this Item.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'ITEM\\xa01B. UNRESOLVED STAFF COMMENTS', 'type': 'Title'},\n", - " {'text': 'A smaller reporting company is not required to provide the information required by this Item.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'ITEM\\xa02. PROPERTIES', 'type': 'Title'},\n", - " {'text': 'We do not own any real property used in the operation of our current business.\\xa0We maintain our corporate office at 6480 Cameron Street, Suite 305, Las Vegas, Nevada, where we currently occupy approximately 14,000 square feet of combined office and warehouse space. We also maintain a small warehouse and service facility in Kent, Washington and a small office in Richland, Washington. See Note 9 to our audited financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'ITEM\\xa03. LEGAL PROCEEDINGS', 'type': 'Title'},\n", - " {'text': 'We have been named in and have brought lawsuits in the normal course of business. See Note 11 to our audited financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'ITEM\\xa04. MINE SAFETY DISCLOSURES', 'type': 'Title'},\n", - " {'text': 'A smaller reporting company is not required to provide the information required by this Item.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'PART II', 'type': 'Title'},\n", - " {'text': 'ITEM\\xa05. MARKET FOR REGISTRANT’S COMMON EQUITY AND RELATED STOCKHOLDER MATTERS',\n", - " 'type': 'Title'},\n", - " {'text': 'Our common stock is quoted on the OTCQB marketplace (“OTCQB”) under the ticker symbol GLXZ.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'The following table sets forth the range of high and low closing sale prices for our common stock for each of the periods indicated as reported by the OTCQB.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '2021', 'type': 'Uncategorized'},\n", - " {'text': '2020', 'type': 'Uncategorized'},\n", - " {'text': 'Quarter Ended', 'type': 'Title'},\n", - " {'text': 'High ($)', 'type': 'Title'},\n", - " {'text': 'Low ($)', 'type': 'Title'},\n", - " {'text': 'High ($)', 'type': 'Title'},\n", - " {'text': 'Low ($)', 'type': 'Title'},\n", - " {'text': 'March 31,', 'type': 'Uncategorized'},\n", - " {'text': '3.02', 'type': 'Uncategorized'},\n", - " {'text': '1.72', 'type': 'Uncategorized'},\n", - " {'text': '1.95', 'type': 'Uncategorized'},\n", - " {'text': '0.70', 'type': 'Uncategorized'},\n", - " {'text': 'June 30,', 'type': 'Uncategorized'},\n", - " {'text': '3.70', 'type': 'Uncategorized'},\n", - " {'text': '2.70', 'type': 'Uncategorized'},\n", - " {'text': '1.36', 'type': 'Uncategorized'},\n", - " {'text': '0.73', 'type': 'Uncategorized'},\n", - " {'text': 'September 30,', 'type': 'Uncategorized'},\n", - " {'text': '4.64', 'type': 'Uncategorized'},\n", - " {'text': '3.68', 'type': 'Uncategorized'},\n", - " {'text': '1.36', 'type': 'Uncategorized'},\n", - " {'text': '1.08', 'type': 'Uncategorized'},\n", - " {'text': 'December 31,', 'type': 'Uncategorized'},\n", - " {'text': '4.45', 'type': 'Uncategorized'},\n", - " {'text': '3.67', 'type': 'Uncategorized'},\n", - " {'text': '1.95', 'type': 'Uncategorized'},\n", - " {'text': '0.95', 'type': 'Uncategorized'},\n", - " {'text': 'The Securities and Exchange Commission (the “SEC”) has adopted rules that regulate broker-dealer practices in connection with transactions in penny stocks. Penny stocks are generally equity securities with a market price of less than $5.00, other than securities registered on certain national securities exchanges or quoted on the NASDAQ system, provided that current price and volume information with respect to transactions in such securities is provided by the exchange or system. The penny stock rules require a broker-dealer, prior to a transaction in a penny stock, to deliver a standardized risk disclosure document prepared by the SEC, that: (a)\\xa0contains a description of the nature and level of risk in the market for penny stocks in both public offerings and secondary trading; (b)\\xa0contains a description of the broker’s or dealer’s duties to the customer and of the rights and remedies available to the customer with respect to a violation of such duties or other requirements of the securities laws; (c)\\xa0contains a brief, clear, narrative description of a dealer market, including bid and ask prices for penny stocks and the significance of the spread between the bid and ask price; (d)\\xa0contains a toll-free telephone number for inquiries on disciplinary actions; (e)\\xa0defines significant terms in the disclosure document or in the conduct of trading in penny stocks; and (f)\\xa0contains such other information and is in such form, including language, type size and format, as the SEC shall require by rule or regulation.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'The broker-dealer also must provide, prior to effecting any transaction in a penny stock, the customer with (a)\\xa0bid and offer quotations for the penny stock; (b)\\xa0the compensation of the broker-dealer and its salesperson in the transaction; (c)\\xa0the number of shares to which such bid and ask prices apply, or other comparable information relating to the depth and liquidity of the market for such stock; and (d)\\xa0a monthly account statement showing the market value of each penny stock held in the customer’s account.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'In addition, the penny stock rules require that prior to a transaction in a penny stock not otherwise exempt from those rules, the broker-dealer must make a special written determination that the penny stock is a suitable investment for the purchaser and receive the purchaser’s written acknowledgment of the receipt of a risk disclosure statement, a written agreement as to transactions involving penny stocks, and a signed and dated copy of a written suitability statement.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'These disclosure requirements may have the effect of reducing the trading activity for our common stock. Therefore, stockholders may have difficulty buying or selling our securities.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'HOLDERS OF OUR COMMON STOCK', 'type': 'Title'},\n", - " {'text': 'As of March 28, 2022, we had 23,718,968 shares of our common stock issued and outstanding and 40 shareholders of record.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'DIVIDEND POLICY', 'type': 'Title'},\n", - " {'text': 'There are no restrictions in our articles of incorporation or bylaws that prevent us from declaring dividends.\\xa0The Nevada Revised Statutes, however, do prohibit us from declaring dividends where after giving effect to the distribution of the dividend:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '•\\n\\nOur total assets would be less than the sum of our total liabilities plus the amount that would be needed to satisfy the rights of shareholders who have preferential rights superior to those receiving the distribution.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'We would not be able to pay our debts as they become due in the usual course of business; or',\n", - " 'type': 'ListItem'},\n", - " {'text': 'We have not declared any dividends, and we do not plan to declare any dividends in the foreseeable future. Even though we repaid in full the borrowings we made in 2020 from the MSPLP, we are prohibited from paying dividends or making share repurchases for one year after the repayment (until November 15, 2022). We are prohibited from paying dividends while our MSPLP is outstanding and for one year thereafter. In addition, the Fortress Credit Agreement imposes significant restrictions on our ability to pay dividends. See Note 10 to our audited consolidated financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Our total assets would be less than the sum of our total liabilities plus the amount that would be needed to satisfy the rights of shareholders who have preferential rights superior to those receiving the distribution.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'We have not declared any dividends, and we do not plan to declare any dividends in the foreseeable future. Even though we repaid in full the borrowings we made in 2020 from the MSPLP, we are prohibited from paying dividends or making share repurchases for one year after the repayment (until November 15, 2022). We are prohibited from paying dividends while our MSPLP is outstanding and for one year thereafter. In addition, the Fortress Credit Agreement imposes significant restrictions on our ability to pay dividends. See Note 10 to our audited consolidated financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '10', 'type': 'Uncategorized'},\n", - " {'text': 'TRANSFER AGENT', 'type': 'Title'},\n", - " {'text': 'Our stock transfer agent and registrar is Philadelphia Stock Transfer, Inc. located at 2320 Haverford Street, Ardmore, PA 19003. Their telephone number is (484) 416-3124.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '11', 'type': 'Uncategorized'},\n", - " {'text': 'ITEM\\xa07. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'The following is a discussion and analysis of our financial condition, results of operations and liquidity and capital resources as of and for the years ended December\\xa031, 2021 and 2020. This discussion should be read together with our audited consolidated financial statements and related notes included in Item\\xa08. Financial Statements and Supplementary Financial Information. Some of the information contained in this discussion includes forward-looking statements that involve risks and uncertainties; therefore our “Special Note Regarding Forward-Looking Statements” should be reviewed\\xa0for a discussion of important factors that could cause actual results to differ materially from the results described in, or implied by, such forward-looking statements.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'OVERVIEW', 'type': 'Title'},\n", - " {'text': 'We develop, acquire, assemble and market technology and entertainment-based products and services for the gaming industry for placement on casino floors and on legal internet gaming sites.\\xa0Our products and services primarily relate to licensed casino operators’ table games activities and focus on either increasing their profitability, productivity and security or expanding their gaming entertainment offerings in the form of proprietary table games, electronically enhanced table game platforms, fully-automated electronic tables and other ancillary equipment. In addition, we license intellectual property to legal internet gaming operators. Our products and services are offered in highly regulated markets throughout the world.\\xa0Our products are assembled at our headquarters in Las Vegas, Nevada, as well as outsourced for certain sub-assemblies in the United States.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Results of operations for the years ended December\\xa031, 2021 and 2020. For the year ended December\\xa031, 2021, we generated gross revenues of $19,984,378 compared to $10,230,316 in 2020, representing an increase of $9,754,062, or 95.3%.\\xa0This increase was directly attributable to the re-opening of a significant portion of our land-based customers after the restrictions due to the COVID-19 crisis were lifted. Also, our online gaming revenues increased significantly due primarily to the acquisition of PGP in August of 2020 as well as to the opening of new markets in the U.S.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Selling, general and administrative expenses were $10,646,524 in 2021 compared to $8,964,930 in 2020, representing an increase of $1,681,594, or 18.8%. This increase was due to the 2021 employee bonus accrual being included in the current year as compared to no bonus accrual being included in the comparable prior-year period. Also, higher expenses were incurred in the current period directly related to the opening of jurisdictions throughout 2021 as COVID-19 restrictions were lifted (sales commissions, royalty expenses and repairs and maintenance of BJS units). Lastly, higher insurance payments were incurred in the current period as compared to the comparable prior-year period related to the financed Directors & Officers (“D&O) policy. These increased expenses incurred were offset by a decrease in legal fees related to the Triangulum Lawsuit and a decrease in distributor fees related to the acquisition of PGP in August 2020.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Research and development expenses were $520,449 in 2021 compared to $487,679\\xa0in 2020, representing an increase of $32,770, or 6.7%. This increase was primarily due to the 2021 employee bonus accrual. Prior year did not include an employee bonus accrual.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Share-based compensation expenses were $1,532,455 in 2021 compared to $737,991 in 2020, representing an increase of $794,464, or 107.7%. This increase was due to the quarterly restricted shares granted to our Board members being issued at a higher stock price than the comparable prior-year period. The increase was also due to increased amortization related to more shares being granted in the current period than the comparable prior-year period (two employees, a contractor and an additional Board member).',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'As a result of the changes described above, income from operations was $4,345,126 in 2021 compared to a loss from operations of $(2,255,010) in 2020, an increase of $6,600,136, or 292.7%.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Total interest expense was $1,505,386 in 2021 compared to $683,357 in 2020, an increase of $822,029, or 120.3%. The increase was attributable to the Fortress Credit Agreement entered into on November 15, 2021. Loan fees related to the MSPLP, the NSB Term Loan and the Revolving Loan were written off in November 2021. Also, the Fortress Credit Agreement bears a higher interest rate than the NSB Term Loan, the Revolving Loan, the MSPLP and the Triangulum promissory note, along with higher amortization on loan fees and warrants issued. See Note 10 to our audited consolidated financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Share redemption consideration was $682,469 in 2021 compared to $781,928 in 2020, a decrease of $99,459, or 12.7%. The decrease was attributable to the settlement of the Triangulum litigation on November 15, 2021. A total of $411,316 in accrued interest through November 15, 2021 was paid in connection with the settlement. See Note 11 to our audited consolidated financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'The income tax expense was $48,637 in 2021 based on an effective rate of 2.25 percent compared to the benefit of ($605,936) in 2020 based on an effective rate of 17.42 percent. The 2.25 percent effective tax rate for 2021 differed from the statutory federal income tax rate of 21.0 percent and was primarily attributable to (i) increased tax benefit from the exercise of stock options; (ii) the increased foreign rate differential and (iii) the Company maintaining a valuation allowance against its deferred tax assets.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '12', 'type': 'Uncategorized'},\n", - " {'text': 'Adjusted', 'type': 'Title'},\n", - " {'text': 'Earnings Before Interest, Taxes, Depreciation and Amortization (“',\n", - " 'type': 'Title'},\n", - " {'text': 'EBITDA', 'type': 'Uncategorized'},\n", - " {'text': '”)', 'type': 'Uncategorized'},\n", - " {'text': 'Adjusted EBITDA includes adjustment', 'type': 'NarrativeText'},\n", - " {'text': 'to net', 'type': 'NarrativeText'},\n", - " {'text': 'income', 'type': 'Title'},\n", - " {'text': '(loss)', 'type': 'Title'},\n", - " {'text': 'to exclude interest,', 'type': 'NarrativeText'},\n", - " {'text': 'income', 'type': 'Title'},\n", - " {'text': 'taxes, depreciation, amortization, share based compensation,',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'foreign currency exchange', 'type': 'Title'},\n", - " {'text': 'loss', 'type': 'Title'},\n", - " {'text': 'change in fair value of', 'type': 'Title'},\n", - " {'text': 'interest rate swap liability', 'type': 'Title'},\n", - " {'text': 'and', 'type': 'Title'},\n", - " {'text': 'severance and other expenses related to litigation',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Adjusted', 'type': 'Title'},\n", - " {'text': 'EBITDA is not a measure of performance defined in accordance with',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'U.S.', 'type': 'Uncategorized'},\n", - " {'text': 'Generally Accepted Accounting Principles (“', 'type': 'Title'},\n", - " {'text': 'GAAP', 'type': 'Uncategorized'},\n", - " {'text': '”)', 'type': 'Uncategorized'},\n", - " {'text': '. However', 'type': 'Title'},\n", - " {'text': ', Adjusted EBITDA is used by management to evaluate our operating performance. Management believes that disclosure of the Adjusted EBITDA metric offers investors,',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'regulators', 'type': 'Title'},\n", - " {'text': 'and other stakeholders a view of our operations in the same manner management evaluates our performance. When combined with U.S. GAAP results, management believes Adjusted EBITDA provides a comprehensive understanding of our financial results. Adjusted EBITDA should not be considered as an alternative to net income or',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'loss', 'type': 'Title'},\n", - " {'text': 'to net cash provided by operating activities as a measure of operating results or of liquidity. It may not be comparable to similarly titled measures used by other companies, and it excludes financial information that some may consider important in evaluating our performance. A reconciliation of U.S. GAAP net income to Adjusted EBITDA is as follows:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Years ended December 31,', 'type': 'NarrativeText'},\n", - " {'text': 'Adjusted EBITDA Reconciliation:', 'type': 'Title'},\n", - " {'text': '2021', 'type': 'Uncategorized'},\n", - " {'text': '2020', 'type': 'Uncategorized'},\n", - " {'text': 'Net income (loss)', 'type': 'Title'},\n", - " {'text': '2,111,812', 'type': 'Uncategorized'},\n", - " {'text': '(2,208,887', 'type': 'Uncategorized'},\n", - " {'text': 'Interest expense', 'type': 'Title'},\n", - " {'text': '1,505,386', 'type': 'Uncategorized'},\n", - " {'text': '683,357', 'type': 'Uncategorized'},\n", - " {'text': 'Share redemption consideration', 'type': 'Title'},\n", - " {'text': '682,469', 'type': 'Uncategorized'},\n", - " {'text': '781,928', 'type': 'Uncategorized'},\n", - " {'text': 'Interest income', 'type': 'Title'},\n", - " {'text': '(2,048', 'type': 'Uncategorized'},\n", - " {'text': '(25,702', 'type': 'Uncategorized'},\n", - " {'text': 'Depreciation and amortization', 'type': 'Title'},\n", - " {'text': '2,858,991', 'type': 'Uncategorized'},\n", - " {'text': '2,222,042', 'type': 'Uncategorized'},\n", - " {'text': 'Share-based compensation', 'type': 'Title'},\n", - " {'text': '1,532,455', 'type': 'Uncategorized'},\n", - " {'text': '737,991', 'type': 'Uncategorized'},\n", - " {'text': 'Foreign currency exchange loss', 'type': 'Title'},\n", - " {'text': '64,879', 'type': 'Uncategorized'},\n", - " {'text': '34,961', 'type': 'Uncategorized'},\n", - " {'text': 'Change in fair value of interest rate swap liability',\n", - " 'type': 'Title'},\n", - " {'text': '(66,009', 'type': 'Uncategorized'},\n", - " {'text': '(74,487', 'type': 'Uncategorized'},\n", - " {'text': 'Provision (benefit) for income taxes', 'type': 'Title'},\n", - " {'text': '48,637', 'type': 'Uncategorized'},\n", - " {'text': '(605,937', 'type': 'Uncategorized'},\n", - " {'text': 'Paycheck Protection Program Loan forgiveness', 'type': 'Title'},\n", - " {'text': '(840,243', 'type': 'Uncategorized'},\n", - " {'text': 'Severance expense', 'type': 'Title'},\n", - " {'text': '12,596', 'type': 'Uncategorized'},\n", - " {'text': '20,058', 'type': 'Uncategorized'},\n", - " {'text': 'Special project expense(1)', 'type': 'Title'},\n", - " {'text': '(15,338', 'type': 'Uncategorized'},\n", - " {'text': '652,198', 'type': 'Uncategorized'},\n", - " {'text': 'Adjusted EBITDA', 'type': 'Title'},\n", - " {'text': '8,733,830', 'type': 'Uncategorized'},\n", - " {'text': '1,377,279', 'type': 'Uncategorized'},\n", - " {'text': '(1)', 'type': 'Uncategorized'},\n", - " {'text': 'Includes expenses associated with the Triangulum Lawsuit in both 2021 and 2020. There is a credit balance in 2021 due to $720,000 in D&O insurance claim payments received in 2021.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Liquidity and capital resources. We have generally been able to fund our continuing operations, our investments, and the obligations under our existing borrowings through cash flow from operations. In 2020, as a result of the COVID-19 crisis, we were required to raise funds from financing sources in order to maintain operations. In addition to our normal operations, we may make acquisitions of products, technologies or entire businesses.\\xa0Our ability to access capital for operations or for acquisitions will depend on conditions in the capital markets and investors’ perceptions of our business prospects and such conditions and perceptions may not always favor us.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'As of December\\xa031, 2021, we had total current assets of $23,890,122 and total assets of $40,452,705. This compares to $11,562,833 and $30,574,594, respectively, as of December\\xa031, 2020. The increase in total current assets as of December\\xa031, 2021 was primarily due to an increase in the accounts receivable balance, resulting from higher billings and lower collections directly related to the COVID-19 crisis. Also, the Company entered into the Fortress Credit Agreement on November 15, 2021, which provided $5,273,464 in cash to the Company, after settlement of the NSB Term Loan, the Revolving Loan, the MSPLP and the Triangulum promissory note. See Note 10 to our audited consolidated financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details. The increase in total assets as of December\\xa031, 2021 was offset by amortization on the Company’s long-term other intangible assets.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Our total current liabilities as of December\\xa031, 2021 increased to $4,401,071 from $4,247,794 as of December 31, 2020, primarily due to the Company accruing for 2021 employee bonuses and an increase in accrued royalties in our online gaming business. These increases were offset by a decrease in current portion of long-term debt which was repaid in connection with the Fortress Credit agreement. See Note 10 to our audited consolidated financial statements included in Item 8 “Financial Statements and Supplementary Financial Information” for further details.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Despite the continuing effects of the COVID-19 crisis, our business was profitable and cash-flow positive in 2021. Based on our current forecast of operations, we believe we will have sufficient liquidity to fund our operations and to meet the obligations under our financing arrangements as they come due over at least the next 12 months.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'We continue to file applications for new or enhanced licenses in several jurisdictions, which may result in significant future legal and regulatory expenses. A significant increase in such expenses may require us to postpone growth initiatives or investments in personnel, inventory and research and development of our products. It is our intention to continue such initiatives and investments. However, to the extent we are not able to achieve our growth objectives or raise additional capital, we will need to evaluate the reduction of operating expenses.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '13', 'type': 'Uncategorized'},\n", - " {'text': 'Our operating activities', 'type': 'Title'},\n", - " {'text': 'provided', 'type': 'NarrativeText'},\n", - " {'text': '6,003,576', 'type': 'Uncategorized'},\n", - " {'text': 'in cash', 'type': 'Title'},\n", - " {'text': 'for the year ended', 'type': 'NarrativeText'},\n", - " {'text': 'December\\xa031, 2021', 'type': 'Title'},\n", - " {'text': ', compared to', 'type': 'NarrativeText'},\n", - " {'text': 'cash', 'type': 'Title'},\n", - " {'text': 'used', 'type': 'NarrativeText'},\n", - " {'text': 'of', 'type': 'Title'},\n", - " {'text': '$1,633,132', 'type': 'Uncategorized'},\n", - " {'text': 'for the year ended', 'type': 'NarrativeText'},\n", - " {'text': 'December\\xa031, 2020', 'type': 'Title'},\n", - " {'text': 'The increase in operating cash flow was primarily due to higher net income for the period',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'as a result of', 'type': 'Title'},\n", - " {'text': 'the re-opening of a significant portion of our land-based customers after the restrictions due to the COVID-19 crisis were lifted. Also, higher depreciation and amortization and share-based compensation contributed to the higher operating cash flow. These increases were partially offset by changes in operating assets and liabilities such as',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'ccounts', 'type': 'Uncategorized'},\n", - " {'text': 'eceivable', 'type': 'Uncategorized'},\n", - " {'text': 'and', 'type': 'Title'},\n", - " {'text': 'ccrued', 'type': 'Uncategorized'},\n", - " {'text': 'xpenses.', 'type': 'Uncategorized'},\n", - " {'text': 'Investing activities used cash of $233,734 for the year December\\xa031, 2021, compared to cash used of $6,456,714 for the year ended December\\xa031, 2020. This decrease was primarily due to closing of the acquisition of PGP in August 2020.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Cash provided by financing activities for the year ended December\\xa031, 2021 was $4,362,293. This compares to $4,389,234 cash provided by financing activities for the comparable prior-year period. The cash inflow in the current year was due to the Fortress Credit Agreement, offset by the pay-off of the Nevada State Bank (“NSB”) Term Loan, the Revolving Loan, the MSPLP and the Triangulum promissory note.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Critical Accounting Policies. Our consolidated financial statements have been prepared in accordance with U.S. GAAP. We consider the following accounting policies to be the most important to understanding and evaluating our financial results:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Revenue recognition.\\xa0We account for our revenue in accordance with Accounting Standards Codification Topic 606,\\xa0Revenue from Contracts with Customers. We generate revenue primarily from the licensing of our intellectual property. We recognize revenue under recurring fee license contracts monthly as we satisfy our performance obligation, which consists of granting the customer the right to use our intellectual property. Amounts billed are determined based on flat rates or usage rates stipulated in the customer contract.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Goodwill and other intangible assets.\\xa0Goodwill and other intangible assets are assessed for impairment at least annually\\xa0or at other times during the year if events or circumstances indicate that it is more-likely-than-not that the fair value of a reporting asset is below the carrying amount. If found to be impaired, the carrying amounts will be reduced, and an impairment loss will be recognized.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Recently issued accounting pronouncements. We do not expect the adoption of recently issued accounting pronouncements to have a significant impact on our results of operations, financial position or cash flow.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'ITEM\\xa07A. QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK',\n", - " 'type': 'Title'},\n", - " {'text': 'A smaller reporting company is not required to provide the information required by this Item.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '14', 'type': 'Uncategorized'},\n", - " {'text': 'ITEM\\xa08. FINANCIAL STATEMENTS AND SUPPLEMENTARY FINANCIAL INFORMATION',\n", - " 'type': 'Title'},\n", - " {'text': 'INDEX TO FINANCIAL STATEMENTS', 'type': 'Title'},\n", - " {'text': 'Report of Independent Registered Public Accounting Firm, (Moss Adams LLP, San Diego, CA, PCAOB ID: 659)',\n", - " 'type': 'Uncategorized'},\n", - " {'text': '16', 'type': 'Uncategorized'},\n", - " {'text': 'Consolidated Balance Sheets as of December\\xa031, 2021\\xa0and 2020',\n", - " 'type': 'Title'},\n", - " {'text': '17', 'type': 'Uncategorized'},\n", - " {'text': 'Consolidated Statements of Operations and Comprehensive Income for the years ended December\\xa031, 2021 and 2020',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '18', 'type': 'Uncategorized'},\n", - " {'text': 'Consolidated Statements of Changes in Stockholders’ Deficit for the years ended December\\xa031, 2021 and 2020',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '19', 'type': 'Uncategorized'},\n", - " {'text': 'Consolidated Statements of Cash Flows for the years ended December\\xa031, 2021 and 2020',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '20', 'type': 'Uncategorized'},\n", - " {'text': 'Notes to Consolidated Financial Statements', 'type': 'Title'},\n", - " {'text': '21', 'type': 'Uncategorized'},\n", - " {'text': '15', 'type': 'Uncategorized'},\n", - " {'text': 'REPORT OF INDEPENDENT REGISTERED PUBLIC ACCOUNTING FIRM',\n", - " 'type': 'Title'},\n", - " {'text': 'To the Shareholders and the Board of Directors', 'type': 'Title'},\n", - " {'text': 'Galaxy Gaming, Inc.', 'type': 'Title'},\n", - " {'text': 'Opinion on the Financial Statements', 'type': 'Title'},\n", - " {'text': 'We have audited the accompanying consolidated balance sheets of Galaxy Gaming, Inc. (the “Company”) as of December 31, 2021 and 2020, the related consolidated statements of operations and comprehensive income, stockholders’ deficit, and cash flows for the years then ended, and the related notes (collectively referred to as the “consolidated financial statements”). In our opinion, the consolidated financial statements present fairly, in all material respects, the consolidated financial position of the Company as of December 31, 2021 and 2020, and the consolidated results of its operations and its cash flows for the years then ended, in conformity with accounting principles generally accepted in the United States of America.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Basis for Opinion', 'type': 'Title'},\n", - " {'text': 'These consolidated financial statements are the responsibility of the Company’s management. Our responsibility is to express an opinion on the Company’s consolidated financial statements based on our audits. We are a public accounting firm registered with the Public Company Accounting Oversight Board (United States) (PCAOB) and are required to be independent with respect to the Company in accordance with the U.S. federal securities laws and the applicable rules and regulations of the Securities and Exchange Commission and the PCAOB.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'We conducted our audits in accordance with the standards of the PCAOB. Those standards require that we plan and perform the audit to obtain reasonable assurance about whether the consolidated financial statements are free of material misstatement, whether due to error or fraud. The Company is not required to have, nor were we engaged to perform, an audit of its internal control over financial reporting. As part of our audits we are required to obtain an understanding of internal control over financial reporting but not for the purpose of expressing an opinion on the effectiveness of the Company’s internal control over financial reporting. Accordingly, we express no such opinion.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Our audits included performing procedures to assess the risks of material misstatement of the consolidated financial statements, whether due to error or fraud, and performing procedures to respond to those risks. Such procedures included examining, on a test basis, evidence regarding the amounts and disclosures in the consolidated financial statements. Our audits also included evaluating the accounting principles used and significant estimates made by management, as well as evaluating the overall presentation of the consolidated financial statements. We believe that our audits provide a reasonable basis for our opinion.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Critical Audit Matters', 'type': 'Title'},\n", - " {'text': 'Critical audit matters are matters arising from the current period audit of the consolidated financial statements that were communicated or required to be communicated to the audit committee and that (1) relate to accounts or disclosures that are material to the financial statements and (2) involved our especially challenging, subjective, or complex judgments. We determined that there are no critical audit matters.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '/s/ Moss Adams LLP', 'type': 'Title'},\n", - " {'text': 'San Diego, California', 'type': 'Title'},\n", - " {'text': 'March 30, 2022', 'type': 'Uncategorized'},\n", - " {'text': 'We have served as the Company’s auditor since 2020',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '16', 'type': 'Uncategorized'},\n", - " {'text': 'GALAXY GAMING, INC.', 'type': 'Title'},\n", - " {'text': 'CONSOLIDATED BALANCE SHEETS', 'type': 'Title'},\n", - " {'text': 'DECEMBER\\xa031, 2021 AND 2020', 'type': 'Title'},\n", - " {'text': 'ASSETS', 'type': 'Title'},\n", - " {'text': '2021', 'type': 'Uncategorized'},\n", - " {'text': '2020', 'type': 'Uncategorized'},\n", - " {'text': 'Current assets:', 'type': 'Title'},\n", - " {'text': 'Cash and cash equivalents', 'type': 'Title'},\n", - " {'text': '16,058,714', 'type': 'Uncategorized'},\n", - " {'text': '5,993,388', 'type': 'Uncategorized'},\n", - " {'text': 'Accounts receivable, net of allowance of $348,695 and $145,000, respectively',\n", - " 'type': 'Title'},\n", - " {'text': '4,377,165', 'type': 'Uncategorized'},\n", - " {'text': '2,493,254', 'type': 'Uncategorized'},\n", - " {'text': 'Inventory', 'type': 'Title'},\n", - " {'text': '770,248', 'type': 'Uncategorized'},\n", - " {'text': '668,525', 'type': 'Uncategorized'},\n", - " {'text': 'Income tax receivable', 'type': 'Title'},\n", - " {'text': '1,536,682', 'type': 'Uncategorized'},\n", - " {'text': '1,229,795', 'type': 'Uncategorized'},\n", - " {'text': 'Prepaid expenses', 'type': 'Title'},\n", - " {'text': '1,125,777', 'type': 'Uncategorized'},\n", - " {'text': '1,167,068', 'type': 'Uncategorized'},\n", - " {'text': 'Other current assets', 'type': 'Title'},\n", - " {'text': '21,536', 'type': 'Uncategorized'},\n", - " {'text': '10,803', 'type': 'Uncategorized'},\n", - " {'text': 'Total current assets', 'type': 'Title'},\n", - " {'text': '23,890,122', 'type': 'Uncategorized'},\n", - " {'text': '11,562,833', 'type': 'Uncategorized'},\n", - " {'text': 'Property and equipment, net', 'type': 'Title'},\n", - " {'text': '98,594', 'type': 'Uncategorized'},\n", - " {'text': '116,724', 'type': 'Uncategorized'},\n", - " {'text': 'Operating lease right-of-use assets', 'type': 'NarrativeText'},\n", - " {'text': '1,167,903', 'type': 'Uncategorized'},\n", - " {'text': '1,367,821', 'type': 'Uncategorized'},\n", - " {'text': 'Assets deployed at client locations, net', 'type': 'NarrativeText'},\n", - " {'text': '360,735', 'type': 'Uncategorized'},\n", - " {'text': '232,156', 'type': 'Uncategorized'},\n", - " {'text': 'Goodwill', 'type': 'Title'},\n", - " {'text': '1,091,000', 'type': 'Uncategorized'},\n", - " {'text': '1,091,000', 'type': 'Uncategorized'},\n", - " {'text': 'Other intangible assets, net', 'type': 'Title'},\n", - " {'text': '13,677,264', 'type': 'Uncategorized'},\n", - " {'text': '16,086,896', 'type': 'Uncategorized'},\n", - " {'text': 'Other assets, net', 'type': 'Title'},\n", - " {'text': '167,087', 'type': 'Uncategorized'},\n", - " {'text': '117,164', 'type': 'Uncategorized'},\n", - " {'text': 'Total assets', 'type': 'Title'},\n", - " {'text': '40,452,705', 'type': 'Uncategorized'},\n", - " {'text': '30,574,594', 'type': 'Uncategorized'},\n", - " {'text': 'LIABILITIES AND STOCKHOLDERS’ DEFICIT', 'type': 'NarrativeText'},\n", - " {'text': 'Current liabilities:', 'type': 'Title'},\n", - " {'text': 'Accounts payable', 'type': 'Title'},\n", - " {'text': '374,323', 'type': 'Uncategorized'},\n", - " {'text': '467,792', 'type': 'Uncategorized'},\n", - " {'text': 'Accrued expenses', 'type': 'NarrativeText'},\n", - " {'text': '2,666,073', 'type': 'Uncategorized'},\n", - " {'text': '1,333,032', 'type': 'Uncategorized'},\n", - " {'text': 'Revenue contract liability', 'type': 'Title'},\n", - " {'text': '37,500', 'type': 'Uncategorized'},\n", - " {'text': '29,167', 'type': 'Uncategorized'},\n", - " {'text': 'Current portion of long-term debt', 'type': 'Title'},\n", - " {'text': '1,100,369', 'type': 'Uncategorized'},\n", - " {'text': '2,222,392', 'type': 'Uncategorized'},\n", - " {'text': 'Current portion of operating lease liabilities', 'type': 'Title'},\n", - " {'text': '222,806', 'type': 'Uncategorized'},\n", - " {'text': '195,411', 'type': 'Uncategorized'},\n", - " {'text': 'Total current liabilities', 'type': 'Title'},\n", - " {'text': '4,401,071', 'type': 'Uncategorized'},\n", - " {'text': '4,247,794', 'type': 'Uncategorized'},\n", - " {'text': 'Long-term operating lease liabilities', 'type': 'Title'},\n", - " {'text': '1,019,029', 'type': 'Uncategorized'},\n", - " {'text': '1,215,680', 'type': 'Uncategorized'},\n", - " {'text': 'Long-term debt and liabilities, net', 'type': 'Title'},\n", - " {'text': '52,143,810', 'type': 'Uncategorized'},\n", - " {'text': '49,691,184', 'type': 'Uncategorized'},\n", - " {'text': 'Interest rate swap liability', 'type': 'Title'},\n", - " {'text': '66,009', 'type': 'Uncategorized'},\n", - " {'text': 'Deferred tax liabilities, net', 'type': 'Title'},\n", - " {'text': '175,218', 'type': 'Uncategorized'},\n", - " {'text': '150,892', 'type': 'Uncategorized'},\n", - " {'text': 'Total liabilities', 'type': 'Title'},\n", - " {'text': '57,739,128', 'type': 'Uncategorized'},\n", - " {'text': '55,371,559', 'type': 'Uncategorized'},\n", - " {'text': 'Commitments and Contingencies (See Note 11)', 'type': 'Title'},\n", - " {'text': 'Stockholders’ deficit', 'type': 'NarrativeText'},\n", - " {'text': 'Preferred stock, 10,000,000 shares authorized, $0.001 par value;\\n\\xa0\\xa0 0 shares issued and outstanding, respectively',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Common stock, 65,000,000 shares authorized; $0.001 par value;\\n\\xa0\\xa0 23,523,969 and 21,970,638 shares issued and outstanding, respectively',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '23,524', 'type': 'Uncategorized'},\n", - " {'text': '21,971', 'type': 'Uncategorized'},\n", - " {'text': 'Additional paid-in capital', 'type': 'Title'},\n", - " {'text': '16,380,597', 'type': 'Uncategorized'},\n", - " {'text': '10,798,536', 'type': 'Uncategorized'},\n", - " {'text': 'Accumulated deficit', 'type': 'NarrativeText'},\n", - " {'text': '(33,543,351', 'type': 'Uncategorized'},\n", - " {'text': '(35,655,163', 'type': 'Uncategorized'},\n", - " {'text': 'Accumulated other comprehensive (loss) income',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '(147,193', 'type': 'Uncategorized'},\n", - " {'text': '37,691', 'type': 'Uncategorized'},\n", - " {'text': 'Total stockholders’ deficit', 'type': 'NarrativeText'},\n", - " {'text': '(17,286,423', 'type': 'Uncategorized'},\n", - " {'text': '(24,796,965', 'type': 'Uncategorized'},\n", - " {'text': 'Total liabilities and stockholders’ deficit',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '40,452,705', 'type': 'Uncategorized'},\n", - " {'text': '30,574,594', 'type': 'Uncategorized'},\n", - " {'text': 'The accompanying notes are an integral part of the consolidated financial statements.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '17', 'type': 'Uncategorized'},\n", - " {'text': 'GALAXY GAMING, INC.', 'type': 'Title'},\n", - " {'text': 'CONSOLIDATED STATEMENTS OF OPERATIONS AND COMPREHENSIVE INCOME (LOSS)',\n", - " 'type': 'Title'},\n", - " {'text': 'YEARS ENDED DECEMBER\\xa031, 2021 AND 2020',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '2021', 'type': 'Uncategorized'},\n", - " {'text': '2020', 'type': 'Uncategorized'},\n", - " {'text': 'Revenue:', 'type': 'Title'},\n", - " {'text': 'Licensing fees', 'type': 'Title'},\n", - " {'text': '19,984,378', 'type': 'Uncategorized'},\n", - " {'text': '10,230,316', 'type': 'Uncategorized'},\n", - " {'text': 'Total revenue', 'type': 'Title'},\n", - " {'text': '19,984,378', 'type': 'Uncategorized'},\n", - " {'text': '10,230,316', 'type': 'Uncategorized'},\n", - " {'text': 'Costs and expenses:', 'type': 'Title'},\n", - " {'text': 'Cost of ancillary products and assembled components',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '80,833', 'type': 'Uncategorized'},\n", - " {'text': '72,684', 'type': 'Uncategorized'},\n", - " {'text': 'Selling, general and administrative', 'type': 'NarrativeText'},\n", - " {'text': '10,646,524', 'type': 'Uncategorized'},\n", - " {'text': '8,964,930', 'type': 'Uncategorized'},\n", - " {'text': 'Research and development', 'type': 'Title'},\n", - " {'text': '520,449', 'type': 'Uncategorized'},\n", - " {'text': '487,679', 'type': 'Uncategorized'},\n", - " {'text': 'Depreciation and amortization', 'type': 'Title'},\n", - " {'text': '2,858,991', 'type': 'Uncategorized'},\n", - " {'text': '2,222,042', 'type': 'Uncategorized'},\n", - " {'text': 'Share-based compensation', 'type': 'Title'},\n", - " {'text': '1,532,455', 'type': 'Uncategorized'},\n", - " {'text': '737,991', 'type': 'Uncategorized'},\n", - " {'text': 'Total costs and expenses', 'type': 'Title'},\n", - " {'text': '15,639,252', 'type': 'Uncategorized'},\n", - " {'text': '12,485,326', 'type': 'Uncategorized'},\n", - " {'text': 'Income (loss) from operations', 'type': 'Title'},\n", - " {'text': '4,345,126', 'type': 'Uncategorized'},\n", - " {'text': '(2,255,010', 'type': 'Uncategorized'},\n", - " {'text': 'Other income (expense):', 'type': 'Title'},\n", - " {'text': 'Interest income', 'type': 'Title'},\n", - " {'text': '2,048', 'type': 'Uncategorized'},\n", - " {'text': '25,702', 'type': 'Uncategorized'},\n", - " {'text': 'Interest expense', 'type': 'Title'},\n", - " {'text': '(1,505,386', 'type': 'Uncategorized'},\n", - " {'text': '(683,357', 'type': 'Uncategorized'},\n", - " {'text': 'Share redemption consideration', 'type': 'Title'},\n", - " {'text': '(682,469', 'type': 'Uncategorized'},\n", - " {'text': '(781,928', 'type': 'Uncategorized'},\n", - " {'text': 'Foreign currency exchange (loss)', 'type': 'Title'},\n", - " {'text': '(64,879', 'type': 'Uncategorized'},\n", - " {'text': '(34,961', 'type': 'Uncategorized'},\n", - " {'text': 'Change in estimated fair value of interest rate swap liability',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '66,009', 'type': 'Uncategorized'},\n", - " {'text': '74,487', 'type': 'Uncategorized'},\n", - " {'text': 'Paycheck Protection Program Loan forgiveness', 'type': 'Title'},\n", - " {'text': '840,243', 'type': 'Uncategorized'},\n", - " {'text': 'Total other expense', 'type': 'Title'},\n", - " {'text': '(2,184,677', 'type': 'Uncategorized'},\n", - " {'text': '(559,814', 'type': 'Uncategorized'},\n", - " {'text': 'Income (loss) before (provision) benefit for income taxes',\n", - " 'type': 'Title'},\n", - " {'text': '2,160,449', 'type': 'Uncategorized'},\n", - " {'text': '(2,814,824', 'type': 'Uncategorized'},\n", - " {'text': '(Provision) benefit for income taxes', 'type': 'Title'},\n", - " {'text': '(48,637', 'type': 'Uncategorized'},\n", - " {'text': '605,937', 'type': 'Uncategorized'},\n", - " {'text': 'Net income (loss)', 'type': 'Title'},\n", - " {'text': '2,111,812', 'type': 'Uncategorized'},\n", - " {'text': '(2,208,887', 'type': 'Uncategorized'},\n", - " {'text': 'Foreign currency translation adjustment', 'type': 'Title'},\n", - " {'text': '(184,884', 'type': 'Uncategorized'},\n", - " {'text': '37,691', 'type': 'Uncategorized'},\n", - " {'text': 'Comprehensive income (loss)', 'type': 'Title'},\n", - " {'text': '1,926,928', 'type': 'Uncategorized'},\n", - " {'text': '(2,171,196', 'type': 'Uncategorized'},\n", - " {'text': 'Net income (loss) per share:', 'type': 'Title'},\n", - " {'text': 'Basic', 'type': 'Title'},\n", - " {'text': '0.10', 'type': 'Uncategorized'},\n", - " {'text': '(0.12', 'type': 'Uncategorized'},\n", - " {'text': 'Diluted', 'type': 'Title'},\n", - " {'text': '0.10', 'type': 'Uncategorized'},\n", - " {'text': '(0.12', 'type': 'Uncategorized'},\n", - " {'text': 'Weighted-average shares outstanding:', 'type': 'Title'},\n", - " {'text': 'Basic', 'type': 'Title'},\n", - " {'text': '20,328,110', 'type': 'Uncategorized'},\n", - " {'text': '18,282,262', 'type': 'Uncategorized'},\n", - " {'text': 'Diluted', 'type': 'Title'},\n", - " {'text': '21,840,609', 'type': 'Uncategorized'},\n", - " {'text': '18,282,262', 'type': 'Uncategorized'},\n", - " {'text': 'The accompanying notes are an integral part of the consolidated financial statements.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '18', 'type': 'Uncategorized'},\n", - " {'text': 'GALAXY GAMING, INC.', 'type': 'Title'},\n", - " {'text': 'CONSOLIDATED STATEMENTS OF CHANGES IN STOCKHOLDERS’ DEFICIT',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'YEARS ENDED DECEMBER\\xa031, 2021 AND 2020',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Common Stock', 'type': 'Title'},\n", - " {'text': 'Additional Paid in', 'type': 'Title'},\n", - " {'text': 'Accumulated (Deficit)', 'type': 'Title'},\n", - " {'text': 'Accumulated Other', 'type': 'Title'},\n", - " {'text': \"Total Shareholders'\", 'type': 'Title'},\n", - " {'text': 'Shares', 'type': 'Title'},\n", - " {'text': 'Amount', 'type': 'Title'},\n", - " {'text': 'Capital', 'type': 'Title'},\n", - " {'text': 'Earnings', 'type': 'Title'},\n", - " {'text': 'Comprehensive Income (Loss)', 'type': 'Title'},\n", - " {'text': 'Deficit', 'type': 'Title'},\n", - " {'text': 'Beginning balance, January 1, 2020', 'type': 'NarrativeText'},\n", - " {'text': '18,017,944', 'type': 'Uncategorized'},\n", - " {'text': '18,018', 'type': 'Uncategorized'},\n", - " {'text': '5,795,636', 'type': 'Uncategorized'},\n", - " {'text': '(33,446,276', 'type': 'Uncategorized'},\n", - " {'text': '(27,632,622', 'type': 'Uncategorized'},\n", - " {'text': 'Shares issued in connection with PGP asset acquisition',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '3,141,361', 'type': 'Uncategorized'},\n", - " {'text': '3,141', 'type': 'Uncategorized'},\n", - " {'text': '3,986,387', 'type': 'Uncategorized'},\n", - " {'text': '3,989,528', 'type': 'Uncategorized'},\n", - " {'text': 'Net loss', 'type': 'Title'},\n", - " {'text': '(2,208,887', 'type': 'Uncategorized'},\n", - " {'text': '(2,208,887', 'type': 'Uncategorized'},\n", - " {'text': 'Foreign currency translation', 'type': 'Title'},\n", - " {'text': '37,691', 'type': 'Uncategorized'},\n", - " {'text': '37,691', 'type': 'Uncategorized'},\n", - " {'text': 'Stock options exercised', 'type': 'NarrativeText'},\n", - " {'text': '558,000', 'type': 'Uncategorized'},\n", - " {'text': '559', 'type': 'Uncategorized'},\n", - " {'text': '278,775', 'type': 'Uncategorized'},\n", - " {'text': '279,334', 'type': 'Uncategorized'},\n", - " {'text': 'Share-based compensation', 'type': 'Title'},\n", - " {'text': '253,333', 'type': 'Uncategorized'},\n", - " {'text': '253', 'type': 'Uncategorized'},\n", - " {'text': '737,738', 'type': 'Uncategorized'},\n", - " {'text': '737,991', 'type': 'Uncategorized'},\n", - " {'text': 'Balance, December\\xa031, 2020', 'type': 'Title'},\n", - " {'text': '21,970,638', 'type': 'Uncategorized'},\n", - " {'text': '21,971', 'type': 'Uncategorized'},\n", - " {'text': '10,798,536', 'type': 'Uncategorized'},\n", - " {'text': '(35,655,163', 'type': 'Uncategorized'},\n", - " {'text': '37,691', 'type': 'Uncategorized'},\n", - " {'text': '(24,796,965', 'type': 'Uncategorized'},\n", - " {'text': 'Warrants issued in connection with Fortress credit agreement',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '3,149,002', 'type': 'Uncategorized'},\n", - " {'text': '3,149,002', 'type': 'Uncategorized'},\n", - " {'text': 'Net income', 'type': 'Title'},\n", - " {'text': '2,111,812', 'type': 'Uncategorized'},\n", - " {'text': '2,111,812', 'type': 'Uncategorized'},\n", - " {'text': 'Foreign currency translation', 'type': 'Title'},\n", - " {'text': '(184,884', 'type': 'Uncategorized'},\n", - " {'text': '(184,884', 'type': 'Uncategorized'},\n", - " {'text': 'Stock options exercised', 'type': 'NarrativeText'},\n", - " {'text': '1,094,998', 'type': 'Uncategorized'},\n", - " {'text': '1,095', 'type': 'Uncategorized'},\n", - " {'text': '901,062', 'type': 'Uncategorized'},\n", - " {'text': '902,157', 'type': 'Uncategorized'},\n", - " {'text': 'Share-based compensation', 'type': 'Title'},\n", - " {'text': '458,333', 'type': 'Uncategorized'},\n", - " {'text': '458', 'type': 'Uncategorized'},\n", - " {'text': '1,531,997', 'type': 'Uncategorized'},\n", - " {'text': '1,532,455', 'type': 'Uncategorized'},\n", - " {'text': 'Balance, December\\xa031, 2021', 'type': 'Title'},\n", - " {'text': '23,523,969', 'type': 'Uncategorized'},\n", - " {'text': '23,524', 'type': 'Uncategorized'},\n", - " {'text': '16,380,597', 'type': 'Uncategorized'},\n", - " {'text': '(33,543,351', 'type': 'Uncategorized'},\n", - " {'text': '(147,193', 'type': 'Uncategorized'},\n", - " {'text': '(17,286,423', 'type': 'Uncategorized'},\n", - " {'text': 'The accompanying notes are an integral part of the consolidated financial statements.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '19', 'type': 'Uncategorized'},\n", - " {'text': 'GALAXY GAMING, INC.', 'type': 'Title'},\n", - " {'text': 'CONSOLIDATED STATEMENTS OF CASH FLOWS', 'type': 'Title'},\n", - " {'text': 'YEARS ENDED December\\xa031, 2021 AND 2020', 'type': 'Title'},\n", - " {'text': '2021', 'type': 'Uncategorized'},\n", - " {'text': '2020', 'type': 'Uncategorized'},\n", - " {'text': 'Cash flows from operating activities:', 'type': 'NarrativeText'},\n", - " {'text': 'Net income (loss)', 'type': 'Title'},\n", - " {'text': '2,111,812', 'type': 'Uncategorized'},\n", - " {'text': '(2,208,887', 'type': 'Uncategorized'},\n", - " {'text': 'Adjustments to reconcile net income (loss) to net cash provided by (used in) operating activities:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Depreciation and amortization', 'type': 'Title'},\n", - " {'text': '2,858,991', 'type': 'Uncategorized'},\n", - " {'text': '2,222,042', 'type': 'Uncategorized'},\n", - " {'text': 'Amortization of right-of-use assets', 'type': 'Title'},\n", - " {'text': '228,522', 'type': 'Uncategorized'},\n", - " {'text': '329,040', 'type': 'Uncategorized'},\n", - " {'text': 'Amortization of debt issuance costs and debt discount',\n", - " 'type': 'Title'},\n", - " {'text': '369,093', 'type': 'Uncategorized'},\n", - " {'text': '38,195', 'type': 'Uncategorized'},\n", - " {'text': 'Bad debt expense', 'type': 'Title'},\n", - " {'text': '358,160', 'type': 'Uncategorized'},\n", - " {'text': '226,691', 'type': 'Uncategorized'},\n", - " {'text': 'Change in estimated fair value of interest rate swap liability',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '(66,009', 'type': 'Uncategorized'},\n", - " {'text': '(74,487', 'type': 'Uncategorized'},\n", - " {'text': 'Gain on forgiveness of Paycheck Protection Program Loan',\n", - " 'type': 'Title'},\n", - " {'text': '(835,300', 'type': 'Uncategorized'},\n", - " {'text': 'Deferred income tax', 'type': 'Title'},\n", - " {'text': '24,326', 'type': 'Uncategorized'},\n", - " {'text': '596,874', 'type': 'Uncategorized'},\n", - " {'text': 'Share-based compensation', 'type': 'Title'},\n", - " {'text': '1,532,455', 'type': 'Uncategorized'},\n", - " {'text': '737,991', 'type': 'Uncategorized'},\n", - " {'text': 'Changes in operating assets and liabilities:', 'type': 'Title'},\n", - " {'text': 'Accounts receivable', 'type': 'Title'},\n", - " {'text': '(2,367,258', 'type': 'Uncategorized'},\n", - " {'text': '(236,890', 'type': 'Uncategorized'},\n", - " {'text': 'Inventory', 'type': 'Title'},\n", - " {'text': '(427,795', 'type': 'Uncategorized'},\n", - " {'text': '(51,709', 'type': 'Uncategorized'},\n", - " {'text': 'Income tax receivable/payable', 'type': 'Title'},\n", - " {'text': '(306,887', 'type': 'Uncategorized'},\n", - " {'text': '(893,930', 'type': 'Uncategorized'},\n", - " {'text': 'Prepaid expense and other current assets', 'type': 'Title'},\n", - " {'text': '680,663', 'type': 'Uncategorized'},\n", - " {'text': '259,616', 'type': 'Uncategorized'},\n", - " {'text': 'Other assets', 'type': 'Title'},\n", - " {'text': '(49,923', 'type': 'Uncategorized'},\n", - " {'text': 'Accounts payable', 'type': 'Title'},\n", - " {'text': '(91,242', 'type': 'Uncategorized'},\n", - " {'text': '(1,081,836', 'type': 'Uncategorized'},\n", - " {'text': 'Accrued expenses', 'type': 'NarrativeText'},\n", - " {'text': '1,338,195', 'type': 'Uncategorized'},\n", - " {'text': '(257,179', 'type': 'Uncategorized'},\n", - " {'text': 'Revenue contract liability', 'type': 'Title'},\n", - " {'text': '8,333', 'type': 'Uncategorized'},\n", - " {'text': 'Operating lease liabilities', 'type': 'NarrativeText'},\n", - " {'text': '(197,860', 'type': 'Uncategorized'},\n", - " {'text': '(403,363', 'type': 'Uncategorized'},\n", - " {'text': 'Net cash provided by (used in) operating activities',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '6,003,576', 'type': 'Uncategorized'},\n", - " {'text': '(1,633,132', 'type': 'Uncategorized'},\n", - " {'text': 'Cash flows from investing activities:', 'type': 'NarrativeText'},\n", - " {'text': 'Investment in intangible assets', 'type': 'Title'},\n", - " {'text': '(198,667', 'type': 'Uncategorized'},\n", - " {'text': 'Proceeds from sale of property and equipment', 'type': 'Title'},\n", - " {'text': '25,000', 'type': 'Uncategorized'},\n", - " {'text': 'Acquisition of PGP assets, net of cash acquired',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '(6,393,920', 'type': 'Uncategorized'},\n", - " {'text': 'Acquisition of property and equipment', 'type': 'Title'},\n", - " {'text': '(60,067', 'type': 'Uncategorized'},\n", - " {'text': '(62,794', 'type': 'Uncategorized'},\n", - " {'text': 'Net cash used in investing activities', 'type': 'NarrativeText'},\n", - " {'text': '(233,734', 'type': 'Uncategorized'},\n", - " {'text': '(6,456,714', 'type': 'Uncategorized'},\n", - " {'text': 'Cash flows from financing activities:', 'type': 'Title'},\n", - " {'text': 'Proceeds from draw on revolving loan', 'type': 'NarrativeText'},\n", - " {'text': '1,000,000', 'type': 'Uncategorized'},\n", - " {'text': 'Proceeds from Paycheck Protection Program', 'type': 'Title'},\n", - " {'text': '835,300', 'type': 'Uncategorized'},\n", - " {'text': 'Proceeds from Mainstreet Priority Loan Program', 'type': 'Title'},\n", - " {'text': '3,920,000', 'type': 'Uncategorized'},\n", - " {'text': 'Proceeds from stock option exercises', 'type': 'Title'},\n", - " {'text': '902,157', 'type': 'Uncategorized'},\n", - " {'text': '279,334', 'type': 'Uncategorized'},\n", - " {'text': 'Principal payments on long-term debt', 'type': 'Title'},\n", - " {'text': '(13,104,942', 'type': 'Uncategorized'},\n", - " {'text': '(1,645,400', 'type': 'Uncategorized'},\n", - " {'text': 'Proceeds from Fortress Credit Agreement, net', 'type': 'Title'},\n", - " {'text': '58,800,000', 'type': 'Uncategorized'},\n", - " {'text': 'Payments of debt issuance costs', 'type': 'Title'},\n", - " {'text': '(3,138,521', 'type': 'Uncategorized'},\n", - " {'text': 'Settlement of Redemption Consideration Obligation',\n", - " 'type': 'Title'},\n", - " {'text': '(39,096,401', 'type': 'Uncategorized'},\n", - " {'text': 'Net cash provided by financing activities',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '4,362,293', 'type': 'Uncategorized'},\n", - " {'text': '4,389,234', 'type': 'Uncategorized'},\n", - " {'text': 'Effect of exchange rate changes on cash', 'type': 'Title'},\n", - " {'text': '(66,809', 'type': 'Uncategorized'},\n", - " {'text': '7,302', 'type': 'Uncategorized'},\n", - " {'text': 'Net increase (decrease) in cash and cash equivalents',\n", - " 'type': 'Title'},\n", - " {'text': '10,065,326', 'type': 'Uncategorized'},\n", - " {'text': '(3,693,310', 'type': 'Uncategorized'},\n", - " {'text': 'Cash and cash equivalents – beginning of period',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '5,993,388', 'type': 'Uncategorized'},\n", - " {'text': '9,686,698', 'type': 'Uncategorized'},\n", - " {'text': 'Cash and cash equivalents – end of period', 'type': 'Title'},\n", - " {'text': '16,058,714', 'type': 'Uncategorized'},\n", - " {'text': '5,993,388', 'type': 'Uncategorized'},\n", - " {'text': 'Supplemental cash flow information:', 'type': 'Title'},\n", - " {'text': 'Cash paid for interest', 'type': 'NarrativeText'},\n", - " {'text': '940,097', 'type': 'Uncategorized'},\n", - " {'text': '612,840', 'type': 'Uncategorized'},\n", - " {'text': 'Cash paid for income taxes', 'type': 'NarrativeText'},\n", - " {'text': '319,967', 'type': 'Uncategorized'},\n", - " {'text': '75,786', 'type': 'Uncategorized'},\n", - " {'text': 'Supplemental schedule of non-cash activities:', 'type': 'Title'},\n", - " {'text': 'Shares issued in connection with PGP asset acquisition',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '3,989,528', 'type': 'Uncategorized'},\n", - " {'text': 'Gain on forgiveness of Paycheck Protection Program Loan',\n", - " 'type': 'Title'},\n", - " {'text': '835,300', 'type': 'Uncategorized'},\n", - " {'text': 'Fortress warrants issued', 'type': 'NarrativeText'},\n", - " {'text': '3,149,002', 'type': 'Uncategorized'},\n", - " {'text': 'Insurance acquired under note payable', 'type': 'NarrativeText'},\n", - " {'text': '653,521', 'type': 'Uncategorized'},\n", - " {'text': '678,108', 'type': 'Uncategorized'},\n", - " {'text': 'Right-of-use assets obtained in exchange for lease liabilities',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '28,604', 'type': 'Uncategorized'},\n", - " {'text': '1,390,002', 'type': 'Uncategorized'},\n", - " {'text': 'Inventory transferred to assets deployed at client locations',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '326,072', 'type': 'Uncategorized'},\n", - " {'text': '48,838', 'type': 'Uncategorized'},\n", - " {'text': 'The accompanying notes are an integral part of the consolidated financial statements.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '20', 'type': 'Uncategorized'},\n", - " {'text': 'GALAXY GAMING, INC.', 'type': 'Title'},\n", - " {'text': 'NOTES TO CONSOLIDATED FINANCIAL STATEMENTS', 'type': 'Title'},\n", - " {'text': 'YEARS ENDED DECEMBER\\xa031, 2021 AND 2020',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'NOTE 1. NATURE OF OPERATIONS', 'type': 'Title'},\n", - " {'text': 'Unless the context indicates otherwise, references to “Galaxy Gaming, Inc.,” “we,” “us,” “our,” or the “Company,” refer to Galaxy Gaming, Inc., a Nevada corporation (“Galaxy Gaming”).',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'We are an established global gaming company specializing in the design, development, acquisition, assembly, marketing and licensing of proprietary casino table games and associated technology, platforms and systems for the casino gaming industry. Casinos use our proprietary products and services to enhance their gaming operations and improve their profitability, productivity and security, as well as to offer popular cutting-edge gaming entertainment content and technology to their players. We market our products and services to online casinos worldwide and to land-based casino gaming companies in North America, the Caribbean, Central America, the United Kingdom, Europe and Africa and to cruise ship companies. We license our products and services for use solely in legalized gaming markets. We also license our content and distribute content from other companies to iGaming operators throughout the world.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': \"Share Redemption. On May 6, 2019, we redeemed all 23,271,667 shares of our common stock held by Triangulum Partners, LLC (“Triangulum”), an entity controlled by Robert B. Saucier, Galaxy Gaming's founder, and, prior to the redemption, the holder of a majority of our outstanding common stock. Our Articles of Incorporation (the “Articles”) provide that if certain events occur in relation to a stockholder that is required to undergo a gaming suitability review or similar investigative process, we have the option to purchase all or any part of such stockholder’s shares at a price per share that is equal to the average closing share price over the thirty calendar days preceding the purchase. The average closing share price over the thirty calendar days preceding the redemption was $1.68 per share.\",\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'The consideration owed to Triangulum for the redemption was $39,096,401 (the “Redemption Consideration Obligation”). See Note 10. All of the litigation related to the Redemption Consideration Obligation and other matters between the Company and Triangulum was resolved on November 15, 2021, when Galaxy made a settlement payment in the amount of $39,507,717 to Triangulum. See Note 10 and Note 11.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Membership Interest Purchase Agreement. On February 25, 2020, Galaxy Gaming entered into a Membership Interest Purchase Agreement, dated February 25, 2020 (the “Purchase Agreement”), between the Company and the membership interest holders of Progressive Games Partners, LLC (“PGP”).',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'On August 21, 2020, the Company entered into a First Amendment to the Purchase Agreement between the Company and the membership interest holders of PGP. The First Amendment, among other things, fixed the cash portion of the purchase price at $6.425 million and established that the stock portion would be satisfied through the issuance of 3,141,361 shares of the Company’s common stock with a value of $1.27 per share on the date of the acquisition. The shares issued are being held in escrow with Philadelphia Stock Transfer, Inc., the Company’s stock transfer agent. The shares were released to the sellers in August 2021.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'On August 21, 2020, the Company completed the acquisition of 100% of the member interests in PGP. The entirety of the purchase price ($10,414,528) has been allocated to customer relationships and is included in Other intangible assets, net, on the Company’s balance sheet. See Note 7. The Company also acquired certain receivables and payables in the net amount of $581,885, which was to be remitted to the sellers of PGP as the receivables and payables were settled. The remaining balance of $76,053 at December 31, 2020 was paid to the sellers on May 7, 2021.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Management has determined that, for accounting purposes, the PGP transaction does not meet the definition of a business combination and, therefore, has been accounted for as an asset acquisition.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'COVID-19. On March 11, 2020, the World Health Organization declared a pandemic related to the COVID-19 outbreak, which led to a global health emergency.\\xa0The public-health impact of the outbreak continues to remain largely unknown and still evolving as new strains of COVID-19 continue to evolve. The related health crisis could continue to adversely affect the global economy, resulting in continued economic downturn that could impact demand for our products. Virtually all of our land-based clients have reopened, although casino revenues have not returned to pre-COVID-19 levels.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'We rely on third-party suppliers and manufacturers in China, many of whom were shut down or severely cut back production during some portion of 2020, with supply shortages continuing into 2021. Although we have been able to maintain inventories adequate to our needs, any future disruption of our suppliers and their contract manufacturers may impact our sales and operating results going forward.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '21', 'type': 'Uncategorized'},\n", - " {'text': 'Because of the uncertainties of COVID-19, the Company drew on its Revolving Loan in the amount of $1,000,000 on March 12, 2020. Also, on April 17, 2020, the Company obtained the Paycheck Protection Program (the “PPP Loan”) pursuant to the Coronavirus Aid, Relief, and Economic Security Act (the “CARES Act”) and the Paycheck Protection Program Flexibility Act (the “Flexibility Act”). On July 16, 2020, the Company filed an application and supporting documentation for forgiveness in full of the PPP Loan. On November 21, 2020, the Company received notification the PPP Loan had been forgiven in full, including $4,943 in accrued interest. Pursuant to the CARES Act, the Federal Reserve created the Main Street Priority Loan Program (“MSPLP”) to provide financing for small and medium-sized businesses. On October 26, 2020, the Company borrowed $4 million from Zions Bancorporation N.A., dba Nevada State Bank under this program. All of the Company’s obligations under the Nevada State Bank credit agreement were repaid on November 15, 2021. See Note 10.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Credit Agreement Amendments and Fortress Credit Agreement. See Note 10 for discussion of amendments made to the Company’s credit agreement and the entry into the Fortress Credit Agreement.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'NOTE 2.\\xa0SIGNIFICANT ACCOUNTING POLICIES', 'type': 'Title'},\n", - " {'text': 'The accompanying consolidated financial statements have been prepared in accordance with',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'generally accepted accounting principles in the United States of America (“U.S. GAAP”)',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'and the rules of', 'type': 'Title'},\n", - " {'text': 'the Securities and Exchange Commission (“SEC”)', 'type': 'Title'},\n", - " {'text': '. In the opinion of management,', 'type': 'Uncategorized'},\n", - " {'text': 'the accompanying consolidated financial statements contain all necessary adjustments (including',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'all', 'type': 'Title'},\n", - " {'text': 'those of a recurring nature', 'type': 'Title'},\n", - " {'text': 'and those necessary in order for the financial statements to be not misleading)',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'and all disclosures to present fairly our financial position and the results of our operations and cash flows for the periods presented',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Basis of accounting. The consolidated financial statements have been prepared on the accrual basis of accounting in conformity with U.S. GAAP.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Use of estimates and assumptions. We are required to make estimates, judgments and assumptions that we believe are reasonable based on our historical experience, contract terms, observance of known trends in our Company and the industry as a whole, and information available from other outside sources. Our estimates affect reported amounts for assets, liabilities, revenues, expenses and related disclosures. Actual results may differ from initial estimates.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Consolidation. The financial statements are presented on a consolidated basis and include the results of the Company and its wholly owned subsidiary, PGP. All intercompany transactions and balances have been eliminated in consolidation.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Reclassifications. Certain accounts and financial statement captions in the prior period have been reclassified to conform to the current period financial statement presentations and had no effect on net income (loss).',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Cash and cash equivalents.\\xa0We consider cash on hand and cash in banks as cash. We consider certificates of deposit and other short-term securities with maturities of three months or less when purchased as cash equivalents. Our cash in bank balances are deposited in insured banking institutions, which are insured up to $250,000\\xa0\\xa0per account. To date, we have not experienced uninsured losses, and we believe the risk of future loss is negligible.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Accounts receivable and allowance for doubtful accounts. Accounts receivable are stated at face value less an allowance for doubtful accounts. Accounts receivable are non-interest bearing. The Company reviews the accounts receivable on a monthly basis to determine if any receivables will potentially be uncollectible. The allowance for doubtful accounts is estimated based on specific customer reviews, historical collection trends and current economic and business conditions.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Inventory.\\xa0Inventory consists of ancillary products such as signs, layouts and bases for the various games and electronic devices and components to support all our electronic enhancements used on casino table games (“Enhanced Table Systems”), and we maintain inventory levels based on historical and industry trends.\\xa0We regularly assess inventory quantities for excess and obsolescence primarily based on forecasted product demand. Inventory is valued at the lower of net realizable value or cost, which is determined by the average cost method.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Assets deployed at client locations, net.\\xa0Our Enhanced Table Systems are assembled by us and accounted for as inventory until deployed at our casino clients’ premises (Note 6). Once deployed and placed into service at client locations, the assets are transferred from inventory and reported as assets deployed at client locations. These assets are stated at cost, net of accumulated depreciation. Depreciation on assets deployed at client locations\\xa0is calculated using the straight-line method over a three-year period.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Property and equipment, net.\\xa0Property and equipment are being depreciated over their estimated useful lives (three\\xa0to\\xa0five\\xa0years) using the straight-line method of depreciation (Note 5). Property and equipment are analyzed for potential impairment whenever events or changes in circumstances indicate the carrying value may not be recoverable and exceeds their fair value.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Goodwill.\\xa0Goodwill (Note 7) is assessed for impairment at least annually\\xa0or at other times during the year if events or circumstances indicate that it is more-likely-than-not that the fair value of a reporting asset is below the carrying amount. If found to be impaired, the carrying amount will be reduced, and an impairment loss will be recognized.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '22', 'type': 'Uncategorized'},\n", - " {'text': 'Other\\xa0intangible assets, net.\\xa0The following intangible assets have finite lives and are being amortized using the straight-line method over their estimated economic lives as follows:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Patents', 'type': 'Title'},\n", - " {'text': '4 - 20 years', 'type': 'Title'},\n", - " {'text': 'Client relationships', 'type': 'Title'},\n", - " {'text': '9 - 22 years', 'type': 'Title'},\n", - " {'text': 'Trademarks', 'type': 'Title'},\n", - " {'text': '12 - 30 years', 'type': 'Title'},\n", - " {'text': 'Non-compete agreements', 'type': 'Title'},\n", - " {'text': '9 years', 'type': 'Title'},\n", - " {'text': 'Software', 'type': 'Title'},\n", - " {'text': '3 years', 'type': 'Title'},\n", - " {'text': 'Other intangible assets (Note 7) are analyzed for potential impairment at least annually or whenever events or changes in circumstances indicate the carrying value may not be recoverable and exceeds the fair value, which is the sum of the undiscounted cash flows expected to result from the use and eventual disposition of the intangible assets. No impairment was recorded for the years ended December 31, 2021 or 2020.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Interest rates swap agreement. In May 2018, the Company entered into an interest rate swap agreement to reduce the impact of changes in interest rates on its floating rate long-term debt. The interest rate swap has not been designated a hedging instrument and is adjusted to fair value through earnings in the Company’s statements of operations. The interest rate swap agreement matured on May 1, 2021.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Fair value of financial instruments.\\xa0We estimate fair value for financial assets and liabilities in accordance with Financial Accounting Standards Board (“FASB”) Accounting Standards Codification (“ASC”) Topic 820, Fair Value Measurement (“ASC 820”). ASC 820 defines fair value, provides guidance for measuring fair value, requires certain disclosures and discusses valuation techniques, such as the market approach (comparable market prices), the income approach (present value of future income or cash flow) and the cost approach (cost to replace the service capacity of an asset or replacement cost). ASC 820 utilizes a fair value hierarchy that prioritizes the inputs to valuation techniques used to measure fair value into three broad levels:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '•\\n\\nLevel 2: Inputs other than quoted prices that are observable for the asset or liability, either directly or indirectly. These include quoted prices for similar assets or liabilities in active markets and quoted prices for identical or similar assets or liabilities in markets that are not active.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Level 1: Observable inputs such as quoted prices (unadjusted) in active markets for identical assets or liabilities.',\n", - " 'type': 'ListItem'},\n", - " {'text': '•\\n\\nLevel 3: Unobservable inputs that reflect the reporting entity’s own assumptions.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Level 2: Inputs other than quoted prices that are observable for the asset or liability, either directly or indirectly. These include quoted prices for similar assets or liabilities in active markets and quoted prices for identical or similar assets or liabilities in markets that are not active.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'The estimated fair values of cash equivalents, accounts receivable and accounts payable approximate their carrying amounts due to their short-term nature. The estimated fair value of our long-term debt approximates its carrying value based upon our expected borrowing rate for debt with similar remaining maturities and comparable risk. The Company currently has no financial instruments measured at estimated fair value on a recurring basis based on valuation reports provided by counterparties.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'Level 3: Unobservable inputs that reflect the reporting entity’s own assumptions.',\n", - " 'type': 'ListItem'},\n", - " {'text': 'The estimated fair values of cash equivalents, accounts receivable and accounts payable approximate their carrying amounts due to their short-term nature. The estimated fair value of our long-term debt approximates its carrying value based upon our expected borrowing rate for debt with similar remaining maturities and comparable risk. The Company currently has no financial instruments measured at estimated fair value on a recurring basis based on valuation reports provided by counterparties.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Leases.\\xa0We account for lease components (such as rent payments) separately from non-lease components (such as common-area maintenance costs, real estate and sales taxes and insurance costs). Operating and finance leases with terms greater than 12 months are recorded on the consolidated balance sheets as right-of-use assets with corresponding lease liabilities. Lease expense is recognized on a straight-line basis using the discount rate implicit in each lease or our incremental borrowing rate at lease commencement date (Note 9).',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Revenue recognition.\\xa0We account for our revenue in accordance with ASC Topic 606,\\xa0Revenue from Contracts with Customers. See Note 3.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Costs of ancillary products and assembled components.\\xa0Ancillary products include\\xa0pay tables\\xa0(display of payouts), bases, layouts, signage and other items as they relate to support of specific proprietary games in connection with the licensing of our games. Assembled components represent the cost of the equipment, devices and incorporated software used to support our Enhanced Table Systems.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Research and development.\\xa0We incur research and development (“R&D”) costs to develop our new and next-generation products. Our products reach commercial feasibility shortly before the products are released, and therefore R&D costs are expensed as incurred. Employee related costs associated with product development are included in R&D costs.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Foreign currency translation.\\xa0The functional currency for PGP is the Euro. Gains and losses from settlement of transactions involving foreign currency amounts are included in other income or expense in the consolidated statements of operations. Gains and losses resulting from translating assets and liabilities from the functional currency to U.S. dollars are included in accumulated other comprehensive income or (loss) in the consolidated statements of changes in stockholders’ deficit.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Net income per share.\\xa0Basic net income per share is calculated by dividing net income by the weighted-average number of common shares issued and outstanding during the year. Diluted net income per share is similar to basic, except that the weighted-average number of shares outstanding is increased by the potentially dilutive effect of outstanding stock options and restricted stock, if applicable, during the year.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '23', 'type': 'Uncategorized'},\n", - " {'text': 'Segmented information. We define operating segments as components of our enterprise for which separate financial information is reviewed regularly by the chief operating decision-makers to evaluate performance and to make operating decisions. We currently have two operating segments (land-based gaming and online gaming) which are aggregated into one reporting segment.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Share-based compensation.\\xa0We recognize compensation expense for all restricted stock and stock option awards made to employees, directors and independent contractors. The fair value of restricted stock is measured using the grant date trading price of\\xa0our\\xa0stock.\\xa0The fair value of stock option awards (Note 13) is estimated at the grant date using the Black-Scholes option-pricing model, and the portion that is ultimately expected to vest is recognized as compensation cost over the requisite service period. We have elected to recognize compensation expense for all options with graded vesting on a straight-line basis over the vesting period of the entire option. The determination of fair value using the Black-Scholes pricing model is affected by our stock price as well as assumptions regarding a number of complex and subjective variables, including expected stock price volatility, risk-free interest rate, expected dividends and projected employee stock option exercise behaviors. We estimate volatility based on historical volatility of our common stock, and estimate the expected term based on several criteria, including the vesting period of the grant and the term of the award. We estimate employee stock option exercise behavior based on actual historical exercise activity and assumptions regarding future exercise activity of unexercised, outstanding options.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Income taxes.\\xa0We are subject to income taxes in both the United States and in certain non-U.S. jurisdictions. We account for income taxes in accordance with ASC 740, Income Taxes (“ASC 740”) using the asset and liability method. Under the asset and liability method, deferred tax assets and liabilities are recognized for the future tax consequences attributable to temporary differences between the financial statement carrying amounts of existing assets and liabilities and their respective tax bases, and operating loss and tax credit carryforwards. These temporary differences will result in deductible or taxable amounts in future years when the reported amounts of the assets or liabilities are recovered or settled. Deferred tax assets and liabilities are measured using enacted tax rates expected to apply to taxable income in the years in which those temporary differences are expected to be recovered or settled. The effect on deferred tax assets and liabilities of a change in tax rates is recognized in income in the period that includes the enactment date. A valuation allowance is provided when it is more-likely-than-not that some or all of the deferred tax assets may not be realized. Adjustments to the valuation allowance increase or decrease our income tax provision or benefit. To the extent we believe that recovery is more likely than not, we establish a valuation allowance against these deferred tax assets. Significant judgment is required in determining our provision for income taxes, our deferred tax assets and liabilities, and any valuation allowance recorded against our deferred tax assets. As of December\\xa031, 2021 and 2020, we recorded a full valuation allowance against certain deferred assets.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'In the ordinary course of business, there are transactions and calculations where the ultimate tax outcome is uncertain. Additionally, our tax returns are subject to audit by various tax authorities. Although we believe that our estimates are reasonable, actual results could differ from these estimates. We recognize the tax benefit from an uncertain tax position if it is more-likely-than-not that the tax position will be sustained on examination by the taxing authorities, based on an evaluation of the technical merits of the position, which requires a significant degree of judgment (Note 13).',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Recently adopted accounting standards. Simplifying the Accounting for Income Taxes. In December 2019, the FASB issued Accounting Standard Update (“ASU”) No. 2019-12, Income Taxes (Topic 740): Simplifying the Accounting for Income Taxes (ASU 2019-12), which simplifies the accounting for income taxes. This guidance is effective for the first quarter of 2021 on a prospective basis. We adopted the new standard effective January 1, 2021, and its adoption did not have a material impact on our consolidated financial statements.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'New accounting standards not yet adopted. Financial Instruments – Credit Losses. In February 2020, the FASB issued ASU No. 2020-02, Financial Instruments – Credit Losses (Topic 326). ASU 2020-02 provides updated guidance on how an entity should measure credit losses on financial instruments and delayed the effective date of Topic 326 for smaller reporting companies until fiscal years beginning after December 15, 2022. Early adoption is permitted. We do not believe the adoption of this guidance will have a material impact on our financial statements or related disclosures.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'NOTE 3. REVENUE RECOGNITION', 'type': 'Title'},\n", - " {'text': 'Revenue recognition. We generate revenue primarily from the licensing of our intellectual property. We recognize revenue under recurring fee license contracts monthly as we satisfy our performance obligation, which consists of granting the customer the right to use our intellectual property. Amounts billed are determined based on flat rates or usage rates stipulated in the customer contract.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Disaggregation of revenue', 'type': 'Title'},\n", - " {'text': 'The following table disaggregates our revenue by geographic location for the years ended December\\xa031, 2021 and 2020:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '2021', 'type': 'Uncategorized'},\n", - " {'text': '2020', 'type': 'Uncategorized'},\n", - " {'text': 'North America and Caribbean', 'type': 'Title'},\n", - " {'text': '10,024,537', 'type': 'Uncategorized'},\n", - " {'text': '5,757,143', 'type': 'Uncategorized'},\n", - " {'text': 'Europe, Middle East and Africa', 'type': 'Title'},\n", - " {'text': '9,959,841', 'type': 'Uncategorized'},\n", - " {'text': '4,473,173', 'type': 'Uncategorized'},\n", - " {'text': 'Total revenue', 'type': 'Title'},\n", - " {'text': '19,984,378', 'type': 'Uncategorized'},\n", - " {'text': '10,230,316', 'type': 'Uncategorized'},\n", - " {'text': 'Contract liabilities. Amounts billed and cash received in advance of performance obligations fulfilled are recorded as contract liabilities and recognized as performance obligations are fulfilled.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '24', 'type': 'Uncategorized'},\n", - " {'text': 'Contract assets. The Company’s contract assets consist solely of unbilled receivables which are recorded when the Company recognizes revenue in advance of billings. Unbilled receivables totaled $771,293 and $502,860 for the years ended December 31, 2021 and 2020 and are included in the accounts receivable balance in the accompanying balance sheets.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Royalty agreements. From time to time, the Company licenses intellectual property from third-party owners and the Company, in turn, re-licenses that intellectual property to its casino clients. In these arrangements, the Company usually agrees to pay the owner of the intellectual property a royalty based on the revenues the Company receives from licensing the intellectual property to its casino clients.\\xa0For the years ended December 31, 2021 and 2020, license royalty payments of $1,670,210 and $438,837, respectively, are recorded net in revenue.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'NOTE 4. INVENTORY', 'type': 'Title'},\n", - " {'text': 'Inventory consisted of the following as of December\\xa031, 2021 and 2020:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '2021', 'type': 'Uncategorized'},\n", - " {'text': '2020', 'type': 'Uncategorized'},\n", - " {'text': 'Raw materials and component parts', 'type': 'Title'},\n", - " {'text': '413,320', 'type': 'Uncategorized'},\n", - " {'text': '300,244', 'type': 'Uncategorized'},\n", - " {'text': 'Finished goods', 'type': 'NarrativeText'},\n", - " {'text': '356,928', 'type': 'Uncategorized'},\n", - " {'text': '368,281', 'type': 'Uncategorized'},\n", - " {'text': 'Inventory', 'type': 'Title'},\n", - " {'text': '770,248', 'type': 'Uncategorized'},\n", - " {'text': '668,525', 'type': 'Uncategorized'},\n", - " {'text': 'NOTE 5. PROPERTY AND EQUIPMENT', 'type': 'Title'},\n", - " {'text': 'Property and equipment consisted of the following at December\\xa031, 2021 and 2020:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '2021', 'type': 'Uncategorized'},\n", - " {'text': '2020', 'type': 'Uncategorized'},\n", - " {'text': 'Furniture and fixtures', 'type': 'Title'},\n", - " {'text': '312,639', 'type': 'Uncategorized'},\n", - " {'text': '312,639', 'type': 'Uncategorized'},\n", - " {'text': 'Automotive vehicles', 'type': 'Title'},\n", - " {'text': '171,671', 'type': 'Uncategorized'},\n", - " {'text': '215,127', 'type': 'Uncategorized'},\n", - " {'text': 'Office and computer equipment', 'type': 'Title'},\n", - " {'text': '389,628', 'type': 'Uncategorized'},\n", - " {'text': '332,544', 'type': 'Uncategorized'},\n", - " {'text': 'Leasehold improvements', 'type': 'Title'},\n", - " {'text': '35,531', 'type': 'Uncategorized'},\n", - " {'text': '32,547', 'type': 'Uncategorized'},\n", - " {'text': 'Property and equipment, gross', 'type': 'Title'},\n", - " {'text': '909,469', 'type': 'Uncategorized'},\n", - " {'text': '892,857', 'type': 'Uncategorized'},\n", - " {'text': 'Less: accumulated depreciation', 'type': 'NarrativeText'},\n", - " {'text': '(810,875', 'type': 'Uncategorized'},\n", - " {'text': '(776,133', 'type': 'Uncategorized'},\n", - " {'text': 'Property and equipment, net', 'type': 'Title'},\n", - " {'text': '98,594', 'type': 'Uncategorized'},\n", - " {'text': '116,724', 'type': 'Uncategorized'},\n", - " {'text': 'For the years ended December\\xa031, 2021 and 2020, depreciation expense related to property and equipment was $78,199 and $90,979, respectively.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'NOTE 6. Assets deployed at client locations',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Assets deployed at client locations, net consisted of the following at December\\xa031, 2021 and 2020:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '2021', 'type': 'Uncategorized'},\n", - " {'text': '2020', 'type': 'Uncategorized'},\n", - " {'text': 'Enhanced table systems', 'type': 'Title'},\n", - " {'text': '1,139,827', 'type': 'Uncategorized'},\n", - " {'text': '890,560', 'type': 'Uncategorized'},\n", - " {'text': 'Less: accumulated depreciation', 'type': 'NarrativeText'},\n", - " {'text': '(779,092', 'type': 'Uncategorized'},\n", - " {'text': '(658,404', 'type': 'Uncategorized'},\n", - " {'text': 'Assets deployed at client location, net', 'type': 'NarrativeText'},\n", - " {'text': '360,735', 'type': 'Uncategorized'},\n", - " {'text': '232,156', 'type': 'Uncategorized'},\n", - " {'text': 'For the years ended December\\xa031, 2021 and 2020, depreciation expense related to assets deployed at client locations was $197,493 and $222,204, respectively.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '25', 'type': 'Uncategorized'},\n", - " {'text': 'NOTE 7. GOODWILL AND OTHER INTANGIBLE ASSETS', 'type': 'Title'},\n", - " {'text': 'Goodwill. A goodwill balance of $1,091,000\\xa0was created as a result of a transaction completed in October 2011 with Prime Table Games, LLC (“PTG”).',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Other intangible assets, net. Other intangible assets, net consisted of the following at December\\xa031, 2021 and 2020:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': '2021', 'type': 'Uncategorized'},\n", - " {'text': '2020', 'type': 'Uncategorized'},\n", - " {'text': 'Patents', 'type': 'Title'},\n", - " {'text': '13,507,997', 'type': 'Uncategorized'},\n", - " {'text': '13,507,997', 'type': 'Uncategorized'},\n", - " {'text': 'Customer relationships', 'type': 'Title'},\n", - " {'text': '14,040,856', 'type': 'Uncategorized'},\n", - " {'text': '13,942,115', 'type': 'Uncategorized'},\n", - " {'text': 'Trademarks', 'type': 'Title'},\n", - " {'text': '2,880,967', 'type': 'Uncategorized'},\n", - " {'text': '2,880,967', 'type': 'Uncategorized'},\n", - " {'text': 'Non-compete agreements', 'type': 'Title'},\n", - " {'text': '660,000', 'type': 'Uncategorized'},\n", - " {'text': '660,000', 'type': 'Uncategorized'},\n", - " {'text': 'Software', 'type': 'Title'},\n", - " {'text': '283,340', 'type': 'Uncategorized'},\n", - " {'text': '183,415', 'type': 'Uncategorized'},\n", - " {'text': 'Other intangible assets, gross', 'type': 'Title'},\n", - " {'text': '31,373,160', 'type': 'Uncategorized'},\n", - " {'text': '31,174,494', 'type': 'Uncategorized'},\n", - " {'text': 'Less: accumulated amortization', 'type': 'NarrativeText'},\n", - " {'text': '(17,695,896', 'type': 'Uncategorized'},\n", - " {'text': '(15,087,598', 'type': 'Uncategorized'},\n", - " {'text': 'Other intangible assets, net', 'type': 'Title'},\n", - " {'text': '13,677,264', 'type': 'Uncategorized'},\n", - " {'text': '16,086,896', 'type': 'Uncategorized'},\n", - " {'text': 'For the years ended December\\xa031, 2021 and 2020, amortization expense related to the finite-lived intangible assets was $2,608,299 and $1,908,858 respectively.',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Estimated future amortization expense is as follows:',\n", - " 'type': 'NarrativeText'},\n", - " {'text': 'Year Ended December\\xa031,', 'type': 'Uncategorized'},\n", - " {'text': 'Total', 'type': 'Title'},\n", - " {'text': '2022', 'type': 'Uncategorized'},\n", - " {'text': '2,325,888', 'type': 'Uncategorized'},\n", - " {'text': '2023', 'type': 'Uncategorized'},\n", - " {'text': '1,459,601', 'type': 'Uncategorized'},\n", - " {'text': '2024', 'type': 'Uncategorized'},\n", - " {'text': '1,444,126', 'type': 'Uncategorized'},\n", - " {'text': '2025', 'type': 'Uncategorized'},\n", - " {'text': '1,436,968', 'type': 'Uncategorized'},\n", - " {'text': '2026', 'type': 'Uncategorized'},\n", - " {'text': '1,436,968', 'type': 'Uncategorized'},\n", - " {'text': 'Thereafter', 'type': 'Title'},\n", - " {'text': '5,523,691', 'type': 'Uncategorized'},\n", - " {'text': 'Total amortization', 'type': 'Title'},\n", - " ...]" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "convert_to_dict(elements)" - ] + ], + "outputs": [] } ], "metadata": { diff --git a/test_unstructured/partition/test_docx.py b/test_unstructured/partition/test_docx.py index bab761d4a0..9ddf30840d 100644 --- a/test_unstructured/partition/test_docx.py +++ b/test_unstructured/partition/test_docx.py @@ -45,11 +45,12 @@ PartitionStrategy, ) + # -- docx-file loading behaviors ----------------------------------------------------------------- def test_partition_docx_from_filename( - mock_document_file_path: str, expected_elements: list[Element] + mock_document_file_path: str, expected_elements: list[Element] ): elements = partition_docx(mock_document_file_path) @@ -62,7 +63,7 @@ def test_partition_docx_from_filename( def test_partition_docx_with_spooled_file( - mock_document_file_path: str, expected_elements: list[Text] + mock_document_file_path: str, expected_elements: list[Text] ): """`partition_docx()` accepts a SpooledTemporaryFile as its `file` argument. @@ -73,7 +74,7 @@ def test_partition_docx_with_spooled_file( spooled_temp_file = tempfile.SpooledTemporaryFile() spooled_temp_file.write(test_file.read()) spooled_temp_file.seek(0) - elements = partition_docx(file=spooled_temp_file) + elements = partition_docx(file = spooled_temp_file) assert elements == expected_elements for element in elements: assert element.metadata.filename is None @@ -81,22 +82,22 @@ def test_partition_docx_with_spooled_file( def test_partition_docx_from_file(mock_document_file_path: str, expected_elements: list[Text]): with open(mock_document_file_path, "rb") as f: - elements = partition_docx(file=f) + elements = partition_docx(file = f) assert elements == expected_elements for element in elements: assert element.metadata.filename is None def test_partition_docx_uses_file_path_when_both_are_specified( - mock_document_file_path: str, expected_elements: list[Text] + mock_document_file_path: str, expected_elements: list[Text] ): f = io.BytesIO(b"abcde") - elements = partition_docx(filename=mock_document_file_path, file=f) + elements = partition_docx(filename = mock_document_file_path, file = f) assert elements == expected_elements def test_partition_docx_raises_with_neither(): - with pytest.raises(ValueError, match="either `filename` or `file` argument must be provided"): + with pytest.raises(ValueError, match = "either `filename` or `file` argument must be provided"): partition_docx() @@ -117,11 +118,11 @@ def test_parition_docx_from_team_chat(): @pytest.mark.parametrize("infer_table_structure", [True, False]) def test_partition_docx_infer_table_structure(infer_table_structure: bool): elements = partition_docx( - example_doc_path("fake_table.docx"), infer_table_structure=infer_table_structure + example_doc_path("fake_table.docx"), infer_table_structure = infer_table_structure ) table_element_has_text_as_html_field = ( - hasattr(elements[0].metadata, "text_as_html") - and elements[0].metadata.text_as_html is not None + hasattr(elements[0].metadata, "text_as_html") + and elements[0].metadata.text_as_html is not None ) assert table_element_has_text_as_html_field == infer_table_structure @@ -165,7 +166,7 @@ def test_partition_docx_includes_neither_page_breaks_nor_numbers_when_rendered_b breaks are a false-positive and will generally produce incorrect page numbers. """ elements = partition_docx( - example_doc_path("handbook-1p-no-rendered-page-breaks.docx"), include_page_breaks=True + example_doc_path("handbook-1p-no-rendered-page-breaks.docx"), include_page_breaks = True ) assert "PageBreak" not in [type(e).__name__ for e in elements] @@ -177,7 +178,7 @@ def test_partition_docx_includes_page_numbers_when_page_break_elements_are_suppr Only inclusion of PageBreak elements is affected by that option. """ - elements = partition_docx(example_doc_path("handbook-1p.docx"), include_page_breaks=False) + elements = partition_docx(example_doc_path("handbook-1p.docx"), include_page_breaks = False) assert "PageBreak" not in [type(e).__name__ for e in elements] assert elements[1].metadata.page_number == 1 @@ -186,7 +187,7 @@ def test_partition_docx_includes_page_numbers_when_page_break_elements_are_suppr def test_partition_docx_includes_page_break_elements_when_so_instructed(): elements = partition_docx( - example_doc_path("handbook-1p.docx"), include_page_breaks=True, starting_page_number=3 + example_doc_path("handbook-1p.docx"), include_page_breaks = True, starting_page_number = 3 ) assert "PageBreak" in [type(e).__name__ for e in elements] @@ -210,7 +211,7 @@ def test_partition_docx_detects_lists(): def test_partition_docx_from_filename_excludes_metadata_when_so_instructed(): - elements = partition_docx(example_doc_path("handbook-1p.docx"), include_metadata=False) + elements = partition_docx(example_doc_path("handbook-1p.docx"), include_metadata = False) assert all(e.metadata.to_dict() == {} for e in elements) @@ -218,7 +219,7 @@ def test_partition_docx_from_file_excludes_metadata_when_so_instructed(): with open(example_doc_path("simple.docx"), "rb") as f: assert all( element.metadata.to_dict() == {} - for element in partition_docx(file=f, include_metadata=False) + for element in partition_docx(file = f, include_metadata = False) ) @@ -226,13 +227,13 @@ def test_partition_docx_from_file_excludes_metadata_when_so_instructed(): def test_partition_docx_from_filename_prefers_metadata_filename_when_provided(): - elements = partition_docx(example_doc_path("simple.docx"), metadata_filename="test") + elements = partition_docx(example_doc_path("simple.docx"), metadata_filename = "test") assert all(element.metadata.filename == "test" for element in elements) def test_partition_docx_from_file_prefers_metadata_filename_when_provided(): with open(example_doc_path("simple.docx"), "rb") as f: - elements = partition_docx(file=f, metadata_filename="test") + elements = partition_docx(file = f, metadata_filename = "test") assert all(element.metadata.filename == "test" for element in elements) @@ -241,7 +242,7 @@ def test_partition_docx_from_file_prefers_metadata_filename_when_provided(): def test_partition_docx_metadata_date(mocker: MockFixture): mocker.patch( - "unstructured.partition.docx.get_last_modified_date", return_value="2029-07-05T09:24:28" + "unstructured.partition.docx.get_last_modified_date", return_value = "2029-07-05T09:24:28" ) elements = partition_docx(example_doc_path("fake.docx")) @@ -251,11 +252,11 @@ def test_partition_docx_metadata_date(mocker: MockFixture): def test_partition_docx_metadata_date_with_custom_metadata(mocker: MockFixture): mocker.patch( - "unstructured.partition.docx.get_last_modified_date", return_value="2023-11-01T14:13:07" + "unstructured.partition.docx.get_last_modified_date", return_value = "2023-11-01T14:13:07" ) elements = partition_docx( - example_doc_path("fake.docx"), metadata_last_modified="2020-07-05T09:24:28" + example_doc_path("fake.docx"), metadata_last_modified = "2020-07-05T09:24:28" ) assert elements[0].metadata.last_modified == "2020-07-05T09:24:28" @@ -264,11 +265,11 @@ def test_partition_docx_metadata_date_with_custom_metadata(mocker: MockFixture): def test_partition_docx_from_file_metadata_date(mocker: MockFixture): mocker.patch( "unstructured.partition.docx.get_last_modified_date_from_file", - return_value="2029-07-05T09:24:28", + return_value = "2029-07-05T09:24:28", ) with open(example_doc_path("fake.docx"), "rb") as f: - elements = partition_docx(file=f) + elements = partition_docx(file = f) assert elements[0].metadata.last_modified is None @@ -276,11 +277,11 @@ def test_partition_docx_from_file_metadata_date(mocker: MockFixture): def test_partition_docx_from_file_explicit_get_metadata_date(mocker: MockFixture): mocker.patch( "unstructured.partition.docx.get_last_modified_date_from_file", - return_value="2029-07-05T09:24:28", + return_value = "2029-07-05T09:24:28", ) with open(example_doc_path("fake.docx"), "rb") as f: - elements = partition_docx(file=f, date_from_file_object=True) + elements = partition_docx(file = f, date_from_file_object = True) assert elements[0].metadata.last_modified == "2029-07-05T09:24:28" @@ -288,11 +289,11 @@ def test_partition_docx_from_file_explicit_get_metadata_date(mocker: MockFixture def test_partition_docx_from_file_metadata_date_with_custom_metadata(mocker: MockFixture): mocker.patch( "unstructured.partition.docx.get_last_modified_date_from_file", - return_value="2023-11-01T14:13:07", + return_value = "2023-11-01T14:13:07", ) with open(example_doc_path("fake.docx"), "rb") as f: - elements = partition_docx(file=f, metadata_last_modified="2020-07-05T09:24:28") + elements = partition_docx(file = f, metadata_last_modified = "2020-07-05T09:24:28") assert elements[0].metadata.last_modified == "2020-07-05T09:24:28" @@ -303,7 +304,7 @@ def test_partition_docx_from_file_without_metadata_date(): sf = tempfile.SpooledTemporaryFile() sf.write(f.read()) sf.seek(0) - elements = partition_docx(file=sf, date_from_file_object=True) + elements = partition_docx(file = sf, date_from_file_object = True) assert elements[0].metadata.last_modified is None @@ -312,7 +313,7 @@ def test_partition_docx_from_file_without_metadata_date(): def test_get_emphasized_texts_from_paragraph( - opts_args: dict[str, Any], expected_emphasized_texts: list[dict[str, str]] + opts_args: dict[str, Any], expected_emphasized_texts: list[dict[str, str]] ): opts_args["file_path"] = example_doc_path("fake-doc-emphasized-text.docx") opts = DocxPartitionerOptions(**opts_args) @@ -335,7 +336,7 @@ def test_get_emphasized_texts_from_paragraph( def test_iter_table_emphasis( - opts_args: dict[str, Any], expected_emphasized_texts: list[dict[str, str]] + opts_args: dict[str, Any], expected_emphasized_texts: list[dict[str, str]] ): opts_args["file_path"] = example_doc_path("fake-doc-emphasized-text.docx") opts = DocxPartitionerOptions(**opts_args) @@ -348,9 +349,9 @@ def test_iter_table_emphasis( def test_table_emphasis( - opts_args: dict[str, Any], - expected_emphasized_text_contents: list[str], - expected_emphasized_text_tags: list[str], + opts_args: dict[str, Any], + expected_emphasized_text_contents: list[str], + expected_emphasized_text_tags: list[str], ): opts_args["file_path"] = example_doc_path("fake-doc-emphasized-text.docx") opts = DocxPartitionerOptions(**opts_args) @@ -364,8 +365,8 @@ def test_table_emphasis( def test_partition_docx_grabs_emphasized_texts( - expected_emphasized_text_contents: list[str], - expected_emphasized_text_tags: list[str], + expected_emphasized_text_contents: list[str], + expected_emphasized_text_tags: list[str], ): elements = partition_docx(example_doc_path("fake-doc-emphasized-text.docx")) @@ -416,7 +417,7 @@ def test_parse_category_depth_by_style(opts_args: dict[str, Any]): actual_depth = partitioner._parse_category_depth_by_style(paragraph) assert text in paragraph.text, f"paragraph[{[idx]}].text does not contain {text}" assert ( - actual_depth == depth + actual_depth == depth ), f"expected paragraph[{idx}] to have depth=={depth}, got {actual_depth}" @@ -441,7 +442,7 @@ def test_parse_category_depth_by_style_name(opts_args: dict[str, Any]): for idx, (depth, text) in enumerate(test_cases): assert ( - partitioner._parse_category_depth_by_style_name(text) == depth + partitioner._parse_category_depth_by_style_name(text) == depth ), f"test case {test_cases[idx]} failed" @@ -453,7 +454,7 @@ def test_parse_category_depth_by_style_ilvl(opts_args: dict[str, Any]): def test_add_chunking_strategy_on_partition_docx_default_args(): chunk_elements = partition_docx( - example_doc_path("handbook-1p.docx"), chunking_strategy="by_title" + example_doc_path("handbook-1p.docx"), chunking_strategy = "by_title" ) elements = partition_docx(example_doc_path("handbook-1p.docx")) chunks = chunk_by_title(elements) @@ -466,10 +467,10 @@ def test_add_chunking_strategy_on_partition_docx(): docx_path = example_doc_path("fake-doc-emphasized-text.docx") chunk_elements = partition_docx( - docx_path, chunking_strategy="by_title", max_characters=9, combine_text_under_n_chars=5 + docx_path, chunking_strategy = "by_title", max_characters = 9, combine_text_under_n_chars = 5 ) elements = partition_docx(docx_path) - chunks = chunk_by_title(elements, max_characters=9, combine_text_under_n_chars=5) + chunks = chunk_by_title(elements, max_characters = 9, combine_text_under_n_chars = 5) assert chunk_elements == chunks assert elements != chunk_elements @@ -483,20 +484,20 @@ def test_add_chunking_strategy_on_partition_docx(): def test_partition_docx_element_metadata_has_languages(): filename = example_doc_path("handbook-1p.docx") - elements = partition_docx(filename=filename) + elements = partition_docx(filename = filename) assert elements[0].metadata.languages == ["eng"] def test_partition_docx_respects_detect_language_per_element(): filename = example_doc_path("language-docs/eng_spa_mult.docx") - elements = partition_docx(filename=filename, detect_language_per_element=True) + elements = partition_docx(filename = filename, detect_language_per_element = True) langs = [element.metadata.languages for element in elements] assert langs == [["eng"], ["spa", "eng"], ["eng"], ["eng"], ["spa"]] def test_partition_docx_respects_languages_arg(): filename = example_doc_path("handbook-1p.docx") - elements = partition_docx(filename=filename, languages=["deu"]) + elements = partition_docx(filename = filename, languages = ["deu"]) assert elements[0].metadata.languages == ["deu"] @@ -504,8 +505,8 @@ def test_partition_docx_raises_TypeError_for_invalid_languages(): with pytest.raises(TypeError): filename = example_doc_path("handbook-1p.docx") partition_docx( - filename=filename, - languages="eng", # pyright: ignore[reportArgumentType] + filename = filename, + languages = "eng", # pyright: ignore[reportArgumentType] ) @@ -663,21 +664,21 @@ def expected_emphasized_texts(): def mock_document(): document = docx.Document() - document.add_paragraph("These are a few of my favorite things:", style="Heading 1") + document.add_paragraph("These are a few of my favorite things:", style = "Heading 1") # NOTE(robinson) - this should get picked up as a list item due to the • - document.add_paragraph("• Parrots", style="Normal") + document.add_paragraph("• Parrots", style = "Normal") # NOTE(robinson) - this should get dropped because it's empty - document.add_paragraph("• ", style="Normal") - document.add_paragraph("Hockey", style="List Bullet") + document.add_paragraph("• ", style = "Normal") + document.add_paragraph("Hockey", style = "List Bullet") # NOTE(robinson) - this should get dropped because it's empty - document.add_paragraph("", style="List Bullet") + document.add_paragraph("", style = "List Bullet") # NOTE(robinson) - this should get picked up as a title - document.add_paragraph("Analysis", style="Normal") + document.add_paragraph("Analysis", style = "Normal") # NOTE(robinson) - this should get dropped because it is empty - document.add_paragraph("", style="Normal") + document.add_paragraph("", style = "Normal") # NOTE(robinson) - this should get picked up as a narrative text - document.add_paragraph("This is my first thought. This is my second thought.", style="Normal") - document.add_paragraph("This is my third thought.", style="Body Text") + document.add_paragraph("This is my first thought. This is my second thought.", style = "Normal") + document.add_paragraph("This is my third thought.", style = "Body Text") # NOTE(robinson) - this should just be regular text document.add_paragraph("2023") # NOTE(robinson) - this should be an address @@ -726,16 +727,16 @@ class DescribeDocxPartitionerOptions: # -- .document ------------------------------- def it_loads_the_docx_document( - self, - request: FixtureRequest, - opts_args: dict[str, Any], + self, + request: FixtureRequest, + opts_args: dict[str, Any], ): document_ = instance_mock(request, Document) docx_Document_ = function_mock( - request, "unstructured.partition.docx.docx.Document", return_value=document_ + request, "unstructured.partition.docx.docx.Document", return_value = document_ ) _docx_file_prop_ = property_mock( - request, DocxPartitionerOptions, "_docx_file", return_value="abcde.docx" + request, DocxPartitionerOptions, "_docx_file", return_value = "abcde.docx" ) opts = DocxPartitionerOptions(**opts_args) @@ -749,7 +750,7 @@ def it_loads_the_docx_document( @pytest.mark.parametrize("arg_value", [True, False]) def it_knows_whether_to_emit_PageBreak_elements_as_part_of_the_output_element_stream( - self, arg_value: bool, opts_args: dict[str, Any] + self, arg_value: bool, opts_args: dict[str, Any] ): opts_args["include_page_breaks"] = arg_value opts = DocxPartitionerOptions(**opts_args) @@ -760,7 +761,7 @@ def it_knows_whether_to_emit_PageBreak_elements_as_part_of_the_output_element_st @pytest.mark.parametrize("arg_value", [True, False]) def it_knows_whether_to_include_text_as_html_in_Table_metadata( - self, arg_value: bool, opts_args: dict[str, Any] + self, arg_value: bool, opts_args: dict[str, Any] ): opts_args["infer_table_structure"] = arg_value opts = DocxPartitionerOptions(**opts_args) @@ -770,7 +771,7 @@ def it_knows_whether_to_include_text_as_html_in_Table_metadata( # -- .increment_page_number() ---------------- def it_generates_a_PageBreak_element_when_the_page_number_is_incremented( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts = DocxPartitionerOptions(**opts_args) @@ -782,7 +783,7 @@ def it_generates_a_PageBreak_element_when_the_page_number_is_incremented( next(page_break_iter) def but_it_does_not_generate_a_PageBreak_element_when_include_page_breaks_option_is_off( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts_args["include_page_breaks"] = False opts = DocxPartitionerOptions(**opts_args) @@ -796,7 +797,7 @@ def but_it_does_not_generate_a_PageBreak_element_when_include_page_breaks_option # -- .last_modified -------------------------- def it_gets_the_last_modified_date_of_the_document_from_the_caller_when_provided( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts_args["metadata_last_modified"] = "2024-03-05T17:02:53" opts = DocxPartitionerOptions(**opts_args) @@ -804,7 +805,7 @@ def it_gets_the_last_modified_date_of_the_document_from_the_caller_when_provided assert opts.last_modified == "2024-03-05T17:02:53" def and_it_falls_back_to_the_last_modified_date_of_the_file_when_a_path_is_provided( - self, opts_args: dict[str, Any], get_last_modified_date_: Mock + self, opts_args: dict[str, Any], get_last_modified_date_: Mock ): opts_args["file_path"] = "a/b/document.docx" get_last_modified_date_.return_value = "2024-04-02T20:32:35" @@ -816,7 +817,7 @@ def and_it_falls_back_to_the_last_modified_date_of_the_file_when_a_path_is_provi assert last_modified == "2024-04-02T20:32:35" def and_it_falls_back_to_the_last_modified_date_of_the_file_when_a_file_like_object_is_provided( - self, opts_args: dict[str, Any], get_last_modified_date_from_file_: Mock + self, opts_args: dict[str, Any], get_last_modified_date_from_file_: Mock ): file = io.BytesIO(b"abcdefg") opts_args["file"] = file @@ -830,7 +831,7 @@ def and_it_falls_back_to_the_last_modified_date_of_the_file_when_a_file_like_obj assert last_modified == "2024-04-02T20:42:07" def but_it_falls_back_to_None_for_the_last_modified_date_when_date_from_file_object_is_False( - self, opts_args: dict[str, Any], get_last_modified_date_from_file_: Mock + self, opts_args: dict[str, Any], get_last_modified_date_from_file_: Mock ): file = io.BytesIO(b"abcdefg") opts_args["file"] = file @@ -846,7 +847,7 @@ def but_it_falls_back_to_None_for_the_last_modified_date_when_date_from_file_obj # -- .metadata_file_path --------------------- def it_uses_the_user_provided_file_path_in_the_metadata_when_provided( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts_args["file_path"] = "x/y/z.docx" opts_args["metadata_file_path"] = "a/b/c.docx" @@ -856,7 +857,7 @@ def it_uses_the_user_provided_file_path_in_the_metadata_when_provided( @pytest.mark.parametrize("file_path", ["u/v/w.docx", None]) def and_it_falls_back_to_the_document_file_path_otherwise( - self, file_path: str | None, opts_args: dict[str, Any] + self, file_path: str | None, opts_args: dict[str, Any] ): opts_args["file_path"] = file_path opts_args["metadata_file_path"] = None @@ -871,18 +872,18 @@ def and_it_falls_back_to_the_document_file_path_otherwise( [(7, True, 7), (1, False, None)], ) def it_reports_None_when_no_rendered_page_breaks_are_found_in_document( - self, - request: FixtureRequest, - opts_args: dict[str, Any], - page_count: int, - document_contains_pagebreaks: bool, - expected_value: int | None, + self, + request: FixtureRequest, + opts_args: dict[str, Any], + page_count: int, + document_contains_pagebreaks: bool, + expected_value: int | None, ): _document_contains_pagebreaks_prop_ = property_mock( request, DocxPartitionerOptions, "_document_contains_pagebreaks", - return_value=document_contains_pagebreaks, + return_value = document_contains_pagebreaks, ) opts = DocxPartitionerOptions(**opts_args) opts._page_counter = page_count @@ -905,9 +906,9 @@ def it_keeps_track_of_the_page_number(self, opts_args: dict[str, Any]): assert opts.page_number == 3 def it_assigns_the_correct_page_number_when_starting_page_number_is_given( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): - opts = DocxPartitionerOptions(**opts_args, starting_page_number=3) + opts = DocxPartitionerOptions(**opts_args, starting_page_number = 3) assert opts.page_number == 3 list(opts.increment_page_number()) @@ -920,7 +921,7 @@ def it_assigns_the_correct_page_number_when_starting_page_number_is_given( [(None, "hi_res"), (PartitionStrategy.FAST, "fast"), (PartitionStrategy.HI_RES, "hi_res")], ) def it_knows_which_partitioning_strategy_to_use( - self, opts_args: dict[str, Any], arg_value: str, expected_value: str + self, opts_args: dict[str, Any], arg_value: str, expected_value: str ): opts_args["strategy"] = arg_value opts = DocxPartitionerOptions(**opts_args) @@ -933,7 +934,7 @@ def it_knows_which_partitioning_strategy_to_use( ("file_name", "expected_value"), [("page-breaks.docx", True), ("teams_chat.docx", False)] ) def it_knows_whether_the_document_contains_page_breaks( - self, opts_args: dict[str, Any], file_name: str, expected_value: bool + self, opts_args: dict[str, Any], file_name: str, expected_value: bool ): opts_args["file_path"] = example_doc_path(file_name) opts = DocxPartitionerOptions(**opts_args) @@ -943,7 +944,7 @@ def it_knows_whether_the_document_contains_page_breaks( # -- ._docx_file ----------------------------- def it_uses_the_path_to_open_the_presentation_when_file_path_is_provided( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts_args["file_path"] = "l/m/n.docx" opts = DocxPartitionerOptions(**opts_args) @@ -951,7 +952,7 @@ def it_uses_the_path_to_open_the_presentation_when_file_path_is_provided( assert opts._docx_file == "l/m/n.docx" def and_it_uses_a_BytesIO_file_to_replaces_a_SpooledTemporaryFile_provided( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): spooled_temp_file = tempfile.SpooledTemporaryFile() spooled_temp_file.write(b"abcdefg") @@ -965,7 +966,7 @@ def and_it_uses_a_BytesIO_file_to_replaces_a_SpooledTemporaryFile_provided( assert docx_file.getvalue() == b"abcdefg" def and_it_uses_the_provided_file_directly_when_not_a_SpooledTemporaryFile( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): file = io.BytesIO(b"abcdefg") opts_args["file"] = file @@ -978,11 +979,11 @@ def and_it_uses_the_provided_file_directly_when_not_a_SpooledTemporaryFile( assert docx_file.getvalue() == b"abcdefg" def but_it_raises_ValueError_when_neither_a_file_path_or_file_is_provided( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts = DocxPartitionerOptions(**opts_args) - with pytest.raises(ValueError, match="No DOCX document specified, either `filename` or "): + with pytest.raises(ValueError, match = "No DOCX document specified, either `filename` or "): opts._docx_file # -- fixtures -------------------------------------------------------------------------------- @@ -1325,3 +1326,89 @@ def it_includes_table_cell_text_in_Footer_text(self, opts_args: dict[str, Any]): element = next(footer_iter) assert element.text == "para1\ncell1 a b c d e f\npara2" + + +def create_test_docx(file_path): + from docx import Document as DocxDocument + + doc = DocxDocument() + + # 添加标题和文本内容 + doc.add_heading('春节放假通知', level = 1) + doc.add_paragraph('\n') + doc.add_paragraph('春节放假从大年 30 开始\n共计放假一个月\n比法定假期长三周\n') + + doc.add_heading('标题 2', level = 2) + doc.add_heading('标题 3', level = 3) + doc.add_heading('又一个标题 2', level = 2) + + doc.add_paragraph('正文普通\n') + + # 添加列表 + doc.add_paragraph('一组\n', style = 'ListBullet') + doc.add_paragraph('二组\n', style = 'ListBullet') + doc.add_paragraph('三组\n', style = 'ListBullet') + + doc.add_paragraph('继续正文\n') + + # 保存文档 + doc.save(file_path) + + +def test_partition_zh_docs() -> None: + """ + Fix the issue of erroneously recognizing NarrativeText as Title when splitting Chinese DOCX documents + """ + with tempfile.NamedTemporaryFile(suffix = ".docx", delete = False) as tmp: + create_test_docx(tmp.name) + elements = partition_docx(tmp.name) + + # 打印或检查分区结果 + for element in elements: + print(element) + + # 进行断言检查 + assert any('春节放假通知' in element.text for element in elements) + assert any('春节放假从大年 30 开始' in element.text for element in elements) + assert any('标题 2' in element.text for element in elements) + assert any('标题 3' in element.text for element in elements) + assert any('又一个标题 2' in element.text for element in elements) + assert any('正文普通' in element.text for element in elements) + assert any('一组' in element.text for element in elements) + assert any('二组' in element.text for element in elements) + assert any('三组' in element.text for element in elements) + assert any('继续正文' in element.text for element in elements) + assert list(filter(lambda x: '正文普通' in x.text, elements))[0].category == 'NarrativeText' + assert list(filter(lambda x: '一组' in x.text, elements))[0].category == 'ListItem' + assert list(filter(lambda x: '继续正文' in x.text, elements))[0].category == 'NarrativeText' + + +def test_partition_zh_docs_as_eng() -> None: + """ + Fix the issue of erroneously recognizing NarrativeText as Title when splitting Chinese DOCX documents + + When specifying the language as English, the partitioning result should be deceived, it will be recognized + incorrectly. + """ + with tempfile.NamedTemporaryFile(suffix = ".docx", delete = False) as tmp: + create_test_docx(tmp.name) + elements = partition_docx(tmp.name, languages=["eng"]) + + # 打印或检查分区结果 + for element in elements: + print(element) + + # 进行断言检查 + assert any('春节放假通知' in element.text for element in elements) + assert any('春节放假从大年 30 开始' in element.text for element in elements) + assert any('标题 2' in element.text for element in elements) + assert any('标题 3' in element.text for element in elements) + assert any('又一个标题 2' in element.text for element in elements) + assert any('正文普通' in element.text for element in elements) + assert any('一组' in element.text for element in elements) + assert any('二组' in element.text for element in elements) + assert any('三组' in element.text for element in elements) + assert any('继续正文' in element.text for element in elements) + assert list(filter(lambda x: '正文普通' in x.text, elements))[0].category == 'Title' + assert list(filter(lambda x: '一组' in x.text, elements))[0].category == 'ListItem' + assert list(filter(lambda x: '继续正文' in x.text, elements))[0].category == 'Title' diff --git a/test_unstructured/partition/test_md.py b/test_unstructured/partition/test_md.py index e3484c753d..e7c7799820 100644 --- a/test_unstructured/partition/test_md.py +++ b/test_unstructured/partition/test_md.py @@ -323,3 +323,52 @@ def test_partition_md_parse_table(): elements = partition_md(filename=filename) assert len(elements) > 0 assert elements[0].category == ElementType.TABLE + + +def test_partition_zh_md() -> None: + """ + Fix the issue of erroneously recognizing NarrativeText as Title when splitting Chinese DOCX documents + """ + filename = example_doc_path("zho_md_partition.md") + elements = partition_md(filename=filename) + assert len(elements) > 0 + # 进行断言检查 + assert any('春节放假通知' in element.text for element in elements) + assert any('春节放假从大年 30 开始' in element.text for element in elements) + assert any('标题 2' in element.text for element in elements) + assert any('标题 3' in element.text for element in elements) + assert any('Another Title 2' in element.text for element in elements) + assert any('正文开始' in element.text for element in elements) + assert any('一组1' in element.text for element in elements) + assert any('一组2' in element.text for element in elements) + assert any('一组3' in element.text for element in elements) + assert any('正文结束' in element.text for element in elements) + assert list(filter(lambda x: '正文开始' in x.text, elements))[0].category == 'NarrativeText' + assert list(filter(lambda x: '一组' in x.text, elements))[0].category == 'ListItem' + assert list(filter(lambda x: '正文结束' in x.text, elements))[0].category == 'NarrativeText' + + +def test_partition_zh_docs_as_eng() -> None: + """ + Fix the issue of erroneously recognizing NarrativeText as Title when splitting Chinese DOCX documents + + When specifying the language as English, the partitioning result should be deceived, it will be recognized + incorrectly. + """ + filename = example_doc_path("zho_md_partition.md") + elements = partition_md(filename=filename, languages=["eng"]) + assert len(elements) > 0 + # 进行断言检查 + assert any('春节放假通知' in element.text for element in elements) + assert any('春节放假从大年 30 开始' in element.text for element in elements) + assert any('标题 2' in element.text for element in elements) + assert any('标题 3' in element.text for element in elements) + assert any('Another Title 2' in element.text for element in elements) + assert any('正文开始' in element.text for element in elements) + assert any('一组1' in element.text for element in elements) + assert any('一组2' in element.text for element in elements) + assert any('一组3' in element.text for element in elements) + assert any('正文结束' in element.text for element in elements) + assert list(filter(lambda x: '正文开始' in x.text, elements))[0].category == 'Title' + assert list(filter(lambda x: '一组' in x.text, elements))[0].category == 'ListItem' + assert list(filter(lambda x: '正文结束' in x.text, elements))[0].category == 'Title' diff --git a/unstructured/documents/base.py b/unstructured/documents/base.py index a2c729b74e..77627c1c4d 100644 --- a/unstructured/documents/base.py +++ b/unstructured/documents/base.py @@ -9,9 +9,10 @@ class Document(ABC): """The base class for all document types. A document consists of an ordered list of pages.""" - def __init__(self): + def __init__(self, languages: Optional[list[str]] = None): self._pages: Optional[List[Page]] = None self._elements: Optional[List[Element]] = None + self._language: list[str] = languages or ["auto"] def __str__(self) -> str: return "\n\n".join([str(page) for page in self.pages]) diff --git a/unstructured/documents/html.py b/unstructured/documents/html.py index 95239cba0b..eeb1cd22c1 100644 --- a/unstructured/documents/html.py +++ b/unstructured/documents/html.py @@ -5,6 +5,7 @@ import sys from typing import Any, Callable, Dict, Iterator, List, Optional, Sequence, Tuple, cast +from unstructured.partition.lang import detect_languages from unstructured.partition.utils.constants import HTML_MAX_PREDECESSOR_LEN if sys.version_info < (3, 8): @@ -139,9 +140,12 @@ def __init__( stylesheet: Optional[str] = None, parser: VALID_PARSERS = None, assemble_articles: bool = True, + languages: Optional[list[str]] = None, + **kwargs: Any, ): self.assembled_articles = assemble_articles - super().__init__(stylesheet=stylesheet, parser=parser) + super().__init__(stylesheet=stylesheet, parser=parser, **kwargs) + self._languages: list[str] = languages or ["auto"] def _parse_pages_from_element_tree(self) -> List[Page]: """Parse HTML elements into pages. @@ -164,19 +168,22 @@ def _parse_pages_from_element_tree(self) -> List[Page]: for article in articles: descendanttag_elems: Tuple[etree._Element, ...] = () for tag_elem in article.iter(): + elem_languages = self._languages \ + if "auto" not in self._languages or not tag_elem.text \ + else detect_languages(tag_elem.text) if tag_elem in descendanttag_elems: # Prevent repeating something that's been flagged as text as we chase it # down a chain continue if _is_text_tag(tag_elem): - _page_elements, descendanttag_elems = _process_text_tag(tag_elem) + _page_elements, descendanttag_elems = _process_text_tag(tag_elem, languages=elem_languages) page.elements.extend(_page_elements) elif _is_container_with_text(tag_elem): tag_elem_tail = tag_elem.tail.strip() if tag_elem.tail else None if tag_elem_tail: - _page_elements, descendanttag_elems = _process_text_tag(tag_elem, False) + _page_elements, descendanttag_elems = _process_text_tag(tag_elem, False, languages=elem_languages) page.elements.extend(_page_elements) # NOTE(christine): generate a separate element using a tag tail @@ -185,6 +192,7 @@ def _parse_pages_from_element_tree(self) -> List[Page]: tag_elem.tag, (), depth=0, + languages=elem_languages, ) else: links = _get_links_from_tag(tag_elem) @@ -196,6 +204,7 @@ def _parse_pages_from_element_tree(self) -> List[Page]: depth=0, links=links, emphasized_texts=emphasized_texts, + languages=elem_languages, ) if element is not None: page.elements.append(element) @@ -407,6 +416,7 @@ def _get_emphasized_texts_from_tag(tag_elem: etree._Element) -> List[Dict[str, s def _parse_tag( tag_elem: etree._Element, include_tail_text: bool = True, + languages: Optional[list[str]] = None, ) -> Optional[Element]: """Parses `tag_elem` to a Text element if it contains qualifying text. @@ -442,6 +452,7 @@ def _parse_tag( links=links, emphasized_texts=emphasized_texts, depth=depth, + languages=languages, ) @@ -450,6 +461,7 @@ def _text_to_element( tag: str, ancestortags: Tuple[str, ...], depth: int, + languages: list[str], links: List[Link] = [], emphasized_texts: List[Dict[str, str]] = [], ) -> Optional[Element]: @@ -483,7 +495,7 @@ def _text_to_element( if len(text) < 2: return None - elif is_narrative_tag(text, tag): + elif is_narrative_tag(text, tag, languages=languages): return HTMLNarrativeText( text, tag=tag, @@ -491,7 +503,7 @@ def _text_to_element( links=links, emphasized_texts=emphasized_texts, ) - elif is_heading_tag(tag) or is_possible_title(text): + elif is_heading_tag(tag) or is_possible_title(text, languages=languages): return HTMLTitle( text, tag=tag, @@ -531,9 +543,9 @@ def _is_container_with_text(tag_elem: etree._Element) -> bool: return True -def is_narrative_tag(text: str, tag: str) -> bool: +def is_narrative_tag(text: str, tag: str, languages: Optional[list[str]] = None) -> bool: """Uses tag information to infer whether text is narrative.""" - return tag not in HEADING_TAGS and is_possible_narrative_text(text) + return tag not in HEADING_TAGS and is_possible_narrative_text(text, languages=languages) def is_heading_tag(tag: str) -> bool: @@ -615,6 +627,7 @@ def _is_text_tag( def _process_text_tag( tag_elem: etree._Element, include_tail_text: bool = True, + languages: Optional[list[str]] = None, ) -> tuple[list[Element], tuple[etree._Element]]: """Produces a document element from `tag_elem`.""" @@ -622,12 +635,12 @@ def _process_text_tag( if _has_break_tags(tag_elem): flattened_elems = _unfurl_break_tags(tag_elem) for _tag_elem in flattened_elems: - element = _parse_tag(_tag_elem, include_tail_text) + element = _parse_tag(_tag_elem, include_tail_text, languages=languages) if element is not None: page_elements.append(element) else: - element = _parse_tag(tag_elem, include_tail_text) + element = _parse_tag(tag_elem, include_tail_text, languages=languages) if element is not None: page_elements.append(element) descendant_tag_elems = tuple(tag_elem.iterdescendants()) diff --git a/unstructured/documents/xml.py b/unstructured/documents/xml.py index 8decc24aec..69f55dc458 100644 --- a/unstructured/documents/xml.py +++ b/unstructured/documents/xml.py @@ -18,6 +18,8 @@ def __init__( self, stylesheet: Optional[str] = None, parser: VALID_PARSERS = None, + languages: Optional[list[str]] = None, + **kwargs: Any, ): """Class for parsing XML documents. XML documents are parsed using lxml. @@ -42,7 +44,7 @@ def __init__( self.stylesheet = stylesheet self.parser = parser self.document_tree = None - super().__init__() + super().__init__(languages=languages) def _parse_pages_from_element_tree(self) -> List[Page]: raise NotImplementedError @@ -99,11 +101,12 @@ def from_string( text: str, parser: VALID_PARSERS = None, stylesheet: Optional[str] = None, + languages: Optional[list[str]] = None, **kwargs: Any, ) -> Self: """Supports reading in an XML file as a raw string rather than as a file.""" logger.info("Reading document from string ...") - doc = cls(parser=parser, stylesheet=stylesheet, **kwargs) + doc = cls(parser=parser, stylesheet=stylesheet, languages=languages, **kwargs) doc._read_xml(text) return doc @@ -114,8 +117,9 @@ def from_file( parser: VALID_PARSERS = None, stylesheet: Optional[str] = None, encoding: Optional[str] = None, + languages: Optional[list[str]] = None, **kwargs: Any, ) -> Self: _, content = read_txt_file(filename=filename, encoding=encoding) - return cls.from_string(content, parser=parser, stylesheet=stylesheet, **kwargs) + return cls.from_string(content, parser=parser, stylesheet=stylesheet, languages=languages, **kwargs) diff --git a/unstructured/partition/docx.py b/unstructured/partition/docx.py index e29a7089f2..531a664222 100644 --- a/unstructured/partition/docx.py +++ b/unstructured/partition/docx.py @@ -47,7 +47,7 @@ get_last_modified_date, get_last_modified_date_from_file, ) -from unstructured.partition.lang import apply_lang_metadata +from unstructured.partition.lang import apply_lang_metadata, detect_languages from unstructured.partition.text_type import ( is_bulleted_text, is_email_address, @@ -89,6 +89,8 @@ def partition_docx( A string defining the target filename path. file A file-like object using "rb" mode --> open(filename, "rb"). + detect_language_per_element + Detect language per element instead of at the document level. include_page_breaks When True, add a `PageBreak` element to the element-stream when a page-break is detected in the document. Note that not all DOCX files include page-break information. @@ -127,6 +129,7 @@ def partition_docx( metadata_last_modified=metadata_last_modified, starting_page_number=starting_page_number, strategy=strategy, + languages=languages, ) elements = _DocxPartitioner.iter_document_elements(opts) @@ -154,6 +157,7 @@ def __init__( metadata_last_modified: Optional[str], starting_page_number: int = 1, strategy: str | None = None, + languages: Optional[list[str]] = None, ): self._date_from_file_object = date_from_file_object self._file = file @@ -165,12 +169,18 @@ def __init__( self._strategy = strategy # -- options object maintains page-number state -- self._page_counter = starting_page_number + # -- languages is a list of languages to use for category detection -- + self._languages: list[str] = languages or ["auto"] @lazyproperty def document(self) -> Document: """The python-docx `Document` object loaded from file or filename.""" return docx.Document(self._docx_file) + @property + def languages(self) -> list[str]: + return self._languages + @lazyproperty def include_page_breaks(self) -> bool: """When True, include `PageBreak` elements in element-stream. @@ -864,9 +874,19 @@ def _parse_paragraph_text_for_element_type(self, paragraph: Paragraph) -> Option return Address if is_email_address(text): return EmailAddress - if is_possible_narrative_text(text): + if is_possible_narrative_text( + text, + languages=self._opts.languages + if "auto" not in self._opts.languages + else detect_languages(text, self._opts.languages) + ): return NarrativeText - if is_possible_title(text): + if is_possible_title( + text, + languages=self._opts.languages + if "auto" not in self._opts.languages + else detect_languages(text, self._opts.languages) + ): return Title return None diff --git a/unstructured/partition/html.py b/unstructured/partition/html.py index 1615b1755b..0fd9d72064 100644 --- a/unstructured/partition/html.py +++ b/unstructured/partition/html.py @@ -109,6 +109,7 @@ def partition_html( parser=parser, encoding=encoding, assemble_articles=html_assemble_articles, + languages=languages, ) elif file is not None: @@ -120,6 +121,7 @@ def partition_html( file_text, parser=parser, assemble_articles=html_assemble_articles, + languages=languages, ) elif text is not None: @@ -128,6 +130,7 @@ def partition_html( _text, parser=parser, assemble_articles=html_assemble_articles, + languages=languages, ) elif url is not None: @@ -139,7 +142,7 @@ def partition_html( if not content_type.startswith("text/html"): raise ValueError(f"Expected content type text/html. Got {content_type}.") - document = HTMLDocument.from_string(response.text, parser=parser) + document = HTMLDocument.from_string(response.text, parser=parser, languages=languages) if skip_headers_and_footers: document = filter_footer_and_header(document) @@ -153,6 +156,7 @@ def partition_html( last_modification_date=metadata_last_modified or last_modification_date, source_format=source_format if source_format else None, detection_origin=detection_origin, + languages=languages, **kwargs, ), languages=languages, diff --git a/unstructured/partition/lang.py b/unstructured/partition/lang.py index 18fc6c05db..90be172cf9 100644 --- a/unstructured/partition/lang.py +++ b/unstructured/partition/lang.py @@ -293,12 +293,15 @@ def _get_all_tesseract_langcodes_with_prefix(prefix: str) -> list[str]: def detect_languages( text: str, - languages: Optional[list[str]] = ["auto"], + languages: Optional[list[str]] = None, ) -> Optional[list[str]]: """ Detects the list of languages present in the text (in the default "auto" mode), or formats and passes through the user inputted document languages if provided. """ + if languages is None: + languages = ["auto"] + if not isinstance(languages, list): raise TypeError( 'The language parameter must be a list of language codes as strings, ex. ["eng"]', @@ -412,7 +415,7 @@ def apply_lang_metadata( else: for e in elements: if hasattr(e, "text"): - e.metadata.languages = detect_languages(e.text) + e.metadata.languages = detect_languages(e.text, languages=e.metadata.languages) yield e else: yield e diff --git a/unstructured/partition/text_type.py b/unstructured/partition/text_type.py index 2989c24728..1528a86a05 100644 --- a/unstructured/partition/text_type.py +++ b/unstructured/partition/text_type.py @@ -34,7 +34,7 @@ def is_possible_narrative_text( text: str, cap_threshold: float = 0.5, non_alpha_threshold: float = 0.5, - languages: List[str] = ["eng"], + languages: Optional[list[str]] = None, language_checks: bool = False, ) -> bool: """Checks to see if the text passes all of the checks for a narrative text section. @@ -57,6 +57,8 @@ def is_possible_narrative_text( If True, conducts checks that are specific to the chosen language. Turn on for more accurate partitioning and off for faster processing. """ + if languages is None: + languages = ["eng"] _language_checks = os.environ.get("UNSTRUCTURED_LANGUAGE_CHECKS") if _language_checks is not None: language_checks = _language_checks.lower() == "true" @@ -77,7 +79,11 @@ def is_possible_narrative_text( cap_threshold = float( os.environ.get("UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD", cap_threshold), ) - if exceeds_cap_ratio(text, threshold=cap_threshold): + # NOTE: exceeds_cap_ratio is designed for english text, so we only use it if the language is english. + # For caution's sake, we will temporarily use "eng" in languages for judgment, that is, as long as English appears, + # we will make a judgment. In the future, we may need to modify it to where only pure English is needed for + # exceeds_cap_ratio judgment. + if "eng" in languages and exceeds_cap_ratio(text, threshold=cap_threshold): trace_logger.detail(f"Not narrative. Text exceeds cap ratio {cap_threshold}:\n\n{text}") # type: ignore # noqa: E501 return False From e2bbcc80ff52aa0b1b11398f7039f3ae616ac7da Mon Sep 17 00:00:00 2001 From: JQQ Date: Fri, 24 May 2024 13:56:35 +0800 Subject: [PATCH 02/20] =?UTF-8?q?=E9=80=82=E9=85=8D=E4=BA=86[""]=E8=BF=99?= =?UTF-8?q?=E7=A7=8Dlanguages=E5=AF=BC=E8=87=B4=E7=9A=84=E4=B8=80=E4=BA=9B?= =?UTF-8?q?=E8=BF=87=E7=A8=8B=E4=B8=AD=E7=9A=84=E9=97=AE=E9=A2=98=E3=80=82?= =?UTF-8?q?=20=E8=BF=9B=E8=A1=8C=E4=BA=86=E5=85=A8=E9=87=8F=E6=B5=8B?= =?UTF-8?q?=E8=AF=95=EF=BC=8C=E5=B9=B6=E5=9F=BA=E6=9C=AC=E4=BF=9D=E6=8C=81?= =?UTF-8?q?=E4=BA=86=E4=B8=8Emain=E5=88=86=E6=94=AF=E4=B8=80=E8=87=B4?= =?UTF-8?q?=E7=9A=84=E9=80=9A=E8=BF=87=E7=8E=87=E3=80=82?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- examples/pgvector/pgvector.ipynb | 96 ++--- examples/training/1-Intro to Bricks.ipynb | 379 +++--------------- examples/training/2-File Exploration.ipynb | 246 +----------- test_unstructured/documents/test_html.py | 4 +- test_unstructured/partition/test_auto.py | 2 + test_unstructured/partition/test_odt.py | 3 +- test_unstructured/partition/test_text_type.py | 4 +- unstructured/documents/base.py | 14 +- unstructured/documents/html.py | 7 +- unstructured/partition/epub.py | 1 + unstructured/partition/lang.py | 2 +- unstructured/partition/text_type.py | 17 +- 12 files changed, 141 insertions(+), 634 deletions(-) diff --git a/examples/pgvector/pgvector.ipynb b/examples/pgvector/pgvector.ipynb index 9c94a5a109..d20a5a9bef 100644 --- a/examples/pgvector/pgvector.ipynb +++ b/examples/pgvector/pgvector.ipynb @@ -34,7 +34,6 @@ "execution_count": 1, "id": "8a538b14", "metadata": {}, - "outputs": [], "source": [ "from sqlalchemy import (\n", " create_engine,\n", @@ -49,37 +48,37 @@ ")\n", "from pgvector.sqlalchemy import Vector\n", "from sqlalchemy.orm import declarative_base, sessionmaker" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 2, "id": "91893826", "metadata": {}, - "outputs": [], "source": [ "ADA_TOKEN_COUNT = 1536" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 3, "id": "6614726b", "metadata": {}, - "outputs": [], "source": [ "connection_string = \"postgresql://localhost:5432/postgres\"\n", "engine = create_engine(connection_string)\n", "\n", "Base = declarative_base()" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 4, "id": "bb2ffda8", "metadata": {}, - "outputs": [], "source": [ "class Element(Base):\n", " __tablename__ = \"unstructured_elements\"\n", @@ -94,28 +93,29 @@ " sent_to = Column(ARRAY(String))\n", " sent_from = Column(ARRAY(String))\n", " subject = Column(String)" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 5, "id": "e9130393", "metadata": {}, - "outputs": [], "source": [ "Base.metadata.create_all(engine)" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 6, "id": "d59aa418", "metadata": {}, - "outputs": [], "source": [ "Session = sessionmaker(bind=engine)\n", "session = Session()" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -132,31 +132,30 @@ "execution_count": 7, "id": "b08244dc", "metadata": {}, - "outputs": [], "source": [ "import datetime\n", "import os\n", "\n", "from langchain.embeddings.openai import OpenAIEmbeddings\n", "from unstructured.partition.email import partition_email" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 8, "id": "de97a526", "metadata": {}, - "outputs": [], "source": [ "EXAMPLE_DOCS_DIRECTORY = \"../../example-docs\"" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 9, "id": "5a18fd82", "metadata": {}, - "outputs": [], "source": [ "elements = []\n", "for f in os.listdir(EXAMPLE_DOCS_DIRECTORY):\n", @@ -165,28 +164,29 @@ "\n", " filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, f)\n", " elements.extend(partition_email(filename=filename))" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 10, "id": "b69915f0", "metadata": {}, - "outputs": [], "source": [ "embedding_function = OpenAIEmbeddings()" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 11, "id": "2e6537d3", "metadata": {}, - "outputs": [], "source": [ "for element in elements:\n", " element.embedding = embedding_function.embed_query(element.text)" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -203,7 +203,6 @@ "execution_count": 12, "id": "a47c99d3", "metadata": {}, - "outputs": [], "source": [ "items_to_add = []\n", "for element in elements:\n", @@ -219,18 +218,19 @@ " subject=element.metadata.subject,\n", " )\n", " )" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 13, "id": "5d6bbf43", "metadata": {}, - "outputs": [], "source": [ "session.add_all(items_to_add)\n", "session.commit()" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -247,29 +247,16 @@ "execution_count": 14, "id": "7ba10d65", "metadata": {}, - "outputs": [], "source": [ "vector = embedding_function.embed_query(\"email\")" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 16, "id": "25ed06a2", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1 This is a test email to use for unit tests.\n", - "13 This is a test email to use for unit tests.\n", - "5 This is a test email to use for unit tests.\n", - "9 The unstructured logo is attached to this email.\n", - "19 It includes:\n" - ] - } - ], "source": [ "query = (\n", " session.query(Element)\n", @@ -280,7 +267,8 @@ "\n", "for element in query:\n", " print(element.id, element.text)" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -295,30 +283,17 @@ "execution_count": 17, "id": "532cb832", "metadata": {}, - "outputs": [], "source": [ "vector = embedding_function.embed_query(\"violets\")\n", "decay_rate = 0.10" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 18, "id": "2ebff5da", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "0.0050977532596662945 - Violets are blue\n", - "0.001773595479626926 - Violets are blue\n", - "0.001773595479626926 - Violets are blue\n", - "0.0011421532895244265 - Roses are red\n", - "0.00029501066142995373 - Roses are red\n" - ] - } - ], "source": [ "query = (\n", " session.query(\n", @@ -335,15 +310,16 @@ "\n", "for element in query:\n", " print(f\"{element.decay_score} - {element.text}\")" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": null, "id": "edd8ba5f", "metadata": {}, - "outputs": [], - "source": [] + "source": [], + "outputs": [] } ], "metadata": { diff --git a/examples/training/1-Intro to Bricks.ipynb b/examples/training/1-Intro to Bricks.ipynb index 91a1b1cf14..21d1917909 100644 --- a/examples/training/1-Intro to Bricks.ipynb +++ b/examples/training/1-Intro to Bricks.ipynb @@ -19,14 +19,14 @@ "execution_count": 1, "id": "3908be82", "metadata": {}, - "outputs": [], "source": [ "import os\n", "import pathlib\n", "\n", "DIRECTORY = os.path.abspath(\"\")\n", "EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, \"..\", \"..\", \"example-docs\")" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -46,97 +46,44 @@ "execution_count": 2, "id": "8bbb73c0", "metadata": {}, - "outputs": [], "source": [ "from unstructured.partition.auto import partition\n", "\n", "filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"layout-parser-paper-fast.pdf\")\n", "elements = partition(filename=filename)" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 3, "id": "5319593c", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "LayoutParser : A Unified Toolkit for Deep Learning Based Document Image Analysis\n", - "\n", - "Zejiang Shen 1 ( (ea)\n", - " ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5\n", - "\n", - "Allen Institute for AI shannons@allenai.org\n", - "\n", - "Brown University ruochen zhang@brown.edu\n", - "\n", - "Harvard University { melissadell,jacob carlson } @fas.harvard.edu\n", - "\n", - "University of Washington bcgl@cs.washington.edu\n", - "\n", - "University of Waterloo w\n", - "\n", - "li@uwaterloo.ca\n", - "\n", - "Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser , an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io\n", - "\n", - "Keywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library · Toolkit.\n" - ] - } - ], "source": [ "print(\"\\n\\n\".join([str(el) for el in elements][:10]))" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 4, "id": "8de9bee1", "metadata": {}, - "outputs": [], "source": [ "with open(filename, \"rb\") as f:\n", " elements = partition(file=f, include_page_breaks=True)" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 5, "id": "75c6c73c", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "University of Washington bcgl@cs.washington.edu\n", - "\n", - "University of Waterloo w\n", - "\n", - "li@uwaterloo.ca\n", - "\n", - "Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser , an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io\n", - "\n", - "Keywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library · Toolkit.\n", - "\n", - "Introduction\n", - "\n", - "Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classification [11,\n", - "\n", - "\n", - "\n", - "37], layout detection [38, 22], table detection [26], and scene text detection [4]. A generalized learning-based framework dramatically reduces the need for the manual specification of complicated rules, which is the status quo with traditional methods. DL has the potential to transform DIA pipelines and benefit a broad spectrum of large-scale document digitization projects.\n", - "\n", - "However, there are several practical difficulties for taking advantages of re- cent advances in DL-based methods: 1) DL models are notoriously convoluted for reuse and extension. Existing models are developed using distinct frame- works like TensorFlow [1] or PyTorch [24], and the high-level parameters can be obfuscated by implementation details [8]. It can be a time-consuming and frustrating experience to debug, reproduce, and adapt existing models for DIA, and many researchers who would benefit the most from using these methods lack the technical background to implement them from scratch. 2) Document images contain diverse and disparate patterns across domains, and customized training is often required to achieve a desirable detection accuracy. Currently there is no full-fledged infrastructure for easily curating the target document image datasets and fine-tuning or re-training the models. 3) DIA usually requires a sequence of models and other processing to obtain the final outputs. Often research teams use DL models and then perform further document analyses in separate processes, and these pipelines are not documented in any central location (and often not documented at all). This makes it difficult for research teams to learn about how full pipelines are implemented and leads them to invest significant resources in reinventing the DIA wheel .\n" - ] - } - ], "source": [ "print(\"\\n\\n\".join([str(el) for el in elements][5:15]))" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -184,64 +131,23 @@ "execution_count": 6, "id": "b7ce3fa8", "metadata": {}, - "outputs": [], "source": [ "from unstructured.partition.html import partition_html\n", "\n", "url = \"https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html\"\n", "elements = partition_html(url=url)" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 7, "id": "ab6d9307", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CNN\n", - "  —\n", - "\n", - "The Empire State Building was lit in green and white to celebrate the Philadelphia Eagles' victory in the NFC Championship game on Sunday — a decision that's sparked a bit of a backlash in the Big Apple.\n", - "\n", - "The Eagles advanced to the Super Bowl for the first time since 2018 after defeating the San Francisco 49ers 31-7, and the Empire State Building later tweeted how it was marking the occasion.\n", - "\n", - "Fly @Eagles Fly! We're going Green and White in honor of the Eagles NFC Championship Victory. pic.twitter.com/RNiwbCIkt7— Empire State Building (@EmpireStateBldg)\n", - "\n", - "January 29, 2023\n", - "\n", - "But given the fierce rivalry between the Eagles and the New York Giants, who the Super Bowl-bound team had comfortably defeated in the previous round of the NFL Playoffs, many were left questioning the move.\n", - "\n", - "œDid y'all lose a bet, ESPN contributor Mina Kimes asked in response to the tweet, while Giants running back Matt Breida also expressed his disbelief.\n", - "\n", - "SMH🤦🏾â™‚️— Matt Breida (@MattBreida)\n", - "\n", - "January 30, 2023\n", - "\n", - "œAs the representative for the Empire State Building, and a diehard Giants fan, let me be on the record saying that this is absolutely ridiculous, said New York City councilman Keith Powers.\n", - "\n", - "The Giants' Twitter account also acknowledged the divisive decision, writing: œI'm just here for the comments.\n", - "\n", - "The Empire State Building, whose original tweet honoring the Eagles was viewed nearly 30 million at the time of writing, said the color switch œhurt us more than it hurt you — but only after mocking another tweet calling the New York landmark œlame.\n", - "\n", - "The building was later lit in red to celebrate the Kansas City Chiefs' AFC Championship win against the Cincinnati Bengals.\n", - "\n", - "In Philadelphia, meanwhile, Eagles fans poured onto the streets on Sunday night. Large crowds gathered in the city as people climbed up light posts, street signs, and on top of a bus stop canopy.\n", - "\n", - "The city announced street closures and vehicle restrictions in Philadelphia's city center œdue to Eagles celebratory activity between 8th to 20th streets and Race to Lombard streets, the city's Office of Emergency Management tweeted on Sunday night.\n", - "\n", - "œPhiladelphians, let's celebrate joyously, safely, and respectfully and show the same love we have for our team to our city. Go Birds! Mayor Jim Kenney tweeted.\n", - "\n", - "The Eagles and the Chiefs face off in Super Bowl LVII on February 12.\n" - ] - } - ], "source": [ "print(\"\\n\\n\".join([str(el) for el in elements]))" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -260,23 +166,12 @@ "execution_count": 8, "id": "a1c4ba19", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\"Philadelphia Eagles' victory\"" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "from unstructured.cleaners.core import replace_unicode_quotes\n", "\n", "replace_unicode_quotes(\"Philadelphia Eaglesâ\\x80\\x99 victory\")" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -291,22 +186,14 @@ "execution_count": 9, "id": "215c4b35", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Philadelphia Eagles' victory\n" - ] - } - ], "source": [ "from unstructured.documents.elements import Text\n", "\n", "element = Text(\"Philadelphia Eaglesâ\\x80\\x99 victory\")\n", "element.apply(replace_unicode_quotes)\n", "print(element)" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -321,100 +208,67 @@ "execution_count": 10, "id": "ae048814", "metadata": {}, - "outputs": [], "source": [ "url = \"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023\"\n", "elements = partition_html(url=url)" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 11, "id": "4211194b", "metadata": {}, - "outputs": [], "source": [ "from unstructured.documents.elements import NarrativeText\n", "\n", "narrative_text = [el for el in elements if isinstance(el, NarrativeText)][2:]" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 12, "id": "3abd4280", "metadata": {}, - "outputs": [], "source": [ "import re\n", "\n", "remove_citations = lambda text: re.sub(\"\\[\\d{1,3}\\]\", \"\", text)" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 13, "id": "3327feda", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'[1]\\xa0Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.'" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "narrative_text[0].text" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 14, "id": "02eb95ae", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'\\xa0Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.'" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "narrative_text[0].apply(remove_citations)\n", "narrative_text[0].text" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 15, "id": "b755cc86", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Russian officials continue to propose measures to prepare Russia’s military industry for a protracted war in Ukraine while also likely setting further conditions for sanctions evasion.\\xa0Russian Prime Minister Mikhail Mishustin stated on February 8 that the Russian government will subsidize investment projects for the modernization of enterprises operating in the interests of the Russian military and will allocate significant funds for manufacturing new military equipment.\\xa0Mishustin also stated that the Russian government would extend benefits to Russian entrepreneurs who support the Russian military, including extended payment periods on rented federal property.\\xa0The Kremlin likely intends these measures to augment its overarching effort to gradually prepare Russia’s military industry for a protracted war in Ukraine while avoiding a wider economic mobilization that would create further domestic economic disruptions and corresponding discontent.'" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "narrative_text[6].apply(remove_citations)\n", "narrative_text[6].text" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -429,55 +283,33 @@ "execution_count": 16, "id": "7d65d7c8", "metadata": {}, - "outputs": [], "source": [ "from unstructured.cleaners.core import clean_extra_whitespace\n", "\n", "narrative_text[0].apply(clean_extra_whitespace)\n", "narrative_text[6].apply(clean_extra_whitespace)" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 17, "id": "a37f9bad", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.'" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "narrative_text[0].text" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 18, "id": "25245bc1", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Russian officials continue to propose measures to prepare Russia’s military industry for a protracted war in Ukraine while also likely setting further conditions for sanctions evasion. Russian Prime Minister Mikhail Mishustin stated on February 8 that the Russian government will subsidize investment projects for the modernization of enterprises operating in the interests of the Russian military and will allocate significant funds for manufacturing new military equipment. Mishustin also stated that the Russian government would extend benefits to Russian entrepreneurs who support the Russian military, including extended payment periods on rented federal property. The Kremlin likely intends these measures to augment its overarching effort to gradually prepare Russia’s military industry for a protracted war in Ukraine while avoiding a wider economic mobilization that would create further domestic economic disruptions and corresponding discontent.'" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "narrative_text[6].text" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -492,12 +324,12 @@ "execution_count": 19, "id": "0218cc7a", "metadata": {}, - "outputs": [], "source": [ "for element in narrative_text:\n", " element.apply(remove_citations)\n", " element.apply(clean_extra_whitespace)" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -514,35 +346,14 @@ "execution_count": 20, "id": "21819f56", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[\n", - " {\n", - " \"data\": {\n", - " \"text\": \"Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.\",\n", - " \"ref_id\": \"c311a941b80429f2ba0b6a2137f7315e\"\n", - " }\n", - " },\n", - " {\n", - " \"data\": {\n", - " \"text\": \"Russian military command additionally appears to have fully committed elements of several conventional divisions to decisive offensive operations along the Svatove-Kreminna line, as ISW previously reported.\",\n", - " \"ref_id\": \"79748ec84695bd88f41b13e98eae53be\"\n", - " }\n", - " }\n", - "]\n" - ] - } - ], "source": [ "import json\n", "from unstructured.staging.label_studio import stage_for_label_studio\n", "\n", "output = stage_for_label_studio(narrative_text)\n", "print(json.dumps(output[:2], indent=4))" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -578,112 +389,26 @@ "execution_count": 21, "id": "6d5cf8cf", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[\n", - " {\n", - " \"text\": \"Skip to main content\",\n", - " \"type\": \"Title\"\n", - " },\n", - " {\n", - " \"text\": \"(function(d){\\n var js, id = 'facebook-jssdk'; if (d.getElementById(id)) {return;}\\n js = d.createElement('script'); js.id = id; js.async = true;\\n js.src = \\\"//connect.facebook.net/en_US/all.js#xfbml=1\\\";\\n d.getElementsByTagName('head')[0].appendChild(js);\\n}(document));\",\n", - " \"type\": \"NarrativeText\"\n", - " }\n", - "]\n" - ] - } - ], "source": [ "from unstructured.staging.base import convert_to_isd\n", "\n", "isd = convert_to_isd(elements)\n", "print(json.dumps(isd[:2], indent=4))" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 22, "id": "706cc9c7", "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
typetext
0TitleSkip to main content
1NarrativeText(function(d){\\n var js, id = 'facebook-jssdk'...
2TitleSearch form
3ListItemHome
4ListItemWho We Are
\n", - "
" - ], - "text/plain": [ - " type text\n", - "0 Title Skip to main content\n", - "1 NarrativeText (function(d){\\n var js, id = 'facebook-jssdk'...\n", - "2 Title Search form\n", - "3 ListItem Home\n", - "4 ListItem Who We Are" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "from unstructured.staging.base import convert_to_dataframe\n", "\n", "df = convert_to_dataframe(elements)\n", "df.head()" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -698,24 +423,12 @@ "execution_count": 23, "id": "b2c1282e", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[,\n", - " ]" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "from unstructured.staging.base import isd_to_elements\n", "\n", "isd_to_elements(isd[:2])" - ] + ], + "outputs": [] } ], "metadata": { diff --git a/examples/training/2-File Exploration.ipynb b/examples/training/2-File Exploration.ipynb index bc9097e800..3e157ce721 100644 --- a/examples/training/2-File Exploration.ipynb +++ b/examples/training/2-File Exploration.ipynb @@ -18,14 +18,14 @@ "execution_count": 1, "id": "59392a21", "metadata": {}, - "outputs": [], "source": [ "import os\n", "import pathlib\n", "\n", "DIRECTORY = os.path.abspath(\"\")\n", "EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, \"..\", \"..\", \"example-docs\")" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -42,24 +42,13 @@ "execution_count": 2, "id": "c6bd2f4a", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "from unstructured.file_utils.filetype import detect_filetype\n", "\n", "filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"example-10k.html\")\n", "detect_filetype(filename=filename)" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -84,72 +73,23 @@ "execution_count": 3, "id": "c53f054e", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "MIME type was message/rfc822. This file type is not currently supported in unstructured.\n" - ] - }, - { - "data": { - "text/plain": [ - "FileType.EML 4\n", - "FileType.TXT 3\n", - "FileType.HTML 2\n", - "FileType.XML 2\n", - "FileType.PDF 2\n", - "FileType.JPG 2\n", - "FileType.UNK 1\n", - "FileType.DOCX 1\n", - "FileType.PPTX 1\n", - "FileType.XLSX 1\n", - "Name: filetype, dtype: int64" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "from unstructured.file_utils.exploration import get_directory_file_info\n", "\n", "file_info = get_directory_file_info(EXAMPLE_DOCS_DIRECTORY)\n", "file_info.filetype.value_counts()" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 4, "id": "7e1b3300", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAHzCAYAAADy/B0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/P9b71AAAACXBIWXMAAA9hAAAPYQGoP6dpAABDnElEQVR4nO3dd3gVZf7//9cJ5QAhCUYgoQQE6b1YCKIUI4hZMKt4IbB0WAt8hEVRUJRVhLgqgi5IUQGBpahL8cuiGEFsQaWFImtBgaAkwQIJNSC5f3/wI0sg5ZwAuc9Mno/rmj/OFM77vhjmvLhn7ns8xhgjAAAAS4JsFwAAAIo3wggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCppuwBfZGVl6cCBAwoJCZHH47FdDgAA8IExRkeOHFHVqlUVFJR3/4cjwsiBAwcUFRVluwwAAFAI+/fvV/Xq1fPc7ogwEhISIulsY0JDQy1XAwAAfJGRkaGoqKjs3/G8OCKMnLs1ExoaShgBAMBhCnrEggdYAQCAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWHVJYeS5556Tx+PRyJEj893v7bffVoMGDVSmTBk1bdpUq1evvpSvBQAALlLoMLJx40bNmjVLzZo1y3e/xMRE9erVS4MHD9bWrVsVFxenuLg47dy5s7BfDQAAXKRQYeTo0aPq06ePXnvtNV111VX57vvyyy/r9ttv1+jRo9WwYUNNmDBBrVq10rRp0wpVMAAAcJdChZFhw4YpNjZWMTExBe67YcOGi/br0qWLNmzYkOcxmZmZysjIyLEAAAB3KunvAUuWLNGWLVu0ceNGn/ZPTU1VREREjnURERFKTU3N85j4+Hg9/fTT/paWwzVj/nNJx/ti73OxV/w7AABwO796Rvbv368RI0boX//6l8qUKXOlatLYsWOVnp6evezfv/+KfRcAALDLr56RzZs36+DBg2rVqlX2ujNnzuiTTz7RtGnTlJmZqRIlSuQ4JjIyUmlpaTnWpaWlKTIyMs/v8Xq98nq9/pQGAAAcyq+ekVtvvVU7duxQUlJS9nLdddepT58+SkpKuiiISFJ0dLTWrl2bY11CQoKio6MvrXIAAOAKfvWMhISEqEmTJjnWBQcH6+qrr85e369fP1WrVk3x8fGSpBEjRqh9+/aaPHmyYmNjtWTJEm3atEmzZ8++TE0AAABOdtlnYE1OTlZKSkr257Zt22rRokWaPXu2mjdvrnfeeUcrVqy4KNQAAIDiyWOMMbaLKEhGRobCwsKUnp6u0NBQn45hNA0AAHb5+vvNu2kAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVX6FkRkzZqhZs2YKDQ1VaGiooqOj9d577+W5/7x58+TxeHIsZcqUueSiAQCAe5T0Z+fq1avrueeeU926dWWM0Ztvvqk777xTW7duVePGjXM9JjQ0VN9++232Z4/Hc2kVAwAAV/ErjHTr1i3H54kTJ2rGjBn64osv8gwjHo9HkZGRha8QAAC4WqGfGTlz5oyWLFmiY8eOKTo6Os/9jh49qpo1ayoqKkp33nmnvv766wL/7MzMTGVkZORYAACAO/kdRnbs2KHy5cvL6/Xq/vvv1/Lly9WoUaNc961fv77mzJmjlStXauHChcrKylLbtm31008/5fsd8fHxCgsLy16ioqL8LRMAADiExxhj/Dng1KlTSk5OVnp6ut555x29/vrr+vjjj/MMJOc7ffq0GjZsqF69emnChAl57peZmanMzMzszxkZGYqKilJ6erpCQ0N9qvOaMf/xab9Lsfe52Cv+HQAAOFVGRobCwsIK/P3265kRSSpdurTq1KkjSWrdurU2btyol19+WbNmzSrw2FKlSqlly5bavXt3vvt5vV55vV5/SwMAAA50yfOMZGVl5ejFyM+ZM2e0Y8cOValS5VK/FgAAuIRfPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSnnnmGbVp00Z16tTR4cOH9cILL2jfvn0aMmTI5W8JAABwJL/CyMGDB9WvXz+lpKQoLCxMzZo105o1a3TbbbdJkpKTkxUU9L/OlkOHDmno0KFKTU3VVVddpdatWysxMdGn50sAAEDx4PcDrDb4+gDM+XiAFQAAu3z9/ebdNAAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqv8LIjBkz1KxZM4WGhio0NFTR0dF677338j3m7bffVoMGDVSmTBk1bdpUq1evvqSCAQCAu/gVRqpXr67nnntOmzdv1qZNm9SpUyfdeeed+vrrr3PdPzExUb169dLgwYO1detWxcXFKS4uTjt37rwsxQMAAOfzGGPMpfwB4eHheuGFFzR48OCLtvXs2VPHjh3TqlWrste1adNGLVq00MyZM33+joyMDIWFhSk9PV2hoaE+HXPNmP/4/OcX1t7nYq/4dwAA4FS+/n4X+pmRM2fOaMmSJTp27Jiio6Nz3WfDhg2KiYnJsa5Lly7asGFDvn92ZmamMjIyciwAAMCdSvp7wI4dOxQdHa2TJ0+qfPnyWr58uRo1apTrvqmpqYqIiMixLiIiQqmpqfl+R3x8vJ5++ml/S3MlengAAG7nd89I/fr1lZSUpC+//FIPPPCA+vfvr127dl3WosaOHav09PTsZf/+/Zf1zwcAAIHD756R0qVLq06dOpKk1q1ba+PGjXr55Zc1a9asi/aNjIxUWlpajnVpaWmKjIzM9zu8Xq+8Xq+/pQEAAAe65HlGsrKylJmZmeu26OhorV27Nse6hISEPJ8xAQAAxY9fPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSRowYofbt22vy5MmKjY3VkiVLtGnTJs2ePfvytwQAADiSX2Hk4MGD6tevn1JSUhQWFqZmzZppzZo1uu222yRJycnJCgr6X2dL27ZttWjRIo0bN06PP/646tatqxUrVqhJkyaXtxUAAMCx/Aojb7zxRr7b169ff9G6e+65R/fcc49fRQEAgOKDd9MAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAq/wKI/Hx8br++usVEhKiypUrKy4uTt9++22+x8ybN08ejyfHUqZMmUsqGgAAuIdfYeTjjz/WsGHD9MUXXyghIUGnT59W586ddezYsXyPCw0NVUpKSvayb9++SyoaAAC4R0l/dn7//fdzfJ43b54qV66szZs365ZbbsnzOI/Ho8jIyMJVCAAAXO2SnhlJT0+XJIWHh+e739GjR1WzZk1FRUXpzjvv1Ndff53v/pmZmcrIyMixAAAAdyp0GMnKytLIkSN10003qUmTJnnuV79+fc2ZM0crV67UwoULlZWVpbZt2+qnn37K85j4+HiFhYVlL1FRUYUtEwAABLhCh5Fhw4Zp586dWrJkSb77RUdHq1+/fmrRooXat2+vZcuWqVKlSpo1a1aex4wdO1bp6enZy/79+wtbJgAACHB+PTNyzvDhw7Vq1Sp98sknql69ul/HlipVSi1bttTu3bvz3Mfr9crr9RamNAAA4DB+9YwYYzR8+HAtX75c69atU61atfz+wjNnzmjHjh2qUqWK38cCAAD38atnZNiwYVq0aJFWrlypkJAQpaamSpLCwsJUtmxZSVK/fv1UrVo1xcfHS5KeeeYZtWnTRnXq1NHhw4f1wgsvaN++fRoyZMhlbgoAAHAiv8LIjBkzJEkdOnTIsX7u3LkaMGCAJCk5OVlBQf/rcDl06JCGDh2q1NRUXXXVVWrdurUSExPVqFGjS6scAAC4gl9hxBhT4D7r16/P8XnKlCmaMmWKX0UBAIDig3fTAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKv8CiPx8fG6/vrrFRISosqVKysuLk7ffvttgce9/fbbatCggcqUKaOmTZtq9erVhS4YAAC4i19h5OOPP9awYcP0xRdfKCEhQadPn1bnzp117NixPI9JTExUr169NHjwYG3dulVxcXGKi4vTzp07L7l4AADgfB5jjCnswb/88osqV66sjz/+WLfcckuu+/Ts2VPHjh3TqlWrste1adNGLVq00MyZM336noyMDIWFhSk9PV2hoaE+HXPNmP/4tN+l2Ptc7BX/Dre0AwBQ/Pj6+31Jz4ykp6dLksLDw/PcZ8OGDYqJicmxrkuXLtqwYUOex2RmZiojIyPHAgAA3KlkYQ/MysrSyJEjddNNN6lJkyZ57peamqqIiIgc6yIiIpSamprnMfHx8Xr66acLWxoCjFt6d2iHb9zQBokeQ6AoFbpnZNiwYdq5c6eWLFlyOeuRJI0dO1bp6enZy/79+y/7dwAAgMBQqJ6R4cOHa9WqVfrkk09UvXr1fPeNjIxUWlpajnVpaWmKjIzM8xiv1yuv11uY0gAAgMP41TNijNHw4cO1fPlyrVu3TrVq1SrwmOjoaK1duzbHuoSEBEVHR/tXKQAAcCW/ekaGDRumRYsWaeXKlQoJCcl+7iMsLExly5aVJPXr10/VqlVTfHy8JGnEiBFq3769Jk+erNjYWC1ZskSbNm3S7NmzL3NTAACAE/nVMzJjxgylp6erQ4cOqlKlSvaydOnS7H2Sk5OVkpKS/blt27ZatGiRZs+erebNm+udd97RihUr8n3oFQAAFB9+9Yz4MiXJ+vXrL1p3zz336J577vHnqwAAQDHBu2kAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVX6HkU8++UTdunVT1apV5fF4tGLFinz3X79+vTwez0VLampqYWsGAAAu4ncYOXbsmJo3b67p06f7ddy3336rlJSU7KVy5cr+fjUAAHChkv4e0LVrV3Xt2tXvL6pcubIqVKjg93EAAMDdiuyZkRYtWqhKlSq67bbb9Pnnn+e7b2ZmpjIyMnIsAADAna54GKlSpYpmzpypf//73/r3v/+tqKgodejQQVu2bMnzmPj4eIWFhWUvUVFRV7pMAABgid+3afxVv3591a9fP/tz27Zt9cMPP2jKlClasGBBrseMHTtWo0aNyv6ckZFBIAEAwKWueBjJzQ033KDPPvssz+1er1der7cIKwIAALZYmWckKSlJVapUsfHVAAAgwPjdM3L06FHt3r07+/OePXuUlJSk8PBw1ahRQ2PHjtXPP/+s+fPnS5KmTp2qWrVqqXHjxjp58qRef/11rVu3Th988MHlawUAAHAsv8PIpk2b1LFjx+zP557t6N+/v+bNm6eUlBQlJydnbz916pQefvhh/fzzzypXrpyaNWumDz/8MMefAQAAii+/w0iHDh1kjMlz+7x583J8fvTRR/Xoo4/6XRgAACgeeDcNAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsMrvMPLJJ5+oW7duqlq1qjwej1asWFHgMevXr1erVq3k9XpVp04dzZs3rxClAgAAN/I7jBw7dkzNmzfX9OnTfdp/z549io2NVceOHZWUlKSRI0dqyJAhWrNmjd/FAgAA9ynp7wFdu3ZV165dfd5/5syZqlWrliZPnixJatiwoT777DNNmTJFXbp08ffrAQCAy1zxZ0Y2bNigmJiYHOu6dOmiDRs25HlMZmamMjIyciwAAMCd/O4Z8VdqaqoiIiJyrIuIiFBGRoZOnDihsmXLXnRMfHy8nn766StdGgBYdc2Y/1zx79j7XOwV/fPd0AaJdvjqSrUhIEfTjB07Vunp6dnL/v37bZcEAACukCveMxIZGam0tLQc69LS0hQaGpprr4gkeb1eeb3eK10aAAAIAFe8ZyQ6Olpr167NsS4hIUHR0dFX+qsBAIAD+B1Gjh49qqSkJCUlJUk6O3Q3KSlJycnJks7eYunXr1/2/vfff79+/PFHPfroo/rmm2/06quv6q233tLf/va3y9MCAADgaH6HkU2bNqlly5Zq2bKlJGnUqFFq2bKlnnrqKUlSSkpKdjCRpFq1auk///mPEhIS1Lx5c02ePFmvv/46w3oBAICkQjwz0qFDBxlj8tye2+yqHTp00NatW/39KgAAUAwE5GgaAABQfBBGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYVKoxMnz5d11xzjcqUKaMbb7xRX331VZ77zps3Tx6PJ8dSpkyZQhcMAADcxe8wsnTpUo0aNUrjx4/Xli1b1Lx5c3Xp0kUHDx7M85jQ0FClpKRkL/v27bukogEAgHv4HUZeeuklDR06VAMHDlSjRo00c+ZMlStXTnPmzMnzGI/Ho8jIyOwlIiLikooGAADu4VcYOXXqlDZv3qyYmJj//QFBQYqJidGGDRvyPO7o0aOqWbOmoqKidOedd+rrr7/O93syMzOVkZGRYwEAAO7kVxj59ddfdebMmYt6NiIiIpSamprrMfXr19ecOXO0cuVKLVy4UFlZWWrbtq1++umnPL8nPj5eYWFh2UtUVJQ/ZQIAAAe54qNpoqOj1a9fP7Vo0ULt27fXsmXLVKlSJc2aNSvPY8aOHav09PTsZf/+/Ve6TAAAYElJf3auWLGiSpQoobS0tBzr09LSFBkZ6dOfUapUKbVs2VK7d+/Ocx+v1yuv1+tPaQAAwKH86hkpXbq0WrdurbVr12avy8rK0tq1axUdHe3Tn3HmzBnt2LFDVapU8a9SAADgSn71jEjSqFGj1L9/f1133XW64YYbNHXqVB07dkwDBw6UJPXr10/VqlVTfHy8JOmZZ55RmzZtVKdOHR0+fFgvvPCC9u3bpyFDhlzelgAAAEfyO4z07NlTv/zyi5566imlpqaqRYsWev/997Mfak1OTlZQ0P86XA4dOqShQ4cqNTVVV111lVq3bq3ExEQ1atTo8rUCAAA4lt9hRJKGDx+u4cOH57pt/fr1OT5PmTJFU6ZMKczXAACAYoB30wAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrChVGpk+frmuuuUZlypTRjTfeqK+++irf/d9++201aNBAZcqUUdOmTbV69epCFQsAANzH7zCydOlSjRo1SuPHj9eWLVvUvHlzdenSRQcPHsx1/8TERPXq1UuDBw/W1q1bFRcXp7i4OO3cufOSiwcAAM7ndxh56aWXNHToUA0cOFCNGjXSzJkzVa5cOc2ZMyfX/V9++WXdfvvtGj16tBo2bKgJEyaoVatWmjZt2iUXDwAAnK+kPzufOnVKmzdv1tixY7PXBQUFKSYmRhs2bMj1mA0bNmjUqFE51nXp0kUrVqzI83syMzOVmZmZ/Tk9PV2SlJGR4XOtWZnHfd63sPypp7Dc0A43tEGiHb5yQxsk2uErN7RBoh2+8rcN5/Y3xuS/o/HDzz//bCSZxMTEHOtHjx5tbrjhhlyPKVWqlFm0aFGOddOnTzeVK1fO83vGjx9vJLGwsLCwsLC4YNm/f3+++cKvnpGiMnbs2By9KVlZWfr999919dVXy+PxXJHvzMjIUFRUlPbv36/Q0NAr8h1XmhvaILmjHW5og0Q7Aokb2iC5ox1uaINUNO0wxujIkSOqWrVqvvv5FUYqVqyoEiVKKC0tLcf6tLQ0RUZG5npMZGSkX/tLktfrldfrzbGuQoUK/pRaaKGhoY4+uSR3tEFyRzvc0AaJdgQSN7RBckc73NAG6cq3IywsrMB9/HqAtXTp0mrdurXWrl2bvS4rK0tr165VdHR0rsdER0fn2F+SEhIS8twfAAAUL37fphk1apT69++v6667TjfccIOmTp2qY8eOaeDAgZKkfv36qVq1aoqPj5ckjRgxQu3bt9fkyZMVGxurJUuWaNOmTZo9e/blbQkAAHAkv8NIz5499csvv+ipp55SamqqWrRooffff18RERGSpOTkZAUF/a/DpW3btlq0aJHGjRunxx9/XHXr1tWKFSvUpEmTy9eKy8Dr9Wr8+PEX3R5yEje0QXJHO9zQBol2BBI3tEFyRzvc0AYpsNrhMaag8TYAAABXDu+mAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAwIFSUlJslwBcNoSRXGzfvl2lS5e2XUa+ateurd9++812GVfcmTNndODAAdtlXLIff/xRnTt3tl0GHOLCN51fKCUlRR06dCiaYuAKu3btKnCfhQsXFkEluSOM5MIYozNnztguI1979+4N+Bovh507dyoqKsp2GZfsyJEjF70WIdDUqFEjR8CdNm1akbzy/HI7ceKEVq1alf353Is3zy2jR4/WyZMnLVZYsLlz52rixIm5bjsXRCpVqlTEVfnvySef1B9//JHn9uTkZN12221FWFHhvPHGG/luP3LkiIYMGVJE1RRO69at9eKLLyq3qcXS0tLUvXt3PfDAAxYqO4swAkCS9NNPP+UIuI8//rh+/fVXixUVzptvvqlZs2Zlf542bZoSExO1detWbd26VQsXLtSMGTMsVliwd999V5MmTbqoztTUVHXs2FHh4eF6//33LVXnuzfffFPXX3+9du7cedG2WbNmqUmTJipZMiBfHp/DqFGj9Kc//UmpqakXbVuzZo0aN26sjRs3WqjMdwsXLtTzzz+vW265RT/88EOO9Y0aNdLhw4e1detWewUaXCQpKckEBQXZLiNfHo/HzJ8/36xcuTLfxemc8HfhCye0w+PxmLS0tOzP5cuXNz/88IPFigqnXbt25t13383+fGE7FixYYNq0aWOjNL+sWrXKeL1es3jxYmOMMSkpKaZBgwbmhhtuMBkZGZar8016errp27ev8Xq9ZtKkSebMmTNm37595tZbbzWhoaFm1qxZtkv0yZ49e0yHDh1MeHi4WbRokTHGmIyMDDNo0CBTqlQpM3bsWHPq1CnLVRYsLS3NxMXFmeDgYPPCCy+Y7t27m7Jly5rJkyebrKwsq7UVyzCSnp6e7/Lpp5864oejoCXQ2+ALJ/yI+8IJ7XBLGImMjDR79uzJ/lyxYsUcn7/99lsTGhpa9IUVwr/+9S9TpkwZM3fuXNOwYUNz3XXXmcOHD9suy28rVqwwERERpnnz5iY0NNTExMSYvXv32i7Lb1OmTDHBwcEmNjbW1KhRwzRq1Mh89dVXtsvyW+/evY3H4zHly5c327dvt12OMcaYwO8fuwIqVKggj8eT53ZjTL7bA0VqaqoqV65su4xLsn379ny3f/vtt0VUyaVp2bJlvufM8ePHi7Cawnv99ddVvnx5SdIff/yhefPmqWLFijn2eeihh2yU5rPDhw8rMzMz+/Mvv/ySY3tWVlaO7YGsd+/eOnz4sAYPHqxWrVrpww8/VFhYmO2y/NamTRs1bdpUa9euVXBwsMaNG6eaNWvaLstv9913nz755BOtWLFCwcHBWrVqlZo2bWq7LJ8dOnRIw4YN08qVKzVmzBgtXbpUvXr10vz589WqVSurtRXLMPLRRx/ZLuGSOSEs+aJFixbyeDy5PlR1br0T2hoXF2e7hEtWo0YNvfbaa9mfIyMjtWDBghz7eDyegA8j1atX186dO1W/fv1ct2/fvl3Vq1cv4qr8c2G4LVWqlA4fPqyOHTvm2G/Lli1FXZrfFi9erOHDh6tFixb673//qzfeeEOdO3fWgw8+qPj4eJUpU8Z2iT75/PPPNXDgQJUsWVLvv/++Xn/9dUVHR2vixIkaMWKE7fIKtGrVKg0dOlQ1atTQ5s2b1aBBAz3xxBN65JFHFB0drUcffVTjx4+39gwPb+11qKCgIFf0jOzbt8+n/Zz4vyjYMWLECH344YfavHnzRT90J06c0HXXXaeYmBi9/PLLlios2NNPP+3TfuPHj7/ClVyau+++W2vWrFF8fLz+7//+L3t9YmKiBg4cKEmaN2+eoqOjbZXok4cffljTpk3T8OHDNXHixOzzaunSpRo+fLgaN26suXPnqlatWpYrzZvX69X48eM1ZswYBQXlHLuSkJCgIUOG6KqrrlJSUpKV+ggjDjVw4EC98sorCgkJyXOf06dPq1SpUkVYFWBfWlqaWrRoodKlS2v48OGqV6+epLO3/KZNm6Y//vhDW7duVUREhOVK3e+mm27SvHnzVLdu3Yu2nThxQmPGjNGMGTN06tQpC9X5rk6dOpo7d65uvvnmi7alpaXpr3/9q9atW6cjR45YqM4327dvV7NmzfLcnpGRob/97W8FDmO+UoplGClRooRP+wXyPB59+/bV9OnTFRoamuv2TZs2acCAAbkOqQskycnJPu1Xo0aNK1zJpenUqZNP+61bt+4KV3JpsrKyNG/ePC1btkx79+6Vx+NRrVq11KNHD/Xt29cRt8wkac+ePXrggQeUkJCQfQvQ4/Hotttu06uvvqratWtbrrB4yMrKuuh/4Rf65JNPdMsttxRRRYVz/PhxlStXLt99FixYoL59+xZRRe5TLMNIUFCQatasqf79+6tly5Z57nfnnXcWYVX+ad26tdLS0vTGG2+oS5cu2etPnz6tp556SpMnT9agQYM0c+ZMi1UW7PxgeP6PxvnrPB5PQAdD6X/nVGxsbL69UVOmTCnCqvxjjFG3bt20evVqNW/eXA0aNJAxRv/973+1Y8cOde/eXStWrLBdpl9+//137d69W9LZ/92Gh4dbrsg3HTt2LDD4eTyegJ9I78yZM/r6669Vt25dlS1bNse248ePa/fu3WrSpEmBgcW22rVra+PGjbr66qttl1Jo3333nQ4fPqwbbrghe93atWv17LPP6tixY4qLi9Pjjz9urb5i+QDrV199pTfeeEMvv/yyatWqpUGDBqlPnz666qqrbJfmsy+//FLPPPOMunXrpoEDB2ry5Mn65ptv1L9/fx09elSrVq1yxPTjHo9H1atX14ABA9StWzdHTICUm3/84x+aO3eu3n77bfXp00eDBg1SkyZNbJfll3nz5umTTz7R2rVrL3pQct26dYqLi9P8+fPVr18/SxX6bu/evUpISNDp06d1yy23OO7vokWLFnluO3LkiBYtWuSIEUELFizQtGnT9OWXX160rXTp0ho0aJBGjhypv/zlLxaq850bZrx+7LHH1LRp0+wwsmfPHnXr1k0333yzmjVrpvj4eJUrV04jR460U2DRjyYOHCdOnDALFiwwnTp1MuXKlTM9e/Y0H3zwge2y/LJx40bTuHFjU6VKFVOqVCkzaNAgk56ebrssn6WkpJjnnnvO1K9f30RERJiHH37Y7Nq1y3ZZhZaYmGiGDBliQkNDzfXXX29mzJjhmL+P2267zcTHx+e5feLEiaZz585FWFHhrFu3zpQrVy57vp1SpUqZBQsW2C7rkp0+fdpMnTrVVKpUydSpUyd7MrRA1q5du3zrXLp0qbn55puLsKLCuXAOHieqXr26SUxMzP48YcIE07x58+zPr7/+eo7PRa1Yh5Hz/fjjj6Zjx44mKCjI/Pbbb7bL8dmOHTtMixYtTLly5UxwcLCjL7qffvqpGTRokAkJCTE33nijmT17tjlz5oztsgrl2LFjZt68eeb66683wcHBjggkERERZuvWrXlu37Jli4mIiCi6ggrppptuMnfeeac5cOCA+f33382DDz5oqlSpYrusS7Jw4UJTu3ZtU6VKFTN9+nRz+vRp2yX5pFKlSjkmnLvQjz/+aCpWrFh0BRWSG2a8LlOmjElOTs7+3KlTJzNu3Ljsz7t37zZhYWEWKjur2IeR/fv3mwkTJphrr73WVKlSxTz22GOO+IeelZVlJk2aZLxerxkwYIA5dOiQmT59uilfvrz585//bA4ePGi7xEJLTU11ZDA836effmoGDhxoypcvb2688UZz/Phx2yUVqFSpUubAgQN5bv/5559N6dKli7CiwgkLCzNff/119udjx46ZEiVKmF9//dViVYXz3nvvZc9a+swzz5ijR4/aLskv5cqVM9u2bctz+7Zt20y5cuWKsKLCccOM11WrVjVffvmlMcaYM2fOmNDQULNq1ars7bt27bI6M3FgPzV0hZw6dUpLly5V586dVbduXW3ZskVTp07V/v379dxzzzniuYU2bdron//8p95++23NnTtXFSpU0IMPPqht27bp119/VaNGjbR06VLbZfolMTFRQ4YMUb169XT06FFNnz5dFSpUsF2Wzw4cOKBJkyapXr166tGjh8LDw/Xll1/qiy++uOjhvUB05syZfM/9EiVK5PsG1kCRkZGRY9bYcuXKqWzZskpPT7dYlX+++uordezYUX/+85/VsWNH/fDDD3ryyScVHBxsuzS/1K1bV4mJiXlu/+yzz3Id9huIUlNTlZWVlecS6M+UdOjQQRMmTND+/fs1depUZWVlqUOHDtnbd+3apWuuucZafYH/q3sFVKlSRSEhIerfv79effXV7InDjh07lmO/vIbNBoJatWrpvffeu2h0QO3atfXxxx9r6tSpGjx4sHr27GmpQt+kpKRo/vz5mjt3rg4dOqQ+ffro888/d9wDh3fccYc++ugjde7cWS+88IJiY2MdEWrPZ4zRgAED5PV6c93uhAcmz1mzZk2OadOzsrK0du3aHEPdu3fvbqM0n7Rp00Zly5bV/fffr1q1amnRokW57hfos+H27t1b48aNU9u2bS+a42Lbtm166qmn9Oijj1qqzndOGdKen4kTJ+q2225TzZo1VaJECb3yyis5wu2CBQt8nqLgSii2Q3vPye0kMw4YTpqcnKyoqKh8/5F8//33Af+/jlKlSqlatWrq37+/unfvnuew2Pwm6wkEQUFBqlKliipXrpzv30kgT999bkbMgsydO/cKV3JpfBkmGuj/vq+55hqfhvb++OOPRVRR4Zw+fVqdO3fWZ599ppiYGDVo0ECS9M033+jDDz/UTTfdpISEhICfnNEtM17/8ccf+vrrr1WpUiVVrVo1x7Zt27YpKirK2vD3YhlGPv74Y5/2a9++/RWupPBKlCihlJQUx//jyC0YXnhKBvoPh+Se6buBy+306dOaMmWKFi1apO+//17GGNWrV0+9e/fWyJEjVbp0adslFsiXGa+d7scff9T999+vDz74wMr3F8sw4gZuSeq8myawnJuf49SpU+rQoYMaN25suyTAujNnzujFF1/Uu+++q1OnTunWW2/V+PHjHfEsmK+2bdumVq1aWfuPn7Nual8mb731luLi4rIT+U8//aSqVatm/y/9+PHjmjZtWsDfy3TDfcw333xTjzzySIFTLQe6Xbt2qVGjRvnus3DhwoCe3Omjjz7Sn/70J504cUKSVLJkSc2ZMyega87Nu+++W+A+JUuWVGRkpJo0aRKQ/zMfNWpUruvDwsJUr1493XXXXXk+2xOITpw4oYSEBH333XeSpPr16ysmJsYxP+aTJk3S3//+9+yaX375ZR08eFBz5syxXZprFMuekQtvcYSGhiopKSn7fRVpaWmqWrVqQN8aCAoK0l//+tcCf8RfeumlIqqocNxyu6ls2bKaMGGCHn744YtCYlpamoYOHaqPPvoooF+k1a5dO1WsWFEzZsxQmTJlNG7cOC1fvlwHDhywXZpf/JlaPDIyUkuXLs31BWg2XTgD7jmHDx/W7t27FRERoXXr1gX8O5uks+FwyJAh+vXXX3Osr1ixot544w1169bNUmW+q1u3rh555BHdd999kqQPP/xQsbGxOnHiRMBPZe8r2z0jxTKMXHiLIyQkRNu2bXNcGImOjs73f3UejyfgX8zmlttN//73v/XAAw+ofv36mjdvnq699lpJZ3tDRowYocaNG2vOnDmqU6eO5UrzVqFCBSUmJmb38Bw/flyhoaFKS0tz9Ds5cmOMUVpamp599lklJiYG9IPFF8rIyFCfPn0UEhKS5yibQJGYmKgOHTqoe/fuevjhh9WwYUNJZ3sSJ0+erFWrVunjjz9WmzZtLFeaP6/Xq927dysqKip7XZkyZbR7925Vr17dYmWXD2HEAreEETf8iAcFBSktLU2VKlWyXcolO3jwoO677z4lJCTo73//uz799FMlJCTo2Wef1d/+9reAv62W2zl14b8Nt9m7d68aNGigkydP2i7FL1999ZXuuecen5+5suWOO+5QVFSUZs2alev2++67T/v379fq1auLuDL/lChRQqmpqTmuUyEhIdq+fbtq1aplsTLftWzZMt9r0PHjx/X999/zzAj8E+g/bP6oV69ege35/fffi6iawqtcubKWL1+uPn366NFHH1VwcLC+/PJLNW3a1HZpPnP6/By+SElJ0enTp1WjRg1dc801SktLs12S3ypWrOiIfxNffPGF/vGPf+S5fdiwYQE9avGc3ObgOXnypO6///4cc3UsW7bMRnk+iYuLs11CvoptGDn/onvhBffw4cMWK/ONmzq0nn766Rw/gE516NAhDRs2TCtXrtSYMWO0dOlS9erVS/Pnz1erVq1sl+eT/v37X7Tu3H1yyRnDrAvSqVMnfffdd9ntcOK598UXX2TfCgxkJ06cyHfyyLCwMEf0SuX278JpD3YH+rQCxTaMXHhynX/BlQK/52Hu3LmOvIjm5t5773X87aZVq1Zp6NChqlGjhjZv3qwGDRroiSee0COPPKLo6Gg9+uijGj9+fEDPypqVlWW7hCIxf/58HT9+3HYZ+dq+fXuu69PT07V582ZNmjQp4H9cpLMPfq5bty7PCfXWrl0b8BMzSoE/0Z8vAn7EXxG/CwdF5MCBA2bfvn22yyhQUFCQ41/NbYwxpUuXNhMnTsz1LcMffPCBqVGjhtXXc8NZzr14LbcXslWqVMnEx8ebrKws22UW6KWXXjLh4eHmP//5z0XbVq1aZa6++mozefJkC5X5b8+ePWb27Nlm2rRpZufOnbbL8VuZMmXMCy+8kOt5k5qaarp162bKly9vobKzCCMu1aBBg4B/i6QxZy+6bggj+b2Z1Bhj0tPTzaBBg4qomivDKQH3fIcOHTKvvfaaGTNmTPYboDdv3mx++ukny5Xlb+/evbkuv//+u+3S/HLmzBnTo0cP4/F4TIMGDcyf//xnExcXZ+rXr2+CgoLMXXfdlWuADzTr1q0z5cqVyw6EpUqVMgsWLLBdll/eeecdU6lSJdOuXTuze/fu7PULFiww4eHh5uabbzbff/+9tfqK5Wiagpz/gJtTbdy4UcePH3fEw2FwhoYNG+Z41iLQbd++XTExMQoLC9PevXv17bffqnbt2ho3bpySk5M1f/582yUWG0uXLtXixYuzJz2rV6+e7r33Xt17772WK/ONW+bgCeQRf4SRXDjtoutkd911l0/7BfJT6r4g4Ba9mJgYtWrVSs8//3yOIcqJiYnq3bu39u7da7vEQnPD+eQkbpuDp0+fPlq8eLGCg4OVmJgYECP+AvdpOouc8IDb+Q4fPqx33nlHP/zwg0aPHq3w8HBt2bJFERERqlatmu3y8nXhQ7iLFi1St27dXPdCqgtHcDjR9ddfb7sEv2zcuDHX+S2qVaum1NRUCxVdPk47n9LT05WQkKC9e/fK4/Godu3auvXWW/MdaRNIMjIyVLFixezP5cqVU9myZZWenu6oMBLII/4II7lw0kX3wq7ooUOHKjw8XMuWLXNEV/SFT6m/8847ev755103yRYBt+h5vV5lZGRctP67775z/CR7TjqfFi5cqOHDh1/0dxEWFqaZM2eqZ8+elirzj9Pn4An4EX/WnlYJEE59wO2cW2+91YwePdoYY0z58uXNDz/8YIwx5vPPPzc1a9a0WFnhnN8G2LFt2zZTqVIlU6dOHVOyZMnsv48nnnjC9O3b13J1vhs8eLCJi4szp06dMuXLlzc//vij2bdvn2nZsqUZMWKE7fKKhc2bN5uSJUua/v37m6SkJHPy5Elz4sQJs3nzZtO3b19TqlQpk5SUZLvMAuU2qunCJdAHDAT6iL9iHUbccNENDQ3NfjL6/B/yvXv3Gq/Xa7O0QnFDGCHgBobDhw+bmJgYU6FCBVOiRAkTFRVlSpUqZW655RZz9OhR2+X5zMnn04ABA0yPHj3y3H733XebgQMHFmFFxVegj/gr1rdpRo0apQEDBmQ/4HbOHXfcod69e1uszHdu7op2IqffNpPc86xFWFiYEhIS9Nlnn2n79u06evSoWrVqpZiYGNul+czp59Pnn3+uV199Nc/t999/vx588MEirKj4atasWb7bQ0ND9cYbbxRRNRcr1mHEDRfd7t2765lnntFbb70l6ezMscnJyXrsscd09913W66uYO+++26Oz7ndh5UC+17s+Qi4gaddu3Zq166d7TIKxenn04EDB1SvXr08t9erV08///xzEVZ0ZbhhdJP1NljrkwkAlSpVMlu2bDHG5OyO/uCDD0z16tVtluYzp3dFu+Fe7PnccNvMTc9afPjhhyY2NtbUrl3b1K5d28TGxpqEhATbZfnM6edTQZMapqamOurfd16cMslkfmy3oVj3jDi9V0Fyfle0296H4oZehcmTJ6tHjx6qXLmyTpw4ofbt2ys1NVXR0dGaOHGi7fJ89uqrr2rEiBHq0aOHRowYIensC+buuOMOTZkyRcOGDbNcYcHccD5dOArlfE54KakvnDS6KS+221CsJz1LT09Xjx49tGnTJh05ckRVq1bNvuiuXr06x6uhAV8MGTJEv/32m9566y2Fh4dr+/btKlGihOLi4nTLLbdo6tSptkv0mVMD7jnVq1fXmDFjNHz48Bzrp0+frkmTJjni9oDTz6egoKAC93HDm6Bx6Yp1GDnH6RfdtWvXasqUKfrvf/8r6ewMsiNHjnRcO3Jj/T6mnwi4gaN8+fJKSkpSnTp1cqz//vvv1bJlSx09etRSZb7jfAo8bpiDJxDbQBhxuPO7oqOjoyWd7Yp+5513HNMVnR+nTs1PwLWvd+/eatmypUaPHp1j/YsvvqhNmzZpyZIllirzn9PPJ7dww/uOArUNxT6MOP2i64au6Pw47X0obuCWgPvss8/qxRdf1E033ZSjHZ9//rkefvjhHFORP/TQQ7bKLNac1vPphvcdBWobinUYccNF1w1d0W5DwA0MtWrV8mk/j8ejH3/88QpXU3hOP5/y47Sez7CwMG3ZskXXXnttjh/yffv2qX79+jp58qTtEgsUqG0o+OkiF5s0aZKmTJmixYsX66GHHtJDDz2kRYsWacqUKZo0aZLt8nzSvXt3LV++/KL1K1eu1J/+9CcLFRXe4cOH9frrr2vs2LH6/fffJUlbtmxxzI+fdDbg3n777QoJCdGIESM0YsQIhYaG6o477tD06dNtl+eTw4cP6/bbb79ofefOnZWenm6hosLZs2ePT0sgBxE3nE/5mT9/vtatW2e7DJ+5YXRTwLbB1pjiQBAcHGy+//77i9Z/9913Jjg42EJF/pswYYIJCwszd9xxh5kwYYKZMGGCiY2NNRUqVDATJkwwL7/8cvYSyNwwNb8xxlSrVs3885//vGj9tGnTTNWqVS1U5L9evXqZ559//qL1L7zwgunZs6eFigpn3bp1tku4ZG44n9zEDXPwBGobivVtGjc84OaWruhAvY/pLzfcNnPLsxZer1fVq1fXwIED1b9/f0VFRdkuyW9uOJ/OCcQRHP5yw+imQG1DsQ4jbrnoukGg3sf0FwE3cPz6669asGCB3nzzTX399dfq1KmTBg8erLi4OJUuXdp2eT5xw/kkBe4IjsJyw+imQGtDsQ4jbrjofvTRR+rYsaPtMi5Z5cqVtWbNGrVs2TJHGElISNCgQYO0f/9+2yX6hIAbmLZs2aK5c+dq8eLFks7+yA8ePFjNmze3XFn+3HI+uaXnE1dOsQ4jbuCGrmjJ+TNNnkPADVwHDhzQ7Nmz9dxzz6lkyZI6efKkoqOjNXPmTDVu3Nh2eblyw/kkuafnU3LH6KaAbIO1p1UCgBsecPvll1/MSy+9ZJo3b25KlixpOnfubJYuXWoyMzNtl+YXp7/wz01Kly5tateubSZMmGCSk5Ntl3NJTp06Zd5++23TtWtXU7JkSdOmTRvz2muvmaNHj5o9e/aYPn36mIYNG9ou0/Xc8FJSY4yZPn26KVmypLn33nuzBwb06tXLlCpVykybNs12eT4J1DYU6zDipouuMcZs3rzZDB8+3Fx99dXm6quvNv/3f/9nkpKSbJfll08//dRMnz7d/OMf/3DU21XPIeDa17FjR3Po0KHsfwvh4eFmxIgRZseOHRftm5KSYjwej4UqfeOG88mYwB3B4S83jG4K1DYU6zDi9Itubn7++Wczfvx44/V6TXBwsClRooRp166d2blzp+3SigUCrn1BQUEmLS3NdOrUySxatMicPHkyz31Pnz5t1q9fX4TV+cct55Nbej7dMB1EoLahWIeR8znxonuOW7qiP/zwQxMbG2tq165tateubWJjYx3XO0LAtc/j8Zi0tDTbZVwWbjufnN7z6YY5eAK1DTzAeh4nPeDWqVMnLVu2TE8++aQWL14sY4z69u2rIUOGqEmTJjn2TU1NVdWqVZWVlWWp2oK5YWr+Czl1BIcknT59WitXrtScOXOUkJCg6667ToMHD1avXr30yy+/aNy4cdqyZYt27dplu9SLBAUFad26dQoPD893v2bNmhVRRZeHk88nt3DD6KaAbYO1GBQgnNqr4KauaGMC9z7mpXJSr4JbnrXweDwmKCjIeDyei5Zz64OCgmyXWShOOp8u5Iaez2uuucanpVatWrZLzVOgtqFYhhE3XHTd1BVtTODexywMAq5dHo/HbNy40ezduzffxSmcej6dL1BHcCBwFMsw4oaLrsfjMR999JHZtm1bvotTBOp9TF8RcAOHG9rhhvPpfG7p+XTD6KZAbUOxfGYkKChIqampqly5su1SCi0oKEgej0e5/fWdW+/xeBzzau6AvY/poxIlSiglJUW9evXSkCFDdNddd8nr9ea67x9//KHPP/9c7du3L+Iq8+eWZy3c8O/bDefT+dzyjh03TDIZqG0otmHE6RfdoKAgffXVVwW+8rlmzZpFVNGlcfpMk274AXRLwO3YsaOWL1+uChUq2C6l0NxwPp3PLe/YccP7jgK1DcU2jDj9ouu2i5XTEXADk1PfFOuG8+l8Tu/5zI0bRjcFUhuKbRhx+kXXbWHE6e9DIeAGHie/KdYN59P5nN7zmRcnTQeRl0BpQ8ki+6YAU6NGDUdfdNu3b++YbkFf3H777QF5H9MfX375ZYEBF0Vn1KhRGjBgQPabYs+544471Lt3b4uV+cZN59OePXtsl3DZ5DYHz7Rp03LMwXPPPfcE5Bw85wRkG4r6idlA4Ian7c936NAh89prr5kxY8aY3377zRhzdkbZn376yXJlvnP6TJNuOKc6dOhgDh06ZLuMyyY0NNTs3r3bGJPz5Wx79+41Xq/XZmkFcsP5dL5AHcHhKzeMbgr0NhTLMOKmi+62bdtMpUqVTJ06dUzJkiWzL7hPPPGE6du3r+XqCseJU/O77cfDDQHXyW+Kddv55PR37LhhOohAb0OxDCPnc/pF99ZbbzWjR482xuS84H7++eemZs2aFiu7NE6baZKAG3ic/KZYN51PxtDzGQgCvQ3F8gHWc5z8gNs5YWFh2rJli6699lqFhIRo27Ztql27tvbt26f69evr5MmTtkv0mZPfh3I+p47gOCcmJkatWrXKftbi3DmVmJio3r17a+/evbZL9El6erp69OihTZs26ciRI6patapSU1MVHR2t1atXKzg42HaJPnH6+XShQBrB4Ss3jG4K+DbYTkM2uaFXwcld0cYE/n1Mf7mhV8HJz1rkxslvinXD+ZQbp/V8uuF9R4HehmI7mkaSNm7cqFmzZl20vlq1akpNTbVQkf+6d++uZ555Rm+99Zaks0PjkpOT9dhjj+nuu++2XF3BPv74Y506dUq7du3SP//5z3xnmqxYsaI++uijIq7QP04fwSGdnaExIyPjovXfffedI0d3tGvXTu3atbNdRqG44Xw6JyBHcPjBDaObArkNxTqMuOGiO3nyZPXo0UOVK1fWiRMn1L59++yu6IkTJ9our0Dm/79LuHbt2gL3LVmyZEBPeS0RcG175ZVXfN7XCZNrOf186tSpk5YtW6Ynn3xSixcvljFGffv21fPPP68mTZpk7xccHKwXX3xRVatWtVht/pw+HYQU2G0o1mHEyRfdc8LCwpSQkKDPPvtM27dv19GjR9WqVSvFxMTYLs1nu3btKvDCGsj3Ys9HwLVrypQpPu3n8XgcEUacfj65recTV06xfoDVLQ+4OZnbZpocMmSIfvvtN7311lsKDw/X9u3bVaJECcXFxemWW27R1KlTbZfoMycHXLdw+vnklll93fC+o0BvQ7EOI+c47aLrpq5oN0zNfz4CLi4np59PAT+CoxDcMLopENtAGHEgN73nwS3/c7oQAdeOUaNGacKECQoODtaoUaPy3fell14qoqoundPOp3Pc1vPphukgArUNxS6MuOWi6xZuDSNO45aA27FjR7344otq2bKlbr311jz383g8WrduXRFWVjy5refTDXPwBGobil0YcctF1y0C/T6mLwi4gaVEiRJKSUnJDrg9e/bUK6+8ooiICMuV+cZN55Pb/rPhhkkmA7UNxW40jRveHummrujzn54PxPuYvnDbCA6nu/D/V++9956OHTtmqRr/cT4FLqePbpICtw3FLoy4wdatW/XNN9+oZcuW2rp1a577eTyeIqzq0lx4H3Po0KEKDw/XsmXLAv5eLAE3sDmt89cN59M57du3V+nSpW2Xcdm4YTqIQG1DsbtN45aLrtO7oi8UqPcxiws3PWtRokQJpaamZv8vLyQkRNu3b/f5Fi2uDKf2fJ7P6aObpMBtQ7HrGXFLr4LTu6Iv5OSZJt0QcD/66KPsgHvu1plTA64xRgMGDMieXOvkyZO6//77L7rILlu2zEZ5BXLD+XQhJ/d8ns8Nk0wGahuKXRhx00X3fE7v4ArU+5i+IOAGlv79++f4/Je//MVSJYXjlvPpfG56x47k7PcdnRNobSh2YURyx0XX4/FcdDFy0sXpQoF6H9MXBNzAMnfuXNslXBI3nk9O7vl0w+gmJ7ShWIaRCznxouv0rugLOfl9KBIBF5eXG86n8zm559MNo5uc0IZiGUbccNF1elf0hQL1PmZhEXBxOTnxfDqfk3s+3TC6yQltKHajaaSzE/F07do1+6L7//7f/1OnTp246KLQ3DCCY+DAgT7t5/TbIE7ghvPpfIE6ggOBo1iGES66gcEJ9zF9RcDF5eTW88mJPZ9uGN3khDYUyzCCwOCmqfkJuLicOJ8Chxvm4HFCGwgjAIDLzk09n26YZDLQ20AYAQBcdm7q+bzwhX+hoaFKSkpS7dq1LVfmu0BvQ7EcTYPA4IT7mAAKxwkjOArLDf+HD7Q2EEZgjRtnmgTgPm6YDiLQ28BtGlgV6PcxARSOm3o+3TC6KdDbQM8IrHLbTJMAznJTz6cbJpkM9DbQMwKrLnyoKiQkRNu2bQuYh6oAFB49n/BVkO0CULwF+n1MAIVHzyd8xW0aWMX7UIDig4545IUwAqsC/T4mgMKj5xO+4pkRAMAVEegjOBA46BkBAFwR9HzCV/SMAAAAqxhNAwAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMCq/w88uqkaYYVZSwAAAABJRU5ErkJggg==\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], "source": [ "file_info.filetype.value_counts().plot.bar()" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -164,103 +104,10 @@ "execution_count": 5, "id": "a600fb0f", "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filesize
filetype
FileType.DOCX36602.0
FileType.EML149088.5
FileType.HTML1228404.0
FileType.JPG64002.5
FileType.PDF2429245.0
FileType.PPTX38412.0
FileType.TXT619.0
FileType.UNK1102.0
FileType.XLSX4765.0
FileType.XML713.5
\n", - "
" - ], - "text/plain": [ - " filesize\n", - "filetype \n", - "FileType.DOCX 36602.0\n", - "FileType.EML 149088.5\n", - "FileType.HTML 1228404.0\n", - "FileType.JPG 64002.5\n", - "FileType.PDF 2429245.0\n", - "FileType.PPTX 38412.0\n", - "FileType.TXT 619.0\n", - "FileType.UNK 1102.0\n", - "FileType.XLSX 4765.0\n", - "FileType.XML 713.5" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "file_info.groupby(\"filetype\").mean(numeric_only=True)" - ] + ], + "outputs": [] }, { "cell_type": "markdown", @@ -275,98 +122,41 @@ "execution_count": 6, "id": "e5e3a24d", "metadata": {}, - "outputs": [], "source": [ "from unstructured.file_utils.exploration import get_file_info\n", "\n", "filenames = [os.path.join(EXAMPLE_DOCS_DIRECTORY, f) for f in os.listdir(EXAMPLE_DOCS_DIRECTORY)]" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 7, "id": "d8e59472", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-html.html',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example-10k.html',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xml',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-header.eml',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake.docx',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-image-embedded.eml',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-text.txt',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.pdf',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/email-with-image.eml',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.jpg',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-power-point.pptx',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.txt',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/README.md',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xsl',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-excel.xlsx',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.eml',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper.pdf',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-attachment.eml',\n", - " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example.jpg']" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "filenames" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": 8, "id": "cb0add28", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "MIME type was message/rfc822. This file type is not currently supported in unstructured.\n" - ] - }, - { - "data": { - "text/plain": [ - "FileType.EML 4\n", - "FileType.TXT 3\n", - "FileType.HTML 2\n", - "FileType.XML 2\n", - "FileType.PDF 2\n", - "FileType.JPG 2\n", - "FileType.UNK 1\n", - "FileType.DOCX 1\n", - "FileType.PPTX 1\n", - "FileType.XLSX 1\n", - "Name: filetype, dtype: int64" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "file_info = get_file_info(filenames=filenames)\n", "file_info.filetype.value_counts()" - ] + ], + "outputs": [] }, { "cell_type": "code", "execution_count": null, "id": "ac4473b0", "metadata": {}, - "outputs": [], - "source": [] + "source": [], + "outputs": [] } ], "metadata": { diff --git a/test_unstructured/documents/test_html.py b/test_unstructured/documents/test_html.py index 3ee6d03b79..466abcd6e5 100644 --- a/test_unstructured/documents/test_html.py +++ b/test_unstructured/documents/test_html.py @@ -411,8 +411,8 @@ def test_read_with_existing_pages(): def test_parse_not_anything(monkeypatch): - monkeypatch.setattr(html, "is_narrative_tag", lambda *args: False) - monkeypatch.setattr(html, "is_possible_title", lambda *args: False) + monkeypatch.setattr(html, "is_narrative_tag", lambda *args, **kwargs: False) + monkeypatch.setattr(html, "is_possible_title", lambda *args, **kwargs: False) doc = """

This is nothing

""" document_tree = etree.fromstring(doc, etree.HTMLParser()) el = document_tree.find(".//p") diff --git a/test_unstructured/partition/test_auto.py b/test_unstructured/partition/test_auto.py index d7b837e547..70c6dbb857 100644 --- a/test_unstructured/partition/test_auto.py +++ b/test_unstructured/partition/test_auto.py @@ -667,6 +667,7 @@ def test_auto_partition_works_with_unstructured_jsons_from_file(): def test_auto_partition_odt_from_filename(): filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake.odt") elements = partition(filename=filename, strategy=PartitionStrategy.HI_RES) + # TODO "Lorem ipsum dolor sit amet." looks not like English, so it will be inferred as Narrative Text. Maybe need to Fix it assert elements[0] == Title("Lorem ipsum dolor sit amet.") @@ -675,6 +676,7 @@ def test_auto_partition_odt_from_file(): with open(filename, "rb") as f: elements = partition(file=f, strategy=PartitionStrategy.HI_RES) + # TODO "Lorem ipsum dolor sit amet." looks not like English, so it will be inferred as Narrative Text. Maybe need to Fix it assert elements[0] == Title("Lorem ipsum dolor sit amet.") diff --git a/test_unstructured/partition/test_odt.py b/test_unstructured/partition/test_odt.py index 72e2311f75..e7eab7d74e 100644 --- a/test_unstructured/partition/test_odt.py +++ b/test_unstructured/partition/test_odt.py @@ -33,7 +33,7 @@ def test_partition_odt_matches_partition_docx(): def test_partition_odt_from_filename(): elements = partition_odt(example_doc_path("fake.odt")) - + # TODO Lorem ipsum dolor sit amet. look like not English, how to detect Category? assert elements == [ Title("Lorem ipsum dolor sit amet."), Table( @@ -54,6 +54,7 @@ def test_partition_odt_from_file(): elements = partition_odt(file=f) assert elements == [ + # TODO Lorem ipsum dolor sit amet. look like not English, how to detect Category? Title("Lorem ipsum dolor sit amet."), Table( "Header row Mon Wed Fri" diff --git a/test_unstructured/partition/test_text_type.py b/test_unstructured/partition/test_text_type.py index edff0e9e5b..fb5c8443f0 100644 --- a/test_unstructured/partition/test_text_type.py +++ b/test_unstructured/partition/test_text_type.py @@ -71,7 +71,7 @@ def test_text_type_handles_non_english_examples(monkeypatch): assert text_type.is_possible_narrative_text(narrative_text, languages=[]) is True assert text_type.is_possible_narrative_text(title, languages=["eng"]) is False - assert text_type.is_possible_narrative_text(title, languages=[]) is False + assert text_type.is_possible_narrative_text(title, languages=["rus"]) is False assert text_type.is_possible_title(title, languages=["eng"]) is False assert text_type.is_possible_title(title, languages=[]) is True @@ -88,7 +88,7 @@ def test_text_type_handles_multi_language_examples(monkeypatch): assert text_type.is_possible_narrative_text(title, languages=["eng"]) is False assert text_type.is_possible_narrative_text(title, languages=["spa", "rus"]) is False - assert text_type.is_possible_narrative_text(title, languages=[]) is False + assert text_type.is_possible_narrative_text(title, languages=[]) is True assert text_type.is_possible_title(title, languages=["eng"]) is False assert text_type.is_possible_title(title, languages=["spa", "rus"]) is True diff --git a/unstructured/documents/base.py b/unstructured/documents/base.py index 77627c1c4d..a93e982fc4 100644 --- a/unstructured/documents/base.py +++ b/unstructured/documents/base.py @@ -12,7 +12,19 @@ class Document(ABC): def __init__(self, languages: Optional[list[str]] = None): self._pages: Optional[List[Page]] = None self._elements: Optional[List[Element]] = None - self._language: list[str] = languages or ["auto"] + self._language: list[str] + if not languages or languages == [""]: + # As [""] is a valid input, it's used to avoid duplicate language detection during partitioning. However, I + # believe this design could be improved. Due to the complexity involved in altering the architecture, we + # have chosen to keep it as it is for now. In order to maintain compatibility with past designs, maybe + # discuss better solutions with the core team in the future. + self._language: list[str] = ["auto"] + else: + self._language = languages + + @property + def languages(self) -> list[str]: + return self._language def __str__(self) -> str: return "\n\n".join([str(page) for page in self.pages]) diff --git a/unstructured/documents/html.py b/unstructured/documents/html.py index eeb1cd22c1..90605d909d 100644 --- a/unstructured/documents/html.py +++ b/unstructured/documents/html.py @@ -144,8 +144,7 @@ def __init__( **kwargs: Any, ): self.assembled_articles = assemble_articles - super().__init__(stylesheet=stylesheet, parser=parser, **kwargs) - self._languages: list[str] = languages or ["auto"] + super().__init__(stylesheet=stylesheet, parser=parser, languages=languages, **kwargs) def _parse_pages_from_element_tree(self) -> List[Page]: """Parse HTML elements into pages. @@ -168,8 +167,8 @@ def _parse_pages_from_element_tree(self) -> List[Page]: for article in articles: descendanttag_elems: Tuple[etree._Element, ...] = () for tag_elem in article.iter(): - elem_languages = self._languages \ - if "auto" not in self._languages or not tag_elem.text \ + elem_languages = self.languages \ + if "auto" not in self.languages or not tag_elem.text \ else detect_languages(tag_elem.text) if tag_elem in descendanttag_elems: # Prevent repeating something that's been flagged as text as we chase it diff --git a/unstructured/partition/epub.py b/unstructured/partition/epub.py index 937b29915d..644152cece 100644 --- a/unstructured/partition/epub.py +++ b/unstructured/partition/epub.py @@ -61,6 +61,7 @@ def partition_epub( source_format="epub", detection_origin=DETECTION_ORIGIN, date_from_file_object=date_from_file_object, + languages=[""], ) elements = list( diff --git a/unstructured/partition/lang.py b/unstructured/partition/lang.py index 90be172cf9..391854f9a5 100644 --- a/unstructured/partition/lang.py +++ b/unstructured/partition/lang.py @@ -415,7 +415,7 @@ def apply_lang_metadata( else: for e in elements: if hasattr(e, "text"): - e.metadata.languages = detect_languages(e.text, languages=e.metadata.languages) + e.metadata.languages = detect_languages(e.text) yield e else: yield e diff --git a/unstructured/partition/text_type.py b/unstructured/partition/text_type.py index 1528a86a05..237fb3fcc1 100644 --- a/unstructured/partition/text_type.py +++ b/unstructured/partition/text_type.py @@ -7,6 +7,8 @@ import sys from typing import List, Optional +from unstructured.partition.lang import detect_languages + if sys.version_info < (3, 8): from typing_extensions import Final # pragma: nocover else: @@ -59,6 +61,8 @@ def is_possible_narrative_text( """ if languages is None: languages = ["eng"] + if isinstance(languages, list) and "auto" in languages and text: + languages = detect_languages(text) _language_checks = os.environ.get("UNSTRUCTURED_LANGUAGE_CHECKS") if _language_checks is not None: language_checks = _language_checks.lower() == "true" @@ -83,7 +87,11 @@ def is_possible_narrative_text( # For caution's sake, we will temporarily use "eng" in languages for judgment, that is, as long as English appears, # we will make a judgment. In the future, we may need to modify it to where only pure English is needed for # exceeds_cap_ratio judgment. - if "eng" in languages and exceeds_cap_ratio(text, threshold=cap_threshold): + capitalizable_languages = { + "eng", "spa", "rus", "fra", "deu", "ita", "por", "nld", "swe", "nor", + "dan", "fin", "ell", "pol", "ces", "slk", "hun", "ron", "bul", "hrv" + } + if not capitalizable_languages.isdisjoint(set(languages)) and exceeds_cap_ratio(text, threshold=cap_threshold): trace_logger.detail(f"Not narrative. Text exceeds cap ratio {cap_threshold}:\n\n{text}") # type: ignore # noqa: E501 return False @@ -105,7 +113,7 @@ def is_possible_title( sentence_min_length: int = 5, title_max_word_length: int = 12, non_alpha_threshold: float = 0.5, - languages: List[str] = ["eng"], + languages: Optional[list[str]] = None, language_checks: bool = False, ) -> bool: """Checks to see if the text passes all of the checks for a valid title. @@ -126,10 +134,15 @@ def is_possible_title( If True, conducts checks that are specific to the chosen language. Turn on for more accurate partitioning and off for faster processing. """ + if languages is None: + languages = ["eng"] _language_checks = os.environ.get("UNSTRUCTURED_LANGUAGE_CHECKS") if _language_checks is not None: language_checks = _language_checks.lower() == "true" + if isinstance(languages, list) and "auto" in languages and text: + languages = detect_languages(text) + if len(text) == 0: trace_logger.detail("Not a title. Text is empty.") # type: ignore return False From 7da3d3565ed8f4375a0126b2b8137007469ef1de Mon Sep 17 00:00:00 2001 From: JQQ Date: Fri, 24 May 2024 14:19:49 +0800 Subject: [PATCH 03/20] lint check --- test_unstructured/partition/test_auto.py | 6 +- test_unstructured/partition/test_docx.py | 257 ++++++++++++----------- test_unstructured/partition/test_md.py | 62 +++--- unstructured/documents/base.py | 7 +- unstructured/documents/html.py | 14 +- unstructured/documents/xml.py | 4 +- unstructured/partition/docx.py | 12 +- unstructured/partition/text_type.py | 35 ++- 8 files changed, 218 insertions(+), 179 deletions(-) diff --git a/test_unstructured/partition/test_auto.py b/test_unstructured/partition/test_auto.py index 70c6dbb857..df7e75f854 100644 --- a/test_unstructured/partition/test_auto.py +++ b/test_unstructured/partition/test_auto.py @@ -667,7 +667,8 @@ def test_auto_partition_works_with_unstructured_jsons_from_file(): def test_auto_partition_odt_from_filename(): filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake.odt") elements = partition(filename=filename, strategy=PartitionStrategy.HI_RES) - # TODO "Lorem ipsum dolor sit amet." looks not like English, so it will be inferred as Narrative Text. Maybe need to Fix it + # TODO "Lorem ipsum dolor sit amet." looks not like English, so it will be inferred as + # Narrative Text. Maybe needto Fix it assert elements[0] == Title("Lorem ipsum dolor sit amet.") @@ -676,7 +677,8 @@ def test_auto_partition_odt_from_file(): with open(filename, "rb") as f: elements = partition(file=f, strategy=PartitionStrategy.HI_RES) - # TODO "Lorem ipsum dolor sit amet." looks not like English, so it will be inferred as Narrative Text. Maybe need to Fix it + # TODO "Lorem ipsum dolor sit amet." looks not like English, so it will be inferred as + # Narrative Text. Maybe need to Fix it assert elements[0] == Title("Lorem ipsum dolor sit amet.") diff --git a/test_unstructured/partition/test_docx.py b/test_unstructured/partition/test_docx.py index 9ddf30840d..6a4cb7d5c2 100644 --- a/test_unstructured/partition/test_docx.py +++ b/test_unstructured/partition/test_docx.py @@ -45,12 +45,11 @@ PartitionStrategy, ) - # -- docx-file loading behaviors ----------------------------------------------------------------- def test_partition_docx_from_filename( - mock_document_file_path: str, expected_elements: list[Element] + mock_document_file_path: str, expected_elements: list[Element] ): elements = partition_docx(mock_document_file_path) @@ -63,7 +62,7 @@ def test_partition_docx_from_filename( def test_partition_docx_with_spooled_file( - mock_document_file_path: str, expected_elements: list[Text] + mock_document_file_path: str, expected_elements: list[Text] ): """`partition_docx()` accepts a SpooledTemporaryFile as its `file` argument. @@ -74,7 +73,7 @@ def test_partition_docx_with_spooled_file( spooled_temp_file = tempfile.SpooledTemporaryFile() spooled_temp_file.write(test_file.read()) spooled_temp_file.seek(0) - elements = partition_docx(file = spooled_temp_file) + elements = partition_docx(file=spooled_temp_file) assert elements == expected_elements for element in elements: assert element.metadata.filename is None @@ -82,22 +81,22 @@ def test_partition_docx_with_spooled_file( def test_partition_docx_from_file(mock_document_file_path: str, expected_elements: list[Text]): with open(mock_document_file_path, "rb") as f: - elements = partition_docx(file = f) + elements = partition_docx(file=f) assert elements == expected_elements for element in elements: assert element.metadata.filename is None def test_partition_docx_uses_file_path_when_both_are_specified( - mock_document_file_path: str, expected_elements: list[Text] + mock_document_file_path: str, expected_elements: list[Text] ): f = io.BytesIO(b"abcde") - elements = partition_docx(filename = mock_document_file_path, file = f) + elements = partition_docx(filename=mock_document_file_path, file=f) assert elements == expected_elements def test_partition_docx_raises_with_neither(): - with pytest.raises(ValueError, match = "either `filename` or `file` argument must be provided"): + with pytest.raises(ValueError, match="either `filename` or `file` argument must be provided"): partition_docx() @@ -118,11 +117,11 @@ def test_parition_docx_from_team_chat(): @pytest.mark.parametrize("infer_table_structure", [True, False]) def test_partition_docx_infer_table_structure(infer_table_structure: bool): elements = partition_docx( - example_doc_path("fake_table.docx"), infer_table_structure = infer_table_structure + example_doc_path("fake_table.docx"), infer_table_structure=infer_table_structure ) table_element_has_text_as_html_field = ( - hasattr(elements[0].metadata, "text_as_html") - and elements[0].metadata.text_as_html is not None + hasattr(elements[0].metadata, "text_as_html") + and elements[0].metadata.text_as_html is not None ) assert table_element_has_text_as_html_field == infer_table_structure @@ -166,7 +165,7 @@ def test_partition_docx_includes_neither_page_breaks_nor_numbers_when_rendered_b breaks are a false-positive and will generally produce incorrect page numbers. """ elements = partition_docx( - example_doc_path("handbook-1p-no-rendered-page-breaks.docx"), include_page_breaks = True + example_doc_path("handbook-1p-no-rendered-page-breaks.docx"), include_page_breaks=True ) assert "PageBreak" not in [type(e).__name__ for e in elements] @@ -178,7 +177,7 @@ def test_partition_docx_includes_page_numbers_when_page_break_elements_are_suppr Only inclusion of PageBreak elements is affected by that option. """ - elements = partition_docx(example_doc_path("handbook-1p.docx"), include_page_breaks = False) + elements = partition_docx(example_doc_path("handbook-1p.docx"), include_page_breaks=False) assert "PageBreak" not in [type(e).__name__ for e in elements] assert elements[1].metadata.page_number == 1 @@ -187,7 +186,7 @@ def test_partition_docx_includes_page_numbers_when_page_break_elements_are_suppr def test_partition_docx_includes_page_break_elements_when_so_instructed(): elements = partition_docx( - example_doc_path("handbook-1p.docx"), include_page_breaks = True, starting_page_number = 3 + example_doc_path("handbook-1p.docx"), include_page_breaks=True, starting_page_number=3 ) assert "PageBreak" in [type(e).__name__ for e in elements] @@ -211,7 +210,7 @@ def test_partition_docx_detects_lists(): def test_partition_docx_from_filename_excludes_metadata_when_so_instructed(): - elements = partition_docx(example_doc_path("handbook-1p.docx"), include_metadata = False) + elements = partition_docx(example_doc_path("handbook-1p.docx"), include_metadata=False) assert all(e.metadata.to_dict() == {} for e in elements) @@ -219,7 +218,7 @@ def test_partition_docx_from_file_excludes_metadata_when_so_instructed(): with open(example_doc_path("simple.docx"), "rb") as f: assert all( element.metadata.to_dict() == {} - for element in partition_docx(file = f, include_metadata = False) + for element in partition_docx(file=f, include_metadata=False) ) @@ -227,13 +226,13 @@ def test_partition_docx_from_file_excludes_metadata_when_so_instructed(): def test_partition_docx_from_filename_prefers_metadata_filename_when_provided(): - elements = partition_docx(example_doc_path("simple.docx"), metadata_filename = "test") + elements = partition_docx(example_doc_path("simple.docx"), metadata_filename="test") assert all(element.metadata.filename == "test" for element in elements) def test_partition_docx_from_file_prefers_metadata_filename_when_provided(): with open(example_doc_path("simple.docx"), "rb") as f: - elements = partition_docx(file = f, metadata_filename = "test") + elements = partition_docx(file=f, metadata_filename="test") assert all(element.metadata.filename == "test" for element in elements) @@ -242,7 +241,7 @@ def test_partition_docx_from_file_prefers_metadata_filename_when_provided(): def test_partition_docx_metadata_date(mocker: MockFixture): mocker.patch( - "unstructured.partition.docx.get_last_modified_date", return_value = "2029-07-05T09:24:28" + "unstructured.partition.docx.get_last_modified_date", return_value="2029-07-05T09:24:28" ) elements = partition_docx(example_doc_path("fake.docx")) @@ -252,11 +251,11 @@ def test_partition_docx_metadata_date(mocker: MockFixture): def test_partition_docx_metadata_date_with_custom_metadata(mocker: MockFixture): mocker.patch( - "unstructured.partition.docx.get_last_modified_date", return_value = "2023-11-01T14:13:07" + "unstructured.partition.docx.get_last_modified_date", return_value="2023-11-01T14:13:07" ) elements = partition_docx( - example_doc_path("fake.docx"), metadata_last_modified = "2020-07-05T09:24:28" + example_doc_path("fake.docx"), metadata_last_modified="2020-07-05T09:24:28" ) assert elements[0].metadata.last_modified == "2020-07-05T09:24:28" @@ -265,11 +264,11 @@ def test_partition_docx_metadata_date_with_custom_metadata(mocker: MockFixture): def test_partition_docx_from_file_metadata_date(mocker: MockFixture): mocker.patch( "unstructured.partition.docx.get_last_modified_date_from_file", - return_value = "2029-07-05T09:24:28", + return_value="2029-07-05T09:24:28", ) with open(example_doc_path("fake.docx"), "rb") as f: - elements = partition_docx(file = f) + elements = partition_docx(file=f) assert elements[0].metadata.last_modified is None @@ -277,11 +276,11 @@ def test_partition_docx_from_file_metadata_date(mocker: MockFixture): def test_partition_docx_from_file_explicit_get_metadata_date(mocker: MockFixture): mocker.patch( "unstructured.partition.docx.get_last_modified_date_from_file", - return_value = "2029-07-05T09:24:28", + return_value="2029-07-05T09:24:28", ) with open(example_doc_path("fake.docx"), "rb") as f: - elements = partition_docx(file = f, date_from_file_object = True) + elements = partition_docx(file=f, date_from_file_object=True) assert elements[0].metadata.last_modified == "2029-07-05T09:24:28" @@ -289,11 +288,11 @@ def test_partition_docx_from_file_explicit_get_metadata_date(mocker: MockFixture def test_partition_docx_from_file_metadata_date_with_custom_metadata(mocker: MockFixture): mocker.patch( "unstructured.partition.docx.get_last_modified_date_from_file", - return_value = "2023-11-01T14:13:07", + return_value="2023-11-01T14:13:07", ) with open(example_doc_path("fake.docx"), "rb") as f: - elements = partition_docx(file = f, metadata_last_modified = "2020-07-05T09:24:28") + elements = partition_docx(file=f, metadata_last_modified="2020-07-05T09:24:28") assert elements[0].metadata.last_modified == "2020-07-05T09:24:28" @@ -304,7 +303,7 @@ def test_partition_docx_from_file_without_metadata_date(): sf = tempfile.SpooledTemporaryFile() sf.write(f.read()) sf.seek(0) - elements = partition_docx(file = sf, date_from_file_object = True) + elements = partition_docx(file=sf, date_from_file_object=True) assert elements[0].metadata.last_modified is None @@ -313,7 +312,7 @@ def test_partition_docx_from_file_without_metadata_date(): def test_get_emphasized_texts_from_paragraph( - opts_args: dict[str, Any], expected_emphasized_texts: list[dict[str, str]] + opts_args: dict[str, Any], expected_emphasized_texts: list[dict[str, str]] ): opts_args["file_path"] = example_doc_path("fake-doc-emphasized-text.docx") opts = DocxPartitionerOptions(**opts_args) @@ -336,7 +335,7 @@ def test_get_emphasized_texts_from_paragraph( def test_iter_table_emphasis( - opts_args: dict[str, Any], expected_emphasized_texts: list[dict[str, str]] + opts_args: dict[str, Any], expected_emphasized_texts: list[dict[str, str]] ): opts_args["file_path"] = example_doc_path("fake-doc-emphasized-text.docx") opts = DocxPartitionerOptions(**opts_args) @@ -349,9 +348,9 @@ def test_iter_table_emphasis( def test_table_emphasis( - opts_args: dict[str, Any], - expected_emphasized_text_contents: list[str], - expected_emphasized_text_tags: list[str], + opts_args: dict[str, Any], + expected_emphasized_text_contents: list[str], + expected_emphasized_text_tags: list[str], ): opts_args["file_path"] = example_doc_path("fake-doc-emphasized-text.docx") opts = DocxPartitionerOptions(**opts_args) @@ -365,8 +364,8 @@ def test_table_emphasis( def test_partition_docx_grabs_emphasized_texts( - expected_emphasized_text_contents: list[str], - expected_emphasized_text_tags: list[str], + expected_emphasized_text_contents: list[str], + expected_emphasized_text_tags: list[str], ): elements = partition_docx(example_doc_path("fake-doc-emphasized-text.docx")) @@ -417,7 +416,7 @@ def test_parse_category_depth_by_style(opts_args: dict[str, Any]): actual_depth = partitioner._parse_category_depth_by_style(paragraph) assert text in paragraph.text, f"paragraph[{[idx]}].text does not contain {text}" assert ( - actual_depth == depth + actual_depth == depth ), f"expected paragraph[{idx}] to have depth=={depth}, got {actual_depth}" @@ -442,7 +441,7 @@ def test_parse_category_depth_by_style_name(opts_args: dict[str, Any]): for idx, (depth, text) in enumerate(test_cases): assert ( - partitioner._parse_category_depth_by_style_name(text) == depth + partitioner._parse_category_depth_by_style_name(text) == depth ), f"test case {test_cases[idx]} failed" @@ -454,7 +453,7 @@ def test_parse_category_depth_by_style_ilvl(opts_args: dict[str, Any]): def test_add_chunking_strategy_on_partition_docx_default_args(): chunk_elements = partition_docx( - example_doc_path("handbook-1p.docx"), chunking_strategy = "by_title" + example_doc_path("handbook-1p.docx"), chunking_strategy="by_title" ) elements = partition_docx(example_doc_path("handbook-1p.docx")) chunks = chunk_by_title(elements) @@ -467,10 +466,10 @@ def test_add_chunking_strategy_on_partition_docx(): docx_path = example_doc_path("fake-doc-emphasized-text.docx") chunk_elements = partition_docx( - docx_path, chunking_strategy = "by_title", max_characters = 9, combine_text_under_n_chars = 5 + docx_path, chunking_strategy="by_title", max_characters=9, combine_text_under_n_chars=5 ) elements = partition_docx(docx_path) - chunks = chunk_by_title(elements, max_characters = 9, combine_text_under_n_chars = 5) + chunks = chunk_by_title(elements, max_characters=9, combine_text_under_n_chars=5) assert chunk_elements == chunks assert elements != chunk_elements @@ -484,20 +483,20 @@ def test_add_chunking_strategy_on_partition_docx(): def test_partition_docx_element_metadata_has_languages(): filename = example_doc_path("handbook-1p.docx") - elements = partition_docx(filename = filename) + elements = partition_docx(filename=filename) assert elements[0].metadata.languages == ["eng"] def test_partition_docx_respects_detect_language_per_element(): filename = example_doc_path("language-docs/eng_spa_mult.docx") - elements = partition_docx(filename = filename, detect_language_per_element = True) + elements = partition_docx(filename=filename, detect_language_per_element=True) langs = [element.metadata.languages for element in elements] assert langs == [["eng"], ["spa", "eng"], ["eng"], ["eng"], ["spa"]] def test_partition_docx_respects_languages_arg(): filename = example_doc_path("handbook-1p.docx") - elements = partition_docx(filename = filename, languages = ["deu"]) + elements = partition_docx(filename=filename, languages=["deu"]) assert elements[0].metadata.languages == ["deu"] @@ -505,8 +504,8 @@ def test_partition_docx_raises_TypeError_for_invalid_languages(): with pytest.raises(TypeError): filename = example_doc_path("handbook-1p.docx") partition_docx( - filename = filename, - languages = "eng", # pyright: ignore[reportArgumentType] + filename=filename, + languages="eng", # pyright: ignore[reportArgumentType] ) @@ -664,21 +663,21 @@ def expected_emphasized_texts(): def mock_document(): document = docx.Document() - document.add_paragraph("These are a few of my favorite things:", style = "Heading 1") + document.add_paragraph("These are a few of my favorite things:", style="Heading 1") # NOTE(robinson) - this should get picked up as a list item due to the • - document.add_paragraph("• Parrots", style = "Normal") + document.add_paragraph("• Parrots", style="Normal") # NOTE(robinson) - this should get dropped because it's empty - document.add_paragraph("• ", style = "Normal") - document.add_paragraph("Hockey", style = "List Bullet") + document.add_paragraph("• ", style="Normal") + document.add_paragraph("Hockey", style="List Bullet") # NOTE(robinson) - this should get dropped because it's empty - document.add_paragraph("", style = "List Bullet") + document.add_paragraph("", style="List Bullet") # NOTE(robinson) - this should get picked up as a title - document.add_paragraph("Analysis", style = "Normal") + document.add_paragraph("Analysis", style="Normal") # NOTE(robinson) - this should get dropped because it is empty - document.add_paragraph("", style = "Normal") + document.add_paragraph("", style="Normal") # NOTE(robinson) - this should get picked up as a narrative text - document.add_paragraph("This is my first thought. This is my second thought.", style = "Normal") - document.add_paragraph("This is my third thought.", style = "Body Text") + document.add_paragraph("This is my first thought. This is my second thought.", style="Normal") + document.add_paragraph("This is my third thought.", style="Body Text") # NOTE(robinson) - this should just be regular text document.add_paragraph("2023") # NOTE(robinson) - this should be an address @@ -727,16 +726,16 @@ class DescribeDocxPartitionerOptions: # -- .document ------------------------------- def it_loads_the_docx_document( - self, - request: FixtureRequest, - opts_args: dict[str, Any], + self, + request: FixtureRequest, + opts_args: dict[str, Any], ): document_ = instance_mock(request, Document) docx_Document_ = function_mock( - request, "unstructured.partition.docx.docx.Document", return_value = document_ + request, "unstructured.partition.docx.docx.Document", return_value=document_ ) _docx_file_prop_ = property_mock( - request, DocxPartitionerOptions, "_docx_file", return_value = "abcde.docx" + request, DocxPartitionerOptions, "_docx_file", return_value="abcde.docx" ) opts = DocxPartitionerOptions(**opts_args) @@ -750,7 +749,7 @@ def it_loads_the_docx_document( @pytest.mark.parametrize("arg_value", [True, False]) def it_knows_whether_to_emit_PageBreak_elements_as_part_of_the_output_element_stream( - self, arg_value: bool, opts_args: dict[str, Any] + self, arg_value: bool, opts_args: dict[str, Any] ): opts_args["include_page_breaks"] = arg_value opts = DocxPartitionerOptions(**opts_args) @@ -761,7 +760,7 @@ def it_knows_whether_to_emit_PageBreak_elements_as_part_of_the_output_element_st @pytest.mark.parametrize("arg_value", [True, False]) def it_knows_whether_to_include_text_as_html_in_Table_metadata( - self, arg_value: bool, opts_args: dict[str, Any] + self, arg_value: bool, opts_args: dict[str, Any] ): opts_args["infer_table_structure"] = arg_value opts = DocxPartitionerOptions(**opts_args) @@ -771,7 +770,7 @@ def it_knows_whether_to_include_text_as_html_in_Table_metadata( # -- .increment_page_number() ---------------- def it_generates_a_PageBreak_element_when_the_page_number_is_incremented( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts = DocxPartitionerOptions(**opts_args) @@ -783,7 +782,7 @@ def it_generates_a_PageBreak_element_when_the_page_number_is_incremented( next(page_break_iter) def but_it_does_not_generate_a_PageBreak_element_when_include_page_breaks_option_is_off( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts_args["include_page_breaks"] = False opts = DocxPartitionerOptions(**opts_args) @@ -797,7 +796,7 @@ def but_it_does_not_generate_a_PageBreak_element_when_include_page_breaks_option # -- .last_modified -------------------------- def it_gets_the_last_modified_date_of_the_document_from_the_caller_when_provided( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts_args["metadata_last_modified"] = "2024-03-05T17:02:53" opts = DocxPartitionerOptions(**opts_args) @@ -805,7 +804,7 @@ def it_gets_the_last_modified_date_of_the_document_from_the_caller_when_provided assert opts.last_modified == "2024-03-05T17:02:53" def and_it_falls_back_to_the_last_modified_date_of_the_file_when_a_path_is_provided( - self, opts_args: dict[str, Any], get_last_modified_date_: Mock + self, opts_args: dict[str, Any], get_last_modified_date_: Mock ): opts_args["file_path"] = "a/b/document.docx" get_last_modified_date_.return_value = "2024-04-02T20:32:35" @@ -817,7 +816,7 @@ def and_it_falls_back_to_the_last_modified_date_of_the_file_when_a_path_is_provi assert last_modified == "2024-04-02T20:32:35" def and_it_falls_back_to_the_last_modified_date_of_the_file_when_a_file_like_object_is_provided( - self, opts_args: dict[str, Any], get_last_modified_date_from_file_: Mock + self, opts_args: dict[str, Any], get_last_modified_date_from_file_: Mock ): file = io.BytesIO(b"abcdefg") opts_args["file"] = file @@ -831,7 +830,7 @@ def and_it_falls_back_to_the_last_modified_date_of_the_file_when_a_file_like_obj assert last_modified == "2024-04-02T20:42:07" def but_it_falls_back_to_None_for_the_last_modified_date_when_date_from_file_object_is_False( - self, opts_args: dict[str, Any], get_last_modified_date_from_file_: Mock + self, opts_args: dict[str, Any], get_last_modified_date_from_file_: Mock ): file = io.BytesIO(b"abcdefg") opts_args["file"] = file @@ -847,7 +846,7 @@ def but_it_falls_back_to_None_for_the_last_modified_date_when_date_from_file_obj # -- .metadata_file_path --------------------- def it_uses_the_user_provided_file_path_in_the_metadata_when_provided( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts_args["file_path"] = "x/y/z.docx" opts_args["metadata_file_path"] = "a/b/c.docx" @@ -857,7 +856,7 @@ def it_uses_the_user_provided_file_path_in_the_metadata_when_provided( @pytest.mark.parametrize("file_path", ["u/v/w.docx", None]) def and_it_falls_back_to_the_document_file_path_otherwise( - self, file_path: str | None, opts_args: dict[str, Any] + self, file_path: str | None, opts_args: dict[str, Any] ): opts_args["file_path"] = file_path opts_args["metadata_file_path"] = None @@ -872,18 +871,18 @@ def and_it_falls_back_to_the_document_file_path_otherwise( [(7, True, 7), (1, False, None)], ) def it_reports_None_when_no_rendered_page_breaks_are_found_in_document( - self, - request: FixtureRequest, - opts_args: dict[str, Any], - page_count: int, - document_contains_pagebreaks: bool, - expected_value: int | None, + self, + request: FixtureRequest, + opts_args: dict[str, Any], + page_count: int, + document_contains_pagebreaks: bool, + expected_value: int | None, ): _document_contains_pagebreaks_prop_ = property_mock( request, DocxPartitionerOptions, "_document_contains_pagebreaks", - return_value = document_contains_pagebreaks, + return_value=document_contains_pagebreaks, ) opts = DocxPartitionerOptions(**opts_args) opts._page_counter = page_count @@ -906,9 +905,9 @@ def it_keeps_track_of_the_page_number(self, opts_args: dict[str, Any]): assert opts.page_number == 3 def it_assigns_the_correct_page_number_when_starting_page_number_is_given( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): - opts = DocxPartitionerOptions(**opts_args, starting_page_number = 3) + opts = DocxPartitionerOptions(**opts_args, starting_page_number=3) assert opts.page_number == 3 list(opts.increment_page_number()) @@ -921,7 +920,7 @@ def it_assigns_the_correct_page_number_when_starting_page_number_is_given( [(None, "hi_res"), (PartitionStrategy.FAST, "fast"), (PartitionStrategy.HI_RES, "hi_res")], ) def it_knows_which_partitioning_strategy_to_use( - self, opts_args: dict[str, Any], arg_value: str, expected_value: str + self, opts_args: dict[str, Any], arg_value: str, expected_value: str ): opts_args["strategy"] = arg_value opts = DocxPartitionerOptions(**opts_args) @@ -934,7 +933,7 @@ def it_knows_which_partitioning_strategy_to_use( ("file_name", "expected_value"), [("page-breaks.docx", True), ("teams_chat.docx", False)] ) def it_knows_whether_the_document_contains_page_breaks( - self, opts_args: dict[str, Any], file_name: str, expected_value: bool + self, opts_args: dict[str, Any], file_name: str, expected_value: bool ): opts_args["file_path"] = example_doc_path(file_name) opts = DocxPartitionerOptions(**opts_args) @@ -944,7 +943,7 @@ def it_knows_whether_the_document_contains_page_breaks( # -- ._docx_file ----------------------------- def it_uses_the_path_to_open_the_presentation_when_file_path_is_provided( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts_args["file_path"] = "l/m/n.docx" opts = DocxPartitionerOptions(**opts_args) @@ -952,7 +951,7 @@ def it_uses_the_path_to_open_the_presentation_when_file_path_is_provided( assert opts._docx_file == "l/m/n.docx" def and_it_uses_a_BytesIO_file_to_replaces_a_SpooledTemporaryFile_provided( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): spooled_temp_file = tempfile.SpooledTemporaryFile() spooled_temp_file.write(b"abcdefg") @@ -966,7 +965,7 @@ def and_it_uses_a_BytesIO_file_to_replaces_a_SpooledTemporaryFile_provided( assert docx_file.getvalue() == b"abcdefg" def and_it_uses_the_provided_file_directly_when_not_a_SpooledTemporaryFile( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): file = io.BytesIO(b"abcdefg") opts_args["file"] = file @@ -979,11 +978,11 @@ def and_it_uses_the_provided_file_directly_when_not_a_SpooledTemporaryFile( assert docx_file.getvalue() == b"abcdefg" def but_it_raises_ValueError_when_neither_a_file_path_or_file_is_provided( - self, opts_args: dict[str, Any] + self, opts_args: dict[str, Any] ): opts = DocxPartitionerOptions(**opts_args) - with pytest.raises(ValueError, match = "No DOCX document specified, either `filename` or "): + with pytest.raises(ValueError, match="No DOCX document specified, either `filename` or "): opts._docx_file # -- fixtures -------------------------------------------------------------------------------- @@ -1334,22 +1333,22 @@ def create_test_docx(file_path): doc = DocxDocument() # 添加标题和文本内容 - doc.add_heading('春节放假通知', level = 1) - doc.add_paragraph('\n') - doc.add_paragraph('春节放假从大年 30 开始\n共计放假一个月\n比法定假期长三周\n') + doc.add_heading("春节放假通知", level=1) + doc.add_paragraph("\n") + doc.add_paragraph("春节放假从大年 30 开始\n共计放假一个月\n比法定假期长三周\n") - doc.add_heading('标题 2', level = 2) - doc.add_heading('标题 3', level = 3) - doc.add_heading('又一个标题 2', level = 2) + doc.add_heading("标题 2", level=2) + doc.add_heading("标题 3", level=3) + doc.add_heading("又一个标题 2", level=2) - doc.add_paragraph('正文普通\n') + doc.add_paragraph("正文普通\n") # 添加列表 - doc.add_paragraph('一组\n', style = 'ListBullet') - doc.add_paragraph('二组\n', style = 'ListBullet') - doc.add_paragraph('三组\n', style = 'ListBullet') + doc.add_paragraph("一组\n", style="ListBullet") + doc.add_paragraph("二组\n", style="ListBullet") + doc.add_paragraph("三组\n", style="ListBullet") - doc.add_paragraph('继续正文\n') + doc.add_paragraph("继续正文\n") # 保存文档 doc.save(file_path) @@ -1357,9 +1356,10 @@ def create_test_docx(file_path): def test_partition_zh_docs() -> None: """ - Fix the issue of erroneously recognizing NarrativeText as Title when splitting Chinese DOCX documents + Fix the issue of erroneously recognizing NarrativeText as Title when splitting + Chinese DOCX documents """ - with tempfile.NamedTemporaryFile(suffix = ".docx", delete = False) as tmp: + with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as tmp: create_test_docx(tmp.name) elements = partition_docx(tmp.name) @@ -1368,29 +1368,30 @@ def test_partition_zh_docs() -> None: print(element) # 进行断言检查 - assert any('春节放假通知' in element.text for element in elements) - assert any('春节放假从大年 30 开始' in element.text for element in elements) - assert any('标题 2' in element.text for element in elements) - assert any('标题 3' in element.text for element in elements) - assert any('又一个标题 2' in element.text for element in elements) - assert any('正文普通' in element.text for element in elements) - assert any('一组' in element.text for element in elements) - assert any('二组' in element.text for element in elements) - assert any('三组' in element.text for element in elements) - assert any('继续正文' in element.text for element in elements) - assert list(filter(lambda x: '正文普通' in x.text, elements))[0].category == 'NarrativeText' - assert list(filter(lambda x: '一组' in x.text, elements))[0].category == 'ListItem' - assert list(filter(lambda x: '继续正文' in x.text, elements))[0].category == 'NarrativeText' + assert any("春节放假通知" in element.text for element in elements) + assert any("春节放假从大年 30 开始" in element.text for element in elements) + assert any("标题 2" in element.text for element in elements) + assert any("标题 3" in element.text for element in elements) + assert any("又一个标题 2" in element.text for element in elements) + assert any("正文普通" in element.text for element in elements) + assert any("一组" in element.text for element in elements) + assert any("二组" in element.text for element in elements) + assert any("三组" in element.text for element in elements) + assert any("继续正文" in element.text for element in elements) + assert list(filter(lambda x: "正文普通" in x.text, elements))[0].category == "NarrativeText" + assert list(filter(lambda x: "一组" in x.text, elements))[0].category == "ListItem" + assert list(filter(lambda x: "继续正文" in x.text, elements))[0].category == "NarrativeText" def test_partition_zh_docs_as_eng() -> None: """ - Fix the issue of erroneously recognizing NarrativeText as Title when splitting Chinese DOCX documents + Fix the issue of erroneously recognizing NarrativeText as Title when splitting + Chinese DOCX documents - When specifying the language as English, the partitioning result should be deceived, it will be recognized - incorrectly. + When specifying the language as English, the partitioning result should be + deceived, it will be recognized incorrectly. """ - with tempfile.NamedTemporaryFile(suffix = ".docx", delete = False) as tmp: + with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as tmp: create_test_docx(tmp.name) elements = partition_docx(tmp.name, languages=["eng"]) @@ -1399,16 +1400,16 @@ def test_partition_zh_docs_as_eng() -> None: print(element) # 进行断言检查 - assert any('春节放假通知' in element.text for element in elements) - assert any('春节放假从大年 30 开始' in element.text for element in elements) - assert any('标题 2' in element.text for element in elements) - assert any('标题 3' in element.text for element in elements) - assert any('又一个标题 2' in element.text for element in elements) - assert any('正文普通' in element.text for element in elements) - assert any('一组' in element.text for element in elements) - assert any('二组' in element.text for element in elements) - assert any('三组' in element.text for element in elements) - assert any('继续正文' in element.text for element in elements) - assert list(filter(lambda x: '正文普通' in x.text, elements))[0].category == 'Title' - assert list(filter(lambda x: '一组' in x.text, elements))[0].category == 'ListItem' - assert list(filter(lambda x: '继续正文' in x.text, elements))[0].category == 'Title' + assert any("春节放假通知" in element.text for element in elements) + assert any("春节放假从大年 30 开始" in element.text for element in elements) + assert any("标题 2" in element.text for element in elements) + assert any("标题 3" in element.text for element in elements) + assert any("又一个标题 2" in element.text for element in elements) + assert any("正文普通" in element.text for element in elements) + assert any("一组" in element.text for element in elements) + assert any("二组" in element.text for element in elements) + assert any("三组" in element.text for element in elements) + assert any("继续正文" in element.text for element in elements) + assert list(filter(lambda x: "正文普通" in x.text, elements))[0].category == "Title" + assert list(filter(lambda x: "一组" in x.text, elements))[0].category == "ListItem" + assert list(filter(lambda x: "继续正文" in x.text, elements))[0].category == "Title" diff --git a/test_unstructured/partition/test_md.py b/test_unstructured/partition/test_md.py index e7c7799820..0e172fa870 100644 --- a/test_unstructured/partition/test_md.py +++ b/test_unstructured/partition/test_md.py @@ -327,48 +327,50 @@ def test_partition_md_parse_table(): def test_partition_zh_md() -> None: """ - Fix the issue of erroneously recognizing NarrativeText as Title when splitting Chinese DOCX documents + Fix the issue of erroneously recognizing NarrativeText as Title when splitting + Chinese DOCX documents """ filename = example_doc_path("zho_md_partition.md") elements = partition_md(filename=filename) assert len(elements) > 0 # 进行断言检查 - assert any('春节放假通知' in element.text for element in elements) - assert any('春节放假从大年 30 开始' in element.text for element in elements) - assert any('标题 2' in element.text for element in elements) - assert any('标题 3' in element.text for element in elements) - assert any('Another Title 2' in element.text for element in elements) - assert any('正文开始' in element.text for element in elements) - assert any('一组1' in element.text for element in elements) - assert any('一组2' in element.text for element in elements) - assert any('一组3' in element.text for element in elements) - assert any('正文结束' in element.text for element in elements) - assert list(filter(lambda x: '正文开始' in x.text, elements))[0].category == 'NarrativeText' - assert list(filter(lambda x: '一组' in x.text, elements))[0].category == 'ListItem' - assert list(filter(lambda x: '正文结束' in x.text, elements))[0].category == 'NarrativeText' + assert any("春节放假通知" in element.text for element in elements) + assert any("春节放假从大年 30 开始" in element.text for element in elements) + assert any("标题 2" in element.text for element in elements) + assert any("标题 3" in element.text for element in elements) + assert any("Another Title 2" in element.text for element in elements) + assert any("正文开始" in element.text for element in elements) + assert any("一组1" in element.text for element in elements) + assert any("一组2" in element.text for element in elements) + assert any("一组3" in element.text for element in elements) + assert any("正文结束" in element.text for element in elements) + assert list(filter(lambda x: "正文开始" in x.text, elements))[0].category == "NarrativeText" + assert list(filter(lambda x: "一组" in x.text, elements))[0].category == "ListItem" + assert list(filter(lambda x: "正文结束" in x.text, elements))[0].category == "NarrativeText" def test_partition_zh_docs_as_eng() -> None: """ - Fix the issue of erroneously recognizing NarrativeText as Title when splitting Chinese DOCX documents + Fix the issue of erroneously recognizing NarrativeText as Title when splitting + Chinese DOCX documents - When specifying the language as English, the partitioning result should be deceived, it will be recognized - incorrectly. + When specifying the language as English, the partitioning result should be deceived, + it will be recognized incorrectly. """ filename = example_doc_path("zho_md_partition.md") elements = partition_md(filename=filename, languages=["eng"]) assert len(elements) > 0 # 进行断言检查 - assert any('春节放假通知' in element.text for element in elements) - assert any('春节放假从大年 30 开始' in element.text for element in elements) - assert any('标题 2' in element.text for element in elements) - assert any('标题 3' in element.text for element in elements) - assert any('Another Title 2' in element.text for element in elements) - assert any('正文开始' in element.text for element in elements) - assert any('一组1' in element.text for element in elements) - assert any('一组2' in element.text for element in elements) - assert any('一组3' in element.text for element in elements) - assert any('正文结束' in element.text for element in elements) - assert list(filter(lambda x: '正文开始' in x.text, elements))[0].category == 'Title' - assert list(filter(lambda x: '一组' in x.text, elements))[0].category == 'ListItem' - assert list(filter(lambda x: '正文结束' in x.text, elements))[0].category == 'Title' + assert any("春节放假通知" in element.text for element in elements) + assert any("春节放假从大年 30 开始" in element.text for element in elements) + assert any("标题 2" in element.text for element in elements) + assert any("标题 3" in element.text for element in elements) + assert any("Another Title 2" in element.text for element in elements) + assert any("正文开始" in element.text for element in elements) + assert any("一组1" in element.text for element in elements) + assert any("一组2" in element.text for element in elements) + assert any("一组3" in element.text for element in elements) + assert any("正文结束" in element.text for element in elements) + assert list(filter(lambda x: "正文开始" in x.text, elements))[0].category == "Title" + assert list(filter(lambda x: "一组" in x.text, elements))[0].category == "ListItem" + assert list(filter(lambda x: "正文结束" in x.text, elements))[0].category == "Title" diff --git a/unstructured/documents/base.py b/unstructured/documents/base.py index a93e982fc4..11bd043f0c 100644 --- a/unstructured/documents/base.py +++ b/unstructured/documents/base.py @@ -14,9 +14,10 @@ def __init__(self, languages: Optional[list[str]] = None): self._elements: Optional[List[Element]] = None self._language: list[str] if not languages or languages == [""]: - # As [""] is a valid input, it's used to avoid duplicate language detection during partitioning. However, I - # believe this design could be improved. Due to the complexity involved in altering the architecture, we - # have chosen to keep it as it is for now. In order to maintain compatibility with past designs, maybe + # As [""] is a valid input, it's used to avoid duplicate language detection during + # partitioning. However, I believe this design could be improved. Due to the + # complexity involved in altering the architecture, we have chosen to keep it as + # it is for now. In order to maintain compatibility with past designs, maybe # discuss better solutions with the core team in the future. self._language: list[str] = ["auto"] else: diff --git a/unstructured/documents/html.py b/unstructured/documents/html.py index 90605d909d..da4fb2abfd 100644 --- a/unstructured/documents/html.py +++ b/unstructured/documents/html.py @@ -167,22 +167,28 @@ def _parse_pages_from_element_tree(self) -> List[Page]: for article in articles: descendanttag_elems: Tuple[etree._Element, ...] = () for tag_elem in article.iter(): - elem_languages = self.languages \ - if "auto" not in self.languages or not tag_elem.text \ + elem_languages = ( + self.languages + if "auto" not in self.languages or not tag_elem.text else detect_languages(tag_elem.text) + ) if tag_elem in descendanttag_elems: # Prevent repeating something that's been flagged as text as we chase it # down a chain continue if _is_text_tag(tag_elem): - _page_elements, descendanttag_elems = _process_text_tag(tag_elem, languages=elem_languages) + _page_elements, descendanttag_elems = _process_text_tag( + tag_elem, languages=elem_languages + ) page.elements.extend(_page_elements) elif _is_container_with_text(tag_elem): tag_elem_tail = tag_elem.tail.strip() if tag_elem.tail else None if tag_elem_tail: - _page_elements, descendanttag_elems = _process_text_tag(tag_elem, False, languages=elem_languages) + _page_elements, descendanttag_elems = _process_text_tag( + tag_elem, False, languages=elem_languages + ) page.elements.extend(_page_elements) # NOTE(christine): generate a separate element using a tag tail diff --git a/unstructured/documents/xml.py b/unstructured/documents/xml.py index 69f55dc458..752e09fb93 100644 --- a/unstructured/documents/xml.py +++ b/unstructured/documents/xml.py @@ -122,4 +122,6 @@ def from_file( ) -> Self: _, content = read_txt_file(filename=filename, encoding=encoding) - return cls.from_string(content, parser=parser, stylesheet=stylesheet, languages=languages, **kwargs) + return cls.from_string( + content, parser=parser, stylesheet=stylesheet, languages=languages, **kwargs + ) diff --git a/unstructured/partition/docx.py b/unstructured/partition/docx.py index 531a664222..0be3464cd2 100644 --- a/unstructured/partition/docx.py +++ b/unstructured/partition/docx.py @@ -875,17 +875,21 @@ def _parse_paragraph_text_for_element_type(self, paragraph: Paragraph) -> Option if is_email_address(text): return EmailAddress if is_possible_narrative_text( - text, - languages=self._opts.languages + text, + languages=( + self._opts.languages if "auto" not in self._opts.languages else detect_languages(text, self._opts.languages) + ), ): return NarrativeText if is_possible_title( - text, - languages=self._opts.languages + text, + languages=( + self._opts.languages if "auto" not in self._opts.languages else detect_languages(text, self._opts.languages) + ), ): return Title diff --git a/unstructured/partition/text_type.py b/unstructured/partition/text_type.py index 237fb3fcc1..58384ee6ca 100644 --- a/unstructured/partition/text_type.py +++ b/unstructured/partition/text_type.py @@ -83,15 +83,36 @@ def is_possible_narrative_text( cap_threshold = float( os.environ.get("UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD", cap_threshold), ) - # NOTE: exceeds_cap_ratio is designed for english text, so we only use it if the language is english. - # For caution's sake, we will temporarily use "eng" in languages for judgment, that is, as long as English appears, - # we will make a judgment. In the future, we may need to modify it to where only pure English is needed for - # exceeds_cap_ratio judgment. + # NOTE: exceeds_cap_ratio is designed for english text, so we only use it if the + # language is english. For caution's sake, we will temporarily use "eng" in + # languages for judgment, that is, as long as English appears, we will make + # a judgment. In the future, we may need to modify it to where only pure English + # is needed for exceeds_cap_ratio judgment. capitalizable_languages = { - "eng", "spa", "rus", "fra", "deu", "ita", "por", "nld", "swe", "nor", - "dan", "fin", "ell", "pol", "ces", "slk", "hun", "ron", "bul", "hrv" + "eng", + "spa", + "rus", + "fra", + "deu", + "ita", + "por", + "nld", + "swe", + "nor", + "dan", + "fin", + "ell", + "pol", + "ces", + "slk", + "hun", + "ron", + "bul", + "hrv", } - if not capitalizable_languages.isdisjoint(set(languages)) and exceeds_cap_ratio(text, threshold=cap_threshold): + if not capitalizable_languages.isdisjoint(set(languages)) and exceeds_cap_ratio( + text, threshold=cap_threshold + ): trace_logger.detail(f"Not narrative. Text exceeds cap ratio {cap_threshold}:\n\n{text}") # type: ignore # noqa: E501 return False From aacd9caa5bdfcc5d34757ed48221b783f8b87366 Mon Sep 17 00:00:00 2001 From: JQQ Date: Wed, 29 May 2024 11:15:15 +0800 Subject: [PATCH 04/20] 1. Modify the "languages" parameter in the initialisation function of "DocxPartitionerOptions" to collapse into keyword arguments (kwargs). 2. Change "capitalizable_languages" to "non_capitalizable_languages" in the function "is_possible_narrative_text". --- unstructured/partition/docx.py | 4 ++-- unstructured/partition/text_type.py | 28 ++++++---------------------- 2 files changed, 8 insertions(+), 24 deletions(-) diff --git a/unstructured/partition/docx.py b/unstructured/partition/docx.py index 0be3464cd2..6ce39f31a0 100644 --- a/unstructured/partition/docx.py +++ b/unstructured/partition/docx.py @@ -157,7 +157,7 @@ def __init__( metadata_last_modified: Optional[str], starting_page_number: int = 1, strategy: str | None = None, - languages: Optional[list[str]] = None, + **kwargs: Any, ): self._date_from_file_object = date_from_file_object self._file = file @@ -170,7 +170,7 @@ def __init__( # -- options object maintains page-number state -- self._page_counter = starting_page_number # -- languages is a list of languages to use for category detection -- - self._languages: list[str] = languages or ["auto"] + self._languages: list[str] = kwargs.get("languages", ["auto"]) @lazyproperty def document(self) -> Document: diff --git a/unstructured/partition/text_type.py b/unstructured/partition/text_type.py index 58384ee6ca..21d8dc3502 100644 --- a/unstructured/partition/text_type.py +++ b/unstructured/partition/text_type.py @@ -88,29 +88,13 @@ def is_possible_narrative_text( # languages for judgment, that is, as long as English appears, we will make # a judgment. In the future, we may need to modify it to where only pure English # is needed for exceeds_cap_ratio judgment. - capitalizable_languages = { - "eng", - "spa", - "rus", - "fra", - "deu", - "ita", - "por", - "nld", - "swe", - "nor", - "dan", - "fin", - "ell", - "pol", - "ces", - "slk", - "hun", - "ron", - "bul", - "hrv", + # List of languages that can't be capitalized + non_capitalizable_languages = { + "zho", # Chinese + "jpn", # Japanese + "kor", # Korean } - if not capitalizable_languages.isdisjoint(set(languages)) and exceeds_cap_ratio( + if non_capitalizable_languages.isdisjoint(set(languages)) and exceeds_cap_ratio( text, threshold=cap_threshold ): trace_logger.detail(f"Not narrative. Text exceeds cap ratio {cap_threshold}:\n\n{text}") # type: ignore # noqa: E501 From 57f0afbce5000a33e8a144d38602acc1647a55ef Mon Sep 17 00:00:00 2001 From: JQQ Date: Fri, 31 May 2024 10:39:03 +0800 Subject: [PATCH 05/20] Update CHANGELOG.md --- CHANGELOG.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index e467042c7b..826dc1be2a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,8 @@ ### Fixes +* Compatibility Issue with Chinese Text in Document Parsing + ## 0.14.2 ### Enhancements From 5b07f6fab4804db234dda1bcd95a6658f642b662 Mon Sep 17 00:00:00 2001 From: JQQ Date: Fri, 31 May 2024 10:40:01 +0800 Subject: [PATCH 06/20] Update CHANGELOG.md --- CHANGELOG.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 8348323432..b420de82b9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -19,6 +19,7 @@ * **Diable `table_as_cells` output by default** to reduce overhead in partition; now `table_as_cells` is only produced when the env `EXTACT_TABLE_AS_CELLS` is `true` * **Reduce excessive logging** Change per page ocr info level logging into detail level trace logging * **Replace try block in `document_to_element_list` for handling HTMLDocument** Use `getattr(element, "type", "")` to get the `type` attribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block +* **Compatibility Issue with Chinese Text in Document Parsing** ## 0.14.2 From f660e2ee1038275f2b7350054edd5697cb99799a Mon Sep 17 00:00:00 2001 From: JQQ Date: Fri, 21 Jun 2024 15:18:30 +0800 Subject: [PATCH 07/20] Fix Language auto bug --- unstructured/partition/docx.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/unstructured/partition/docx.py b/unstructured/partition/docx.py index 02b0b8899c..6bd1bbc33d 100644 --- a/unstructured/partition/docx.py +++ b/unstructured/partition/docx.py @@ -218,7 +218,7 @@ def __init__( # -- options object maintains page-number state -- self._page_counter = starting_page_number # -- languages is a list of languages to use for category detection -- - self._languages: list[str] = kwargs.get("languages", ["auto"]) + self._languages: list[str] = kwargs.get("languages") or ["auto"] @classmethod def register_picture_partitioner(cls, picture_partitioner: PicturePartitionerT): From 2bfe800d75cf64b94fdb3b9398ecb0472c0648ba Mon Sep 17 00:00:00 2001 From: JQQ Date: Fri, 21 Jun 2024 16:29:29 +0800 Subject: [PATCH 08/20] "Fix incorrect narrative text detection" This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations. --- test_unstructured/partition/test_text_type.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test_unstructured/partition/test_text_type.py b/test_unstructured/partition/test_text_type.py index fb5c8443f0..dcdc43bb09 100644 --- a/test_unstructured/partition/test_text_type.py +++ b/test_unstructured/partition/test_text_type.py @@ -88,7 +88,7 @@ def test_text_type_handles_multi_language_examples(monkeypatch): assert text_type.is_possible_narrative_text(title, languages=["eng"]) is False assert text_type.is_possible_narrative_text(title, languages=["spa", "rus"]) is False - assert text_type.is_possible_narrative_text(title, languages=[]) is True + assert text_type.is_possible_narrative_text(title, languages=[]) is False assert text_type.is_possible_title(title, languages=["eng"]) is False assert text_type.is_possible_title(title, languages=["spa", "rus"]) is True From 499014382597d22501aadf83acc7ff8f016c9c97 Mon Sep 17 00:00:00 2001 From: JQQ Date: Fri, 21 Jun 2024 20:38:27 +0800 Subject: [PATCH 09/20] "Fix incorrect narrative text detection" This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations. --- test_unstructured/partition/test_auto.py | 10 ++++------ test_unstructured/partition/test_odt.py | 10 +++++----- unstructured/documents/html.py | 13 ++++++++++--- unstructured/partition/email.py | 3 ++- unstructured/partition/html.py | 1 + unstructured/partition/lang.py | 3 ++- unstructured/partition/text_type.py | 6 +++--- 7 files changed, 27 insertions(+), 19 deletions(-) diff --git a/test_unstructured/partition/test_auto.py b/test_unstructured/partition/test_auto.py index e8232ca6e0..2942f7372d 100644 --- a/test_unstructured/partition/test_auto.py +++ b/test_unstructured/partition/test_auto.py @@ -667,9 +667,8 @@ def test_auto_partition_works_with_unstructured_jsons_from_file(): def test_auto_partition_odt_from_filename(): filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake.odt") elements = partition(filename=filename, strategy=PartitionStrategy.HI_RES) - # TODO "Lorem ipsum dolor sit amet." looks not like English, so it will be inferred as - # Narrative Text. Maybe needto Fix it - assert elements[0] == Title("Lorem ipsum dolor sit amet.") + # "Lorem ipsum dolor sit amet." looks not like English, and it look not like a Title + assert elements[0] == NarrativeText("Lorem ipsum dolor sit amet.") def test_auto_partition_odt_from_file(): @@ -677,9 +676,8 @@ def test_auto_partition_odt_from_file(): with open(filename, "rb") as f: elements = partition(file=f, strategy=PartitionStrategy.HI_RES) - # TODO "Lorem ipsum dolor sit amet." looks not like English, so it will be inferred as - # Narrative Text. Maybe need to Fix it - assert elements[0] == Title("Lorem ipsum dolor sit amet.") + # "Lorem ipsum dolor sit amet." looks not like English, and it look not like a Title + assert elements[0] == NarrativeText("Lorem ipsum dolor sit amet.") @pytest.mark.parametrize( diff --git a/test_unstructured/partition/test_odt.py b/test_unstructured/partition/test_odt.py index 73e759f017..606fc1ef69 100644 --- a/test_unstructured/partition/test_odt.py +++ b/test_unstructured/partition/test_odt.py @@ -14,7 +14,7 @@ function_mock, ) from unstructured.chunking.basic import chunk_elements -from unstructured.documents.elements import CompositeElement, Table, TableChunk, Title +from unstructured.documents.elements import CompositeElement, Table, TableChunk, Title, NarrativeText from unstructured.partition.docx import partition_docx from unstructured.partition.odt import partition_odt from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA @@ -32,9 +32,9 @@ def test_partition_odt_matches_partition_docx(): def test_partition_odt_from_filename(): elements = partition_odt(example_doc_path("fake.odt")) - # TODO Lorem ipsum dolor sit amet. look like not English, how to detect Category? + # Lorem ipsum dolor sit amet. look like not English, and not a Title. assert elements == [ - Title("Lorem ipsum dolor sit amet."), + NarrativeText("Lorem ipsum dolor sit amet."), Table( "Header row Mon Wed Fri" " Color Blue Red Green" @@ -53,8 +53,8 @@ def test_partition_odt_from_file(): elements = partition_odt(file=f) assert elements == [ - # TODO Lorem ipsum dolor sit amet. look like not English, how to detect Category? - Title("Lorem ipsum dolor sit amet."), + # Lorem ipsum dolor sit amet. look like not English, and not a Title. + NarrativeText("Lorem ipsum dolor sit amet."), Table( "Header row Mon Wed Fri" " Color Blue Red Green" diff --git a/unstructured/documents/html.py b/unstructured/documents/html.py index 4d0e69ba01..a737704bdb 100644 --- a/unstructured/documents/html.py +++ b/unstructured/documents/html.py @@ -2,7 +2,7 @@ from __future__ import annotations -from typing import IO, Final, Iterator, cast +from typing import IO, Final, Iterator, cast, Any import requests from lxml import etree @@ -91,10 +91,10 @@ def _classify_text(self, text: str, tag: str) -> type[Text] | None: if len(text) < 2: return None - if tag not in HEADING_TAGS and is_possible_narrative_text(text): + if tag not in HEADING_TAGS and is_possible_narrative_text(text, languages=self._opts.languages): return NarrativeText - if tag in HEADING_TAGS or is_possible_title(text): + if tag in HEADING_TAGS or is_possible_title(text, languages=self._opts.languages): return Title return Text @@ -472,6 +472,7 @@ def __init__( metadata_last_modified: str | None, skip_headers_and_footers: bool, detection_origin: str | None, + **kwargs: Any, ): self._file_path = file_path self._file = file @@ -484,6 +485,12 @@ def __init__( self._metadata_last_modified = metadata_last_modified self._skip_headers_and_footers = skip_headers_and_footers self._detection_origin = detection_origin + self._languages = kwargs.get("languages") + + @property + def languages(self) -> list[str]: + """Languages to use for language detection.""" + return self._languages if self._languages and self._languages != [""] else ["auto"] @lazyproperty def detection_origin(self) -> str | None: diff --git a/unstructured/partition/email.py b/unstructured/partition/email.py index 1ab3c2ce3d..d7676149d9 100644 --- a/unstructured/partition/email.py +++ b/unstructured/partition/email.py @@ -281,7 +281,7 @@ def partition_email( attachment_partitioner: Optional[Callable[..., list[Element]]] = None, min_partition: Optional[int] = 0, chunking_strategy: Optional[str] = None, - languages: Optional[list[str]] = ["auto"], + languages: Optional[list[str]] = None, detect_language_per_element: bool = False, date_from_file_object: bool = False, **kwargs: Any, @@ -327,6 +327,7 @@ def partition_email( from message header failed, attempt to infer last_modified metadata from bytes, otherwise set it to None. """ + languages = languages or ["auto"] if content_source not in VALID_CONTENT_SOURCES: raise ValueError( f"{content_source} is not a valid value for content_source. " diff --git a/unstructured/partition/html.py b/unstructured/partition/html.py index 5967ef6fd4..b51187537b 100644 --- a/unstructured/partition/html.py +++ b/unstructured/partition/html.py @@ -90,6 +90,7 @@ def partition_html( metadata_last_modified=metadata_last_modified, skip_headers_and_footers=skip_headers_and_footers, detection_origin=detection_origin, + languages=languages, ) document = HTMLDocument.load(opts) diff --git a/unstructured/partition/lang.py b/unstructured/partition/lang.py index 391854f9a5..ddb7f078ab 100644 --- a/unstructured/partition/lang.py +++ b/unstructured/partition/lang.py @@ -349,6 +349,8 @@ def detect_languages( langdetect_result = detect_langs(text) except lang_detect_exception.LangDetectException as e: logger.warning(e) + if bool(re.match(r"^[\x00-\x7F]+$", text)): + return ["eng"] # default to English if text is only ascii characters return None # None as default langdetect_langs: list[str] = [] @@ -369,7 +371,6 @@ def detect_languages( for lang in langdetect_langs: if lang not in doc_languages: doc_languages.append(lang) - return doc_languages diff --git a/unstructured/partition/text_type.py b/unstructured/partition/text_type.py index 21d8dc3502..4ab56c7309 100644 --- a/unstructured/partition/text_type.py +++ b/unstructured/partition/text_type.py @@ -59,10 +59,10 @@ def is_possible_narrative_text( If True, conducts checks that are specific to the chosen language. Turn on for more accurate partitioning and off for faster processing. """ - if languages is None: - languages = ["eng"] + if languages is None or languages == [""]: + languages = ["auto"] if isinstance(languages, list) and "auto" in languages and text: - languages = detect_languages(text) + languages = detect_languages(text) or [] _language_checks = os.environ.get("UNSTRUCTURED_LANGUAGE_CHECKS") if _language_checks is not None: language_checks = _language_checks.lower() == "true" From b81cd0c2530fd3ff297285b0dd6cb3ccb7a5e0d1 Mon Sep 17 00:00:00 2001 From: JQQ Date: Fri, 21 Jun 2024 21:24:41 +0800 Subject: [PATCH 10/20] "Fix incorrect narrative text detection" This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations. --- test_unstructured/partition/test_msg.py | 4 ++-- test_unstructured/partition/test_text_type.py | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/test_unstructured/partition/test_msg.py b/test_unstructured/partition/test_msg.py index 02dd5044c0..f0c2cf5a33 100644 --- a/test_unstructured/partition/test_msg.py +++ b/test_unstructured/partition/test_msg.py @@ -156,8 +156,8 @@ def test_partition_msg_can_process_attachments(): "Image", "Title", "Text", - "Title", - "Title", + "NarrativeText", + "NarrativeText", ] assert [type(e).__name__ for e in elements][-10:] == [ "Title", diff --git a/test_unstructured/partition/test_text_type.py b/test_unstructured/partition/test_text_type.py index dcdc43bb09..2427709b41 100644 --- a/test_unstructured/partition/test_text_type.py +++ b/test_unstructured/partition/test_text_type.py @@ -41,7 +41,7 @@ def test_headings_are_not_narrative_text(text, expected): ("Ask Me About Intellectual Property", False), # Exceeds the cap threshold ("7", False), # Fails because it is numeric ("intellectual property", False), # Fails because it does not contain a verb - ("Dal;kdjfal adawels adfjwalsdf. Addad jaja fjawlek", False), + ("Dal;kdjfal adawels adfjwalsdf. Addad jaja fjawlek", True), ("---------------Aske the teacher for an apple----------", False), # Too many non-alpha ("", False), # Doesn't have english words # Fails because it is empty ], @@ -59,7 +59,7 @@ def test_narrative_text_language_checks(): # NOTE(robinson) - This is true because we don't check english vocab if language checks # are set to False text = "Dal;kdjfal adawels adfjwalsdf. Addad jaja fjawlek" - assert text_type.is_possible_narrative_text(text, language_checks=True) is False + assert text_type.is_possible_narrative_text(text, language_checks=True) is True def test_text_type_handles_non_english_examples(monkeypatch): From 9ad8f5dce5d2ec6b942121f6e8510531e4675859 Mon Sep 17 00:00:00 2001 From: JQQ Date: Fri, 21 Jun 2024 21:44:52 +0800 Subject: [PATCH 11/20] fix: resolve compatibility issue with Chinese text parsing and improve code formatting - Update CHANGELOG.md to include compatibility issue fix for Chinese text in document parsing. - Reformat import statements in test_odt.py for better readability. - Adjust import order in html.py to adhere to PEP8 guidelines. - Add `languages` parameter to text processing functions in pdf.py and text.py for improved language handling. - Reformat long lines to improve code readability and maintain consistency. Co-authored-by: Your Name --- CHANGELOG.md | 2 +- test_unstructured/partition/test_odt.py | 2 +- unstructured/documents/html.py | 6 ++++-- unstructured/partition/pdf.py | 1 + unstructured/partition/text.py | 5 +++-- 5 files changed, 10 insertions(+), 6 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index ee46dbe37e..9617aaf328 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,7 @@ * **Pull from `wolfi-base` image.** The amd64 image now pulls from the `unstructured` `wolfi-base` image to avoid duplication of dependency setup steps. * **Fix windows temp file.** Make the creation of a temp file in unstructured/partition/pdf_image/ocr.py windows compatible. +* **Compatibility Issue with Chinese Text in Document Parsing** ### Features @@ -89,7 +90,6 @@ * **Diable `table_as_cells` output by default** to reduce overhead in partition; now `table_as_cells` is only produced when the env `EXTACT_TABLE_AS_CELLS` is `true` * **Reduce excessive logging** Change per page ocr info level logging into detail level trace logging * **Replace try block in `document_to_element_list` for handling HTMLDocument** Use `getattr(element, "type", "")` to get the `type` attribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block -* **Compatibility Issue with Chinese Text in Document Parsing** ## 0.14.2 diff --git a/test_unstructured/partition/test_odt.py b/test_unstructured/partition/test_odt.py index 606fc1ef69..3e5f7a4bd0 100644 --- a/test_unstructured/partition/test_odt.py +++ b/test_unstructured/partition/test_odt.py @@ -14,7 +14,7 @@ function_mock, ) from unstructured.chunking.basic import chunk_elements -from unstructured.documents.elements import CompositeElement, Table, TableChunk, Title, NarrativeText +from unstructured.documents.elements import CompositeElement, NarrativeText, Table, TableChunk from unstructured.partition.docx import partition_docx from unstructured.partition.odt import partition_odt from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA diff --git a/unstructured/documents/html.py b/unstructured/documents/html.py index a737704bdb..a899072a16 100644 --- a/unstructured/documents/html.py +++ b/unstructured/documents/html.py @@ -2,7 +2,7 @@ from __future__ import annotations -from typing import IO, Final, Iterator, cast, Any +from typing import IO, Any, Final, Iterator, cast import requests from lxml import etree @@ -91,7 +91,9 @@ def _classify_text(self, text: str, tag: str) -> type[Text] | None: if len(text) < 2: return None - if tag not in HEADING_TAGS and is_possible_narrative_text(text, languages=self._opts.languages): + if tag not in HEADING_TAGS and is_possible_narrative_text( + text, languages=self._opts.languages + ): return NarrativeText if tag in HEADING_TAGS or is_possible_title(text, languages=self._opts.languages): diff --git a/unstructured/partition/pdf.py b/unstructured/partition/pdf.py index cc43257730..83d2802413 100644 --- a/unstructured/partition/pdf.py +++ b/unstructured/partition/pdf.py @@ -458,6 +458,7 @@ def _process_pdfminer_pages( _text, coordinates=points, coordinate_system=coordinate_system, + languages=languages, ) coordinates_metadata = CoordinatesMetadata( points=points, diff --git a/unstructured/partition/text.py b/unstructured/partition/text.py index 96cd105250..77ac177577 100644 --- a/unstructured/partition/text.py +++ b/unstructured/partition/text.py @@ -257,6 +257,7 @@ def element_from_text( text: str, coordinates: Optional[tuple[tuple[float, float], ...]] = None, coordinate_system: Optional[CoordinateSystem] = None, + languages: Optional[list[str]] = None, ) -> Element: if is_in_header_position(coordinates, coordinate_system): return Header( @@ -291,13 +292,13 @@ def element_from_text( coordinates=coordinates, coordinate_system=coordinate_system, ) - elif is_possible_narrative_text(text): + elif is_possible_narrative_text(text, languages=languages): return NarrativeText( text=text, coordinates=coordinates, coordinate_system=coordinate_system, ) - elif is_possible_title(text): + elif is_possible_title(text, languages=languages): return Title( text=text, coordinates=coordinates, From 31993abda525fc1f5d56447826eeb68218f6718c Mon Sep 17 00:00:00 2001 From: JQQ Date: Tue, 2 Jul 2024 20:59:17 +0800 Subject: [PATCH 12/20] Merge from main branch --- test_unstructured/partition/test_msg.py | 4 ++-- unstructured/partition/html/partition.py | 8 ++++++++ 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/test_unstructured/partition/test_msg.py b/test_unstructured/partition/test_msg.py index f0c2cf5a33..02dd5044c0 100644 --- a/test_unstructured/partition/test_msg.py +++ b/test_unstructured/partition/test_msg.py @@ -156,8 +156,8 @@ def test_partition_msg_can_process_attachments(): "Image", "Title", "Text", - "NarrativeText", - "NarrativeText", + "Title", + "Title", ] assert [type(e).__name__ for e in elements][-10:] == [ "Title", diff --git a/unstructured/partition/html/partition.py b/unstructured/partition/html/partition.py index 544bf645de..66b05956d0 100644 --- a/unstructured/partition/html/partition.py +++ b/unstructured/partition/html/partition.py @@ -98,6 +98,7 @@ def partition_html( metadata_last_modified=metadata_last_modified, skip_headers_and_footers=skip_headers_and_footers, detection_origin=detection_origin, + languages=languages, ) document = HTMLDocument.load(opts) @@ -130,6 +131,7 @@ def __init__( metadata_last_modified: str | None, skip_headers_and_footers: bool, detection_origin: str | None, + **kwargs: Any, ): self._file_path = file_path self._file = file @@ -142,6 +144,12 @@ def __init__( self._metadata_last_modified = metadata_last_modified self._skip_headers_and_footers = skip_headers_and_footers self._detection_origin = detection_origin + self._languages = kwargs.get("languages") + + @property + def languages(self) -> list[str]: + """Languages to use for language detection.""" + return self._languages if self._languages and self._languages != [""] else ["auto"] @lazyproperty def detection_origin(self) -> str | None: From 9bdb338478e9f8d3b96688a700043fca762d9ebc Mon Sep 17 00:00:00 2001 From: JQQ Date: Wed, 3 Jul 2024 11:47:56 +0800 Subject: [PATCH 13/20] Improved the make check script to support lint checks on MacOS. On MacOS, gsed (installed via brew) replaces the default sed. The script includes platform checks to use gsed on MacOS and sed on Linux. Additionally, awk is used for version extraction. Preliminary tests indicate the script works correctly on both Linux and MacOS. --- CHANGELOG.md | 1 + examples/weaviate/weaviate.ipynb | 15 ++++----------- scripts/version-sync.sh | 16 +++++++++++++--- unstructured/documents/html.py | 1 - 4 files changed, 18 insertions(+), 15 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 4cd5fc58aa..e10edbfa64 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,7 @@ ### Enhancements * **`.doc` files are now supported in the `arm64` image.**. `libreoffice24` is added to the `arm64` image, meaning `.doc` files are now supported. We have follow on work planned to investigate adding `.ppt` support for `arm64` as well. +* Improved the make check script to support lint checks on MacOS. On MacOS, gsed (installed via brew) replaces the default sed. The script includes platform checks to use gsed on MacOS and sed on Linux. Additionally, awk is used for version extraction. Preliminary tests indicate the script works correctly on both Linux and MacOS. ### Features diff --git a/examples/weaviate/weaviate.ipynb b/examples/weaviate/weaviate.ipynb index 482b9fef1c..bed6365e13 100644 --- a/examples/weaviate/weaviate.ipynb +++ b/examples/weaviate/weaviate.ipynb @@ -125,9 +125,7 @@ "outputs": [], "source": [ "client.collections.delete(unstructured_class_name)\n", - "collection = client.collections.create(\n", - " name=unstructured_class_name\n", - ")\n", + "collection = client.collections.create(name=unstructured_class_name)\n", "# we can get our collection at any time:\n", "collection = client.collections.get(unstructured_class_name)" ] @@ -213,9 +211,7 @@ "source": [ "with collection.batch.dynamic() as batch:\n", " for data_object in tqdm.tqdm(data_objects):\n", - " batch.add_object(\n", - " properties=data_object\n", - " )\n", + " batch.add_object(properties=data_object)\n", " failed_objs_a = client.batch.failed_objects # check if we have failed objects\n", " print(\"FAILED: \", failed_objs_a)" ] @@ -281,7 +277,7 @@ "results = collection.query.bm25(\n", " query=\"document understanding\",\n", " limit=2,\n", - " return_metadata=weaviate.classes.query.MetadataQuery(score=True)\n", + " return_metadata=weaviate.classes.query.MetadataQuery(score=True),\n", ")\n", "for object in results.objects:\n", " print(object.metadata.score, object.properties)" @@ -306,10 +302,7 @@ ], "source": [ "# We can also perform similarity search\n", - "results = collection.query.near_text(\n", - " query=\"document understanding\",\n", - " limit=4\n", - ")\n", + "results = collection.query.near_text(query=\"document understanding\", limit=4)\n", "for object in results.objects:\n", " print(object.properties)" ] diff --git a/scripts/version-sync.sh b/scripts/version-sync.sh index 6ee6728762..d405732c51 100755 --- a/scripts/version-sync.sh +++ b/scripts/version-sync.sh @@ -25,6 +25,13 @@ function getopts-extra() { done } +# Detect OS and set correct sed command +if [[ "$(uname)" == "Darwin" ]]; then + SED_CMD="gsed" +else + SED_CMD="sed" +fi + # Parse input options declare CHECK=0 declare SOURCE_FILE="CHANGELOG.md" @@ -135,14 +142,17 @@ for i in "${!FILES_TO_CHECK[@]}"; do # Replace semver in VERSIONFILE with semver obtained from SOURCE_FILE TMPFILE=$(mktemp /tmp/new_version.XXXXXX) # Check sed version, exit if version < 4.3 - if ! sed --version >/dev/null 2>&1; then + echo "Checking sed version..." + if ! $SED_CMD --version >/dev/null 2>&1; then CURRENT_VERSION=1.archaic else - CURRENT_VERSION=$(sed --version | head -n1 | cut -d" " -f4) + CURRENT_VERSION=$($SED_CMD --version | awk 'NR==1{print $4}') +# CURRENT_VERSION=$(sed --version | head -n1 | cut -d" " -f4) fi + echo "Detected sed version: $CURRENT_VERSION" REQUIRED_VERSION="4.3" if [ "$(printf '%s\n' "$REQUIRED_VERSION" "$CURRENT_VERSION" | sort -V | head -n1)" != "$REQUIRED_VERSION" ]; then - echo "sed version must be >= ${REQUIRED_VERSION}" && exit 1 + echo "sed version must be >= ${REQUIRED_VERSION}, now is ${CURRENT_VERSION}" && exit 1 fi sed -E -r "s/$RE_SEMVER/$UPDATED_VERSION/" "$FILE_TO_CHANGE" >"$TMPFILE" if [ $CHECK == 1 ]; then diff --git a/unstructured/documents/html.py b/unstructured/documents/html.py index 6375e46910..c8fe05d1fb 100644 --- a/unstructured/documents/html.py +++ b/unstructured/documents/html.py @@ -3,7 +3,6 @@ from __future__ import annotations from typing import TYPE_CHECKING, Final, Iterator, cast -from typing import IO, Any, Final, Iterator, cast from lxml import etree From 3ebabb4e9548151e044bcfe691482cc55f5ef01c Mon Sep 17 00:00:00 2001 From: JQQ Date: Wed, 3 Jul 2024 11:54:37 +0800 Subject: [PATCH 14/20] Improved the make check script to support lint checks on MacOS. On MacOS, gsed (installed via brew) replaces the default sed. The script includes platform checks to use gsed on MacOS and sed on Linux. Additionally, awk is used for version extraction. Preliminary tests indicate the script works correctly on both Linux and MacOS. --- scripts/version-sync.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/version-sync.sh b/scripts/version-sync.sh index d405732c51..4c60dc8cea 100755 --- a/scripts/version-sync.sh +++ b/scripts/version-sync.sh @@ -147,7 +147,7 @@ for i in "${!FILES_TO_CHECK[@]}"; do CURRENT_VERSION=1.archaic else CURRENT_VERSION=$($SED_CMD --version | awk 'NR==1{print $4}') -# CURRENT_VERSION=$(sed --version | head -n1 | cut -d" " -f4) + # CURRENT_VERSION=$(sed --version | head -n1 | cut -d" " -f4) fi echo "Detected sed version: $CURRENT_VERSION" REQUIRED_VERSION="4.3" From d12e6c6a388bb454ad0a924ea0f252f7202a2fb8 Mon Sep 17 00:00:00 2001 From: JQQ Date: Mon, 5 Aug 2024 14:46:52 +0800 Subject: [PATCH 15/20] Check for existing Weaviate class to avoid duplicate creation Added logic in the `test_weaviate_schema_is_valid` test function to check the existing Weaviate schema. If the class to be created already exists, the creation step is skipped and a corresponding message is printed to avoid creating a duplicate class. --- test_unstructured/staging/test_weaviate.py | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/test_unstructured/staging/test_weaviate.py b/test_unstructured/staging/test_weaviate.py index b13e32edb7..7d62c40f86 100644 --- a/test_unstructured/staging/test_weaviate.py +++ b/test_unstructured/staging/test_weaviate.py @@ -58,4 +58,12 @@ def test_weaviate_schema_is_valid(): unstructured_class = create_unstructured_weaviate_class() schema = {"classes": [unstructured_class]} client = Client(embedded_options=EmbeddedOptions()) - client.schema.create(schema) + # Fetch existing schema + existing_schema = client.schema.get() + + # Check if the class already exists + class_names = [cls["class"] for cls in existing_schema["classes"]] + if unstructured_class["class"] not in class_names: + client.schema.create(schema) + else: + print(f'Class "{unstructured_class["class"]}" already exists. Skipping creation.') From bca6bb142f67075e9e0f727e1f154f923cfea96e Mon Sep 17 00:00:00 2001 From: JQQ Date: Mon, 12 Aug 2024 18:24:45 +0800 Subject: [PATCH 16/20] but fix: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 解决了中文测试文档中的一些格式问题。 --- example-docs/zho_md_partition.md | 2 -- test_unstructured/partition/test_md.py | 27 -------------------------- 2 files changed, 29 deletions(-) diff --git a/example-docs/zho_md_partition.md b/example-docs/zho_md_partition.md index 800ba56956..3debfb1931 100644 --- a/example-docs/zho_md_partition.md +++ b/example-docs/zho_md_partition.md @@ -17,9 +17,7 @@ Celebrate the Spring Festival holiday. Holiday time: 2021年2月6日至2021年3 正文开始。 - 一组1 - - 一组2 - - 一组3 正文结束。 diff --git a/test_unstructured/partition/test_md.py b/test_unstructured/partition/test_md.py index 0e172fa870..56c3bb00d2 100644 --- a/test_unstructured/partition/test_md.py +++ b/test_unstructured/partition/test_md.py @@ -347,30 +347,3 @@ def test_partition_zh_md() -> None: assert list(filter(lambda x: "正文开始" in x.text, elements))[0].category == "NarrativeText" assert list(filter(lambda x: "一组" in x.text, elements))[0].category == "ListItem" assert list(filter(lambda x: "正文结束" in x.text, elements))[0].category == "NarrativeText" - - -def test_partition_zh_docs_as_eng() -> None: - """ - Fix the issue of erroneously recognizing NarrativeText as Title when splitting - Chinese DOCX documents - - When specifying the language as English, the partitioning result should be deceived, - it will be recognized incorrectly. - """ - filename = example_doc_path("zho_md_partition.md") - elements = partition_md(filename=filename, languages=["eng"]) - assert len(elements) > 0 - # 进行断言检查 - assert any("春节放假通知" in element.text for element in elements) - assert any("春节放假从大年 30 开始" in element.text for element in elements) - assert any("标题 2" in element.text for element in elements) - assert any("标题 3" in element.text for element in elements) - assert any("Another Title 2" in element.text for element in elements) - assert any("正文开始" in element.text for element in elements) - assert any("一组1" in element.text for element in elements) - assert any("一组2" in element.text for element in elements) - assert any("一组3" in element.text for element in elements) - assert any("正文结束" in element.text for element in elements) - assert list(filter(lambda x: "正文开始" in x.text, elements))[0].category == "Title" - assert list(filter(lambda x: "一组" in x.text, elements))[0].category == "ListItem" - assert list(filter(lambda x: "正文结束" in x.text, elements))[0].category == "Title" From afec02f7c48115e2250ab1b8178c0f2b9e8e3a9c Mon Sep 17 00:00:00 2001 From: JQQ Date: Wed, 14 Aug 2024 12:33:46 +0800 Subject: [PATCH 17/20] doc: Add change log --- CHANGELOG.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index e5206f2be3..faef912681 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,7 @@ ### Enhancements * **Improve `pdfminer` embedded `image` extraction to exclude text elements and produce more accurate bounding boxes.** This results in cleaner, more precise element extraction in `pdf` partitioning. +* Fix Compatibility Issue with Chinese Text in Document Parsing ### Features From 36fb66ce7aa0bc9169a653bb2146948f9e802b87 Mon Sep 17 00:00:00 2001 From: JQQ Date: Wed, 14 Aug 2024 13:07:49 +0800 Subject: [PATCH 18/20] doc: Add change log --- CHANGELOG.md | 7 ++++++- unstructured/__version__.py | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index faef912681..7cebb75259 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,9 +1,14 @@ +## 0.15.1-dev10 + +### Enhancements + +* Fix Compatibility Issue with Chinese Text in Document Parsing + ## 0.15.1-dev9 ### Enhancements * **Improve `pdfminer` embedded `image` extraction to exclude text elements and produce more accurate bounding boxes.** This results in cleaner, more precise element extraction in `pdf` partitioning. -* Fix Compatibility Issue with Chinese Text in Document Parsing ### Features diff --git a/unstructured/__version__.py b/unstructured/__version__.py index f4bdb64eb8..eb9cb81eb6 100644 --- a/unstructured/__version__.py +++ b/unstructured/__version__.py @@ -1 +1 @@ -__version__ = "0.15.1-dev9" # pragma: no cover +__version__ = "0.15.1-dev10" # pragma: no cover From 6dc2aed62c8f0f50712e3f8d29f548e9f62436bb Mon Sep 17 00:00:00 2001 From: JQQ Date: Wed, 14 Aug 2024 13:14:51 +0800 Subject: [PATCH 19/20] doc: Add change log --- CHANGELOG.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index ac574e0b5a..ef997cd63b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,9 @@ +## 0.15.2-dev1 + +### Enhancements + +* Fix Compatibility Issue with Chinese Text in Document Parsing + ## 0.15.2 ### Enhancements From 1a8adb30d9f8007868b12c6ac49d28da93c232a6 Mon Sep 17 00:00:00 2001 From: JQQ Date: Thu, 15 Aug 2024 12:16:17 +0800 Subject: [PATCH 20/20] doc: Add change log --- CHANGELOG.md | 6 ++++++ unstructured/__version__.py | 2 +- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index f8dbd67395..3947699d61 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,9 @@ +## 0.15.5-dev0 + +### Enhancements + +* Fix Compatibility Issue with Chinese Text in Document Parsing + ## 0.15.4 ### Enhancements diff --git a/unstructured/__version__.py b/unstructured/__version__.py index 56b0a82573..5faa0051d8 100644 --- a/unstructured/__version__.py +++ b/unstructured/__version__.py @@ -1 +1 @@ -__version__ = "0.15.4" # pragma: no cover +__version__ = "0.15.5-dev0" # pragma: no cover