Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing of large TMs (e.g., 10GB XLIFF files) fails #3217

Open
pmarcis opened this issue Feb 27, 2019 · 6 comments
Open

Importing of large TMs (e.g., 10GB XLIFF files) fails #3217

pmarcis opened this issue Feb 27, 2019 · 6 comments

Comments

@pmarcis
Copy link

pmarcis commented Feb 27, 2019

Hi!

I deployed amagama on an Ubuntu machine with 64GB RAM, 2TB SSD (almost empty) and tried importing a 10GB XLIFF file (28.5 million segments). It did not work. I got the following errors (can't tell much from them though):

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/xliff.py", line 880, in parsestring
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/base.py", line 781, in parsestring
MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/flask_script/__init__.py", line 417, in run
MemoryError

During handling of the above exception, another exception occurred:

MemoryError

Then, I tried splitting the large TM in smaller chunks of 300,000 segments. That (almost) worked for 95/96 parts. The one part I had to split even further (down to chunks of up to 25,000) segments. The following error kept appearing (different from the large TM file):

Importing /home/marcis/general.tm.51g.xlf
ERROR:root:Error while processing: /home/marcis/general.tm.51g.xlf
Traceback (most recent call last):
  File "/home/marcis/amagama/amagama/commands.py", line 161, in handlefile
    store = factory.getobject(filename)
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/factory.py", line 209, in getobject
    store = storeclass.parsefile(storefile)
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/base.py", line 900, in parsefile
    newstore = cls.parsestring(storestring)
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/xliff.py", line 880, in parsestring
    xliff = super(xlifffile, cls).parsestring(storestring)
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/base.py", line 781, in parsestring
    newstore.parse(storestring)
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/lisa.py", line 335, in parse
    self.document = etree.fromstring(xml, parser).getroottree()
  File "src/lxml/etree.pyx", line 3213, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "<string>", line 94704

Once I had split everything down to the small parts, the small XLIFF files (almost) imported without errors.

Why almost?

For every file, amagama printed in the console:

Succesfully imported [FILE NAME]

But ... when I looked in the PostgreSQL data base, there were exactly 0 entries.

Any suggestions on what might have failed?

@pmarcis
Copy link
Author

pmarcis commented Feb 28, 2019

An update ...

I found out that the import function does not support my XLIFF files. As a workaround, I found that this works:

  • First, convert the XLIFF files to PO files with the command: xliff2po in.xlf out.po
  • Then, strip the out.po file from lines containing #, fuzzy (xliff2po somehow assumed that all my segments are just suggestions and not actual translations) using: sed -i '/#, fuzzy/d' out.po.

Once having done this, all files could be imported successfully!

I also had to switch from python 3.7 to 2.7 as amagama did not work with python 3.7.

@friedelwolff
Copy link
Member

I haven't worked on this in a while, but my (unconformed) suspicion is that the problem might be in the Translate Toolkit and not in Amagama. Can you check if pocount from the Translate Toolkit work on these files? There was a change some time ago in the lxml library regarding the handling of large XML files, and this might be what is happening here, but I'm really just guessing.

@friedelwolff
Copy link
Member

I also think I know why nothing was imported when using XLIFF: there is probably a mismatch between your view of the state of the translations and amaGama (really the Translate Toolkit). That is why the conversion to PO marks them as fuzzy. If you can paste a snippet of the XLIFF file I should be able to confirm. I'm guessing you don't have approved="yes".

@pmarcis
Copy link
Author

pmarcis commented Jun 4, 2019

I ran the pocount tool on one of the XLIFF files (1.3GB). I got the following result:

pocount corpus.xlf
Processing file : corpus.xlf
Type               Strings      Words (source)    Words (translation)
Translated:       0 (  0%)          0 (  0%)               0
Fuzzy:        2956993 (100%)   59813203 (100%)             n/a
Untranslated:     0 (  0%)          0 (  0%)             n/a
Total:        2956993          59813203                      0

Needs-Work:   2956993 (100%)   59813203 (100%)               0

My XLIFF files were generated from parallel corpora. The example is as follows (none of the segments have the approved attribute):

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
  <file original="abc.txt" source-language="lt" target-language="en" datatype="plaintext">
    <header/>
    <body>
      <trans-unit id="20275" xml:space="preserve">
        <source>- JAV doleris puslapis .</source>
        <target>- US Dollar Index .</target>
      </trans-unit>
    </body>
  </file>
</xliff>

However, now that I know that there must be the approved attribute, I can actually add it in my conversion tool that converts the parallel corpus into XLIFF.

@friedelwolff
Copy link
Member

I can't think of a reason why pocount would process it successfully but not amaGama. You might want to go with the smaller files (or PO conversion) for now. One advantage of doing several files is that you can run parallel import commands. You have to invoke multiple import commands manually, but I tried to make it a safe way to speed up import.

If you alter your XLIFF files to have approved="True", pocount should also report the number of target words.

By the way, I'm probably starting to work on Python 3 support soon.

@friedelwolff
Copy link
Member

Oh, I misread what you wrote: pocount works on the smaller file. Ok then things are consistent.

Although we have from the outset worked with and planned for gigabyte sized databases, I don't think I've tried importing such large files. I can't think of a reason it shouldn't simply work, but the Translate Toolkit holds a complete file in memory while processing it, which is probably part of the problem here.

The issue with lxml parsing big files started with lxml 2.7 as a security precaution. You can add the parameter huge_tree=True next to resolve_entities=False close to the bottom of translate/storage/lisa.py if you are interested in diving into the code. (untested) It might help, but maybe the smaller files works well enough for your case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants