-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing of large TMs (e.g., 10GB XLIFF files) fails #3217
Comments
An update ... I found out that the import function does not support my XLIFF files. As a workaround, I found that this works:
Once having done this, all files could be imported successfully! I also had to switch from python 3.7 to 2.7 as amagama did not work with python 3.7. |
I haven't worked on this in a while, but my (unconformed) suspicion is that the problem might be in the Translate Toolkit and not in Amagama. Can you check if |
I also think I know why nothing was imported when using XLIFF: there is probably a mismatch between your view of the state of the translations and amaGama (really the Translate Toolkit). That is why the conversion to PO marks them as fuzzy. If you can paste a snippet of the XLIFF file I should be able to confirm. I'm guessing you don't have approved="yes". |
I ran the pocount tool on one of the XLIFF files (1.3GB). I got the following result:
My XLIFF files were generated from parallel corpora. The example is as follows (none of the segments have the <?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
<file original="abc.txt" source-language="lt" target-language="en" datatype="plaintext">
<header/>
<body>
<trans-unit id="20275" xml:space="preserve">
<source>- JAV doleris puslapis .</source>
<target>- US Dollar Index .</target>
</trans-unit>
</body>
</file>
</xliff> However, now that I know that there must be the |
I can't think of a reason why pocount would process it successfully but not amaGama. You might want to go with the smaller files (or PO conversion) for now. One advantage of doing several files is that you can run parallel import commands. You have to invoke multiple import commands manually, but I tried to make it a safe way to speed up import. If you alter your XLIFF files to have By the way, I'm probably starting to work on Python 3 support soon. |
Oh, I misread what you wrote: pocount works on the smaller file. Ok then things are consistent. Although we have from the outset worked with and planned for gigabyte sized databases, I don't think I've tried importing such large files. I can't think of a reason it shouldn't simply work, but the Translate Toolkit holds a complete file in memory while processing it, which is probably part of the problem here. The issue with lxml parsing big files started with lxml 2.7 as a security precaution. You can add the parameter |
Hi!
I deployed amagama on an Ubuntu machine with 64GB RAM, 2TB SSD (almost empty) and tried importing a 10GB XLIFF file (28.5 million segments). It did not work. I got the following errors (can't tell much from them though):
Then, I tried splitting the large TM in smaller chunks of 300,000 segments. That (almost) worked for 95/96 parts. The one part I had to split even further (down to chunks of up to 25,000) segments. The following error kept appearing (different from the large TM file):
Once I had split everything down to the small parts, the small XLIFF files (almost) imported without errors.
Why almost?
For every file,
amagama
printed in the console:But ... when I looked in the PostgreSQL data base, there were exactly 0 entries.
Any suggestions on what might have failed?
The text was updated successfully, but these errors were encountered: