Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-10803 -- compress/uncompress property #1526

Merged
merged 48 commits into from
Jul 24, 2024
Merged

Conversation

ionutzpi
Copy link
Contributor

No description provided.

pirlogea and others added 22 commits May 17, 2024 16:42
private final byte[] compressedValue;
private final Compression compression;

private static final int DEFAULT_COMPRESSION_THRESHOLD = Integer.getInteger("oak.mongo.compressionThreshold", -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use SystemPropertySupplier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@reschke
Copy link
Contributor

reschke commented Jun 14, 2024

I believe the main concern is that we don't know yet exactly why we are doing this. IMHO it would be better to first collect data on the size of the properties.

@ionutzpi ionutzpi requested a review from reschke June 17, 2024 10:08
@@ -128,7 +188,7 @@ public boolean equals(Object object) {
} else if (object instanceof DocumentPropertyState) {
DocumentPropertyState other = (DocumentPropertyState) object;
return this.name.equals(other.name)
&& this.value.equals(other.value);
&& this.getValue().equals(other.getValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. That means the "equals()" method would need to uncompress. That seems performance critical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way that equals wouldn't have to uncompress?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could special-case the compressed case, and compare the byte arrays. Still expensive.

Maybe we could keep the String's hashCode in order to avoid comparing the byte arrays in most cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think there should be distinction between non-compressed, compressed, mixed cases. Only the compressed one is slightly more involved. For that one we could use a combination of length comparison, hashcode (calculated when needed) and/or plain Arrays.equals(byte[],byte[])..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Assert that the compressed value is not null or empty
assertTrue(compressedValue != null && !compressedValue.isEmpty());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the test fails, right? That means that for certain property values, compression would be lossy, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failed. The compression didn't work properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(assuming this was originally referring to testInterestingStrings)

The test failed. The compression didn't work properly.

So is the plan to still fix this? I don't think this test should be expected to throw ComparisonFailure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented in the ticket - to me it looks like this assertEquals(value,""); is not what we'd aim for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compression did work properly even for interesting strings. Done

private final Compression compression;

private static final int DEFAULT_COMPRESSION_THRESHOLD =
SystemPropertySupplier.create("oak.mongo.compressionThreshold", -1).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the property isn't mongo specific. this isn't handled entirely consistent in the existing code base neither - perhaps it should be oak.documentMK. ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ionutzpi ionutzpi requested a review from reschke June 21, 2024 12:40
@reschke
Copy link
Contributor

reschke commented Jul 11, 2024

Why is this a MongoDB specific aspect? For MongoDB we know it is not supported (see this skip here) - but for RDB it is supported. So shouldn't this PR keep the support for RDB?

I believe we should come up with consistent behaviour for all types of persistences. My preference would be to disallow Strings with broken UTF-8 serializations, but that would be a change for segment persistence, and that's why I faced resistence when I proposed that back then.

For this PR, it will have to handle what the current system allows.

@stefan-egli
Copy link
Contributor

With the default set to disable compression, in theory this PR shouldn't do no harm and could be merged, as changing the default is a conscious decision that could be done based on the requirement of the system and which DocumentStore is used.

Having said that, I think the current state of testInterestingStrings isn't ideal, as it lists a broken surrogate in a test string, but then doesn't test it. So I'd say either it uses that broken surrogate in a test, or then shouldn't list it at all. Except, I think now that this problem with broken surrogate was identified, we need to address it, so we probably should have it in the test indeed.

What we might do is have the test use different DocumentStore fixtures and test on memory, mongo and rdb with the broken surrogate property and verify that the behavior with or without compression is the same.

That could mean that for mongo the fact that compression doesn't support broken surrogates turns out as a non-issue as mongo already doesn't support it. It would just be good to have a test that verifies that.

For RDB it could be different: it could mean that if you want to use compression for RDB you'd have to accept that broken surrogates are not supported (unless we find a fast way to support it in compression).

@ionutzpi
Copy link
Contributor Author

@reschke @reschke I created methods to check the behavior with broken surrogate before and after introducing compression. After compression the test failed.

@reschke
Copy link
Contributor

reschke commented Jul 12, 2024

bq. For RDB it could be different: it could mean that if you want to use compression for RDB you'd have to accept that broken surrogates are not supported (unless we find a fast way to support it in compression).

The elephant in the room is segment persistence, which IMHO allows any Java string. I don't believe discussing the differences of document store backends is helpful here; we should come to an agreement what the expections for Oak in general are. Having document store and segment store behave differently IMHO is a problem we should solve (by fixing segment store, would be my preference).

@ionutzpi
Copy link
Contributor Author

ionutzpi commented Jul 12, 2024

@reschke I vote for your preference(fixing segment store)

@stefan-egli
Copy link
Contributor

Turns out that it might look slightly different altogether:

In the latest test from @ionutzpi where it passes a broken surrogate through NodeBuilder (except it shouldn't also go via DocumentPropertyState there), it can be seen that when it applies a property to the commit, as part of persist / compareAgainstBaseState, it does go through JsonSerializer, which handles broken surrogates properly.

So in short : DocumentNodeStore does actually handle broken surrogates in property values fine.

What I think happened in that original testInterestingStrings, is that it the DocumentStore API directly - instead of the DocumentNodeStore API...

Wrt this PR I think there are a few things still open though :

  • beautify the newly introduced tests - they have a few issue, namely :
    • getBrokenSurrogateAndInitializeDifferentStores calls createPropAndCheckValue 3 times dependent on the DocumentStore used, except there's no difference between the 3 calls, can do that in 1 call.
    • the way it uses the fixture is different from how it is classically done through a parameterized test. either I'd suggest to consider a parameterized test (where it would dispose the fixture at the end) or make sure to dispose the store/documentNodeStore created. The way it is done now leaves the DocumentNodeStore undisposed, which means follow-up tests hit an active lease and wait for lease expiration, which extends the test unnecessarily
    • createPropAndCheckValue as mentioned should go via DocumentPropertyState when setting the property value. that is not how it happens in existing code. DocumentPropertyState is only used for reading. Writing happens via builder which uses things like ModifiedDocumentNodeState, goes via jcr ValueImpl, StringPropertyState etc.
  • additionally though I believe there is a bigger issue : also based on the new test, when debugging through compression/decompression it can be seen that immediately after a DocumentPropertyState is created and with compression the value is compressed, it goes ahead and wants to put it into the cache and for that wants to know its size, so it goes an calls parsed() - several times - as part of which it calls decompression - several times. I think this highlights a bigger problem, that we should see if/how compression is performant. It seems the way it is now would add a lot of decompression overhead. This needs further improvement in my view.

wdyt?

@ionutzpi
Copy link
Contributor Author

After @stefan-egli insights and looking on performance implication for whole properties compression/decompression process, my opinion is that it doesn't brink to much value to continue.

Comment on lines 303 to 304
nodeStore.dispose();
store.dispose();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to do that in a teardown or at least a finally block

Copy link
Contributor

@stefan-egli stefan-egli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some further offline discussion the suggested plan is as follows:

  • fix the last bits of the PR (mainly the interestingString test buglet where it goes via DocumentStateProperty but shouldn't, the rest are nitpicks).
  • then merge, however this time the default should really be that it is disabled and does no harm by default
  • before activation of the compression however, performance tests are necessary. At the moment the assumption is that enabling compression comes at cost of CPU (it calls getValue() several times right after getNodeAtRevision, which would seem inefficient and needs to be improved).

tldr let's get this finished and merged, but before using compression, (more) performance tests and improvements are needed, the suggestion is to not yet use this feature as is (unless the consequences are understood)

Comment on lines 323 to 324
DocumentPropertyState documentPropertyState = new DocumentPropertyState(nodeStore, "p", test, Compression.GZIP);
builder.child(TEST_NODE).setProperty("p", documentPropertyState.getValue(), Type.STRING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DocumentPropertyState documentPropertyState = new DocumentPropertyState(nodeStore, "p", test, Compression.GZIP);
builder.child(TEST_NODE).setProperty("p", documentPropertyState.getValue(), Type.STRING);
builder.child(TEST_NODE).setProperty("p", test, Type.STRING);

As mentioned in the previous comment, going via DocumentPropertyState when setting a property on a NodeBuilder is not correct (I might have made that point not fully clear there).

With this fix, the tests actually succeed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

Comment on lines 307 to 309
if (store != null) {
store.dispose();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (store != null) {
store.dispose();
}

nodeStore.dispose already disposes the DocumentStore, calling it twice here results in a test failure:

[ERROR] testBrokenSurrogateWithoutCompressionForRDB(org.apache.jackrabbit.oak.plugins.document.DocumentPropertyStateTest)  Time elapsed: 0.19 s  <<< ERROR!
java.lang.IllegalStateException: Connection handler is already closed (0ms ago)
	at org.apache.jackrabbit.oak.plugins.document.rdb.RDBConnectionHandler.getDataSource(RDBConnectionHandler.java:144)
	at org.apache.jackrabbit.oak.plugins.document.rdb.RDBConnectionHandler.getConnection(RDBConnectionHandler.java:154)
	at org.apache.jackrabbit.oak.plugins.document.rdb.RDBConnectionHandler.getRWConnection(RDBConnectionHandler.java:76)
	at org.apache.jackrabbit.oak.plugins.document.rdb.RDBDocumentStore.dispose(RDBDocumentStore.java:797)
	at org.apache.jackrabbit.oak.plugins.document.DocumentPropertyStateTest.getBrokenSurrogateAndInitializeDifferentStores(DocumentPropertyStateTest.java:308)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@stefan-egli stefan-egli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, lgtm now. I'll (squash and) merge. Feature should this time hopefully be disabled thus have no effect.

@stefan-egli stefan-egli merged commit 35f625b into apache:trunk Jul 24, 2024
1 of 2 checks passed
thomasmueller pushed a commit that referenced this pull request Jul 29, 2024
thomasmueller added a commit that referenced this pull request Jul 30, 2024
thomasmueller added a commit that referenced this pull request Sep 6, 2024
* OAK-10341 Tree store

* OAK-10341 Tree store

* OAK-10341 Tree store

* OAK-10341 Tree store

* OAK-10341 Tree store

* OAK-10944: oak-auth-ldap: update commons-pool2 dependency to 2.12.0 (#1576)

* OAK-10705: oak-standalone: update dependencies (#1411)

Updated dependencies.

* OAK-10905 | Add a configurable  async checkpoint creator service (#1560)

* OAK-10905 | Add license header to  AsyncCheckpointService (#1579)

* OAK-10848: commons: remove use of slf4j.event.Level in SystemPropertySupplier implementation (#1580)

* OAK-10685: remove unused import of java.io.UnsupportedEncodingException

* OAK-10949: blob-cloud, segment-aws: update aws SDK to 1.12.761 (dependencies reference vulnerable amazon ion-java version) (#1581)

* OAK-10954: Update spotbugs plugin to 4.8.6.2 (#1588)

* OAK-10959: webapp: update Tomcat dependency to 9.0.90 (#1589)

* OAK-10960: blob-cloud, segment: update netty version to 4.1.111 (#1590)

* OAK-10945: Remove usage of Guava Function interface (#1578)

* OAK-10945: Remove usage of Guava Function interface (oak-upgrade)

* OAK-10945: Remove usage of Guava Function interface (oak-store-spi)

* OAK-10945: Remove usage of Guava Function interface (oak-it)

* OAK-10945: Remove usage of Guava Function interface (oak-store-document)

* OAK-10945: Remove usage of Guava Function interface (oak-store-composite)

* OAK-10945: Remove usage of Guava Function interface (oak-segment-tar)

* OAK-10945: Remove usage of Guava Function interface (oak-segment-azure)

* OAK-10945: Remove usage of Guava Function interface (oak-security-spi)

* OAK-10945: Remove usage of Guava Function interface (oak-security-search)

* OAK-10945: Remove usage of Guava Function interface (oak-run-commons)

* OAK-10945: Remove usage of Guava Function interface (oak-run)

* OAK-10945: Remove usage of Guava Function interface (oak-lucene)

* OAK-10945: Remove usage of Guava Function interface (oak-jcr)

* OAK-10945: Remove usage of Guava Function interface (oak-exercise)

* OAK-10945: Remove usage of Guava Function interface (oak-core-spi)

* OAK-10945: Remove usage of Guava Function interface (oak-core)

* OAK-10945: Remove usage of Guava Function interface (oak-commons)

* OAK-10945: Remove usage of Guava Function interface (oak-blob-plugins)

* OAK-10945: Remove usage of Guava Function interface (oak-blob-cloud, oak-blob-cloud-azure)

* OAK-10945: Remove usage of Guava Function interface (oak-upgrade) - cleanup

* OAK-10945: Remove usage of Guava Function interface (oak-store-spi) - cleanup

* OAK-10945: Remove usage of Guava Function interface (oak-store-document) - cleanup

* OAK-10945: Remove usage of Guava Function interface (oak-store-composite) - cleanup

* OAK-10945: Remove usage of Guava Function interface (oak-segment-tar) - cleanup

* OAK-10945: Remove usage of Guava Function interface (oak-lucene) - cleanup

* OAK-10945: Remove usage of Guava Function interface (oak-jcr) - cleanup

* OAK-10962: oak-solr-osgi: update zookeeper dependency to 3.9.2 (#1591)

* Update build.yml to disable Sonar for now

...because of failures, as in https://github.com/apache/jackrabbit-oak/actions/runs/10021673018

* OAK-10965 - Make ConsoleIndexingReporter thread safe. (#1592)

* OAK-6762: Convert oak-blob to OSGi R7 annotations (#1413)

done

* OAK-6773: Convert oak-store-composite to OSGi R7 annotations (#1489)

done

* OAK-10951 - Add a new configuration property (#1594)

- "oak.indexer.persistedLinkedList.cacheSize" - sets the cache size of the PersistedLinkedList used to traverse the FFS. This controls the number FFS entries kept in memory.

* OAK-10966 - Avoid object creation in PathUtils.isAncestor (#1596)

* OAK-10964: bump nimbus-jose-jwt dependency to latest (#1593)

* OAK-6773: Convert oak-store-composite to OSGi R7 annotations - fix line ends

* OAK-10803 -- compress/uncompress property, disabled by default (#1526)

Co-authored-by: pirlogea <[email protected]>

* OAK-10974 and OAK-10869 : temporarily disabling flaky tests

* OAK-10971 - Add a method to test if a path is a direct ancestor of another: PathUtils.isDirectAncestor()  (#1598)

* OAK-10803: fix NPE when Mongo is unavailable, remove '*' imports (#1601)

* OAK-10976 - Avoid unnecessary call to PathUtils.getName in IndexDefinition (#1603)

* OAK-10904: Close token refresh executor service after access token is no longer needed (#1545)

* OAK-10904: change token refresh executor to use daemon thread

* OAK-10904: add log

* OAK-10904: move storage credential method to concrete class and explicitly calling close to shutdown the executor

* OAK-10904: wrap code for generating storage credentials in try catch

* OAK-10904: minor changes

* OAK-10904: refactor log

* OAK-10977 - Cleanup IndexDefinition class (#1604)

- replace Guava usages with JDK equivalents
- fix typos
- improve formatting

* OAK-10978 - Skip Azure compaction when there's not enough garbage in the repository (#1606)

* OAK-10966 - Indexing job: create optimized version of PersistedLinkedList (#1595)

* Revert "OAK-10966 - Indexing job: create optimized version of PersistedLinkedList (#1595)"

This reverts commit 8a72ef8.

* Revert "OAK-10978 - Skip Azure compaction when there's not enough garbage in the repository (#1606)"

This reverts commit 5814638.

* Revert "OAK-10977 - Cleanup IndexDefinition class (#1604)"

This reverts commit ce5c7df.

* Revert "OAK-10904: Close token refresh executor service after access token is no longer needed (#1545)"

This reverts commit 632a15b.

* Revert "OAK-10976 - Avoid unnecessary call to PathUtils.getName in IndexDefinition (#1603)"

This reverts commit e411960.

* Revert "OAK-10803: fix NPE when Mongo is unavailable, remove '*' imports (#1601)"

This reverts commit 5dd6344.

* Revert "OAK-10971 - Add a method to test if a path is a direct ancestor of another: PathUtils.isDirectAncestor()  (#1598)"

This reverts commit c416850.

* Revert "OAK-10974 and OAK-10869 : temporarily disabling flaky tests"

This reverts commit aefb990.

* Revert "OAK-10803 -- compress/uncompress property, disabled by default (#1526)"

This reverts commit 25792e7.

* Revert "OAK-6773: Convert oak-store-composite to OSGi R7 annotations - fix line ends"

This reverts commit 818317c.

* Revert "OAK-10964: bump nimbus-jose-jwt dependency to latest (#1593)"

This reverts commit 9528fdd.

* Revert "OAK-10966 - Avoid object creation in PathUtils.isAncestor (#1596)"

This reverts commit b37db4c.

* Revert "OAK-10951 - Add a new configuration property (#1594)"

This reverts commit ebefe01.

* Revert "OAK-6773: Convert oak-store-composite to OSGi R7 annotations (#1489)"

This reverts commit 0521c63.

* Revert "OAK-6762: Convert oak-blob to OSGi R7 annotations (#1413)"

This reverts commit 6c44805.

* Revert "OAK-10965 - Make ConsoleIndexingReporter thread safe. (#1592)"

This reverts commit 8b05cae.

* Revert "Update build.yml to disable Sonar for now"

This reverts commit 8db9183.

* Revert "OAK-10962: oak-solr-osgi: update zookeeper dependency to 3.9.2 (#1591)"

This reverts commit 95ed6c4.

* Revert "OAK-10945: Remove usage of Guava Function interface (#1578)"

This reverts commit 5d56db1.

* Revert "OAK-10960: blob-cloud, segment: update netty version to 4.1.111 (#1590)"

This reverts commit cd35db7.

* Revert "OAK-10959: webapp: update Tomcat dependency to 9.0.90 (#1589)"

This reverts commit 5fb55e0.

* Revert "OAK-10954: Update spotbugs plugin to 4.8.6.2 (#1588)"

This reverts commit b098fd8.

* Revert "OAK-10949: blob-cloud, segment-aws: update aws SDK to 1.12.761 (dependencies reference vulnerable amazon ion-java version) (#1581)"

This reverts commit cafbd29.

* Revert "OAK-10685: remove unused import of java.io.UnsupportedEncodingException"

This reverts commit 36cfcfa.

* Revert "OAK-10848: commons: remove use of slf4j.event.Level in SystemPropertySupplier implementation (#1580)"

This reverts commit ba530fe.

* Revert "OAK-10905 | Add license header to  AsyncCheckpointService (#1579)"

This reverts commit 7f112fa.

* Revert "OAK-10905 | Add a configurable  async checkpoint creator service (#1560)"

This reverts commit f12ccb1.

* Revert "OAK-10705: oak-standalone: update dependencies (#1411)"

This reverts commit dc10a12.

* Revert "OAK-10944: oak-auth-ldap: update commons-pool2 dependency to 2.12.0 (#1576)"

This reverts commit 3d6d009.

* OAK-10341 Tree store (bugfix for loggers)

* OAK-10341 Tree store (PipelinedTreeStoreStrategy.java)

* OAK-10341 Tree store (tests)

* OAK-10341 Tree store (use less memory)

* OAK-10341 Tree store (fix memory cache calculation)

* OAK-10341 Tree store (blob prefetch)

* OAK-10341 Tree store (blob prefetch)

* OAK-10341 Tree store (blob prefetch)

* OAK-10341 Tree store (node prefetch)

* OAK-10341 Tree store (incremental)

* OAK-10341 Tree store (pack files)

* OAK-10341 Tree store (pack files)

* OAK-10341 Tree store (tests)

* OAK-10341 Tree store (incremental)

* OAK-10341 Tree store (compress)

* OAK-10341 Tree store (traverse included paths)

* OAK-10341 Tree store

* OAK-10341 Tree store

* OAK-10341 Tree store

* OAK-10341 Tree store

* OAK-10341 Tree store (use same blob prefetching configuration as for FlatFileStore)

* OAK-10341 Tree store (javadocs)

* OAK-10341 Tree store (code review)

---------

Co-authored-by: Julian Reschke <[email protected]>
Co-authored-by: mbaedke <[email protected]>
Co-authored-by: nit0906 <[email protected]>
Co-authored-by: Nuno Santos <[email protected]>
Co-authored-by: Tushar <[email protected]>
Co-authored-by: ionutzpi <[email protected]>
Co-authored-by: pirlogea <[email protected]>
Co-authored-by: Stefan Egli <[email protected]>
Co-authored-by: Andrei Dulceanu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants