BufferedAdd with many documents slowly increases flush cycle time #523

rms2219 · 2017-09-15T16:02:13Z

I'm using the BufferedAdd plugin to index my documents. In some instances, I'll need to index tens of millions of documents. While testing, I set my buffer size to 30,000. I noticed that when my indexing process begins, the flushing those 30K documents takes about a second. After a couple million documents, that cycle takes a little longer (~1 second more). This goes on and on, and at around 10 or 11 million documents that have been indexed, I'm up to like 11 seconds between cycles. Is there something that Solarium is holding onto or writing to that would be slowing this down? I should add that this happens regardless of whether the Solr index is empty or whether it's already populated with millions of other documents.

thomascorthals · 2020-08-16T20:42:32Z

I have tried to reproduce this with a buffer size of 30,000 and a dataset of 12,000,000 documents against a local Solr on my laptop. There is a noticeable difference between smaller (< 100 KiB) and larger (~ 600 KiB) docs. There are periodical peaks in the time between flushes that I didn't look into because they're not what this issue is about.

The trendlines are near horizontal. There is a small upward slope (m_{Small docs} = 0.0002 and m_{Large docs} = 0.0003), but not nearly as much as @rms2219 was experiencing. My document ids were generated as test_1 .. test_12000000 so the later buffers are almost 90 KiB larger than the first ones. I tried again with test_00000001 .. test_12000000 to keep the buffer size consistent throughout the test. This time, the slope was slightly downward (m_{Small docs} = −0.000009 and m_{Large docs} = −0.0002).

I think it's safe to assume that larger buffers lead to slower flushes if everything else remains the same. One thing that could cause a steadily growing delay is a loop bleeding over data in the next iteration. I tried this:

set_time_limit(0);
ini_set('memory_limit', ini_get('suhosin.memory_limit') ?: -1);

$data = [
    'name' => 'Test doc',
    'ignored_cat' => [],
];

for ($i = 0; $i < 12000000; ++$i) {
    $data['id'] = 'test_'.$i;
    $data['ignored_cat'][] = $i;
    $buffer->createDocument($data);
}

I couldn't get it to complete even one flush cycle in a reasonable time. The size of the first buffer would already be a couple of gigabytes by the time it flushes. If something like this is at play, it's much more subtle.

@mkalkbrenner @wickedOne Do you agree that this was either caused by BufferedAdd behaviour that has since been changed or by something external to the BufferedAdd implementation entirely? My working hypothesis is that the documents themselves grew slightly larger with every buffer. If it was existing data spanning a larger period—perhaps across multiple versions of the platform that generated it—newer documents might be naturally larger because over time more attributes (more fields for Solr) and/or more accurate/elaborate content (larger fields for Solr) would be stored that was never added to existing older documents. Bulk import of that kind of data in chronological order would result in ever increasing buffer sizes and thus slower flushes. We would need sample documents and the script that was used to fill the buffer to confirm or rule out this hypothesis.

thomascorthals · 2020-08-28T19:50:32Z

@rms2219 I know it's been a while since you reported this. Are you still experiencing it with the latest Solarium version?

Could you provide sample documents evenly spread over the corpus and the script you used to add them to the buffer? If it is indeed caused by your documents slowly growing themselves, we wouldn't be able to tell from a contiguous block unless it spans multiple buffers. It would be even better if you could run a document size analysis on the full corpus yourself.

wickedOne · 2020-12-27T08:07:30Z

@thomascorthals i like the analysis you've done! i do suspect something is bleeding in the bufferedAdd plugin as i've been experiencing similar behaviour. tried to look into it a couple of times without finding a major pitfall. using a NoopEventListener makes it a bit more performant tho

thomascorthals · 2020-12-30T10:30:45Z

@wickedOne You wouldn't happen to have a publicly available corpus and test script that can reproduce this consistently?

There's one more thing I'm thinking that's just a wild guess at the moment. It might be something in the Solr configuration or the server it's running on; or something in the configuration of the machine Solarium is running on. If I try with the exact same script and data and can't reproduce it, we're probably looking at a cause outside Solarium. That would be a whole other rabbit hole. If I can reproduce it, we might get through the looking glass.

wickedOne · 2020-12-30T14:38:23Z

it's not really publicly avaiable, but herer's the command i use to populate the cores.
https://gist.github.com/wickedOne/0821a954e14d0ba196a5134b9413ac33

as you can see i disabe the the debug classloader for performance reasons just as my solrium client is equipped with no-operator dispaters. i usually run it with the default buffer size (200) as i've never seen noteworthy improvements increasing it. one of the cores is ~28.000.000 documents and ends up taking about ~45 min for a complete rebuild.

feel free to ask for more information / code as this has been a thowrn in my side for quite a while 😎

thomascorthals · 2022-04-04T17:00:18Z

I have been wondering if using an SplFixedArray instead of a regular array for the buffer would do any good for this issue. Its main benefit is less memory usage which doesn't necessarily impact performance.

Haven't tried it yet because it would introduce a BC break for PRE_FLUSH listeners that push additional documents to the buffer (will throw a RuntimeException) or use any of the array functions on it (will throw a TypeError).

It would also change the timing of the next flush when setting a lower buffer size mid-run. That shouldn't make a functional difference for the user though.

Not sure if this is worth pursuing as long as I can't reproduce this issue.

wickedOne · 2022-04-04T20:24:20Z

willing to do a test run if you have an implementation.
as stated in #971 i no longer have access to that humongous core, but also do not use the event dispatcher so your considered bc breaks would not be an issue

thomascorthals · 2022-04-13T13:51:39Z

I have a working implementation in thomascorthals/solarium:bufferedadd. Didn't open a PR yet because my benchmarks with examples/7.5.3-plugin-bufferedupdate-benchmarks.php are inconclusive.

Averaged over 3 runs of both the current master and my bufferedadd branch on my laptop, I get:

`array`

@@                  bufferedadd / buffereddelete                  @@
## add buffer size | add time  | delete buffer size | delete time ##
====================================================================
-              600 | 134155 ms |               2400 | 25412 ms      
-             6000 | 120299 ms |              24000 | 20750 ms      
-            60000 | 117067 ms |             240000 | 19501 ms      
+           600000 | 118768 ms |            1200000 | 20604 ms      

@@              bufferedaddlite / buffereddeletelite              @@
## add buffer size | add time  | delete buffer size | delete time ##
====================================================================
+              600 | 124654 ms |               2400 | 20753 ms      
+             6000 | 110800 ms |              24000 | 16747 ms      
-            60000 | 113540 ms |             240000 | 15994 ms      
+           600000 | 111846 ms |            1200000 | 16711 ms      

@@                        Peak memory usage                       @@
##          emalloc()          |               real               ##
====================================================================
+       1106894056 bytes       |         1113587712 bytes

`SplFixedArray`

@@                  bufferedadd / buffereddelete                  @@
## add buffer size | add time  | delete buffer size | delete time ##
====================================================================
+              600 | 132067 ms |               2400 | 23412 ms      
+             6000 | 116605 ms |              24000 | 14340 ms      
+            60000 | 113163 ms |             240000 | 20014 ms      
-           600000 | 121793 ms |            1200000 | 20846 ms      

@@              bufferedaddlite / buffereddeletelite              @@
## add buffer size | add time  | delete buffer size | delete time ##
====================================================================
-              600 | 139566 ms |               2400 | 21175 ms      
-             6000 | 113731 ms |              24000 | 17309 ms      
+            60000 | 112279 ms |             240000 | 16913 ms      
-           600000 | 118047 ms |            1200000 | 17586 ms      

@@                        Peak memory usage                       @@
##          emalloc()          |               real               ##
====================================================================
-       1205461128 bytes       |         1226833920 bytes

It seems that SplFixedArray generally leads to better execution times for the "classic" BufferedAdd with event dispatching, but worse for BufferedAddLite. I have no idea why that is. It even makes the lite version slower than the classic at the smallest buffer size. I've coloured the fastest add time green and the slowest red (compared between array and SplFixedArray). It's the reverse for deletes in some cases, but since they're much faster anyway outside influences on my laptop skew them more.

The reported peak memory usage is exactly the same for each 3 runs with the same implementation. While SplFixedArray should use less than a regular array, I'm not surprised at the higher peak. The buffer does have to be converted to an array to hand it off to the update query in the end.

The (array) cast that's needed anyway means I didn't run into a BC break for PRE_FLUSH after all.

The BC change when lowering the buffer size mid-run turns out to be a bugfix, actually. Even if we forego SplFixedArray, I'm going to keep that in. It's an edge case, but the very minimal cost for handling it properly only happens once upfront in regular usage.

I'm inclined to vote against this change as the performance benefits aren't unequivocal, it increases memory usage by 9–10% (with the testdata in the benchmark at least), and it uses a more obscure PHP feature that is less obvious to work with for Solarium developers. Unless someone can prove that it solves the flush cycle time issue under all circumstances, then we could opt for the stability at the expense of performance in some circumstances and memory usage in probably all circumstances.

wickedOne · 2022-04-13T14:03:52Z

thanks for the implementation and extensive analysis!
will dedicate some time tomorrow to see whether i come to another conclusion on my side

thomascorthals · 2022-10-20T07:57:42Z

Time to revisit this once again.

#1037 added support for JSON requests to BufferedAdd. It's not the default (yet) so you have to set it explicitly for now. The benchmarks show a significant improvement in performance over XML. This is what I expected: there's less string manipulation in the request builder (still need to concatenate because Solr JSON uses non-unique keys and you can't json_encode() that from a PHP data structure), no recursive calls (that's handled internally by json_encode()) and the actual request over the wire has a smaller footprint.

It still annoys me that I could never reproduce the issue to begin with. If the recursive calls in our code caused it, it will be gone completely with JSON requests. If it has something to do with the string manipulation, the effect could be much smaller. If it's something on the Solr side after all, it might be XML specific. But there is no way to tell for me. @wickedOne? 😉

thomascorthals · 2023-01-13T18:50:25Z

I've also noticed that PHP 8.2 performs noticeably faster than previous versions in the benchmarks.

thomascorthals mentioned this issue Jan 3, 2022

Add BufferedDelete plugin #971

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BufferedAdd with many documents slowly increases flush cycle time #523

BufferedAdd with many documents slowly increases flush cycle time #523

rms2219 commented Sep 15, 2017

thomascorthals commented Aug 16, 2020

thomascorthals commented Aug 28, 2020

wickedOne commented Dec 27, 2020

thomascorthals commented Dec 30, 2020

wickedOne commented Dec 30, 2020

thomascorthals commented Apr 4, 2022

wickedOne commented Apr 4, 2022

thomascorthals commented Apr 13, 2022

wickedOne commented Apr 13, 2022

thomascorthals commented Oct 20, 2022

thomascorthals commented Jan 13, 2023

BufferedAdd with many documents slowly increases flush cycle time #523

BufferedAdd with many documents slowly increases flush cycle time #523

Comments

rms2219 commented Sep 15, 2017

thomascorthals commented Aug 16, 2020

thomascorthals commented Aug 28, 2020

wickedOne commented Dec 27, 2020

thomascorthals commented Dec 30, 2020

wickedOne commented Dec 30, 2020

thomascorthals commented Apr 4, 2022

wickedOne commented Apr 4, 2022

thomascorthals commented Apr 13, 2022

array

SplFixedArray

wickedOne commented Apr 13, 2022

thomascorthals commented Oct 20, 2022

thomascorthals commented Jan 13, 2023

`array`

`SplFixedArray`