-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Fix String.split()
implementations
#3528
base: nightly
Are you sure you want to change the base?
[stdlib] Fix String.split()
implementations
#3528
Conversation
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
String.split()
implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job. I've just add a NIT-pick suggestion.
Also, is it possible to add a unit test?
Co-authored-by: Manuel Saelices <[email protected]> Signed-off-by: martinvuyk <[email protected]>
Hi, thanks for the review. Any type of test in mind that split tescases don't cover ? |
Signed-off-by: martinvuyk <[email protected]>
I thought this check would be broken in ❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
assert_equal(res6[2], "долор")
assert_equal(res6[3], "сит")
assert_equal(res6[4], "амет")
+ var res7 = in6.split("м")
+ assert_equal(res7[0], "Лоре")
+ assert_equal(res7[1], " ипсу")
+ assert_equal(res7[2], " долор сит а")
+ assert_equal(res7[3], "ет") BTW, I still think it's a good test to add. |
I'm not understanding, so the lines from |
Signed-off-by: martinvuyk <[email protected]>
It's just a diff if you want to complete it with more test. LGTM anyways so don't worry. Thanks for that contribution 🥇 |
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
…o into fix-split-implementations
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@JoeLoser I managed to unify what was |
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge amount of work - nice job @martinvuyk! The benchmark improvements look great.
I left a few comments/questions (mostly non-blocking). Would love to be able to land this big PR soon and we can also address some of the things as follow-ups to avoid PR churn/merge conflicts as I know you are working on other things in this area.
Signed-off-by: martinvuyk <[email protected]>
!sync |
Signed-off-by: martinvuyk <[email protected]>
FYI this is crashing several things internally in the parser with mutability, so we'll have to dig into that next week sometime. 😞 |
Ok 😞 |
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Main issue
Fix
String.split()
implementations to use a generic implementation and without assuming that indexing is by byte offset. Added all methods toStringLiteral
andStringSlice
. Some important optimizations were added by parametrizing and avoiding slicing with numeric tricks.What this PR introduces
Trait for working with
String
,StringLiteral
, andStringSlice
in a generic manner. This will be used for this PR and many others.Changes in behavior
This PR changes
split("")
behavior to be non-raising and return the separated unicode characters analogous to when the whole string has the separator at start, end, and in between every character. Closes #3635StringSlice.split()
now returns aList[StringSlice]
of immutable origin.Benchmark results:
CPU: Intel® Core™ i7-7700HQ
improvement metric: markdown percentage improvement (
(old_value - new_value) / old_value
)Average improvement for split with a sequence: 44% . In orders of magnitude, this is a 1.8x improvement
Average improvement for split on any whitespace: 80% . In orders of magnitude, this is a 5x improvement
bench_string_split[10]
bench_string_split_none[10]
bench_string_split[30]
bench_string_split_none[30]
bench_string_split[50]
bench_string_split_none[50]
bench_string_split[100]
bench_string_split_none[100]
bench_string_split[1000]
bench_string_split_none[1000]
bench_string_split[10000]
bench_string_split_none[10000]
bench_string_split[100000]
bench_string_split_none[100000]
bench_string_split[1000000]
bench_string_split_none[1000000]
Giving some context to the numbers
At an average 5 letters per word and 300 words per page (in the English language):