Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Fix String.split() implementations #3528

Draft
wants to merge 115 commits into
base: nightly
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

@martinvuyk martinvuyk commented Sep 22, 2024

Main issue

Fix String.split() implementations to use a generic implementation and without assuming that indexing is by byte offset. Added all methods to StringLiteral and StringSlice. Some important optimizations were added by parametrizing and avoiding slicing with numeric tricks.

What this PR introduces

Trait for working with String, StringLiteral, and StringSlice in a generic manner. This will be used for this PR and many others.

trait Stringlike(AsBytes):
    """Trait intended to be used only with `String`, `StringLiteral` and
    `StringSlice`."""

    fn find[T: Stringlike, //](self, substr: T, start: Int = 0) -> Int:
        """Finds the offset of the first occurrence of `substr` starting at
        `start`. If not found, returns -1.

        Parameters:
            T: The type of the substring.

        Args:
            substr: The substring to find.
            start: The offset from which to find.

        Returns:
            The offset of `substr` relative to the beginning of the string.
        """
        ...

    fn __iter__[
        is_mutable: Bool, origin: Origin[is_mutable].type
    ](self) -> _StringSliceIter[origin]:
        """Return an iterator over the string.

        Parameters:
            is_mutable: Whether the result will be mutable.
            origin: The origin of the data.

        Returns:
            An iterator over the string.
        """
        ...

Changes in behavior

This PR changes split("") behavior to be non-raising and return the separated unicode characters analogous to when the whole string has the separator at start, end, and in between every character. Closes #3635

StringSlice.split() now returns a List[StringSlice] of immutable origin.

Benchmark results:

CPU: Intel® Core™ i7-7700HQ

improvement metric: markdown percentage improvement ((old_value - new_value) / old_value)

Average improvement for split with a sequence: 44% . In orders of magnitude, this is a 1.8x improvement
Average improvement for split on any whitespace: 80% . In orders of magnitude, this is a 5x improvement

Name old_value (ms) new_value (ms) improvement
bench_string_split[10] 0.000217157538201234 0.000120441848100144 44.54%
bench_string_split_none[10] 0.00141580388409579 0.000165890426069714 88.28%
bench_string_split[30] 0.000325987481757257 0.00015981611103423 50.97%
bench_string_split_none[30] 0.00444625867755221 0.000352369613077174 92.07%
bench_string_split[50] 0.000362292715460147 0.000174240738534596 51.91%
bench_string_split_none[50] 0.00771124915361684 0.000463101261109653 93.99%
bench_string_split[100] 0.000460465531865409 0.000235624846782933 48.83%
bench_string_split_none[100] 0.0152726975133393 0.000855421818320383 94.40%
bench_string_split[1000] 0.00304171546804573 0.00171716390636075 43.55%
bench_string_split_none[1000] 0.16922413574791 0.0131455816333713 92.23%
bench_string_split[10000] 0.0339078229634652 0.0197144336973933 41.86%
bench_string_split_none[10000] 2.33214042304166 0.524931019971969 77.49%
bench_string_split[100000] 0.346620968649082 0.205336961853133 40.76%
bench_string_split_none[100000] 109.497963740448 53.6761552659305 50.98%
bench_string_split[1000000] 3.3621430168819 2.38782905801937 28.98%
bench_string_split_none[1000000] 11085.2204129 5511.5961835 50.28%

Giving some context to the numbers

At an average 5 letters per word and 300 words per page (in the English language):

  • 10: 2 words
  • 30: 6 words
  • 50: 10 words
  • 100: 20 words
  • 1000: ~ 1/2 page (200 words)
  • 10_000: ~ 7 pages (2k words)
  • 100_000: ~ 67 pages (20k words)
  • 1_000_000: ~ 667 pages (200k words)

@martinvuyk martinvuyk requested a review from a team as a code owner September 22, 2024 23:17
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk martinvuyk changed the title [stdlib] Fix split implementations [stdlib] Fix String.split() implementations Sep 23, 2024
Copy link
Contributor

@msaelices msaelices left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. I've just add a NIT-pick suggestion.

Also, is it possible to add a unit test?

stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
Co-authored-by: Manuel Saelices <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk
Copy link
Contributor Author

Great job. I've just add a NIT-pick suggestion.

Also, is it possible to add a unit test?

Hi, thanks for the review. Any type of test in mind that split tescases don't cover ?

.gitignore Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/utils/string_slice.mojo Outdated Show resolved Hide resolved
@msaelices
Copy link
Contributor

msaelices commented Sep 23, 2024

Great job. I've just add a NIT-pick suggestion.
Also, is it possible to add a unit test?

Hi, thanks for the review. Any type of test in mind that split tescases don't cover ?

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

@martinvuyk
Copy link
Contributor Author

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

I'm not understanding, so the lines from var res7 ... onwards didn't get merged in another PR and you'd like me to add them here?

@msaelices
Copy link
Contributor

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

I'm not understanding, so the lines from var res7 ... onwards didn't get merged in another PR and you'd like me to add them here?

It's just a diff if you want to complete it with more test. LGTM anyways so don't worry. Thanks for that contribution 🥇

@martinvuyk
Copy link
Contributor Author

@JoeLoser I managed to unify what was as_bytes_read() and as_bytes_write() into a parametrized as_bytes() that also works for StringSlice, I updated the PR description. This will allow some very cool generic algorithms on any type that implements the trait :)

Copy link
Collaborator

@JoeLoser JoeLoser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge amount of work - nice job @martinvuyk! The benchmark improvements look great.

I left a few comments/questions (mostly non-blocking). Would love to be able to land this big PR soon and we can also address some of the things as follow-ups to avoid PR churn/merge conflicts as I know you are working on other things in this area.

stdlib/src/builtin/string_literal.mojo Show resolved Hide resolved
stdlib/src/utils/span.mojo Outdated Show resolved Hide resolved
stdlib/src/utils/string_slice.mojo Outdated Show resolved Hide resolved
stdlib/src/utils/string_slice.mojo Outdated Show resolved Hide resolved
stdlib/src/utils/string_slice.mojo Outdated Show resolved Hide resolved
stdlib/src/utils/string_slice.mojo Outdated Show resolved Hide resolved
stdlib/src/utils/string_slice.mojo Show resolved Hide resolved
stdlib/src/utils/string_slice.mojo Outdated Show resolved Hide resolved
stdlib/src/utils/string_slice.mojo Outdated Show resolved Hide resolved
stdlib/test/utils/test_string_slice.mojo Show resolved Hide resolved
@JoeLoser
Copy link
Collaborator

JoeLoser commented Nov 1, 2024

!sync

Signed-off-by: martinvuyk <[email protected]>
@JoeLoser
Copy link
Collaborator

JoeLoser commented Nov 1, 2024

FYI this is crashing several things internally in the parser with mutability, so we'll have to dig into that next week sometime. 😞

@martinvuyk
Copy link
Contributor Author

Ok 😞

@martinvuyk martinvuyk marked this pull request as draft November 12, 2024 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants