Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using mypyc to speed things up like black does #159

Open
Bobronium opened this issue May 14, 2022 · 3 comments
Open

Consider using mypyc to speed things up like black does #159

Bobronium opened this issue May 14, 2022 · 3 comments
Assignees

Comments

@Bobronium
Copy link
Contributor

Bobronium commented May 14, 2022

Black is significantly faster than ufsort, especially when it produces no changes.
I believe such performance difference may be due to black using mypyc: https://github.com/ichard26/black-mypyc-wheels

black is 4.25 times faster on check:

curl -s https://raw.githubusercontent.com/facebookexperimental/usort/main/usort/api.py -o example.py

time black --check example.py

# All done! ✨ 🍰 ✨
# 1 file would be left unchanged.
# black --check example.py  0.08s user 0.02s system 75% cpu 0.137 total

time usort check example.py            
# usort check example.py  0.34s user 0.04s system 86% cpu 0.434 total

black is 2 times faster when making changes:

curl -s https://raw.githubusercontent.com/python/cpython/main/Lib/asyncio/futures.py -o example.py  
                                                                                                                          
time black example.py                                                                             
# reformatted example.py

# All done! ✨ 🍰 ✨
# 1 file reformatted.
# black example.py  0.15s user 0.02s system 75% cpu 0.229 total
                                                                                                                          
time usort format example.py                                                                      
# Sorted example.py
# usort format example.py  0.40s user 0.04s system 93% cpu 0.469 total

However looking at profiler run, I see that 3/5 of the time is taken by trailrunner (and nearly 90% of that time is spend in lock.acquire), despite usort was given only one file path:

pyinstrument --show-all -m usort format example.py
pyinstrument --show-all -m usort format example.py

_     ._   __/__   _ _  _  _ _/_   Recorded: 05:51:13  Samples:  167
/_//_/// /_\ / //_// / //_'/ //     Duration: 0.405     CPU time: 0.153
/   _/                      v4.1.1

Program: usort format example.py

0.404 <module>  <string>:1
└─ 0.404 run_module  runpy.py:199
 ├─ 0.266 _run_module_code  runpy.py:89
 │  └─ 0.266 _run_code  runpy.py:63
 │     └─ 0.266 <module>  usort/__main__.py:1
 │        ├─ 0.256 __call__  click/core.py:1128
 │        │  └─ 0.256 main  click/core.py:987
 │        │     └─ 0.256 invoke  click/core.py:1623
 │        │        └─ 0.256 invoke  click/core.py:1393
 │        │           └─ 0.256 invoke  click/core.py:709
 │        │              └─ 0.256 wrapper  usort/cli.py:37
 │        │                 └─ 0.256 format  usort/cli.py:177
 │        │                    └─ 0.255 usort_path  usort/api.py:120
 │        │                       ├─ 0.242 run  trailrunner/core.py:226
 │        │                       │  └─ 0.242 run  trailrunner/core.py:171
 │        │                       │     ├─ 0.216 _chain_from_iterable_of_lists  concurrent/futures/process.py:561
 │        │                       │     │  └─ 0.216 result_iterator  concurrent/futures/_base.py:601
 │        │                       │     │     └─ 0.216 result  concurrent/futures/_base.py:417
 │        │                       │     │        └─ 0.216 wait  threading.py:288
 │        │                       │     │           └─ 0.216 lock.acquire  <built-in>:0
 │        │                       │     ├─ 0.019 __exit__  concurrent/futures/_base.py:635
 │        │                       │     │  └─ 0.019 shutdown  concurrent/futures/process.py:739
 │        │                       │     │     └─ 0.019 join  threading.py:1057
 │        │                       │     │        └─ 0.019 _wait_for_tstate_lock  threading.py:1095
 │        │                       │     │           └─ 0.019 lock.acquire  <built-in>:0
 │        │                       │     └─ 0.004 _default_executor  trailrunner/core.py:128
 │        │                       │        └─ 0.004 __init__  concurrent/futures/process.py:581
 │        │                       └─ 0.012 walk  trailrunner/core.py:210
 │        │                          └─ 0.012 walk  trailrunner/core.py:136
 │        │                             └─ 0.012 gitignore  trailrunner/core.py:67
 │        │                                └─ 0.011 pathspec  trailrunner/core.py:54
 │        │                                   └─ 0.011 from_lines  pathspec/pathspec.py:100
 │        │                                      └─ 0.011 <listcomp>  pathspec/pathspec.py:126
 │        │                                         └─ 0.011 __init__  pathspec/pattern.py:71
 │        │                                            └─ 0.009 compile  re.py:249
 │        │                                               └─ 0.009 _compile  re.py:288
 │        │                                                  └─ 0.008 compile  sre_compile.py:759
 │        │                                                     └─ 0.006 parse  sre_parse.py:937
 │        │                                                        └─ 0.006 _parse_sub  sre_parse.py:435
 │        │                                                           └─ 0.006 _parse  sre_parse.py:493
 │        └─ 0.010 <module>  usort/cli.py:1
 │           └─ 0.007 <module>  click/__init__.py:1
 │              └─ 0.007 <module>  click/core.py:1
 └─ 0.138 _get_module_details  runpy.py:103
    └─ 0.138 _get_module_details  runpy.py:103
       └─ 0.138 <module>  usort/__init__.py:1
          └─ 0.138 <module>  usort/api.py:1
             ├─ 0.114 <module>  usort/sorting.py:1
             │  ├─ 0.101 <module>  libcst/__init__.py:1
             │  │  ├─ 0.038 <module>  libcst/_nodes/expression.py:1
             │  │  │  ├─ 0.017 wrap  dataclasses.py:1175
             │  │  │  │  └─ 0.017 _process_class  dataclasses.py:882
             │  │  │  │     ├─ 0.006 _frozen_get_del_attr  dataclasses.py:599
             │  │  │  │     │  └─ 0.006 _create_fn  dataclasses.py:412
             │  │  │  │     ├─ 0.004 _init_fn  dataclasses.py:529
             │  │  │  │     │  └─ 0.004 _create_fn  dataclasses.py:412
             │  │  │  │     └─ 0.004 _cmp_fn  dataclasses.py:624
             │  │  │  │        └─ 0.004 _create_fn  dataclasses.py:412
             │  │  │  └─ 0.016 <module>  libcst/_nodes/op.py:1
             │  │  │     └─ 0.011 wrap  dataclasses.py:1175
             │  │  │        └─ 0.010 _process_class  dataclasses.py:882
             │  │  ├─ 0.021 <module>  libcst/_nodes/module.py:1
             │  │  │  └─ 0.020 <module>  libcst/_nodes/statement.py:1
             │  │  │     └─ 0.016 wrap  dataclasses.py:1175
             │  │  │        └─ 0.016 _process_class  dataclasses.py:882
             │  │  │           ├─ 0.005 _init_fn  dataclasses.py:529
             │  │  │           │  └─ 0.004 _create_fn  dataclasses.py:412
             │  │  │           └─ 0.004 _cmp_fn  dataclasses.py:624
             │  │  │              └─ 0.004 _create_fn  dataclasses.py:412
             │  │  ├─ 0.014 <module>  libcst/_parser/entrypoints.py:1
             │  │  │  └─ 0.010 <module>  libcst/_parser/grammar.py:1
             │  │  │     └─ 0.008 <module>  libcst/_parser/conversions/expression.py:1
             │  │  │        └─ 0.006 <module>  libcst/_parser/types/partials.py:1
             │  │  │           └─ 0.006 wrap  dataclasses.py:1175
             │  │  │              └─ 0.006 _process_class  dataclasses.py:882
             │  │  ├─ 0.008 <module>  libcst/metadata/__init__.py:1
             │  │  ├─ 0.007 <module>  libcst/_exceptions.py:1
             │  │  │  └─ 0.004 <module>  libcst/_parser/parso/pgen2/generator.py:1
             │  │  │     └─ 0.004 <module>  libcst/_parser/parso/pgen2/grammar_parser.py:1
             │  │  └─ 0.005 <module>  libcst/_batched_visitor.py:1
             │  └─ 0.012 <module>  usort/translate.py:1
             │     └─ 0.012 <module>  usort/types.py:1
             │        └─ 0.009 <module>  attr/__init__.py:1
             │           └─ 0.005 <module>  attr/converters.py:1
             ├─ 0.016 <module>  trailrunner/__init__.py:1
             │  └─ 0.016 <module>  trailrunner/core.py:1
             │     ├─ 0.005 <module>  multiprocessing/__init__.py:1
             │     │  └─ 0.005 <module>  multiprocessing/context.py:1
             │     │     └─ 0.004 <module>  multiprocessing/reduction.py:1
             │     └─ 0.004 <module>  concurrent/futures/__init__.py:1
             └─ 0.005 <module>  usort/config.py:1

So clearly, there's a lot of room for optimizations even before compiling code with mypyc :)

@amyreese
Copy link
Member

Perf

Black is partly faster because its CST implementation (a fork of lib2to3) isn't doing as much as LibCST does when parsing, validating, and transforming the tree (eg, it basically doesn't do any validation). There are ongoing efforts to improve performance of LibCST, including an experimental PEG parser written in Rust, which can be enabled by setting LIBCST_PARSER_TYPE=native in your environment.

The other part is that black doesn't need to materialize the tree if it hasn't modified it, while µsort will materialize the updated tree every time. If black hasn't changed the tree, it just returns the original bytes, so the happy case is that much faster.

µsort could probably be better about this, but would require extra bookkeeping effort during sorting to determine if anything has changed when sorting any of the blocks of imports. It's not impossible, but it's probably not a quick change, and I would expect that requires more memory and CPU time to track/verify, especially for large files. I would be happy to answer questions or offer guidance if you are interested in making that happen.

Profiling

It's worth keeping in mind that µsort—like black—uses multiprocessing when formatting. black uses its own code wrapping the multiprocessing module, but µsort uses trailrunner, which handles the process of walking and filtering any paths given, and running the appropriate sorting functions via multiprocessing.

The reason you see 60%+ of time spent in trailrunner is because trailrunner is spawning the child process and waiting for it to finish sorting the given paths. This is to be expected, and it's not wasted time; that's the time spent actually sorting the requested files.

If you are interested in a detailed break down of time spent in various pieces, please run µsort with the --benchmark option:

(.venv) jreese@butterfree ~/workspace/usort main  » usort --benchmark check usort/tests
parsing usort/tests/translate.py:                    47452 µs
sorting usort/tests/translate.py:                    23267 µs
parsing usort/tests/config.py:                       78881 µs
sorting usort/tests/config.py:                       19024 µs
parsing usort/tests/util.py:                         43918 µs
sorting usort/tests/util.py:                         12205 µs
parsing usort/tests/__init__.py:                     45413 µs
sorting usort/tests/__init__.py:                      1630 µs
parsing usort/tests/types.py:                        65670 µs
sorting usort/tests/types.py:                        14518 µs
parsing usort/tests/functional.py:                   85415 µs
sorting usort/tests/functional.py:                   18093 µs
parsing usort/tests/cli.py:                          83623 µs
sorting usort/tests/cli.py:                          17809 µs
parsing usort/tests/stdlibs.py:                      22188 µs
sorting usort/tests/stdlibs.py:                       1845 µs
parsing usort/tests/__main__.py:                      1785 µs
sorting usort/tests/__main__.py:                      1038 µs
walking usort/tests:                                  9186 µs
total for usort/tests:                              619267 µs

In actuality, we are only spending a few milliseconds in trailrunner when walking paths; the rest of the time is spent in child processes doing the actual sorting. The fewer files you are asking to sort, the more the overhead of spawning the child process(es) will affect your total runtime.

Another place where µsort could be better is in handling multiple path arguments. Currently, we walk and sort those given paths in sequence, rather than walking/gathering all of the given paths up front. This means something like usort format some/path/*.py will end up sorting in sequence, rather than parallel. Another case where a PR would be gladly accepted, but the workaround is to just pass the root directory rather than a shell glob, and use configurable excludes if necessary.

@amyreese amyreese self-assigned this May 17, 2022
@amyreese
Copy link
Member

To be clear, we are not dismissing the idea or benefit of projects like mypyc, but I wanted to at least discuss the reasons why there is a performance difference between black, and where trailrunner fits into µsort.

I believe @zsol has been investigating the possibility of mypyc for LibCST, and as the majority of time running µsort is spent in LibCST, that is probably the best place to focus efforts on improving the runtime speed of µsort.

@Bobronium
Copy link
Contributor Author

Thank you for detailed response!

@zsol, is there opened issue with discussion of using mypyc with LibCST, or any related issues where we can continue this discussion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants