Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize calc_wsratio_v_wd function #24

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

izofat
Copy link
Contributor

@izofat izofat commented Sep 6, 2024

@aclerc
This is not ready for merge.

The calc_wsratio_v_wd function is faster, but expected_df and actual_df don’t match. It looks like the issue might be that when subsector_df is empty, direction and hour data aren’t being added. This commit include them right now .

detrend_df["circ_diff_to_d"] = circ_diff(detrend_df[ref_wd_col], d)
detrend_df["within_dir_bin"] = detrend_df["circ_diff_to_d"].abs() < dir_bin_width / 2
subsector_df = detrend_df[detrend_df["within_dir_bin"]].copy()
if len(subsector_df) > 0:

Can you check the commit for any other mistake.

@aclerc aclerc marked this pull request as draft September 6, 2024 14:12
@aclerc
Copy link
Contributor

aclerc commented Sep 6, 2024

Hi @izofat no worries, since you say it is not ready to merge I have converted it to draft. You can undo that when it is ready. I had a quick look at your changes and I do not see the reason for the test failure at first glance. The ruff format check is not passing so I would fix ruff issues next so that the test result is displayed in the github action. Seeing the test results will make it easier to tell what's wrong with the logic. Also the PR says your branch is out of date with main so I recommend updating.

@izofat
Copy link
Contributor Author

izofat commented Sep 8, 2024

Hi,@aclerc
I've made some changes and I'm wondering about coverage reports. There are differences between the coverage reports of the updated code and the original code, including test counts, which I'll include below for you to review.

I'll continue working on making the code faster.

Coverage reports with original code

============================================== 90 passed in 57.15s ==============================================
Poe => coverage report -m
Name                           Stmts   Miss  Cover   Missing
------------------------------------------------------------
wind_up/__init__.py                2      0   100%
wind_up/caching.py                28     13    54%   24-38
wind_up/combine_results.py        93     18    81%   58-59, 92-93, 104-105, 121, 133, 139-153
wind_up/constants.py              37      1    97%   32
wind_up/conversions.py             7      0   100%
wind_up/detrend.py               143     32    78%   103-105, 109-111, 129, 139-151, 177-178, 206-215, 248-251, 291-292, 304, 309-334
wind_up/interface.py             100     68    32%   25-26, 29-45, 51-102, 125-135, 152-181, 192-204
wind_up/long_term.py              70     16    77%   26, 29, 35-36, 39, 42, 72-73, 75-76, 83-84, 108-112, 126, 168
wind_up/main_analysis.py         357    296    17%   48-50, 60-61, 65-80, 91-145, 149-173, 189-232, 242-243, 272-273, 296-297, 313-320, 326-363, 379-435, 455-746, 750-773, 782-925
wind_up/math_funcs.py              8      0   100%
wind_up/models.py                220     20    91%   273-274, 280-281, 289-290, 292-293, 318-319, 384-385, 391-392, 418-424
wind_up/northing.py               75      6    92%   57, 74-75, 84, 100, 130
wind_up/northing_utils.py          7      0   100%
wind_up/optimize_northing.py     262     36    86%   157, 159, 161, 163, 165, 168, 171, 272-273, 281-282, 316-317, 384, 392-400, 403-404, 445, 528, 611-620, 639-640, 653, 662-663
wind_up/pp_analysis.py           231     77    67%   100-101, 180, 193, 257-275, 381-383, 398-404, 438-549
wind_up/reanalysis_data.py        87      7    92%   44-45, 74, 157, 177-178, 189
wind_up/result_manager.py         10      0   100%
wind_up/scada_funcs.py           241     47    80%   62-63, 152-155, 204-205, 261, 285-286, 347-358, 444-445, 454-491, 502-522
wind_up/scada_power_curve.py      52      9    83%   43-44, 75-77, 79-80, 82-83
wind_up/smart_data.py            102     18    82%   22-23, 35, 47-51, 55-56, 58-59, 66-67, 97-98, 108-109, 112, 154-155
wind_up/waking_state.py          151     23    85%   62-76, 179-180, 213-214, 237-239, 269, 272, 280, 288-289
wind_up/wind_funcs.py              5      0   100%
wind_up/windspeed_drift.py        37      6    84%   29-30, 36, 59, 70, 72
wind_up/ws_est.py                 76     10    87%   80, 94-95, 102-103, 129-138
wind_up/yaml_loader.py            19      2    89%   12-13
------------------------------------------------------------
TOTAL                           2420    705    71%

Coverage reports with the updated code

============================================== 99 passed in 51.41s ==============================================
Poe => coverage report -m
Name                           Stmts   Miss  Cover   Missing
------------------------------------------------------------
wind_up/__init__.py                2      0   100%
wind_up/backporting.py            12      7    42%   10-16
wind_up/caching.py                28     13    54%   24-38
wind_up/combine_results.py        96     19    80%   16, 62-63, 96-97, 108-109, 125, 137, 143-157
wind_up/constants.py              37      1    97%   32
wind_up/conversions.py             7      0   100%
wind_up/detrend.py               149     34    77%   19, 79, 124-126, 130-132, 150, 160-172, 198-199, 227-236, 269-272, 312-313, 325, 330-355
wind_up/interface.py             103     70    32%   22-24, 30-31, 34-50, 56-107, 130-140, 157-186, 197-209
wind_up/long_term.py              73     17    77%   14, 30, 33, 39-40, 43, 46, 76-77, 79-80, 87-88, 112-116, 130, 172
wind_up/main_analysis.py         358    296    17%   50-52, 62-63, 67-82, 93-147, 151-175, 191-234, 244-245, 274-275, 298-299, 315-322, 328-365, 381-437, 457-748, 752-775, 784-927
wind_up/math_funcs.py              9      0   100%
wind_up/models.py                222     20    91%   276-277, 283-284, 292-293, 295-296, 321-322, 387-388, 394-395, 421-427
wind_up/northing.py               78      7    91%   21, 61, 78-79, 88, 104, 134
wind_up/northing_utils.py          7      0   100%
wind_up/optimize_northing.py     266     39    85%   33-37, 164, 166, 168, 170, 172, 175, 178, 279-280, 288-289, 323-324, 391, 399-407, 410-411, 452, 535, 618-627, 646-647, 660, 669-670
wind_up/pp_analysis.py           234     78    67%   17, 104-105, 184, 197, 261-279, 385-387, 402-408, 442-553
wind_up/reanalysis_data.py        90      8    91%   18, 48-49, 78, 161, 181-182, 193
wind_up/result_manager.py         10      0   100%
wind_up/scada_funcs.py           244     48    80%   24, 67-68, 156-159, 208-209, 265, 289-290, 351-362, 448-449, 458-495, 506-526
wind_up/scada_power_curve.py      55     10    82%   12, 47-48, 79-81, 83-84, 86-87
wind_up/smart_data.py            102     18    82%   22-23, 35, 47-51, 55-56, 58-59, 66-67, 97-98, 108-109, 112, 154-155
wind_up/waking_state.py          154     24    84%   23, 67-81, 184-185, 218-219, 242-244, 274, 277, 285, 293-294
wind_up/wind_funcs.py              8      1    88%   7
wind_up/windspeed_drift.py        40      8    80%   11-13, 36-37, 43, 66, 77, 79
wind_up/ws_est.py                 79     11    86%   12, 85, 99-100, 107-108, 134-143
wind_up/yaml_loader.py            19      2    89%   12-13
------------------------------------------------------------
TOTAL                           2482    731    71%

@aclerc
Copy link
Contributor

aclerc commented Sep 9, 2024

Hi @izofat regarding your latest comment I see backporting.py is in the second coverage report but not the first, so I think the first coverage report does not include changes from the recent PR #17 which changed many files. I can see that your PR only changes one file so far, detrend.py, so I would focus on the coverage of that file.

@izofat
Copy link
Contributor Author

izofat commented Sep 9, 2024

@aclerc
I've fixed the issue about the test count.
But I'm still seeing some differences in the coverage reports of detrend

I can guess what stmts, miss, cover, missing all about but I wanna know what those are exactly. Maybe that can help why this difference exist.

Original

Name                           Stmts   Miss  Cover   Missing
------------------------------------------------------------
wind_up/detrend.py               146     33    77%   19, 107-109, 113-115, 133, 143-155, 181-182, 210-219, 252-255, 295-296, 308, 313-338

Updated

Name                           Stmts   Miss  Cover   Missing
------------------------------------------------------------
wind_up/detrend.py               149     34    77%   19, 79, 124-126, 130-132, 150, 160-172, 198-199, 227-236, 269-272, 312-313, 325, 330-355

@aclerc
Copy link
Contributor

aclerc commented Sep 9, 2024

Hi @izofat I think I am the same as you, I can guess what Stmts, Miss, Cover, and Missing mean but for precise definitions I would look at https://coverage.readthedocs.io/en/latest/

@izofat
Copy link
Contributor Author

izofat commented Sep 10, 2024

@aclerc
Thanks for mentioning coverage reports.
I did some research about coverage reports.
It seems like it's completely normal to change because the code have been changed.
It's ready to merge now.
I started to check remove_bad_detrend_results method this method is also slow.

Here what i did to make it faster without changing the calculations.
Filtering Efficiency: Combined the filtering of test_ws_col and ref_ws_col into a single step, minimizing the number of operations on detrend_df.
Vectorized Circular Differences: Replaced the sequential loop calculating circular differences with a vectorized Numpy array, significantly speeding up the computation for each direction bin.

@izofat izofat marked this pull request as ready for review September 10, 2024 13:03
@aclerc
Copy link
Contributor

aclerc commented Sep 10, 2024

Thanks @izofat I have put this review on my to-do list. What is the run time difference before and after your changes? Just for the test cases is fine or if you've benchmarked something else as well like the smarteole example that would be nice info.

@aclerc aclerc changed the title Optimize method performance; note that output dataframes are different Optimize calc_wsratio_v_wd function Sep 13, 2024
@aclerc
Copy link
Contributor

aclerc commented Sep 13, 2024

Hi @izofat , thank you for looking in to this optimization! I added a few commits to use pytest-benchmark to measure run time. The results I got are below. pr/24 looks a little faster but overall the performance is similar to main. I am not an optimization expert but I wonder if it would be faster to avoid calculating circ_diff at all. If detrend_df were ordered by ref_wd_col then it should be straightforward to select the rows which are within_dir_bin, just would need to handle 0-360 wrapping correctly. What do you think?

main
------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in s)                   Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_calc_wsratio_v_wd_scen     1.3951  2.6441  1.7732  0.5346  1.4863  0.7069       1;0  0.5640       5           1
--------------------------------------------------------------------------------------------------------------------

pr/24
------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in s)                   Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_calc_wsratio_v_wd_scen     1.1355  2.0814  1.3855  0.4067  1.1472  0.4392       1;0  0.7218       5           1
--------------------------------------------------------------------------------------------------------------------

@izofat
Copy link
Contributor Author

izofat commented Sep 13, 2024

Hi @aclerc,
Test result seems better than i calculated :).
%20 down at mean time and aproximently %25 increase at Operation per second (OPS).
This is very good if the project requires too frequent use of calc_wsratio_v_wd and we done this by just changing the base code not the logic.
The main reasons why functions are slow in the project is not doing one thing completely.
You are trying to do a part of each task in every iteration of the loop, instead of completing one task at a time.
Instead of that we should try to do things in order like calculating all the cirr_diffs first then going in to loop.
About the avoiding calculating the cirr_diff I don't have a certain clue about that. Ordering the dataframe would be faster with pandas than applying a math function to the dataframe. I think we should give it a change.

I would like to ask, what is the most used function in the project can you determine that. We should focus on that.

I have not been contributing for a while sorry about that. My college start next week and my project at work goes into production next week.
I will try to spend more time in the project next week.
Thanks for the feedback.
Best regards.
Gorkem

@aclerc
Copy link
Contributor

aclerc commented Sep 17, 2024

Hi @izofat , re "what is the most used function in the project" I did try to run the PyCharm profiler on a full analysis but I struggled to interpret the results. I am sure the key area to optimize will be in the function calc_test_ref_results because that has to run for every test-reference combination. The part that usually takes the most time is pre_post_pp_analysis_with_reversal_and_bootstrapping but detrending steps also take a long time. I am pretty confident that the tests test_detrend.py (relevant to detrending) and test_pp_analysis.py (relevant to pp_analysis) are the key focus areas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants