Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example generation regression between 6.47.0 -> 6.103.1 #4014

Open
BBBmau opened this issue Jun 18, 2024 · 11 comments
Open

example generation regression between 6.47.0 -> 6.103.1 #4014

BBBmau opened this issue Jun 18, 2024 · 11 comments
Labels
performance go faster! use less memory!

Comments

@BBBmau
Copy link

BBBmau commented Jun 18, 2024

Hello, we're interested in using the latest version of hypothesis but noticed a significant increase in example generation between versions:

making an issue since I've seen in the past that some regression occurred here and was the result of some necessary code fixes that sped up performance #2308

pinning to 6.47.0:

batman@workstation-shane:~/openpilot/selfdrive/car/tests$ pytest -n8 test_car_interfaces.py
/home/batman/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Benchmarks are automatically disabled because xdist plugin is active.Benchmarks cannot be performed reliably in a parallelized environment.
  warner(PytestBenchmarkWarning(text))
Test session starts (platform: linux, Python 3.11.4, pytest 8.2.1, pytest-sugar 1.0.0)
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=4244670443
rootdir: /home/batman/openpilot
configfile: pyproject.toml
plugins: timeout-2.3.1, xdist-3.6.1, cpp-2.5.0, cov-5.0.0, mock-3.14.0, forked-1.6.0, hypothesis-6.47.0, dash-2.11.1, benchmark-4.0.0, sugar-1.0.0, randomly-3.15.0, subtests-0.12.1, flaky-3.8.1, asyncio-0.23.7, anyio-4.4.0, nbmake-1.5.3, repeat-0.9.3
asyncio: mode=Mode.STRICT
8 workers [209 items]   collecting ... 

 selfdrive/car/tests/test_car_interfaces.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 56% █████▋    
                                            ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓                       100% ██████████
============================================================================ slowest 10 durations =============================================================================
1.16s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_028_FORD_F_150_LIGHTNING_MK1
1.08s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_030_FORD_MAVERICK_MK1
1.06s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_024_FORD_BRONCO_SPORT_MK1
1.03s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_025_FORD_ESCAPE_MK4
1.03s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_031_FORD_MUSTANG_MACH_E_MK1
1.02s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_029_FORD_F_150_MK14
0.99s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_026_FORD_EXPLORER_MK6
0.98s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_074_HYUNDAI_IONIQ_HEV_2022
0.96s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_027_FORD_FOCUS_MK4
0.95s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_151_SKODA_OCTAVIA_MK3

Results (22.66s):
     209 passed

on latest 6.103.1:

batman@workstation-shane:~/openpilot/selfdrive/car/tests$ pytest -n8 test_car_interfaces.py
/home/batman/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Benchmarks are automatically disabled because xdist plugin is active.Benchmarks cannot be performed reliably in a parallelized environment.
  warner(PytestBenchmarkWarning(text))
Test session starts (platform: linux, Python 3.11.4, pytest 8.2.1, pytest-sugar 1.0.0)
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=264861645
rootdir: /home/batman/openpilot
configfile: pyproject.toml
plugins: timeout-2.3.1, xdist-3.6.1, cpp-2.5.0, cov-5.0.0, mock-3.14.0, forked-1.6.0, dash-2.11.1, benchmark-4.0.0, hypothesis-6.103.1, sugar-1.0.0, randomly-3.15.0, subtests-0.12.1, flaky-3.8.1, asyncio-0.23.7, anyio-4.4.0, nbmake-1.5.3, repeat-0.9.3
asyncio: mode=Mode.STRICT
8 workers [209 items]   collecting ... 

 selfdrive/car/tests/test_car_interfaces.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 56% █████▋    
                                            ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓                       100% ██████████
============================================================================ slowest 10 durations =============================================================================
1.70s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_029_FORD_F_150_MK14
1.65s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_027_FORD_FOCUS_MK4
1.62s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_025_FORD_ESCAPE_MK4
1.61s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_026_FORD_EXPLORER_MK6
1.56s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_028_FORD_F_150_LIGHTNING_MK1
1.56s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_108_KIA_NIRO_PHEV
1.51s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_030_FORD_MAVERICK_MK1
1.51s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_151_SKODA_OCTAVIA_MK3
1.50s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_150_SKODA_KODIAQ_MK1
1.49s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_149_SKODA_KAROQ_MK1

Results (36.67s):
     209 passed
@tybug
Copy link
Member

tybug commented Jun 18, 2024

Thanks for the report! It would be great if you could bisect the versions here and narrow it down to a smaller version range (ideally, a single version). We've made quite a few core engine changes starting at around 6.89.1 that could have caused this.

Looking through #2308 I see this comment about generate_new_examples #2308 (comment), which may turn out to be prescient as we just touched that in #4007.

@tybug tybug added the performance go faster! use less memory! label Jun 18, 2024
@bongbui321
Copy link

bongbui321 commented Jul 29, 2024

@tybug
There are speed regressions here and there in trying to improve randomness of the integers and floats (ex trying to hit boundary values,...), but a big speed regression when IRTree is tracked in ConjectureData from this commit 1e76ce2.

Can you explain more on what this tracking of IRTree does since I'm not very familiar with the codebase? and if it is possible to not having this observe condition enabled so that it doesn't have this IRTree tracking without sacrificing the quality of tests generated?

@tybug
Copy link
Member

tybug commented Jul 29, 2024

IRTree is a critical part of our new intermediate representation (#3921), which is powering improvements like faster shrinking (#3962), so we can't just turn it off! "big speed regression" after #3903 doesn't sound great, though. Thanks for narrowing this down; I'll see if I can reproduce.

@bongbui321
Copy link

import sys
import os
import time

import hypothesis.strategies as st
from hypothesis import Phase, given, settings

MAX_EXAMPLES = int(os.getenv("MAX_EXAMPLES", "60"))

def gen_empty_fingerprint():
  return {i: {} for i in range(8)}

@settings(max_examples=MAX_EXAMPLES, deadline=None,
        phases=(Phase.reuse, Phase.generate, Phase.shrink))
@given(data=st.data())
def get_fuzzy_car_interfaces(data):
  fingerprint_strategy = st.fixed_dictionaries({key: st.dictionaries(st.integers(min_value=0, max_value=0x800),
                                                                     st.integers(min_value=0, max_value=64)) for key in
                                                gen_empty_fingerprint()})
  # TODO: Add another st.fixed_dictionaries()
  data.draw(fingerprint_strategy)

if __name__ == "__main__":
  # first bad commit: 5de1fe84252051594fdc6879d4920c357a6d1368 - more likely to generate boundary values
  # from 3.5 - 3.8s:  6e2f394a253761677cdcc0990a32df54a62f079a
  #     - better after revert: 6e2f394a253761677cdcc0990a32df54a62f079a
  # from 4s - >5s: 1e76ce2e52e450d54470ed09b9c65fb1b598fb5c - trackIRTree in ConjectureData

  lower = float(sys.argv[1])
  higher = float(sys.argv[2])

  num_iterations = 20
  start = time.monotonic()
  for _ in range(num_iterations):
    get_fuzzy_car_interfaces()

  print(f"{num_iterations} iterations takes {time.monotonic() - start}s")
  time_taken = time.monotonic() - start
  sys.exit(int(lower <= time_taken <= higher))

I have a random script here for testing. 6.47.0 is > 2x faster than 6.103.1

@tybug
Copy link
Member

tybug commented Aug 2, 2024

Still looking into this, but some initial diagnosis: it seems that LRUReusedCache.__balance is surprisingly expensive (8% runtime total), which we access often in ConjectureData._pooled_kwargs. ConjectureRunner._cache_key_ir may be another hotspot via example computation.

@tybug
Copy link
Member

tybug commented Aug 5, 2024

@bongbui321 could you give 6.108.8 a try on your production workload? On the microbenchmark, I'm down from 5s to 3.5s. I think this is still worse than it used to be, but I'm out of low-hanging fruit for the moment. I've filed #4066 to follow up.

@bongbui321
Copy link

bongbui321 commented Aug 5, 2024

Great! I'll try it out sometime this week, and try to identify some more speed regressions post 6.47.0. Thanks for the quick fix!

@jobh
Copy link
Contributor

jobh commented Aug 5, 2024

FYI @tybug

The element st.dictionaries performance (with parameters as above) went down a lot with 3f82f86, I don't know if that's known/expected?

@tybug
Copy link
Member

tybug commented Aug 5, 2024

yeah, I think that's #4066 as that commit changed mutator_groups to compute ir_starts_and_ends as an intermediary. Let's say "known now, but unexpected at the time" 😄

@bongbui321
Copy link

bongbui321 commented Aug 5, 2024

@tybug. I see an improvement with the new 6.108.9 compared to 6.103.1 using the example above. But when testing with my actual test, the performance is similar takes around 27s (both 6.103.0 and 6.108.9) while 6.47.0 takes around 16s for that test.

I wonder if using pytest-xdist doesn't take effect of the recent improvement you made

@tybug
Copy link
Member

tybug commented Sep 5, 2024

@bongbui321 thanks for checking. The graph in #4076 indicates we have ~recent systemic performance problems, so this may be a 'wait and see'. Sorry I couldn't be more helpful for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance go faster! use less memory!
Projects
None yet
Development

No branches or pull requests

4 participants