-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use FoldHash
instead of FxHash
#12943
base: main
Are you sure you want to change the base?
Conversation
@@ -105,9 +105,10 @@ pyproject-toml = { version = "0.9.0" } | |||
quick-junit = { version = "0.4.0" } | |||
quote = { version = "1.0.23" } | |||
rand = { version = "0.8.5" } | |||
rustc-hash = { version = "2.0.0" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it in red-knot, it was giving me some trouble around lack of a Default
impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should resolve this before merging
bca7a43
to
400732a
Compare
CodSpeed Performance ReportMerging #12943 will degrade performances by 6.45%Comparing Summary
Benchmarks breakdown
|
Locally I see such a large regression that I have to assume I did something wrong here:
|
@charliermarsh That is quite odd, out of curiosity, what happens if you use |
Let me try |
On my machine (Apple M2 laptop): Benchmark 1: ./target/release/ruff check ~/tmp/cpython/ --no-cache --silent -e
Time (mean ± σ): 104.1 ms ± 2.6 ms [User: 752.9 ms, System: 87.4 ms]
Range (min … max): 100.2 ms … 116.4 ms 100 runs
Benchmark 2: ../ruff-foldhash/target/release/ruff check ~/tmp/cpython --no-cache --silent -e
Time (mean ± σ): 117.8 ms ± 1.7 ms [User: 884.8 ms, System: 87.4 ms]
Range (min … max): 114.1 ms … 122.6 ms 100 runs
Summary
./target/release/ruff check ~/tmp/cpython/ --no-cache --silent -e ran
1.13 ± 0.03 times faster than ../ruff-foldhash/target/release/ruff check ~/tmp/cpython --no-cache --silent -e So I can reproduce, somewhat. With FWIW, my benchmarks already showed that I'm slower than fxhash for Foldhash also provides much better guarantees to be solid regardless of input distribution, fxhash is very fast but has serious quality issues for certain inputs (which may or may not be a problem for ruff). |
@charliermarsh I think there might be something else at play... I can also reproduce the slowdown... when I do As a sanity check, I also made |
I have to sleep now, will investigate further tomorrow. Copy/pasting the |
I tried one more thing quickly before bed, and I found the source of the slowdown. It's the @charliermarsh Does ruff happen to instantiate... thousands and thousands of (empty) hash maps? The results on my machine if I monkey-patch Benchmark 1: ../ruff/target/release/ruff check ~/tmp/cpython/ --no-cache --silent -e
Time (mean ± σ): 103.4 ms ± 2.0 ms [User: 755.9 ms, System: 88.8 ms]
Range (min … max): 100.2 ms … 112.2 ms 100 runs
Benchmark 2: ../ruff-foldhash/target/release/ruff check ~/tmp/cpython --no-cache --silent -e
Time (mean ± σ): 104.7 ms ± 2.1 ms [User: 760.3 ms, System: 88.7 ms]
Range (min … max): 100.5 ms … 116.1 ms 100 runs
Summary
../ruff/target/release/ruff check ~/tmp/cpython/ --no-cache --silent -e ran
1.01 ± 0.03 times faster than ../ruff-foldhash/target/release/ruff check ~/tmp/cpython --no-cache --silent -e The offending code ( impl Default for RandomState {
fn default() -> Self {
// We use our current stack address in combination with
// PER_HASHER_NONDETERMINISM to create a new value that is very likely
// to have never been used as a random state before.
//
// PER_HASHER_NONDETERMINISM ensures that even if we create two
// RandomStates with the same stack location it is still highly unlikely
// you'll end up with the same random seed.
//
// PER_HASHER_NONDETERMINISM is loaded and updated in a racy manner, but
// this doesn't matter in practice - it is impossible that two different
// threads have the same stack location, so they'll almost surely
// generate different seeds, and provide a different possible update for
// PER_HASHER_NONDETERMINISM. If we would use a proper atomic update
// then hash table creation would have a global contention point, which
// users could not avoid.
//
// Finally, not all platforms have a 64-bit atomic, so we use usize.
static PER_HASHER_NONDETERMINISM: AtomicUsize = AtomicUsize::new(0);
let nondeterminism = PER_HASHER_NONDETERMINISM.load(Ordering::Relaxed) as u64;
let stack_ptr = &nondeterminism as *const _ as u64;
let per_hasher_seed = folded_multiply(nondeterminism ^ stack_ptr, ARBITRARY2);
PER_HASHER_NONDETERMINISM.store(per_hasher_seed as usize, Ordering::Relaxed);
Self {
per_hasher_seed,
global_seed: GlobalSeed::new(),
}
}
} |
The problem is contention on |
I think the solution for us specifically is to use |
Wow, thanks for taking a look @orlp! Please don't feel obligated of course -- only if it's interesting for you.
I would say... yes. At least, over the course of that command. CPython contains ~2,000 Python files, and we probably instantiate at least 10 hash maps per file -- so I'd guess something like ~20,000 hash maps? |
Oh TIL. I could modify to use |
@charliermarsh I will try to release 0.1.1 today that gets rid of the contention. I would rather not see people use |
I'm afraid your guess is off by an order of magnitude and a half. Out of curiosity I inserted a count tracker in |
Doesn't sound unreasonable. We have some rules that create hash maps internally to detect duplicates. They are called for every node of a specific type. We could probably optimize some of those hash maps by re-using hash maps over calls instead of creating a new hash map whenever the function is called. It's one of the main downside of our rules system today. They're all stateless functions and there's no option to run any setup or teardown code. |
@charliermarsh I just pushed |
With that fix in place I ran foldhash: Time (mean ± σ): 103.4 ms ± 2.0 ms
foldhash(norng): Time (mean ± σ): 104.4 ms ± 1.8 ms
rustc-hash: Time (mean ± σ): 105.0 ms ± 3.5 ms
ahash(norng): Time (mean ± σ): 105.9 ms ± 4.7 ms
siphash: Time (mean ± σ): 114.3 ms ± 4.7 ms
ahash: Time (mean ± σ): 120.5 ms ± 8.4 ms (Note that in earlier benchmarks I got 103.4 and 104.1 as times for The /// Type alias for [`std::collections::HashMap<K, V, foldhash::fast::RandomState>`].
// pub type HashMap<K, V> = std::collections::HashMap<K, V, RandomState>;
pub type HashMap<K, V> = std::collections::HashMap<K, V>;
/// Type alias for [`std::collections::HashSet<T, foldhash::fast::RandomState>`].
// pub type HashSet<T> = std::collections::HashSet<T, RandomState>;
pub type HashSet<T> = std::collections::HashSet<T>; The poor performance for EDIT: I also added a run |
Thanks @orlp, that's awesome. Appreciate you doing all this analysis here. Any idea why |
@charliermarsh I think you meant the other way around? Anyway, I would assume that is noise, unless you happen to fill a hashmap/set from the iterator of another hashmap/set in some part of the code (because that is quadratic if they share the same hasher). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on this change but I'm leaning towards not merging it, considering that there's no significant performance improvement.
- It introduces a new dependency and it's unclear if we can convince upstream dependencies to make the same switch, considering that there's no clear performance win.
- The PR might cause a few merge conflicts. They're easy to solve but it requires extra work.
- We all have trained our memory mussels to use
FxHashSet
. Usingfoldhash
will take some time to get used to.
Are there any advantages other than performance for using foldhash over rustc that offset my concern above? I read that it provides some basic DoS mitigation, although I'm not sure if that's worth it, considering that it only raises the bar but doesn't prevent them in anyway.
Only two upstream dependencies of ruff use
What we do in polars is that we have a
Perhaps I was a bit too careful in my wording. I don't believe that is a realistic attacker model for ruff. Against non-interactive attackers I believe it is essentially impossible to construct inputs which will always collide regardless of the seed. So assuming you don't have someone heavily studying a long-running server executing some incredibly-long-running
Ultimately they all come down to performance. Hashmaps/sets are probabilistic data structures, that use the random nature of hashes for their time expectancy arguments. If you take away the randomization you lose the expected O(1) time per operation guarantee. This can either bite you with malicious inputs that intentionally cause collisions, or it can happen by accident. For example if you iterate over one hashset to fill another hashset, and they both use linear probing with the same random seed, the expected time is O(n^2). Using foldhash prevents all of these issues categorically (if not facing an interactive attacker). If foldhash isn't any slower for typical input, why not also ensure you're fast for edge cases / malicious input? |
Yes, I'm a contributor but such a change would still require prior discussion.
We do this in red-knot too. It does help to minimize the change but we called
Thanks, I wasn't aware of that I don't feel strongly about the change but my concerns remain. It's not a "free-win" and requires additional work that I don't consider a priority right now. |
Alright, that wasn't too hard. No need even for a dictionary, every string of the form
will collide with the exact same hash using rustc-hash 2.0.0, regardless of which 8-byte string you use to replace use std::hash::BuildHasher;
let hasher = rustc_hash::FxBuildHasher::default();
dbg!(hasher.hash_one("AcRwqMBJW5iFhxpF________csQwAdk2"));
dbg!(hasher.hash_one("AcRwqMBJW5iFhxpF_foo____csQwAdk2"));
dbg!(hasher.hash_one("AcRwqMBJW5iFhxpF_foobar_csQwAdk2"));
dbg!(hasher.hash_one("AcRwqMBJW5iFhxpF__orlp__csQwAdk2"));
So just using alphanumeric strings that's already an instant |
@MichaReiser To make this extra concrete, the following script outputs a 3.5MB large Python script: import random, string
TEMPLATE = "AcRwqMBJW5iFhxpF________csQwAdk2 = 0"
ALPHANUM = string.ascii_uppercase + string.ascii_lowercase + string.digits
for _ in range(10**5):
r = ''.join(random.choices(ALPHANUM, k=8))
print(TEMPLATE.replace("________", r)) On my machine with this PR the output file takes ~64ms to process using |
Summary
Entirely out of curiosity, it was pretty easy.