-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixup .mailmap and add maintenance script to tools/ #9856
base: main
Are you sure you want to change the base?
Conversation
Thank you for opening a new pull request. Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient. While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone. One or more of the the following people are requested to review this:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm certainly happy tidying up the mailmap file where there's things missing. I'm also not bothered much if we want to store this script in the tools
directory to help, though I don't really know where the line on what we put there ought to be - maybe this would be better just staying somewhere in a personal repo for you? (Especially if you want to work on it more.)
# if email2 == canonical_email: | ||
# print_yellow( | ||
# f"Dropped mailmap superfluous second email <{email2}> that is equal to the first " | ||
# f"{mailmap_line(index)}" | ||
# ) | ||
# line = f"{canonical_name} <{canonical_email}>" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to clean this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It tries to remove the cases like you pointed out for Ismael's entry only ones where the second actually email isn't needed. But it is sometimes over-zealous and I couldn't think of an easy way to get it to only trigger in the needed cases. I'll add a TODO comment to clarify and maybe someone else has an idea
elif all(dup_emails_canonical): | ||
emails_list = ", ".join(f"<{e}>" for e in dup_emails) | ||
msg = f" All {len(emails_list)} names are in mailmap: {emails_list}\n" | ||
msg += " This tool doesn't handle when normalization merges two unique authors." | ||
# If this comes up, either split the list of duplicates, | ||
# or do less-aggressive normalization | ||
elif dup_emails_canonical.count(True) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to be concerned about performance at all here? We do a lot of iterating all the way through the tuple. I'm guessing it probably shouldn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we care. It's fast enough since it is only running this on the cases where there is a duplication.
# Load the raw author entries | ||
raw_authors = get_raw_authors() | ||
if len(raw_authors) < 10: | ||
print_red("Problem: `get_raw_authors()` returned too few authors") | ||
sys.exit(ExitCode.GIT_LOG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just a sanity check but probably if I try this on some of the less-trafficked qiskit repos I'll need to reduce it. Maybe down to just 1.
Pull Request Test Coverage Report for Build 10946868160Details
💛 - Coveralls |
Summary
Our .mailmap was getting a bit crusty, with both some superfluous entries, and not properly deduplicating all authors (especially those in
Co-Authored-By:
lines).It was enough of a pain to do this that I wrote a dumb script to help.
The problem is sufficiently annoying that I now understand why there is no existing tool to do this properly...
The script calls out to
git log
andgit shortlog
to get authors.It parses the
.mailmap
file looking for errors and unnecessary lines.It does loose matching on names and emails by normalizing unicode and dropping punctuation.
It makes some assumptions about uniqueness of names and emails that are too restrictive to work as a general-purpose mailmap-fixing tool, but those assumptions hold for the current state of our repo and are likely to remain true.
I'm not exactly sure what to do with the script, but checking it into
tools/
with this PR, so it doesn't get lost seems a good start. If we wanted to trust it more then it's complex enough to need tests (where should tests fortools/
scripts even live?). Should it get aMakefile
line or atox
environment?The script is probably too fragile to make it part of the CI, although if you wanted to do it, you should use the form:
$ tools/fix_mailmap.py -sn
which would give a fairly terse output and not make any changes to
.mailmap
, just report a non-zero exit code if there is an error. It requires python 3.9 and no other dependencies.For interactive non-CI use it gives pretty complete feedback about what it's doing but you should definitely look at the output and check that it's making good choices since there's a lot of heuristics going on. Generally the pattern should be something like:
See the individual commits to this PR as examples of how that went for this big fixup. Hopefully things won't get so bad in general.
It's not a good idea to just run the script a whole bunch of times in a row without looking at the output because any bad assumptions could propagate pretty far through the mailmap by that time and it will be more of a pain to fix it.
Results
Before this PR:
After this PR:
and
$ git shortlog -sn --group trailer:Co-Authored-By --group author --email | grep '<'
gives a clean-looking author list.
I put in the
grep '<'
because we have some 85 mis-formed trailer lines of the formrather than the correct
I don't think there's any way to use a
.mailmap
to fix those - a history rewrite would be the only way to normalize those.