Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve citation of archived urls from Wayback #3327

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mvolz
Copy link
Contributor

@mvolz mvolz commented Jul 1, 2024

People citing Wayback urls presumably want the metadata from the website or news article itself.

Using the embedded metadata translator improves these results i.e. including the author, and the date the original article was published instead of the archive date.

Ideally we would actually want to use the translators specific to the particular url itself but this is a good first pass that improves things.

This also sets the access date to the date the page was archived since this is the date the metadata is being extracted from.

People citing Wayback urls presumably want the metadata
from the website or news article itself.

Using the embedded metadata translator improves these results
i.e. including the author, and the date the original article was
published instead of the archive date.

Ideally we would actually want to use the translators specific
to the particular url itself but this is a good first pass
that improves things.

This also sets the access date to the date the page was
archived since this is the date the metadata is being
extracted from.
@AbeJellinek
Copy link
Member

Thanks - this is something I've wanted to work on for a while.

Haven't tried it, but couldn't we do something like:

if (typeof Proxy !== 'undefined') {
	doc = new Proxy(doc, {
		get(target, prop) {
			if (prop === 'location') {
				return new URL(url.match(/\/web\/(\d{4})(\d{2})(\d{2})\d{6}\/(http.*)$/)[4]);
			}
			return target[prop];
		}
	});
}
translator.setDocument(doc);
let possibleTranslators = await translator.getTranslators();
if (possibleTranslators.length) {
	translator.setTranslator(possibleTranslators);
}
else {
	translator.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48');
}

to use site-specific translators when possible instead of just EM?

That'll only work in places where we have access to Proxy, which could just be Scaffold for all I know, but it would be really cool if it worked!

(I'll look into this more later.)

@mvolz
Copy link
Contributor Author

mvolz commented Jul 3, 2024

Another thing to note is there's no obvious field to put the archive.org url in. In this change I've put the original url in (because that's what we want upstream!) but ideally we'd leave it in... maybe for this change you want to leave the archive.org link as the main url instead of what I've done.

Ideally we'd want a separate field to include both?

@dstillman
Copy link
Member

Yeah, I'm pretty sure we want the archive.org URL as the main URL. If someone is saving from the Wayback Machine, there's a decent chance the original URL is no longer available or has changed, and so the archive.org URL is where you go online (e.g., double-clicking in Zotero) to see the content in question, and the point of including a URL in a citation is to enable others to see the original content. There's also the epistemological argument that, while we can believe that IA is trustworthy and not covertly changing the contents of archived pages, it's just factually inaccurate to claim that the original URL represents the content being referenced.

Why do you want the original URL on Wikipedia?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants