Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of Unicode characters #41

Open
kernelmethod opened this issue May 18, 2022 · 0 comments
Open

Incorrect handling of Unicode characters #41

kernelmethod opened this issue May 18, 2022 · 0 comments

Comments

@kernelmethod
Copy link
Contributor

kernelmethod commented May 18, 2022

As @fonsp pointed out in #39, URIs.jl does not technically handle Unicode characters correctly, at least according to RFC 3986. IETF RFC 3986 Sec. 1.2.1 implies that URIs should only contain characters from the US-ASCII charset and should percent-encode additional characters (RFC 3987 makes this a little more explicit). URIs.jl, however, will accept and work with any string as its input regardless of the underlying character set:

julia> using URIs

julia> url = URI("https://a/🌟/e")
URI("https://a/🌟/e")

julia> url.path
"/🌟/e"

After diving into it for a bit, there seems to be a split in how the standard / canonical library for URI handling works in many other languages. In JavaScript, Go, and Rust, passing in a URI that uses Unicode will either force the URI to be percent-encoded or raise an error:

Javascript
>> new URL("https://a/🌟/e").pathname
"/%F0%9F%8C%9F/e"
Go
package main

import (
	"fmt"
	"net/url"
	"os"
)

func main() {
	url, err := url.Parse("https://a/🌟/e")
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error parsing url: %s", err)
		return
	}
	fmt.Printf("%s\n", url)
	// Prints https://a/%F0%9F%8C%9F/e
}
Rust

Rust's http crate will actually panic if you try to feed it a Unicode URI at all, e.g.:

use http::Uri;

fn main() {
    let uri = Uri::from_static("https://a/🌟/e");
    println!("{}", uri.path());
}
$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/uri`
thread 'main' panicked at 'static str is not valid URI: invalid uri character', /home/kernelmethod/.cargo/registry/src/github.com-1ecc6299db9ec823/http-0.2.7/src/uri/mod.rs:365:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

But this isn't universally the case: in Python and Java, the Unicode encoding is preserved:

Python
>>> from urllib.parse import urlparse
>>> url = urlparse("https://a/🌟/e")
>>> url.path
'/🌟/e'
Java
import java.net.*;

class URITesting {
    public static void main(String[] args) {
        try {
            URI url = new URI("https://a/🌟/e");
            System.out.printf("path = %s\n", url.getPath());
        }
        catch (URISyntaxException ex) {
            System.out.println(ex);
        }
        // System.out.println("Hello, World!"); 
    }
}

One potential difference between these languages is that Java's java.net.URI tries to comply with RFC 2936, whereas Python's urllib.parse.urlparse seems to try to comply with a mix of standards.


In any case, there's a bit of a dilemma here -- this library doesn't quite implement the RFC as specified, which is also an issue that has cropped up in other places, e.g. in the implementation of normpath #20 and joinpath (related issue: #18). As far as this issue is concerned, it seems like there are three ways URIs.jl could go:

  1. Percent-encode strings when we generate a URI to ensure compliance to the spec;
  2. Implement RFC 3987 under the hood, which does permit Unicode characters; or
  3. Keep the library's current behavior and try to specify which parts of URIs.jl comply with which RFCs, similar to what Python does for its urllib.parse module.

I would think that option (1) is the most preferable of all of these -- this library says that it implements URIs according to RFC 3986, so it should comply with that RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant