Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44066: [Python] Add Python wrapper for JsonExtensionType #44070

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rok
Copy link
Member

@rok rok commented Sep 11, 2024

Rationale for this change

We added canonical JsonExtensionType and we should make it usable from Python.

What changes are included in this PR?

Python wrapper for JsonExtensionType and JsonArray are added on Python side as well as JsonArray on c++ side.

Are these changes tested?

Python tests for the extension type and array are included.

Are there any user-facing changes?

This adds a json canonical extension type to pyarrow.

Copy link

⚠️ GitHub issue #44066 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Added some inline comments. One other comment:

  • Can you add the new classes to test_extension_type_constructor_errors in test_misc.py

Are you planning to work on enabling the parquet integration in a later PR?

class JsonArray(ExtensionArray):
"""
Concrete class for Arrow arrays of JSON data type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe note here that this array class does not guarantee that the data is actually valid JSON? (which I assume is the case?)

For example in the code example below, you can in theory put whatever string data you want in the array.

Copy link
Member Author

@rok rok Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note. Let me know if it's not extensive enough.

python/pyarrow/lib.pxd Show resolved Hide resolved
cpp/src/arrow/extension/json.h Outdated Show resolved Hide resolved
python/pyarrow/types.pxi Outdated Show resolved Hide resolved
python/pyarrow/tests/test_extension_type.py Outdated Show resolved Hide resolved

assert json_type.extension_name == "arrow.json"
assert json_type.storage_type == storage_type
assert json_type.__class__ is pa.JsonType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check equality at the type instance level?

Suggested change
assert json_type.__class__ is pa.JsonType
assert json_type.__class__ is pa.JsonType
assert json_type == pa.json(storage_type)
assert json_type != storage_type

Which also makes me wonder: if the storage type is different (eg string vs large string), does it then return False? (maybe also good to test that)

python/pyarrow/tests/test_extension_type.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 12, 2024
@rok
Copy link
Member Author

rok commented Sep 12, 2024

Thanks for working on this!

Thanks for the review!

  • Can you add the new classes to test_extension_type_constructor_errors in test_misc.py

Added.

Are you planning to work on enabling the parquet integration in a later PR?

Do we need to do something Python-side? C++ side was covered by #13901. I've added a basic parquet test here and it works for me locally.

Question: any idea why pytest is saying pa.json is not callable in CI? Locally this works as expected.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 12, 2024
@jorisvandenbossche
Copy link
Member

Do we need to do something Python-side? C++ side was covered by #13901. I've added a basic parquet test here and it works for me locally.

I would be surprised if it works locally out of the box because IIRC the option was disabled by default in C++ for now? (so what we would need on the python side is expose that new arrow_extensions_enabled options so that the user can enable it)
Although I suppose that you wrote the file with pyarrow, and at that point we give priority to the stored schema, I suppose, even if the option is not enabled.

Question: any idea why pytest is saying pa.json is not callable in CI? Locally this works as expected.

There might somehow we some conflict with the stdlib module?

@jorisvandenbossche
Copy link
Member

Question: any idea why pytest is saying pa.json is not callable in CI? Locally this works as expected.

There might somehow we some conflict with the stdlib module?

Actually not that, but we have a pyarrow.json module already. So we will have to call this function differently anyway. For others with a conflict with added a trailing underscore, so could do that here as well.

@pytest.mark.parametrize("storage_type", (
pa.utf8(), pa.large_utf8(), pa.string(), pa.large_string()))
@pytest.mark.parquet
def test_parquet_json(tmpdir, storage_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move those tests to parquet subdir? (e.g. python/pyarrow/tests/parquet/test_data_types.py)
I know we already have parquet related tests in this file, but those are for custom extension type support, while this will be a built-in extension type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 12, 2024
@rok rok force-pushed the python_json_extension_type_wrapper branch from 4393536 to cb6820b Compare September 12, 2024 14:01
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 12, 2024
@rok
Copy link
Member Author

rok commented Sep 12, 2024

Do we need to do something Python-side? C++ side was covered by #13901. I've added a basic parquet test here and it works for me locally.

I would be surprised if it works locally out of the box because IIRC the option was disabled by default in C++ for now? (so what we would need on the python side is expose that new arrow_extensions_enabled options so that the user can enable it) Although I suppose that you wrote the file with pyarrow, and at that point we give priority to the stored schema, I suppose, even if the option is not enabled.

I see. I assumed arrow_extensions_enabled was on by default. Do we want to expose a global setter? E.g.: pyarrow.parquet.set_arrow_extensions_enabled? Or would it be preferred to add a parameter for pq.write_table?

Question: any idea why pytest is saying pa.json is not callable in CI? Locally this works as expected.

There might somehow we some conflict with the stdlib module?
[..]
Actually not that, but we have a pyarrow.json module already. So we will have to call this function differently anyway. For > others with a conflict with added a trailing underscore, so could do that here as well.

It seems this was the case, changed extension type name to pa.json_.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 12, 2024
@rok rok force-pushed the python_json_extension_type_wrapper branch 2 times, most recently from 0abe429 to 945e5fa Compare September 12, 2024 15:43
@rok rok force-pushed the python_json_extension_type_wrapper branch from 945e5fa to b176855 Compare September 12, 2024 17:03
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 12, 2024
inner = array.cast(storage_type)
assert inner == storage
inner = array.cast(storage_type)
assert inner == storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing that :)

@rok
Copy link
Member Author

rok commented Sep 17, 2024

@jorisvandenbossche ping :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants