Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add a c++ implementation for podio-dump #620

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

tmadlener
Copy link
Collaborator

@tmadlener tmadlener commented Jun 7, 2024

BEGINRELEASENOTES

  • Add an implementation of podio-dump in c++ to make dumping files quicker

ENDRELEASENOTES

This is an attempt at making podio-dump quicker after several complaints (e.g. key4hep/EDM4hep#312). After some "profiling" it turns out that the slowest part in the python implementation is the loop over all the collections which can be significantly sped up by going to c++. In my local timings the current (python based) podio-dump is almost ten times slower than this (c++ based) podio-dump-cpp) for dumping the example_frame.root file from the tests (times via time)

podio-dump podio-dump-cpp
real 12.393s 1.513s
user 8.522s 1.251s
sys 3.823s 0.296s

The main disadvantages of the c++ implementation are that we need quite a bit of boilerplate for things that are trivial in python, e.g.:

  • We have to manually implement argument parsing and (parts of) the tabulate functionality
    • Since formatting with iostream and iomanip is bordering on masochism, I have decided to pull in fmt for now. In principle c++20 has similar functionality in std::format (but no fmt::print that only comes with c++26). However, that requires gcc >=13 and clang >= 16.
  • Dumping datamodel definitions in YAML is missing entirely at the moment, since that would require dumping the internal json format as YAML. In python this is literally these 2 lines:

    podio/tools/podio-dump

    Lines 99 to 100 in d275460

    model_def = json.loads(reader.get_datamodel_definition(model_name))
    print(yaml.dump(model_def, sort_keys=False, default_flow_style=False))

    in c++ this would require at least one other library to be pulled in

Since dumping the datamodel would require quite a bit of work in c++, I would be in favor of keeping that in python in a separate tool, while the other functionality could be covered by the c++ implementation.

TODO:

@@ -1,3 +1,7 @@
add_executable(podio-dump-cpp src/podio-dump.cpp)
target_link_libraries(podio-dump-cpp PRIVATE podio::podio podio::podioIO fmt::fmt)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executable needs to be installed.

tools/src/tabulate.h Outdated Show resolved Hide resolved
@Zehvogel
Copy link
Contributor

I wonder how an RDataFrame-based python version (with pre-compiled functions) would fare on a performance vs. comfort scale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants