Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i" #6497

Open
sirus20x6 opened this issue Sep 5, 2024 · 10 comments
Assignees
Labels
bug Something isn't working training

Comments

@sirus20x6
Copy link

Describe the bug
A clear and concise description of what the bug is.
deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run ds_report to give us details about your setup.

Screenshots
If applicable, add screenshots to help explain your problem.

Processing dataset chunks: 100%|██████████| 106/106 [00:11<00:00,  9.45it/s]
[2024-09-05 04:11:37,288] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.15.2+c210e601, git-hash=c210e601, git-branch=master
[2024-09-05 04:11:37,288] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-05 04:11:37,288] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
hostname: invalid option -- 'I'
Try 'hostname --help' or 'hostname --usage' for more information.
Traceback (most recent call last):
  File "/code/git/learnable-activations/mflow.py", line 429, in <module>
    run_experiment(args)
  File "/code/git/learnable-activations/mflow.py", line 384, in run_experiment
    model_engine, optimizer = prepare_deepspeed_model(model, args)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/git/learnable-activations/mflow.py", line 266, in prepare_deepspeed_model
    model_engine, _, _, _ = deepspeed.initialize(
                            ^^^^^^^^^^^^^^^^^^^^^
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/__init__.py", line 144, in initialize
    dist.init_distributed(dist_backend=dist_backend,
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 673, in init_distributed
    mpi_discovery(distributed_port=distributed_port, verbose=verbose)
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 701, in mpi_discovery
    result = subprocess.check_output(hostname_cmd, shell=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' returned non-zero exit status 64.

System info (please complete the following information):

  • OS: Arch
  • GPU count and types x1 7900xtx
  • Interconnects (if applicable) one machine
  • Python version 3.12
  • Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?

#!/bin/bash
export OMPI_MCA_accelerator=rocm
mpirun -np 1 --mca accelerator rocm python mflow.py --deepspeed_config ds_config.json --log_interval 100 --batch_size 4 --local_rank -1

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

the offending code:

master_addr = None
    if rank == 0:
        hostname_cmd = ["hostname -I"]
        result = subprocess.check_output(hostname_cmd, shell=True)
        master_addr = result.decode('utf-8').split()[0]
    master_addr = comm.bcast(master_addr, root=0)
@sirus20x6 sirus20x6 added bug Something isn't working training labels Sep 5, 2024
@sirus20x6 sirus20x6 changed the title [BUG] [BUG] deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i" Sep 5, 2024
@loadams loadams self-assigned this Sep 5, 2024
@loadams
Copy link
Contributor

loadams commented Sep 5, 2024

Hi @sirus20x6 - this issue looks to be similar to this one: #5597

Could you share the output of hostname --help and hostname -V?

@sirus20x6
Copy link
Author

here you go!

 ~  hostname --help                                                     ✔  11:11:59
Usage: hostname [OPTION...] [NAME]
Show or set the system's host name.

  -a, --aliases              alias names
  -d, --domain               DNS domain name
  -f, --fqdn, --long         DNS host name or FQDN
  -F, --file=FILE            set host name or NIS domain name from FILE
  -i, --ip-addresses         addresses for the host name
  -s, --short                short host name
  -y, --yp, --nis            NIS/YP domain name
  -?, --help                 give this help list
      --usage                give a short usage message
  -V, --version              print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

Report bugs to <[email protected]>.
 ~                                                                      ✔  11:12:02
 ~  hostname -V                                                      64 ✘  11:12:39
hostname (GNU inetutils) 2.5
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Debarshi Ray.
 ~                                                                      ✔  11:13:03

and I believe that the posix way of doing this is actually

getent hosts localhost

because net-utils which is where the hostname binary is from is sort of an old deprecated package even though a lot of people still have it installed because they have a lot of muscle memory around those tools

@sirus20x6
Copy link
Author

small correction, actually if you just want the first field that posix way of getting loopback is

getent hosts localhost | awk '{ print $1 }'

@loadams
Copy link
Contributor

loadams commented Sep 5, 2024

Thanks, @sirus20x6 - we are also looking at switching to just using socket.gethostname() and socket.gethostbyname_ex() to work around this entirely, do you think that would work for your needs?

@sirus20x6
Copy link
Author

I believe so. Hopefully that will be more cross-platform and resilient

@loadams
Copy link
Contributor

loadams commented Sep 5, 2024

If you want, you could test with pip install git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I

@sirus20x6
Copy link
Author

I will test as soon as I get home to my machine!

@sirus20x6
Copy link
Author

doesn't install

> pip uninstall deepspeed
Found existing installation: deepspeed 0.15.2+c210e601
Uninstalling deepspeed-0.15.2+c210e601:
  Would remove:
    /thearray/git/ComfyUI/comfyvenv/bin/deepspeed
    /thearray/git/ComfyUI/comfyvenv/bin/deepspeed.pt
    /thearray/git/ComfyUI/comfyvenv/bin/ds
    /thearray/git/ComfyUI/comfyvenv/bin/ds_bench
    /thearray/git/ComfyUI/comfyvenv/bin/ds_elastic
    /thearray/git/ComfyUI/comfyvenv/bin/ds_report
    /thearray/git/ComfyUI/comfyvenv/bin/ds_ssh
    /thearray/git/ComfyUI/comfyvenv/bin/dsr
    /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed-0.15.2+c210e601.dist-info/*
    /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/*
Proceed (Y/n)? y
  Successfully uninstalled deepspeed-0.15.2+c210e601
(comfyvenv) (base) neuromancer :) > pip install git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I
Collecting git+https://github.com/microsoft/deepspeed.git@loadams/update-hostname-I
  Cloning https://github.com/microsoft/deepspeed.git (to revision loadams/update-hostname-I) to /tmp/pip-req-build-lvq7vagu
  Running command git clone --filter=blob:none --quiet https://github.com/microsoft/deepspeed.git /tmp/pip-req-build-lvq7vagu
  Running command git checkout -b loadams/update-hostname-I --track origin/loadams/update-hostname-I
  Switched to a new branch 'loadams/update-hostname-I'
  branch 'loadams/update-hostname-I' set up to track 'origin/loadams/update-hostname-I'.
  Resolved https://github.com/microsoft/deepspeed.git to commit 0d2aada49e58490a5a38867b0475f4b57e12c2ae
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [63 lines of output]
      [2024-09-05 23:23:45,883] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      [2024-09-05 23:23:46,521] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
        _torch_pytree._register_pytree_node(
      /thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
        _torch_pytree._register_pytree_node(
      DS_BUILD_OPS=0
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_io_handle.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_thread.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_utils.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_common.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/common/deepspeed_aio_types.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_cpu_op.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_aio_op_desc.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.h -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_py_copy.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/deepspeed_pin_tensor.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/py_ds_aio.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/aio/py_lib/py_ds_aio.cpp [skipped, no changes]
      Successfully preprocessed all matching files.
      Total number of unsupported CUDA function calls: 0
      
      
      Total number of replaced kernel launches: 0
      /tmp/pip-req-build-lvq7vagu/csrc/adam/fused_adam_frontend.cpp -> /tmp/pip-req-build-lvq7vagu/csrc/adam/fused_adam_frontend.cpp [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/includes/compat.h -> /tmp/pip-req-build-lvq7vagu/csrc/includes/compat.h [skipped, no changes]
      /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_apply.cuh -> /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_apply_hip.cuh [ok]
      /tmp/pip-req-build-lvq7vagu/csrc/includes/type_shim.h -> /tmp/pip-req-build-lvq7vagu/csrc/includes/type_shim_hip.h [ok]
      /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_adam.cu -> /tmp/pip-req-build-lvq7vagu/csrc/adam/multi_tensor_adam.hip [ok]
      Successfully preprocessed all matching files.
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-req-build-lvq7vagu/setup.py", line 198, in <module>
          ext_modules.append(builder.builder())
                             ^^^^^^^^^^^^^^^^^
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 699, in builder
          {'cxx': self.strip_empty_entries(self.cxx_args()), \
                                           ^^^^^^^^^^^^^^^
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 842, in cxx_args
          CUDA_ENABLE = self.is_cuda_enable()
                        ^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 420, in is_cuda_enable
          assert_no_cuda_mismatch(self.name)
        File "/tmp/pip-req-build-lvq7vagu/op_builder/builder.py", line 86, in assert_no_cuda_mismatch
          torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
                                        ^^^^^^^^^^^^^^^^^^^^^^^^
      AttributeError: 'NoneType' object has no attribute 'split'
      Total number of unsupported CUDA function calls: 0
      
      
      Total number of replaced kernel launches: 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

@saforem2
Copy link
Contributor

+1 for the socket approach

replacing the subprocess calls from here in deepspeed/comm/comm.py#L700-L702

with

import socket
master_addr = socket.gethostbyaddr(socket.gethostname())[0]

has been working for me on internal systems

+    import socket
-    master_addr = None
     if rank == 0:
-        hostname_cmd = ["hostname -I"]
-        result = subprocess.check_output(hostname_cmd, shell=True)
-        master_addr = result.decode('utf-8').split()[0]
+        master_addr = socket.gethostbyaddr(socket.gethostname())[0]

also see: #2837

I'd be happy to submit a PR + test further if it would be useful

@tjruwase
Copy link
Contributor

@saforem2, thanks for offering to help with this. Please see our concerns here. Would appreciate your insights and PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

4 participants