Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error: Could not initialize thread / error: reading root hints #1038

Closed
kulikov-a opened this issue Mar 30, 2024 · 8 comments · Fixed by #1041 or klutchell/unbound-docker#437
Closed

Comments

@kulikov-a
Copy link

kulikov-a commented Mar 30, 2024

Hi!
Describe the bug
Sporadic fatal errors possible at unbound start if the root-hints option is used.
The server may become unresponsive as a result.

The copy of the https://www.internic.net/domain/named.root is used for root-hints.
Errors seen (can occur either during one start attempt or be limited to one error per unsuccessful attempt):
error: reading root hints /root.hints 19:1: Empty label
error: reading root hints /root.hints 2:37: Syntax error, could not parse the RR's type
error: reading root hints /root.hints 2:9: Syntax error, could not parse the RR's type
error: reading root hints /root.hints 8:14: Syntax error, could not parse the RR's class
error: reading root hints /root.hints 13:15: Conversion error, ip6 addr expected
and so on..
so it ends up with:
error: Could not set root or stub hints and fatal error: Could not initialize thread (for one or more threads)

The sporadic nature of the errors and their content (error in reading the record type in a comment line, mismatch between the record type and the address type) suggest that there may be a thread race when reading a file?

To reproduce
Steps to reproduce the behavior:

  1. Use root-hints option
  2. Сreate conditions for periodic unbound restart
  3. Wait

Expected behavior
A clear and concise description of what you expected to happen.

System:

  • Unbound version: 1.19.3
  • OS: FreeBSD 13.2-RELEASE-p10
  • unbound -V output:
 --with-libexpat=/usr/local --with-ssl=/usr/local --enable-dnscrypt --disable-dnstap --with-libnghttp2 --with-dynlibmodule --enable-ecdsa --disable-event-api --enable-gost --with-libevent --with-pythonmodule=yes --with-pyunbound=yes ac_cv_path_SWIG=/usr/local/bin/swig LDFLAGS=-L/usr/local/lib --disable-subnet --disable-tfo-client --disable-tfo-server --with-pthreads --prefix=/usr/local --localstatedir=/var --mandir=/usr/local/share/man --infodir=/usr/local/share/info/ --build=amd64-portbld-freebsd13.2
Linked libs: libevent 2.1.12-stable (it uses kqueue), OpenSSL 3.0.13 30 Jan 2024
Linked modules: dns64 python dynlib respip validator iterator
DNSCrypt feature available

Additional information
not yet

Thanks!

cc @fichtner

@wcawijngaards
Copy link
Member

From what I can see, the threads each open a file with fopen(..), and then everything is a thread local variable, and then they call getc(file) on their own file. And in this manner the threads read the file. That is the same file for every thread, but they have their own fopen(..) call and FILE object. They call their own getc(..) call and the variables are otherwise owned by themselves. So there could be no race condition on the contents of the file.

If I try it here, on Linux and on FreeBSD, I get no race conditions. So it does not reproduce, either. There is development in a code branch, where the iterator stub hints are unshared from the threads, and as a result would be read only once, at reload time. This is in #1015 . The change would likely, also if you did not use the feature implemented there, remove the issue here. But it is better if we could find out what the cause is. How can I reproduce the failure, just a number of threads and root hints does not trigger it for me? Perhaps use valgrind --tool=helgrind on it and report if it finds failures, apart from other output, about pp_init and about default_event_base, that we do not need right now, about the root hints read?

@kulikov-a
Copy link
Author

@wcawijngaards
Hi! Thanks for feedback. I had to change the config and disable python module (throws: python exception in Py_InitializeFromConfig: init_fs_encoding: failed to get the Python codec of the filesystem encoding when starting under valgrind)
valgrind log attached (too many lines to paste into message)
valgrind.log

How can I reproduce the failure

In my experience, there is a better chance of reproducing on a bare-metal install (it didn’t reproduce on a VM for me) and giving at least some load (i run a couple of scripts with a looped nslookups from client). And restart unbound every 5 minutes. The error is realy sporadic. Under the described conditions, it may take a day before the error appears
actual config is also attached

unbound.conf.txt

@wcawijngaards
Copy link
Member

The issue seems to be that opening the same file in several places is not supported by the standard, and not portable. This makes it misbehave. The logs are very helpful in finding the problem, thank you. The changes to unshare the hints structure should provide a fix for this. Then, the root hints are read once, from one thread, after reload.

@fichtner
Copy link

fichtner commented Apr 3, 2024

@wcawijngaards thanks for looking into it. do you have any ETA for #1041 in an upcoming Unbound release? We can get by for the time being in OPNsense. BTW, this fast reload is pretty neat... been looking forward to something like this for years. Happy to see it progress!

Cheers,
Franco

@wcawijngaards
Copy link
Member

After some review and running tests I think that the change would likely move to the code repository. Then this change is going to be part of the next release. But there is no release date for that right now.

@fichtner
Copy link

fichtner commented Apr 3, 2024

Ok, don't need specifics. Sounds good. :)

@kulikov-a
Copy link
Author

@wcawijngaards sounds great!
thanks a lot!!

@wcawijngaards
Copy link
Member

The fix should result in the issue no longer creating a problem, since there is one structure that is read once.

wcawijngaards added a commit that referenced this issue Apr 25, 2024
- Merge #1041: Stub and Forward unshare. This has one structure
  for them and fixes #1038: fatal error: Could not initialize
  thread / error: reading root hints.
jedisct1 added a commit to jedisct1/unbound that referenced this issue May 7, 2024
* nlnet/master: (45 commits)
  - Fix for NLnetLabs#1062: declaration before statement, avoid print of null,   and redundant check for array size. And changelog note for merge of NLnetLabs#1062.
  Fix potential overflow bug while parsing port in function cfg_mark_ports
  - Set version number to 1.20.0 for release.
  - Fix for the DNSBomb vulnerability CVE-2024-33655. Thanks to Xiang Li   from the Network and Information Security Lab of Tsinghua University   for reporting it.
  - Fix doxygen comment for errinf_to_str_bogus.
  - Cleanup unnecessary strdup calls for EDE strings.
  - Man page entry for unbound-checkconf -q.
  - Fix NLnetLabs#876: [FR] can unbound-checkconf be silenced when configuration   is valid?
  - Add unit tests for cachedb and subnet cache expired data.
  - Fix cachedb with serve-expired-client-timeout disabled. The edns   subnet module deletes global cache and cachedb cache when it   stores a result, and serve-expired is enabled, so that the global   reply, that is older than the ecs reply, does not return after   the ecs reply expires.
  - Fix doc unit test for out of directory build.
  - Fix to disable fragmentation on systems with IP_DONTFRAG,   with a nonzero value for the socket option argument.
  Changelog note for NLnetLabs#1041 and NLnetLabs#1038. - Merge NLnetLabs#1041: Stub and Forward unshare. This has one structure   for them and fixes NLnetLabs#1038: fatal error: Could not initialize   thread / error: reading root hints.
  Update locking management for iter_fwd and iter_hints methods. (NLnetLabs#1054)
  - Fix configure flto check error, by finding grep for it.
  - Fix ci workflow for macos for moved install locations.
  - Merge NLnetLabs#1053: Remove child delegations from cache when grandchild   delegations are returned from parent.
  - When a granchild delegation is returned, remove any cached child delegations   up to parent to not cause delegation invalidation because of an   expired child delegation that would never be updated. Most likely to   happen without qname-minimisation. Reported by Roland van Rijswijk-Deij.
  - Fix edns subnet to sort rrset references when storing messages   in the cache. This fixes a race condition in the rrset locks.
  - Add checklock feature verbose_locking to trace locks and unlocks.
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants