Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message #waitForExit doesn’t always return when the process has terminated #73

Open
Rinzwind opened this issue Aug 3, 2022 · 2 comments

Comments

@Rinzwind
Copy link

Rinzwind commented Aug 3, 2022

We’re having a problem on our Jenkins server: test runs often hang. It seems to be related to OSSubprocess. I tried to capture it in a small example:

https://github.com/Rinzwind/OSSubprocessTest

The example:

Below is a sample from the output we get on our Jenkins server. It shows that the test hangs:

  • The line Did #run (m: 2542, pid: 6480) shows that a ‘curl’ process was spawned with PID 6480
  • The lines 6480 Z 00:00:00 \_ curl <defunct> from the ‘ps’ command show that that process has terminated but that its status has not been consumed yet (it’s a zombie process)
  • There is no corresponding line Did #waitForExit (m: 2542) showing a return from #waitForExit
  • The only remaining output is from the ‘ps’ command

Note there’s also no more line Will #checkFinishedChildren after the last Did #run. As far as I understand, the child watcher in OSSVMProcess should be sending #checkFinishedChildren at least once a second:

childWatcher := [[
"OSProcess authors suspected that there were various ways in which OS signals
could be missed under conditions of heavy load. For that reason, we use
#waitTimeoutMSecs: with the semaphore, so that if a signal is missed,
we time out and rescan occasionally anyway (#updateActiveChildrenAndNotifyDead
sends queryExitStatus which sends waitpid() )
"
self sigChldSemaphore waitTimeoutMSecs: 1000.
processSynchronizationDelay wait. "Avoids lost signals in heavy process switching"
self checkFinishedChildren.
] repeat] newProcess priority: Processor highIOPriority.

So far I’ve only been able to reproduce this on our Jenkins server though, making it a little hard to debug.

Sample output:

[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-9.0 .smalltalk.ston
[…]
Downloading the latest 90 Image:
    http://files.pharo.org/get-files/90/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
	http://files.pharo.org/get-files/90/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-9.0.0+build.1575.sha.9bb5f998e8a6d016ec7abde3ed09c4a60c0b4551 (64 Bit)
[…]
Running suite "Pharo64-9.0" with 1 tests.
[…]
Did #run (m: 2541, pid: 6478)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT

Test
Will #checkFinishedChildren
Did #checkFinishedChildren
Did #waitForExit (m: 2541)
Did #run (m: 2542, pid: 6480)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT

Test
Did sleep (n: 26)
    PID STAT     TIME COMMAND
   1230 Sl   00:00:10 pharo
   1245 S    00:00:00  \_ bash
   6480 Z    00:00:00  \_ curl <defunct>
Did sleep (n: 27)
    PID STAT     TIME COMMAND
   1230 Sl   00:00:10 pharo
   1245 S    00:00:00  \_ bash
   6480 Z    00:00:00  \_ curl <defunct>
[…]
@Rinzwind
Copy link
Author

Rinzwind commented Aug 3, 2022

Here’s sample output from a run using Pharo 11 in which the problem also occurred: the ‘curl’ process with PID 13026 has terminated, but the output is not showing a corresponding Did #waitForExit (m: 5718).

[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-11 .smalltalk.ston
[…]
Downloading the latest 110 Image:
    http://files.pharo.org/get-files/110/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
	http://files.pharo.org/get-files/110/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-11.0.0+build.142.sha.78baf8659decaa50e73226a5eddcd410cebfa4f7 (64 Bit)
[…]
Running suite "Pharo64-11" with 1 tests.
[…]
Did #run (m: 5717, pid: 13024)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Wed, 03 Aug 2022 22:35:33 GMT
Content-Type: text/plain
Content-Length: 5

Test
Will #checkFinishedChildren
Did #checkFinishedChildren
Did #waitForExit (m: 5717)
Did #run (m: 5718, pid: 13026)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Wed, 03 Aug 2022 22:35:33 GMT
Content-Type: text/plain
Content-Length: 5

Test
Did sleep (n: 56)
    PID STAT     TIME COMMAND
   1221 Sl   00:00:20 pharo
   1236 S    00:00:00  \_ bash
  13026 Z    00:00:00  \_ curl <defunct>
Did sleep (n: 57)
    PID STAT     TIME COMMAND
   1221 Sl   00:00:20 pharo
   1236 S    00:00:00  \_ bash
  13026 Z    00:00:00  \_ curl <defunct>
[…]

@Rinzwind
Copy link
Author

After adding this to also log the sends of #waitTimeoutMSecs: in OSSVMProcess>>#initializeChildWatcher:

OSSVMProcess compile: ((OSSVMProcess sourceCodeAt: #initializeChildWatcher)
	copyReplaceAll: 'self sigChldSemaphore waitTimeoutMSecs: 1000.'
	with: '
		Stdio stdout nextPutAll: ''Will #waitTimeoutMSecs:''; lf; flush.
		self sigChldSemaphore waitTimeoutMSecs: 1000.
		Stdio stdout nextPutAll: ''Did #waitTimeoutMSecs:''; lf; flush.').

We get output on our Jenkins server as shown below. After the last Will #waitTimeoutMSecs: there is no Did #waitTimeoutMSecs:, so it seems there is no return from either the send of #sigChldSemaphore or the send of #waitTimeoutMSecs: to the Semaphore here:

self sigChldSemaphore waitTimeoutMSecs: 1000.

Sample output:

[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-11 .smalltalk.ston
[…]
Downloading the latest 110 Image:
    http://files.pharo.org/get-files/110/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
	http://files.pharo.org/get-files/110/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-11.0.0+build.145.sha.1fc9017e1fe9a0d3c3b7580e2131ecfd9a27e428 (64 Bit)
[…]
Running suite "Pharo64-11" with 1 tests.
[…]
Did #run (m: 2767, pid: 6945)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Thu, 11 Aug 2022 13:35:28 GMT
Content-Type: text/plain
Content-Length: 5

Test
Did #waitTimeoutMSecs:
Will #checkFinishedChildren
Did #checkFinishedChildren
Will #waitTimeoutMSecs:
Did #waitForExit (m: 2767)
Did #run (m: 2768, pid: 6947)
Did sleep (n: 27)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Thu, 11 Aug 2022 13:35:28 GMT
Content-Type: text/plain
Content-Length: 5

Test
    PID STAT     TIME COMMAND
   1218 Sl   00:00:11 pharo
   1233 S    00:00:00  \_ bash
   6947 Z    00:00:00  \_ curl <defunct>
Did sleep (n: 28)
    PID STAT     TIME COMMAND
   1218 Sl   00:00:11 pharo
   1233 S    00:00:00  \_ bash
   6947 Z    00:00:00  \_ curl <defunct>
[…]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant