Timeout, but when?
We're running users' arbitrary code wrapped in Docker. This allows us to isolate the code from our infrastructure and keeps all of us happy…
We're running users' arbitrary code wrapped in Docker. This allows us to isolate the code from our infrastructure and keeps all of us happy and worry free. CPU, memory and networking can be assigned directly in the docker run command. That's good for us. But sometimes the user code ends up in an infinite loop — we're still happy, there is no network traffic, 0.5 GB allocated memory and low CPU usage. But the user is not. There is obviously something wrong and we're just slowly crunching through the infinity.
So we thought that introducing a process timeout would solve this. Docker does not have timeout option in the run command and probably never will. But Symfony has it nice and easy in its Process package, and we even covered it with our own tests. Work done, what a good day!
After a while, some long running jobs started showing up in our logs. A client asking to troubleshoot a 12hour processing job (when the timeout was 3 hours). It was crack-opening my head. And this test still worked, I even tested it on the production server.
try {
# run sleep(5); inside the image
$process = new Process('sudo docker run …');
$process->setTimeout(1);
$process->run();
$this->fail("Process should time out.");
} catch (ProcessTimedOutException $e) {
$this->assertContains('exceeded the timeout', $e->getMessage());
}
Then I finally noticed that the test lasted 7 seconds. Hold on, that's way too long for a 1s timeout! The process threw ProcessTimedOutException, but it didn't terminate the process. I attribute this behaviour to the sudo command. So what are the options now?
First, I created a failing test. This consisted of running an empty docker job (just start the container) and measuring the time. Lets say, this job was 5 seconds. Then another run included a sleep 20; with the timeout set to 5 seconds. Ideally this run would finish after 10 seconds. But conditions change and it may be a bit shorter or a bit longer. So — the second run must last longer than the first run (5 seconds), but also shorter than the first run plus sleep 20; (25 seconds). BUT! The first run could be longer than expected and might skew the range and a the simple condition might not work out. So I limited the range from both ends and created a narrower window, where the test must finish, eg. >5s &<15s. In other words I expect that the run must finish shortly after the set timeout.
t1 = run empty docker container
t2 = run docker container with sleep 20 and timeout set to 5
# magic check
t1 < t2 < t1 + timeout + buffer
This was it, here's the actual code. Finally, I had a failing test. Now it was time to fix it.
Docker won't implement timeout. Forking the process in PHP or bash is not the way I'd go. Having a separate process to monitor the jobs and terminate them from the outside using docker stop or docker kill would be quite complicated (although the architecture would be interesting). Fortunately, coreutils contained a timeout utility that looked it might do the job.
Being stupid and a linux noob, I started with this
sudo timeout --signal=SIGKILL 3600 sudo docker run ...
Although it worked using my user on the devel server, running it with Jenkins without tty gave me another round of headache.
sudo: sorry, you must have a tty to run sudo
Dafuq! I thought I managed the /etc/sudoers and /etc/sudoers.d/docker files correctly.
# excerpt from various sudoers files,
# replace jenkins with whatever user/group you're using
defaults:jenkins !requiretty
%jenkins ALL = NOPASSWD: /usr/bin/docker
%jenkins ALL = NOPASSWD: /usr/bin/timeout
I ended up creating a simple Jenkins job with just timeout 1 sleep 5 and playing with sudoers files and sudo. It should have been clear to me at first sight…
sudo timeout --signal=SIGKILL 3600 docker run ...
This worked. Just removed the sudo inside timeout. Worked like a charm. Hours wasted, facepalms delivered. This instantly kills the docker job after 1 hour. The test finally passed.
EDIT: As GitHub user jayaprabhakar noted here, containers terminated by the timeout command will not cleanup properly, eg. any containers running with the remove flag will remain in the system. You need to remove them manually aftewards.