Code Coverage? Code Schmoverage!
In Keboola, we take testing really seriously. Every new component has to start with tests. For simple dockerized components they mostly run…
In Keboola, we take testing really seriously. Every new component has to start with tests. For simple dockerized components they mostly run on Travis. We have iterated into the ultimate setup, which Vlado showed in a previous post, in which the component is tested with the same image that is then pushed into production. This solves some headaches we had which were caused by using images from both Quay.io and DockerHub. This takes care of testing for a majority of components (actually 178 from 248 of our components now run in docker).
At Keboola we take testing very seriously. Every time a new component pings our #scm Slack channel, Martin pops out of nowhere and asks “Where do you have tests”? Apart from standard dockerized components, there some more complicated components like Storage, Orchestrator or Docker Runner. These require quite complex testing environments and therefore run on our Jenkins installation on development server.
For those core components we run also integration tests. So if the Phpunit tests pass, Jenkins deploys the stuff to development server and runs development orchestrator, which takes orchestration from our development
KBC project and runs it.
So for Docker Runner, it goes like this:
- Run tests on Travis (these are only tests with mocks, no docker operations are done).
- If that succeeds, run tests on Jenkins (those involve actually downloading and running containers) and executing API calls.
- If that succeeds, deploy to testing.
- Run test orchestration which executes actual jobs from KBC project on the new Docker Runner version.
- If that succeeds, then deploy to production.
So I’d say that is a pretty solid setup. And still I managed to put into production a bug in piece of code which is tested in about 20 tests. The piece of code was this (simplified):
$process = new Process('sudo docker run ...');
$process->setTimeout($this->getImage()->getProcessTimeout());
$startTime = time();
try {
$this->log->debug("Executing docker process.");
$process->run();
} catch (ProcessTimedOutException $e) {
// is actually not working
$this->removeContainer($containerId);
throw new UserException(
"Running container exceeded the timeout."
);
}
Then we run into an issue that Symfony process timeout was not playing well with docker run
. So Najlos changed the code to something like this:
$process = new Process('sudo timeout --signal=SIGKILL 3600 sudo docker run ...');
$process->setTimeout($this->getImage()->getProcessTimeout());
$startTime = time();
try {
$this->log->debug("Executing docker process.");
$process->run();
// this catches the timeout from `sudo timeout`
if ($process->getExitCode() == 137 && $duration >= $this->getImage()->getProcessTimeout()) {
throw new UserException(
"Running container exceeded the timeout."
);
}
} catch (ProcessTimedOutException $e) {
// is actually not working
$this->removeContainer($containerId);
throw new UserException(
"Running container exceeded the timeout."
);
}
Keep in mind that code is simplified, in fact it was slightly more complicated. When I was modifying this (for a completely unrelated reason), I realized that the outer timeout check is not doing anything. I.e. if the process exceeds timeout, it ends with exit code 137 and throws the UserException
from inside of the try-catch. The catch code is never reached. So I deleted it.
Well, obviously.
So I ended up with this:
$process = new Process('sudo timeout --signal=SIGKILL 3600 sudo docker run ...');
$startTime = time();
$this->log->debug("Executing docker process.");
$process->run();
// this catches the timeout from `sudo timeout`
if ($process->getExitCode() == 137 && $duration >= $this->getImage()->getProcessTimeout()) {
throw new UserException(
"Running container exceeded the timeout."
);
}
Seems fine. Tests green. I deployed to production and everything seemed to work. Until a couple of minutes later a customer complained that a job ended with the message that it exceeded timeout 3600 seconds while it was running only slightly longer then a minute.
It turned out that the problem was that I removed this line:
$process->setTimeout($this->getImage()->getProcessTimeout());
Unfortunately the Symfony Process
component uses a default (slightly poorly documented) 60s timeout for all processes. So to remove it I had to add:
$process->setTimeout(null);
Let’s not delve into why the 60s default timeout was there in the first place.
The problem is with the tests. We try to follow the principles of Aerospace Engineering — fixing the root cause of a problem. So I had to fix the tests to cover the scenario. The problem was not detected by any of our tests, because we were testing whether the timeout works by setting it to 10s and checking whether the process really terminated. This obviously worked because that timeout was smaller than the default one. So I added a test to check whether the the process can run more than 60 seconds.
Martin often likes to say: “I don’t give a shit about code coverage”. Riiight. I understand, Even with 100% code coverage, we had a scenario that was not covered. OK, this is nothing new.
Unfortunately this is no good. Apart from lengthening the test runtime the new tests do not really fix the root cause. They still rely on some internal behavior of something. What would happen if Symfony guys decide to change that default limit (like those guys did)? (I hope they don’t!) And more importantly how to prevent errors like this in future?