How we build and operate Keboola data platform
Ondřej Popelka 4 min read

TDD and the Weirdest Exception

It’s a perverted programmer joy, if I can write:

It’s a perverted programmer joy, if I can write:throw new WeirdException(‘Phantom of the opera is here’);

How did I end up with that? As we run more and more components in docker, on 20th June 2016 we noticed that some docker run commands failed with the message: “docker: Error response from daemon: devicemapper: Error running deviceResume dm_task_run failed.”. We spent a lot time debugging this error, which was really boring and painful, so instead of describing it, I will throw in some nice programming haiku.

The Web site you seek
Cannot be located, but
Countless more exist.
Your file was so big.
It might be very useful.
But now it is gone.

So it turned out that the error occurred irrespective of docker version, OS version, irrespective of docker worker and it occurred on average in 1 of 400 docker runs. It had a slight affinity to occur at full hour, especially around 0:00 and 1:00 which is when many of our customers orchestrators run. Sometimes it occurred once a week, sometimes 10 times a day. There is an endless still open issue on Github full of desperate attempts to debug and localize the issue. One could say we were pretty fucked up.

The error behaves as if the docker device mapper is some awful state, but that state lasts for a few (milli)seconds only. So re-running the task was always the solution. Since we couldn't find any reliable solution, we were willing to accept any solution that would prevent the jobs from failing. So we went with restarting the container if it failed with the above error.

The development of the retry mechanism was not so easy, because we never managed to simulate the error in a useful way (except for running
10000 containers at the same time so that one of them failed).

So I had to start off by writing a test for that, the core of it being:// Create a stub for the Container
$container = $this->getMockBuilder(Container::class)
   ->setConstructorArgs([...])
   ->setMethods(['getRunCommand'])
   ->getMock();$container->method('getRunCommand')
   ->will($this->onConsecutiveCalls(
       'sh -c -e \'echo "failed: (125) docker: Error response from daemon: open /dev/mapper/...: no such file or directory." && exit 125\'',
       'docker run --volume="' . $root . '/data/":/data --net="bridge" "keboola/docker-demo-app:' . $tag . '"'
   ));$container->run();

The test itself looks quite weird. I had to make a public method just for testing (we all know that is wrong) so that I can mock and override the docker run command. Then I slip in a shell command which produces a similar output to what docker does when it fails. If you think about it, it is actually great. Because we are not testing docker and its error. We are testing our application if it can recover from that error when it encounters it.

Then I did the retry mechanism:...
do {
$retry = false;
try {
    $this->logger->notice("Executing docker process.");
    $this->run($process);
    $this->logger->notice("Docker process finished.");
    if (!$process->isSuccessful()) {
        $this->handleContainerFailure($process);
    }
} catch (WeirdException $e) {
    $this->logger->notice("Phantom of the opera is here");
    sleep(random_int(1, 4));
    $retry = true;
    $retries++;
    if ($retries >= 5) {
        $this->logger->notice("Weird error occurred too many times.");
        throw new ApplicationException($e->getMessage(), $e);
    }
}
} while ($retry)
...

And of course I wrote another test to verify that the retry mechanism will eventually end:// Create a stub for the Container
$container = $this->getMockBuilder(Container::class)
->setConstructorArgs([...])
->setMethods(['getRunCommand'])
->getMock();$container->method('getRunCommand')
   ->will($this->onConsecutiveCalls(
       'sh -c -e \'echo "failed: (125) docker: Error response from daemon: open /dev/mapper/...: no such file or directory." && exit 125\'',
       'sh -c -e \'echo "failed: (125) docker: Error response from daemon: open /dev/mapper/...: no such file or directory." && exit 125\'',
       'sh -c -e \'echo "failed: (125) docker: Error response from daemon: open /dev/mapper/...: no such file or directory." && exit 125\'',
       'sh -c -e \'echo "failed: (125) docker: Error response from daemon: open /dev/mapper/...: no such file or directory." && exit 125\'',
       'sh -c -e \'echo "failed: (125) docker: Error response from daemon: open /dev/mapper/...: no such file or directory." && exit 125\'',
       'sh -c -e \'echo "failed: (125) docker: Error response from daemon: open /dev/mapper/...: no such file or directory." && exit 125\'',
       'docker run --volume="' . $root . '/data/":/data --net="bridge" "keboola/docker-demo-app:' . $tag . '"'
   ));try {
   $container->run();
   $this->fail("Too many errors must fail");
} catch (ApplicationException $e) {

Then I rewrote the retry mechanism a few times, deployed, and then set up Papertrail notifications for the error and then fingers crossed and we waited for three days to see if the error would pop up. Yep and it did (on 3rd of August). The retry mechanism worked, the KBC job took longer, but ended successfully. Test driven development at its best.

Today, we run about 190 containers per hour and the error is more common. It usually appears a few times each day. So from now on, if you spot a KBC job running twice as long as usual — it means either this happened or that our worker server went down.

Yes, it’s a shameful horrible solution. It is not really a solution, half a year later, we still don’t know what causes the error or how to fix it. Perhaps a new AMI will solve it. But our customers expect working jobs, and we’ll do anything to meet that expectation. Even if it means squashing our programmers ego.

Yeah and by the way Fuck Docker!

It’s still great, though.

If you liked this article please share it.

Comments ()

Read next

MySQL + SSL + Doctrine

MySQL + SSL + Doctrine

Enabling and enforcing SSL connection on MySQL is easy: Just generate the certificates and configure the server to require secure…
Ondřej Popelka 8 min read