Life of a data extractor
No doubt, that extractors represent an important part in ETL. In Keboola Connection, the extractor is a Docker application, which helps the end-user to extract data of his choice (usually from database or API server).
The developer of the extractor can use any programming language, there are no limits.
When viewed from Keboola Connection, an extractor takes a disk volume as an input. The volume contains all files needed to run the extractor — configuration files, state files, etc. At the end, also processed data are stored to this volume. Basically, extractor is a black box which takes the input, extracts data, stores data and dies.
Some of you may wonder how an extractor is created, tested and how it becames production ready. As an example, I choose to describe development of MongoDB Extractor, which is last one I created.
Development
Since extractor is executed as a Docker container, we should start development same way — in a container. It is much easier than “dockerizing” existing application later.
First of all, we need a Dockerfile.
FROM php:7.0
RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv EA312927 \
&& echo 'deb http://repo.mongodb.org/apt/debian wheezy/mongodb-org/3.2 main' > /etc/apt/sources.list.d/mongodb-org-3.2.list \
&& apt-get update -q \ && apt-get install unzip git libssl-dev mongodb-org-shell mongodb-org-tools ssh -y --no-install-recommends \ && rm -rf /var/lib/apt/lists/*
RUN pecl install mongodb \
&& docker-php-ext-enable mongodb
WORKDIR /root
RUN curl -sS https://getcomposer.org/installer | php \
&& mv composer.phar /usr/local/bin/composer
COPY ./docker/php/php.ini /usr/local/etc/php/php.ini
COPY . /code
WORKDIR /code
RUN composer install --prefer-dist --no-interaction
CMD php ./src/run.php --data=/data
Whose instructions mean:
- take PHP 7.0 image (it’s based on Debian)
- install additional packages and do some cleanup
- install PECL MongoDB package
- change directory, download and install Composer
- copy custom php.ini settings, then copy current directory to /code path
- change directory and install Composer dependencies
- and set entry command.
Now, we can start coding — create a run.php file, add some classes, tests or required Composer packages. By creating files (php.ini, composer.json) mentioned in Dockerfile, we end up with successful image build.
docker build -t keboola/mongodb-extractor .
Indeed, is very impractical to build image again and again each time we change the code. Also, while developing we want to “hack” and have PHP console accessible, e.g. to install new Composer packages and execute our run script manually.
A solution is very simple — mount current working directory as a volume and set the image command to bash (it’ll replace CMD in Dockerfile).
We can skip manual volume mounting and automate this by creating a docker-compose.yml with the required configuration:
version: '2'
services:
php:
build: .
image: keboola/mongodb-extractor
tty: true
stdin_open: true
command: bash
volumes:
- ./:/code
By specifying both image and build values, image will be built (.) and tagged (keboola/mongodb-extractor) — we’ll take an advantage of this later.
For development purposes we only need to remember this command and then do whatever we like:docker-compose run --rm php
Testing, Automation
After requiring test libraries (see composer.json) and writing tests, we can create shell script for running our tests — tests.sh:
#!/bin/bash
php --version \
&& composer --version \
&& ./vendor/bin/phpcs --standard=psr2 -n --ignore=vendor --extensions=php . \
&& ./vendor/bin/phpunit
Now everything is prepared to start automated testing and delivery. That means, every time a new code is pushed to one of our git branches, we want to test the application. Good Continuous Integration server can help with it.
We like TravisCI a lot, since it’s well documented and free for open source projects. The whole CI setup configuration can be done by specifying simple YAML file — .travis.yml:
sudo: required
language: bash
services:
- docker
env:
DOCKER_COMPOSE_VERSION: 1.6.2
before_install:
- sudo apt-get update
- sudo apt-get -o Dpkg::Options::="--force-confnew" -y install docker-engine
install:
- sudo rm /usr/local/bin/docker-compose
- curl -L https://github.com/docker/compose/releases/download/${DOCKER_COMPOSE_VERSION}/docker-compose-`uname -s`-`uname -m` > docker-compose
- chmod +x docker-compose
- sudo mv docker-compose /usr/local/bin
before_script:
- docker -v
- docker-compose -v
- docker-compose build php
script: docker-compose run --rm php ./tests.sh
after_success:
- docker images
Which means:
- we need sudo to run tests, language we’re using is bash (basically only docker and docker-compose commands are used) and we’ll need Docker and Docker Compose
- since there’s old Docker Composer version on Travis, we’ll instal latest Docker Engine and Docker Compose 1.6.2
- before running test we want to see versions of libraries installed in previous step
- then php service (that’s our MongoDB extractor) will be built
- finally, our tests.sh script will be executed (inside the container)
- at the end, we want to see images we created during the testing process
The good thing here is that everything is isolated. At first we have built our application image with docker-compose build command, then tests have been run inside the container.
With this setup, we didn’t only test our application, we tested the whole environment in which the application will be executed.
Production
After tests pass, the image is ready for production. To be usable in Keboola Connection, it must be pushed in to a image repository.
Old fashioned way is to trigger some service which can build our image after passed tests — this is wrong.
Is it possible to push exactly the same image on which tests have been run? Yes, it is. Let’s define deploy section in CI service configuration.
deploy:
provider: script
skip_cleanup: true
script: ./deploy.sh
on:
tags: true
Create deploy.sh script and make sure script is executable (has x flag).
docker login -e="." -u="$QUAY_USERNAME" -p="$QUAY_PASSWORD" quay.io
docker tag keboola/mongodb-extractor quay.io/keboola/mongodb-extractor:$TRAVIS_TAG
docker images
docker push quay.io/keboola/mongodb-extractor:$TRAVIS_TAG
- our deploy provider is script, we don’t need cleanup and we want to deploy when a new git tag is created
- deployment script logs in to Quay.io and pushes image we build at the very first stage (tagged as keboola/mongodb-extractor in Docker Compose file)
Long story short
- after pushing to any branch, a new image is built with actual code
- then the tests are run in that image — both application and its environment are tested
- after pushing a tag, additional steps are applied to push the image into image repository
Further reading:
- check the previous article to learn more about how we use Docker
- or how we run Keboola Connection backend