How we build and operate the Keboola data platform
Ondřej Hlaváček 2 min read

OpenRefine PHP Client

A week or two ago I started playing with OpenRefine and its integration with Keboola Connection. Couple years ago I remember playing with…

A week or two ago I started playing with OpenRefine and its integration with Keboola Connection. Couple years ago I remember playing with Google Refine and hitting some memory limits pretty easily, so it was about time to check it again. OpenRefine engine would be a great fit into our pipeline before we further process data in SQL, Python or R.

OpenRefine is currently in version 2.6-RC2. The app does not support batch processing using the command line, all you can do is to start the server. There are currently two libraries wrapping the OpenRefine API — P3 Batchrefine and OpenRefine Python Client Library.

OpenRefine Python Client Library uses OpenRefine 2.6-beta1 and does not have CLI and that combined with me not being a Python programmer resulted in skipping this library at first.

P3 Batchrefine is in Java, but has a CLI — so I can prepare everything on the side using bash or whatever and then just run the CLI. Sounds good! Everything piled up in a single Docker image, packaged and released.

It worked fine… until a really weird error. If a certain operation (mass-replace) contained character Á at the end of any argument the operations were not executed. Other special characters worked fine, only this one was causing trouble. Executing the operations directly on a OpenRefine server (tried 2.6-beta1, 2.6-beta2, 2.6-rc2) worked fine. So I started playing with the library. Tried to debug, reverse engineer, but the library didn't help much — e.g. release file for version 1.1.7 contains version 1.1.2 etc.

Although the library looked very promising especially with the embedded mode I started to realize that if I need to move on quickly (the library looks pretty stale) I would need to learn Java and the whole devel/build ecosystem around it and fix the library. We develop backend services mainly in PHP and the whole devel/build pipeline is quite easy so I opted for creating a new library, that would meet our needs. As we'll be using it on a daily basis there is a good chance that the library will be kept up to date.

And here it is —

Let us know your thoughts!

Note: while writing this post Andrey fixed his P3 Batchrefine library so it can process UTF-8 operation definitions. So the battle is back on! We're able to switch between multiple libraries and only our customers will decide which one to use.

If you liked this article please share it.

Comments ()

Read next

MySQL + SSL + Doctrine

MySQL + SSL + Doctrine

Enabling and enforcing SSL connection on MySQL is easy: Just generate the certificates and configure the server to require secure…
Ondřej Popelka 8 min read