OpenRefine PHP Client
A week or two ago I started playing with OpenRefine and its integration with Keboola Connection. Couple years ago I remember playing with…
A week or two ago I started playing with OpenRefine and its integration with Keboola Connection. Couple years ago I remember playing with Google Refine and hitting some memory limits pretty easily, so it was about time to check it again. OpenRefine engine would be a great fit into our pipeline before we further process data in SQL, Python or R.
OpenRefine is currently in version 2.6-RC2. The app does not support batch processing using the command line, all you can do is to start the server. There are currently two libraries wrapping the OpenRefine API — P3 Batchrefine and OpenRefine Python Client Library.
OpenRefine Python Client Library uses OpenRefine 2.6-beta1 and does not have CLI and that combined with me not being a Python programmer resulted in skipping this library at first.
P3 Batchrefine is in Java, but has a CLI — so I can prepare everything on the side using bash or whatever and then just run the CLI. Sounds good! Everything piled up in a single Docker image, packaged and released.
It worked fine… until a really weird error. If a certain operation (mass-replace) contained character Á at the end of any argument the operations were not executed. Other special characters worked fine, only this one was causing trouble. Executing the operations directly on a OpenRefine server (tried 2.6-beta1, 2.6-beta2, 2.6-rc2) worked fine. So I started playing with the library. Tried to debug, reverse engineer, but the library didn't help much — e.g. release file for version 1.1.7 contains version 1.1.2 etc.
Although the library looked very promising especially with the embedded mode I started to realize that if I need to move on quickly (the library looks pretty stale) I would need to learn Java and the whole devel/build ecosystem around it and fix the library. We develop backend services mainly in PHP and the whole devel/build pipeline is quite easy so I opted for creating a new library, that would meet our needs. As we'll be using it on a daily basis there is a good chance that the library will be kept up to date.
And here it is — https://github.com/keboola/openrefine-php-client.
Let us know your thoughts!
Note: while writing this post Andrey fixed his P3 Batchrefine library so it can process UTF-8 operation definitions. So the battle is back on! We're able to switch between multiple libraries and only our customers will decide which one to use.