Linguistic data analysis often requires working with XML files (many language corpora are annotated in various flavors of XML).
Since I’m (obviously) using git (more specifically GitLab) to manage my corpora, the idea of integrating XML linting as part of the CI process is fairly straight forward. Thankfully, GitLab CI makes this extremely easy.
What do I need?
- GitLab (Community Edition; a recent version)
- At least one GitLab Runner (for sake of simplicity I’m running it in Shell executor mode)
- libxml / xmllint on the runner
Okay, what now?
Create a new
.gitlab-ci.yml file in your repository. The simplest possible configuration could look something like this:
This will execute xmllint on all .xml files in
/corpus during the testing phase of every CI build.
GitLab CI will look at the exit code of whatever has been plugged into
script. If the exit code is 1, the build process will be marked as failed. If the exit code is 0, the build will pass.
Let’s say we have missed a closing angle bracket in our XML file. GitLab CI will mark the build as failed:
More specifically, the runner returns:
After fixing the issue, the build will pass as expected:
GitLab CI makes it very easy to integrate testing (in this case linting) to the CI process. For this particular case, a pre-commit hook could be the better alternative. However, integrating the linting into the CI process ensures correct XML even when collaborating with people that don’t (can’t) use the hook.