Linguistic data analysis often requires working with XML files (many language corpora are annotated in various flavors of XML).

Since I’m (obviously) using git (more specifically GitLab) to manage my corpora, the idea of integrating XML linting as part of the CI process is fairly straight forward.

Thankfully, GitLab CI makes this extremely easy.

What do I need?

  • GitLab (Community Edition; a recent version)
  • At least one GitLab Runner (for sake of simplicity I’m running it in Shell executor mode)
  • libxml / xmllint on the runner

Okay, what now?

Create a new .gitlab-ci.yml file in your repository. The simplest possible configuration could look something like this:

xml_test:
    stage: test
    script: "xmllint --noout corpus/*.xml"

This will execute xmllint on all .xml files in /corpus during the testing phase of every CI build.

GitLab CI will look at the exit code of whatever has been plugged into script. If the exit code is 1, the build process will be marked as failed. If the exit code is 0, the build will pass.

Example

Let’s say we have missed a closing angle bracket in our XML file. GitLab CI will mark the build as failed:

GitLab CI Example - Failed

More specifically, the runner returns:

<testXML</test>
        ^
corpus/xml-test.xml:1: parser error : Couldn't find end of Start Tag testXML line 1
<testXML</test>
        ^
corpus/xml-test.xml:1: parser error : Extra content at the end of the document
<testXML</test>
        ^
ERROR: Job failed: exit status 1

After fixing the issue, the build will pass as expected: GitLab CI Example - Success

Conclusion

GitLab CI makes it very easy to integrate testing (in this case linting) to the CI process. For this particular case, a pre-commit hook could be the better alternative. However, integrating the linting into the CI process ensures correct XML even when collaborating with people that don’t (can’t) use the hook.