Parsing a Maven pom.xml file with a shell script

My entry in this year's "stupid SED tricks" is a command-line run tool for Maven-based Java projects. Maven's a great choice for managing complex dependencies, especially because of the central repository that allows you to just declare, for example, that your project uses Apache Commons HTTPClient and let Maven resolve and download it for you. One thing that bothers me when I use Maven from the command line, though, is that it's a hassle to build up the correct CLASSPATH environment variable that lets you run the finished product. Ant made that pretty easy, but Maven wants to control everything via its <dependencies> section. This is fine when you're building a webapp, since a webapp builds its classpath dynamically from it's /classes and /lib directories, but if you want to run from the command line, you're stuck with Maven's horrible exec target (which doesn't seem to let you define multiple configurations) or the assembly:assembly goal that conglomerates all of your class files into a single monster .jar file.

If you look at the <dependencies> section, though, you can see that it sort of looks like a classpath — recognizing that the <groupId> and <artifactId> identify a directory under your Maven repository and <artifactId> and <version> identify a file within that directory, you can see an automatic way to translate that section into a classpath. This is, of course, what Maven does in Java — but can you do it using command line tools? As it turns out, you can, if you don't mind trying to maintain a three-deep multi-command pipe configuration.

I'll use the GWT POM file as an example. The XML is "pretty-printed" for human consumption; sed is going to want it "de-pretty-printed". First, strip out the dependencies part to set the classpath:

sed -n -e '/<dependencies>/,/<\/dependencies>/p' pom.xml

This gives you:

  <dependencies>
    <!-- Google Web Toolkit (GWT) -->

    <dependency>
      <groupId>com.google.gwt</groupId>
      <artifactId>gwt-user</artifactId>
      <version>2.5.1</version>
      <!-- "provided" so that we don't deploy -->
      <scope>provided</scope>
    </dependency>

    <!-- GWT projects do not usually need a dependency on gwt-dev, but MobileWebApp
         contains a GWTC Linker (AppCacheLinker) which in turn depends on internals
         of the GWT compiler. -->
    <dependency>
      <groupId>com.google.gwt</groupId>
      <artifactId>gwt-dev</artifactId>
      <version>2.5.1</version>
      <!-- "provided" so that we don't deploy -->
      <scope>provided</scope>
    </dependency>

    <!-- RequestFactory server -->

    <dependency>
      <groupId>com.google.web.bindery</groupId>
      <artifactId>requestfactory-server</artifactId>
      <version>2.5.1</version>
    </dependency>

    <!-- RequestFactory will use JSR 303 javax.validation if you let it -->
    <dependency>
      <groupId>org.hibernate</groupId>
      <artifactId>hibernate-validator</artifactId>
      <version>4.1.0.Final</version>
      <exclusions>
        <exclusion>
          <groupId>javax.xml.bind</groupId>
          <artifactId>jaxb-api</artifactId>
        </exclusion>
        <exclusion>
          <groupId>com.sun.xml.bind</groupId>
          <artifactId>jaxb-impl</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <!-- Required by Hibernate validator because slf4j-log4j is
         optional in the hibernate-validator POM -->
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-log4j12</artifactId>
      <version>1.6.1</version>
    </dependency>
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-api</artifactId>
      <version>1.6.1</version>
    </dependency>

    <!-- Google App Engine (GAE) -->
    <dependency>
      <groupId>com.google.appengine</groupId>
      <artifactId>appengine-api-1.0-sdk</artifactId>
      <version>1.7.1</version>
    </dependency>
    <dependency>
      <groupId>com.google.appengine</groupId>
      <artifactId>appengine-testing</artifactId>
      <version>1.7.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>com.google.appengine</groupId>
      <artifactId>appengine-api-stubs</artifactId>
      <version>1.7.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>com.google.appengine</groupId>
      <artifactId>appengine-api-labs</artifactId>
      <version>1.7.1</version>
    </dependency>

    <!-- Objectify for persistence. It uses the stock javax.persistence annotations -->

    <dependency>
      <groupId>com.googlecode.objectify</groupId>
      <artifactId>objectify</artifactId>
      <version>3.0</version>
    </dependency>
    <dependency>
      <groupId>javax.persistence</groupId>
      <artifactId>persistence-api</artifactId>
      <version>1.0</version>
    </dependency>

    <!-- GIN and Guice for IoC / DI -->

    <dependency>
      <groupId>com.google.inject</groupId>
      <artifactId>guice</artifactId>
      <version>2.0</version>
    </dependency>
    <dependency>
      <groupId>com.google.gwt.inject</groupId>
      <artifactId>gin</artifactId>
      <version>1.0</version>
    </dependency>
    <!-- Use the JSR 330 injection interfaces-->
    <dependency>
      <groupId>javax.inject</groupId>
      <artifactId>javax.inject</artifactId>
      <version>1</version>
    </dependency>

    <!-- Unit tests -->

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.8.1</version>
      <scope>test</scope>
    </dependency>
  </dependencies>

Now, concatenate everything onto one line:

sed -n -e '/<dependencies>/,/<\/dependencies>/p' pom.xml | tr -d '\n'

And remove whitespace:

sed -n -e '/<dependencies>/,/<\/dependencies>/p' pom.xml | tr -d '\n\t '

Ah, nice, command-line-friendly ugliness. Remove the comments (which are now useless anyway with the removal of whitespace):

sed -e 's/<!--[^>]*>//g'

(Notice that I can't capture using ".*" here; I have to use "[^>]*" instead to avoid greedy consumption). I couldn't do this in the prior step since it wouldn't have handled multi-line comments correctly. Finally, you have your classpath — it just happens to still be in XML form. You can almost translate it directly like this:

sed -e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/ \/Users/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'

Here's what this almost classpath looks like:

One problem here is that the groupIds have .'s in them instead of path separators. You can't just s/./\//, since the version themselves have .'s that you want to preserve. The solution is to s/./\// before concatenating onto one line:

sed -n \
  -e '/<groupId>/s/\./\//g' \
  -e '/<dependencies>/,/<\/dependencies>/p' pom.xml |
  tr -d '\n\t ' | \
  sed \
  -e 's/<!--[^>]*>//g' \
  -e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
	\/Users\/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'

There are still a few stragglers in there, though. The inclusion of optional elements like <scope> and <exclusions> cause the last pattern not to match. The easiest way to handle that is to strip them out. <scope> is easy enough to deal with:

sed -n \
  -e '/<groupId>/s/\./\//g' \
  -e '/<dependencies>/,/<\/dependencies>/p' pom.xml |
  tr -d '\n\t ' | \
  sed \
  -e 's/<scope>[^<]*<\/scope>//g'
  -e 's/<!--[^>]*>//g' \
  -e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
	\/Users\/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'

<exclusions> is a little trickier, though, since it has multiple child elements. This can be accomplished at the upper (pretty-print) layer by just deleting those lines from the output entirely:

sed -n \
  -e '/<exclusions>/,/<\/exclusions>/d' \
  -e '/<groupId>/s/\./\//g' \
  -e '/<dependencies>/,/<\/dependencies>/p' pom.xml |
  tr -d '\n\t ' | \
  sed \
  -e 's/<scope>[^<]*<\/scope>//g'
  -e 's/<!--[^>]*>//g' \
  -e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
	\/Users\/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'

Finally, of course, get rid of the "<dependencies>" delimiters:

sed -n \
  -e '/<exclusions>/,/<\/exclusions>/d' \
  -e '/<groupId>/s/\./\//g' \
  -e '/<dependencies>/,/<\/dependencies>/p' pom.xml |
  tr -d '\n\t ' | \
  sed \
  -e 's/<scope>[^<]*<\/scope>//g'
  -e 's/<!--[^>]*>//g' \
  -e 's/<dependencies>//g' \
  -e 's/<\/dependencies>//g' \
  -e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
	\/Users\/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'

I did cheat in one place — the original POM file had placeholders for the GWT and GAE versions ${gwtVersion} and ${gae.version}. I replaced them in the POM file to simplify my work here; you could get a little wilder and do something like this:

GWT_VERSION=`grep "<gwtVersion>" pom.xml | sed -e 's/^ *<gwtVersion>\(.*\)<\/gwtVersion> *$/\1/'`
GWT_VERSION=`grep "<gae.version>" pom.xml | sed -e 's/^ *<gae.version>\(.*\)<\/gae.version> *$/\1/'`
sed -e 's/${gwtVersion}/'${GWT_VERSION}'/' -e 's/${gae.version}/'${GAE_VERSION}'/' pom.xml

Before parsing the POM file. This approach, however, would still require you to name each expansion variable in the script itself. Ideally, you'd do here exactly what Maven does — dynamically replace the property placeholders with the values that the user entered. These variable values come from the <properties> section of the pom — you can convert the <properties> list into a set of name-value pairs via:

sed -n -e '/<properties>/,/<\/properties>/s/ *<\([^>]*\)>\([^<]*\)<.*$/\1=\2/p' pom.xml

So that:

<pre>
  <properties>
    <!-- Convenience property to set the GWT version -->
    <gwtVersion>2.5.1</gwtVersion>

    <!-- GWT needs at least java 1.6 -->
    <maven.compiler.source>1.6</maven.compiler.source>
    <maven.compiler.target>1.6</maven.compiler.target>

    <!-- GAE properties -->
    <gae.version>1.7.1</gae.version>
    <gae.application.version>1</gae.application.version>

    <!-- Don't let your Mac use a crazy non-standard encoding -->
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
  </properties>

Becomes:

gwtVersion=2.5.1
maven.compiler.source=1.6
maven.compiler.target=1.6
gae.version=1.7.1
gae.application.version=1
project.build.sourceEncoding=UTF-8
project.reporting.outputEncoding=UTF-8

As is, this would be executable via:

eval `sed -n -e '/<properties>/,/<\/properties>/s/ *<\([^>]*\)>\([^<]*\)<.*$/\1=\2/p' pom.xml`

This would create a set of environment variables in the script that would represent the expansion variables declared in the POM. But that's not really what you want here — you want sed substitution. So, rather than building a list of variable declarations you can actually build another sed command! (Bet you never realized that sed supports recursion, eh?)

eval "sed `sed -n -e "/<properties>/,/<\/properties>/s/ *<\([^>]*\)>\([^<]*\)<.*$/-e 's\/\\\\$\\{\1\\}\/\2\/' /p" pom.xml | tr -d '\n'` pom.xml"

Which expands to:

sed -e 's/\${gwtVersion}/2.5.1/' 
	-e 's/\${maven.compiler.source}/1.6/' 
	-e 's/\${maven.compiler.target}/1.6/' 
	-e 's/\${gae.version}/1.7.1/' 
	-e 's/\${gae.application.version}/1/' 
	-e 's/\${project.build.sourceEncoding}/UTF-8/' 
	-e 's/\${project.reporting.outputEncoding}/UTF-8/' pom.xml

And feed this stream into the classpath builder:

eval "sed `sed -n -e "/<properties>/,/<\/properties>/s/ *<\([^>]*\)>\([^<]*\)<.*$/-e 's\/\\\\$\\{\1\\}\/\2\/' /p" pom.xml | 
tr -d '\n'` pom.xml" |
sed -n \
  -e '/<exclusions>/,/<\/exclusions>/d' \
  -e '/<groupId>/s/\./\//g' \
  -e '/<dependencies>/,/<\/dependencies>/p' |
  tr -d '\n\t ' | \
  sed \
  -e 's/<scope>[^<]*<\/scope>//g' \
  -e 's/<!--[^>]*>//g' \
  -e 's/<dependencies>//g' \
  -e 's/<\/dependencies>//g' \
  -e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
	\/Users\/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'

And this actually creates a CLASSPATH that you can use to execute the project from the command line; no setup required. There are probably a few edge cases not dealt with here — spaces in command lines or instances where property names might mis-expand due to the use of '.' characters in the search criteria of a sed command are two that spring to mind — but in practical use, I've actually found this to be workable for real-world Maven projects.

Add a comment:

Completely off-topic or spam comments will be removed at the discretion of the moderator.

You may preserve formatting (e.g. a code sample) by indenting with four spaces preceding the formatted line(s)

Name: Name is required
Email (will not be displayed publicly):
Comment:
Comment is required
Sumeet, 2017-09-23
Hey,

I am running a jenkins pipeline build. In the build I am cloning a repo from git and it has a pom.xml. Once I clone I want to increase the version of pom by one. I am looking for a sed command to do so.

Thanks In Advance
Josh, 2017-10-04
Hm - I really can't see any good way to do that with sed; you want to do arithmetic in sed, which it doesn't really support. The best solution I can think of for that would be an XSL transform, using, say, xsltproc:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:pom="http ://maven.apache.org/POM/4.0.0" 
                    xmlns="http ://maven.apache.org/POM/4.0.0" 
                    xmlns:xsl="http ://www.w3.org/1999/XSL/Transform" 
                    exclude-result-prefixes="pom"
                    version="1.0">
      <xsl:output method="xml" indent="yes" />
      
      <xsl:template match="@* | node()">
        <xsl:copy>
          <xsl:apply-templates select="@* | node()" />
        </xsl:copy>
      </xsl:template>
      
      <xsl:template match="pom:project/pom:version">
        <xsl:element name="version">
          <xsl:value-of select="concat(substring-before(., '.'), '.', number(substring-after(., '.')) + 1)" />
        </xsl:element>
      </xsl:template>
    </xsl:stylesheet>
My Book

I'm the author of the book "Implementing SSL/TLS Using Cryptography and PKI". Like the title says, this is a from-the-ground-up examination of the SSL protocol that provides security, integrity and privacy to most application-level internet protocols, most notably HTTP. I include the source code to a complete working SSL implementation, including the most popular cryptographic algorithms (DES, 3DES, RC4, AES, RSA, DSA, Diffie-Hellman, HMAC, MD5, SHA-1, SHA-256, and ECC), and show how they all fit together to provide transport-layer security.

My Picture

Joshua Davies

Past Posts