In my company we are developing MapReduce applications on Hadoop. There is a debate going on over dependency management for these projects and I would like to hear you opinion.
We are using Cloudera’s Hadoop distribution (CDH).
Our development workflow:
- a MapReduce project is hosted in SVN repos
- each of them has a POM file with dependencies defined (and some other stuff too)
- we also create Oozie workflow projects which have these MapReduce projects defined as depenencies in their POM and which are responsible to define the execution flow of the MapReduce projects
- the build artifact of a Oozie project is a jar file containing all MapReduce jars it uses and their dependencies (we use Maven’s assembly plugin to compress it), this is the artifact we later deploy to HDFS (after decompressing)
- we build the projects with Maven, managed by Jenkins
- successful builds get deployed to an Archiva server
- deployment to HDFS is on-demand from Archiva, getting the artifact of the Oozie project build, extracting it and putting it to HDFS
- some dependencies (namely the ones used by Oozie; Hive, Sqoop, MySQL connector, Jline, commons-…, etc) are not needed for building the projects but they needed for it to work
Still with me?
Now the debate is about defining these dependencies of MapReduce and Oozie projects. There are two standpoints.
One says it’s not needed to define these dependencies (ie. the ones not needed to build the projects) in the POM files, but instead, have them in a shared directory in HDFS and always assume they are there.
Pros:
- devs don’t need to take care of these (however, they take care of some others)
- most likely, when updating CDH distribution, it’s easier to update these in the shared directory than in each project individuality (not sure if this is necessary though)
Cons:
- some dependencies are defined for the projects, some are assumed which doesn’t feel right
- the shared directory can become a sink of unused JARs and no one will know which is still used and which not
- code becomes less portable because it assumes these JARs are always there in HDFS with the right version
So what do you guys think?
EDIT: forgot to write, but it’s quite obvious, that the 2nd option is to define all dependencies – even if they will repeat for most projects and need some maintenance.
I vote for the second which means to handle and maintain the dependencies for each project instead of a shared-directory. Cause the problem is that the shared directory will change over the time and after some time other project will not work anymore cause someone removed some dependencies etc. So it’s better to hold the dependencies into the pom which they intended for. Furthermore any project will run out of the box without any dependency to the current state of the shared-directory.
You might think about a parent pom which contains some default dependencies which should be used. This can be handled via definition in dependencyManagement section and the particular project defines the real dependencies without the versions.
An other solution might be to use the import scope.
via this it’s possible to have a defined set of dependencies which is not needed to maintain in every project only in this single pom project which is responsible for the dependencies.