Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14898

Create official Docker images for development and testing features

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.1.0
    • None
    • None
    • Reviewed

    Description

      This is the original mail from the mailing list:

      TL;DR: I propose to create official hadoop images and upload them to the dockerhub.
      
      GOAL/SCOPE: I would like improve the existing documentation with easy-to-use docker based recipes to start hadoop clusters with various configuration.
      
      The images also could be used to test experimental features. For example ozone could be tested easily with these compose file and configuration:
      
      https://gist.github.com/elek/1676a97b98f4ba561c9f51fce2ab2ea6
      
      Or even the configuration could be included in the compose file:
      
      https://github.com/elek/hadoop/blob/docker-2.8.0/example/docker-compose.yaml
      
      I would like to create separated example compose files for federation, ha, metrics usage, etc. to make it easier to try out and understand the features.
      
      CONTEXT: There is an existing Jira https://issues.apache.org/jira/browse/HADOOP-13397
      But it’s about a tool to generate production quality docker images (multiple types, in a flexible way). If no objections, I will create a separated issue to create simplified docker images for rapid prototyping and investigating new features. And register the branch to the dockerhub to create the images automatically.
      
      MY BACKGROUND: I am working with docker based hadoop/spark clusters quite a while and run them succesfully in different environments (kubernetes, docker-swarm, nomad-based scheduling, etc.) My work is available from here: https://github.com/flokkr but they could handle more complex use cases (eg. instrumenting java processes with btrace, or read/reload configuration from consul).
       And IMHO in the official hadoop documentation it’s better to suggest to use official apache docker images and not external ones (which could be changed).
      

      The next list will enumerate the key decision points regarding to docker image creating

      A. automated dockerhub build / jenkins build

      Docker images could be built on the dockerhub (a branch pattern should be defined for a github repository and the location of the Docker files) or could be built on a CI server and pushed.

      The second one is more flexible (it's more easy to create matrix build, for example)
      The first one had the advantage that we can get an additional flag on the dockerhub that the build is automated (and built from the source by the dockerhub).

      The decision is easy as ASF supports the first approach: (see https://issues.apache.org/jira/browse/INFRA-12781?focusedCommentId=15824096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15824096)

      B. source: binary distribution or source build

      The second question is about creating the docker image. One option is to build the software on the fly during the creation of the docker image the other one is to use the binary releases.

      I suggest to use the second approach as:

      1. In that case the hadoop:2.7.3 could contain exactly the same hadoop distrubution as the downloadable one

      2. We don't need to add development tools to the image, the image could be more smaller (which is important as the goal for this image to getting started as fast as possible)

      3. The docker definition will be more simple (and more easy to maintain)

      Usually this approach is used in other projects (I checked Apache Zeppelin and Apache Nutch)

      C. branch usage

      Other question is the location of the Docker file. It could be on the official source-code branches (branch-2, trunk, etc.) or we can create separated branches for the dockerhub (eg. docker/2.7 docker/2.8 docker/3.0)

      For the first approach it's easier to find the docker images, but it's less flexible. For example if we had a Dockerfile for on the source code it should be used for every release (for example the Docker file from the tag release-3.0.0 should be used for the 3.0 hadoop docker image). In that case the release process is much more harder: in case of a Dockerfile error (which could be test on dockerhub only after the taging), a new release should be added after fixing the Dockerfile.

      Another problem is that with using tags it's not possible to improve the Dockerfiles. I can imagine that we would like to improve for example the hadoop:2.7 images (for example adding more smart startup scripts) with using exactly the same hadoop 2.7 distribution.

      Finally with tag based approach we can't create images for the older releases (2.8.1 for example)

      So I suggest to create separated branches for the Dockerfiles.

      D. Versions

      We can create a separated branch for every version (2.7.1/2.7.2/2.7.3) or just for the main version (2.8/2.7). As these docker images are not for the production but for prototyping I suggest to use (at least as a first step) just the 2.7/2.8 and update the images during the bugfix release.

      E. Number of images

      There are two options here, too: Create a separated image for every component (namenode, datanode, etc.) or just one, and the command should be defined everywhere manually. The second seems to be more complex (to use), but I think the maintenance is easier, and it's more visible what should be started

      F. Snapshots

      According to the spirit of the Release policy:

      https://www.apache.org/dev/release-distribution.html#unreleased

      We should distribute only final releases to the dockerhub and not snapshots. But we can create an empty hadoop-runner image as well, which container the starter scripts but not hadoop. It would be used for development locally where the newly built distribution could be maped to the image with docker volumes.

      Attachments

        1. HADOOP-14898.001.tar.gz
          7 kB
          Marton Elek
        2. HADOOP-14898.002.tar.gz
          7 kB
          Marton Elek
        3. HADOOP-14898.003.tgz
          7 kB
          Marton Elek
        4. docker_design.pdf
          81 kB
          Marton Elek

        Activity

          People

            elek Marton Elek
            elek Marton Elek
            Votes:
            0 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: