Details
Description
Goals
- Reproducible reference ASF CI environment so contributors can clone it.
- An accepted “test result output” format that will certify a commit regardless of CI env.
- Turnaround times as fast as circleci (cloned environment scales to capacity).
- Intuitive CI implementation accessible to new contributors.
Existing Problems
- Many unknown flakies due to infrequent failure rates and limited test history,
- time-consuming to identify flakies as infra-related, test-related, code-related,
- ci-cassandra.a.o is hard to debug (donated heterogenous servers around the world, ASF controlled with limited physical access, two executors per agent [noisy neighbour]),
- slow turnaround times compared to circleci, also very variable times as the fixed resource pool running both post- and pre-commit CI becomes easily saturated,
- difficult to pre-commit test jenkins and cassandra-build changes,
- CI development efforts is split between ci-cassandra and circleci, despite ci-cassandra being our canonical and non-commercial CI,
- lacking parity of what is tested between ci-cassandra and circleci
- circleci is restricted to those with access to premium commercial circleci, creating classes of engineers in the community and an exclusive OSS culture,
- cassandra-builds as a separate repo (without release branches matching in-tree) adds complexity to changing matrix values (jdks, pythons, dist)
- mixture of jenkins dsl groovy, declarative and scripting pipeline.
- different pre-commit and post-commit jenkins pipelines are used.
Additional Goals
- Identify all remaining test flakies.
- Thin CI implementation that builds on a common set of CI-agnostic build and test scripts. Contributors are free to add/maintain other CI solutions while remaining aligned on what and how we test.
- Extendable by downstream codebases (designed for re-use and extension).
Proposal
- Provide a jenkins k8s operator based script that with one command-line spawns a ci-cassandra.a.o clone on the k8s cluster in context, runs the Jenkinsfile pipeline, saves the test result, and tears down the ci-cassandra.a.o clone. [turnkey solution]
- Parameters make spawn and tear-down optional, so ci-cassandra.a.o clones can be re-usable.
- Bring build and test scripts (including their docker images) from cassandra-builds to in-tree
- Provide a declarative jenkins pipeline that maps stages to CI-agnostic build and test scripts.
- CI-agnostic build and test scripts can be run with docker, and without any CI, on any machine.
- Branch specific testing context is defined outside of the CI code.
Unknowns
- with the known pipeline steps, the matrixes we desire (jdk, python, dist, arch), and the parallelisation possible, what is the fastest turnaround time we can expect,
- what parameterisation to the script is required for typical developer testing pre-commit,
- what is the cost of a single pipeline run, what is the expected cost of a year's post-commit CI,
- what is the size of the test result and how can it be saved and shared,
- what is the ideal stable agent resource specifications, can we use heterogenous environments if different test types have different minimum requirements,
- is multiplexing testing in jenkins a requirement to this epic (see
CASSANDRA-17932) - how does the community provide CI to contributors that cannot afford the k8s cluster costs
Non Goals
- deciding what to do with ci-cassandra.a.o if a ci-cassandra.a.o clone is donated and provides a more stable post-commit environment than the original ci-cassandra.a.o
- any work/improvements on circleci (e.g.
CASSANDRA-18001) - discussing or changing our branching and merging strategies
- introduction of a develop or staging branch for final pass pre-commit CI optimisation
- jira integration (automatic bot comments of test results) (see CASSANDRA-17277)
- introduction/support of additional tests: code coverage, jmh benchmarking, or larger performance testing; or matrix axis (jdks, pythons, dists, etc) or other checks. (e.g. CASSANDRA-18077, CASSANDRA-18072)
Timing and Priorisation
- Both 4.0 and 4.1 major releases have been delayed (and significant amounts of engineering time spent) on addressing an unknown number of flakies (further exacerbated by unknown CI infra problems).
- Test failures are still being caught months after commit and merge.
- 5.0 promises a significant increase in large contributions, increasing the risk and cost of failing to do stable trunk development.
- Downstream users are requesting closer alignment to upstream, and convergence to upstream's CI approach (including extending it for additional QA).
History and background context can be read in the previous 'Cassandra CI Status' dev@ threads: https://lists.apache.org/list?dev@cassandra.apache.org:lte=5y:%22Cassandra%20CI%20Status%22 and in this gdoc
Attachments
Issue Links
- is duplicated by
-
CASSANDRA-18273 Timeout occurred. Please note the time in the report does not reflect the time until the timeout.
- Resolved
- is related to
-
CASSANDRA-18731 Add declarative root CI structure
- In Progress
-
CASSANDRA-18567 Add jobs for resource-intensive Python upgrade tests
- Open