...
1---
2status: rejected
3date: 1997-06-19
4deciders: ['@as130521', '@ncrvoyix-swt-retail/edge']
5informed: ['@nobody']
6consulted: ['@nobody']
7tags:
8- build
9- ci
10- test
11---
12
13# 0002: Use Argo Workflows for K8s automation engine
14
15## Context and Problem Statement
16
17We have been running GitHub Actions private runner VMs for CI until now. As we look
18to implement more sophisticated CI systems (e.g., ncr-swt-retail/edge-roadmap#1367,
19ncr-swt-retail/edge-roadmap#6454), we feel GitHub Actions isn't meeting our needs:
20
21- Ability to fan out dynamically and conditionally (e.g., one step produces a
22 list, the next step fans out to handle each list item in parallel).
23- Artifact sharing between jobs involves copying data out of GCP, where we
24 host our runners, to GitHub servers, and then back to a separate runner in GCP.
25 We shouldn't need to step outside of our workflow engine's features to implement
26 something as simple as job steps passing outputs to each other.
27- Actions does not facilitate the reuse of private actions across repositories.
28
29## Decision Drivers
30
31- Operating K8s clusters is our bread and butter. Operating the VMs for GitHub
32 Actions as the team scaled up has been a continual source of headaches and
33 toil:
34 - Responsible for producing + using a VM image which is compliant with NCR
35 security requirements.
36 - VMs are stateful, even while using the container execution agent for actions.
37 State between builds accumulates and causes sporadic issues over time.
38- GitHub Actions can only run one job per runner at a time. Autoscaling VMs
39 horizontally is outside of our wheelhouse vs K8s. Even without autoscaling,
40 K8s ability to run multiple concurrent jobs on a single node will allow us to
41 add more workflows (especially lightweight ones) without needing to add more
42 runners to handle the throughput.
43- Secret and identity management: on GKE worker nodes, we can leverage workload
44 identity to ensure each workflow only has the permissons it needs, and can
45 access the secrets in GCP SecretManager.
46 - Currently on GitHub Actions, we use one overly provisioned service account
47 with all of the permissons for all of the workflows.
48 - Despite having workload identity, we still need to manage an image pull
49 secret in GitHub separately so that the Actions agent can pull our build
50 container.
51- Ability to build deeper integrations and reuse existing integrations via the
52 K8s API (including ability to intercept jobs before scheduling + modify them)
53 vs GitHub API.
54
55## Considered Options
56
57- [Tekton](https://github.com/tektoncd/pipeline)
58- [Argo Workflows](https://github.com/argoproj/argo-workflows)
59- [actions-runner-controller](https://github.com/actions/actions-runner-controller)
60
61## Decision Outcome
62
63Argo Workflows was chosen because of the high friction associated with adopting
64Tekton.
65
66## Pros and Cons of the Options
67
68### Tekton
69
70- **Pro**: maintained by Google + RedHat and others
71- **Con**: Despite popularity, doesn't seem to experience much real world usage.
72 Features are added to the Pipelines project without full integration with the
73 Triggers project (https://github.com/tektoncd/triggers/issues/1562) for multiple
74 releases, despite tektoncd/triggers being the sole solution for triggering or
75 scheduling Tekton Pipelines.
76- **Con**: Similarly, many of the features seemed like pet projects of maintainers
77 instead of improving the core functionality of being a K8s orchestration engine.
78
79### Argo Workflows
80
81- **Pro**: long-standing reputation for real-world usage in large scale
82 environments, like [machine learning](https://argoproj.github.io/argo-workflows/use-cases/machine-learning/) and data pipelines.
83- **Pro**: can dynamically fan out based on output from previous steps
84- **Pro**: has concurrency controls for workflows
85- **Pro**: can pass artifacts between steps without provisioning a PVC by using
86 `emptyDir`
87
88### actions-runner-controller
89
90- **Con**: despite running on K8s, it actually uses Docker-based actions runners,
91 which is an unnecessary layer of indirection on K8s and introduces a large
92 security surface area via the Docker runtime for all job steps.
93- **Con**: as a community project until GitHub took ownership this year, so
94 while GitHub may advertise it as an official solution, it has more rough edges
95 than Actions runners built in-house (e.g., VM-based). The [troubleshooting
96 guide](https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md#operations)
97 and open issues are indicative of difficulties getting Actions runners to
98 work nicely with K8s (runners getting stuck, jobs not getting assigned to
99 runners, runner coming up before network available, etc) which don't exist
100 for K8s-native workflow engines.
101- **Con**: despite running on K8s, [it has to be autoscaled in a special way
102 instead of autoscaling the number of K8s nodes](https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md)
View as plain text