0002-rejected.md

Documentation: edge-infra.dev/pkg/tools/dlog/testdata

     1---
     2status: rejected
     3date: 1997-06-19
     4deciders: ['@as130521', '@ncrvoyix-swt-retail/edge']
     5informed: ['@nobody']
     6consulted: ['@nobody']
     7tags:
     8- build
     9- ci
    10- test
    11---
    12
    13# 0002: Use Argo Workflows for K8s automation engine
    14
    15## Context and Problem Statement
    16
    17We have been running GitHub Actions private runner VMs for CI until now. As we look
    18to implement more sophisticated CI systems (e.g., ncr-swt-retail/edge-roadmap#1367,
    19ncr-swt-retail/edge-roadmap#6454), we feel GitHub Actions isn't meeting our needs:
    20
    21- Ability to fan out dynamically and conditionally (e.g., one step produces a 
    22  list, the next step fans out to handle each list item in parallel).
    23- Artifact sharing between jobs involves copying data out of GCP, where we 
    24  host our runners, to GitHub servers, and then back to a separate runner in GCP.
    25  We shouldn't need to step outside of our workflow engine's features to implement
    26  something as simple as job steps passing outputs to each other.
    27- Actions does not facilitate the reuse of private actions across repositories.
    28
    29## Decision Drivers
    30
    31- Operating K8s clusters is our bread and butter. Operating the VMs for GitHub
    32  Actions as the team scaled up has been a continual source of headaches and 
    33  toil:
    34    - Responsible for producing + using a VM image which is compliant with NCR
    35      security requirements.
    36    - VMs are stateful, even while using the container execution agent for actions.
    37      State between builds accumulates and causes sporadic issues over time.
    38- GitHub Actions can only run one job per runner at a time. Autoscaling VMs 
    39  horizontally is outside of our wheelhouse vs K8s. Even without autoscaling, 
    40  K8s ability to run multiple concurrent jobs on a single node will allow us to
    41  add more workflows (especially lightweight ones) without needing to add more 
    42  runners to handle the throughput. 
    43- Secret and identity management: on GKE worker nodes, we can leverage workload 
    44  identity to ensure each workflow only has the permissons it needs, and can 
    45  access the secrets in GCP SecretManager. 
    46    - Currently on GitHub Actions, we use one overly provisioned service account
    47      with all of the permissons for all of the workflows.
    48    - Despite having workload identity, we still need to manage an image pull 
    49      secret in GitHub separately so that the Actions agent can pull our build
    50      container.
    51- Ability to build deeper integrations and reuse existing integrations via the 
    52  K8s API (including ability to intercept jobs before scheduling + modify them)
    53  vs GitHub API.
    54
    55## Considered Options
    56
    57- [Tekton](https://github.com/tektoncd/pipeline)
    58- [Argo Workflows](https://github.com/argoproj/argo-workflows)
    59- [actions-runner-controller](https://github.com/actions/actions-runner-controller)
    60
    61## Decision Outcome
    62
    63Argo Workflows was chosen because of the high friction associated with adopting
    64Tekton.
    65
    66## Pros and Cons of the Options
    67
    68### Tekton
    69
    70- **Pro**: maintained by Google + RedHat and others
    71- **Con**: Despite popularity, doesn't seem to experience much real world usage. 
    72  Features are added to the Pipelines project without full integration with the
    73  Triggers project (https://github.com/tektoncd/triggers/issues/1562) for multiple
    74  releases, despite tektoncd/triggers being the sole solution for triggering or
    75  scheduling Tekton Pipelines.
    76- **Con**: Similarly, many of the features seemed like pet projects of maintainers
    77  instead of improving the core functionality of being a K8s orchestration engine.
    78
    79### Argo Workflows
    80
    81- **Pro**: long-standing reputation for real-world usage in large scale 
    82  environments, like [machine learning](https://argoproj.github.io/argo-workflows/use-cases/machine-learning/) and data pipelines.
    83- **Pro**: can dynamically fan out based on output from previous steps
    84- **Pro**: has concurrency controls for workflows
    85- **Pro**: can pass artifacts between steps without provisioning a PVC by using
    86  `emptyDir`
    87
    88### actions-runner-controller
    89
    90- **Con**: despite running on K8s, it actually uses Docker-based actions runners, 
    91  which is an unnecessary layer of indirection on K8s and introduces a large 
    92  security surface area via the Docker runtime for all job steps.
    93- **Con**: as a community project until GitHub took ownership this year, so 
    94  while GitHub may advertise it as an official solution, it has more rough edges
    95  than Actions runners built in-house (e.g., VM-based). The [troubleshooting 
    96  guide](https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md#operations)
    97  and open issues are indicative of difficulties getting Actions runners to 
    98  work nicely with K8s (runners getting stuck, jobs not getting assigned to 
    99  runners, runner coming up before network available, etc) which don't exist
   100  for K8s-native workflow engines.
   101- **Con**: despite running on K8s, [it has to be autoscaled in a special way 
   102  instead of autoscaling the number of K8s nodes](https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md)
View as plain text