Continuous Integration and Delivery in our organization with Kubernetes, ArgoCD, GitOps — CardoAI

9 min readJul 21, 2022

At Cardo AI, we adopted the Continuous Integration and Delivery practice to face the need for constant development and deploy our applications in the fastest and most efficient way possible.

The world of software and technology is constantly changing and evolving, which means that the applications we are building and the underlying infrastructure supporting them have to keep up with these changes and new requirements.

In order to create a reliable infrastructure that is easy to change, that is secure, and that is resilient to future modifications, you need to go through many experiments, trials, tools, and practices.

In this article, we will go through what software development looks like in Cardo AI: what we did in the past, the main challenges we faced, and the new decisions we have made regarding our products’ infrastructure. All of this is in order to support the evolving changes in our software ecosystem through a continuous integration and development mentality.

How did we start implementing continuous integration?

As the market we operate in is moving at an extremely fast rate, the need to respond quickly to business needs pushed us to find a way to build and deploy applications in a short period of time. However, these software applications need to be run in an infrastructure that is highly scalable and that can support different environmental changes.

To support these requirements, almost all of our resources are deployed in Kubernetes. While we believe that Kubernetes is not always the best way to go, after many experiments and evaluations, we concluded that it definitely fits our use case. In fact, Kubernetes can provide everything we need in a deployment platform: scaling, security, monitoring, resilience, etc. The downside? In order to get the most out of it, you need significant expertise to set it up correctly.

In the beginning, we did not use any managed Kubernetes services such as EKS or GKE. Instead, we set up Kubernetes on Virtual Machines (VM-s), which is somehow similar to the bare-metal strategy. So we set up the VM-s, installed Kubernetes, set up security rules, networking, etc.

This approach worked well enough. In addition, the managed Kubernetes services weren’t exactly advanced at that time, which contributed to our decision to go with VM-s.

From time to time, our developers needed access to some cluster resources for development or testing purposes. Managing this requirement with Kubernetes hosted on VM-s while completely using AWS as a cloud provider had its own disadvantages, as we were using AWS resources but not completely linking them together. This meant controlling and giving the right access to users was a bit difficult to achieve.

This and other issues that we encountered along the way with the strategy that was described above made us think about ways to further improve.

To make the development process and delivery smoother, we built pipelines that would support our strategy related to the deployment platform used. As a tool to enable building these pipelines, we hosted our own Jenkins server. The pipelines were quite straightforward and included:

Build the application
Run the tests
Build and push the image to the registry
Update Kubernetes manifest with the new image and redeploy

As every self-hosted service needs to take care of maintainability and other overheads, we also had similar issues with the self-hosted Jenkins server. After some time, we realized Jenkins represented a limitation for our teams in several aspects.

Developers would have to login into different systems
Limited Scalability
The groovy language used by Jenkins to create workflows was difficult for developers.

How we are moving forward

At Cardo, we are always aiming at moving at the same speed as technology, as we believe that is a way to stay resilient to industry changes. We try to learn from our own “mistakes” but also to follow the new technological trends. In this section, we will discuss what we have built now and how these changes have significantly improved our processes.

We have switched from Kubernetes on VM-s to Elastic Kubernetes Service (EKS) for a variety of reasons, as explained in the previous paragraphs. First of all, managing and scaling all the related resources that came with Kubernetes became difficult.

Amazon Elastic Kubernetes Service (Amazon EKS) is a managed service that you can use to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes. It offers:

Managed Control Plane
Managed Worker Nodes
Load Balancing
Logging
Tight integration with other AWS Services (ex: RBAC with IAM User)

In the past, we were using the traditional way of writing and managing Kubernetes manifests for our applications and resources. Non-sensitive resources were stored in a git repository and then they were applied to the cluster using Jenkins.

However, this took us to a point where we would ask each other for the resources and there were many inconsistencies between the actual state of the cluster versus what we had in our manifest that defined Kubernetes resources and that was something we didn’t want. We wanted to automate as much as we could. When we heard about GitOps, we saw a new opportunity to change and further improve.

What is GitOps

GitOps is an operational framework that takes DevOps best practices used for application development, such as version control, collaboration, compliance, and continuous integration/continuous delivery tooling, and applies them to infrastructure automation. GitOps is used to automate the process of provisioning infrastructure.

In a similar fashion to how teams use application source code, operations teams that adopt GitOps use configuration files stored as code (infrastructure as code). Using this framework, we now have a single source of truth for the whole infrastructure, which is a git repository.

All the manifests and configurations for our infrastructure live in a git repository in a declarative way.

Since we started using GitOps, we also made many changes in our continuous integration/delivery tools and strategies, as GitOps itself enforces some new practices (especially when it comes to continuous delivery). In the following paragraphs we will get back again to the GitOps strategy, but first, let’s review our adaptation to the new CI/CD strategy.

Continuous Integration / Continuous Delivery Pipeline

We already mentioned that Jenkins was not very intuitive so we were seeking to find something that was more developer-friendly and simple. After looking at the alternatives, we moved most of our pipelines to Github Actions as it fulfilled our requirements and it was much more straightforward, without a huge learning curve.

Before you continue reading, take a minute to look at the following picture, as it will give you an overall idea of the pattern we have implemented.

The processes that run in our continuous integration pipeline are mostly the same, but now I’ll start to link with GitOps from here so we can describe the bigger picture. The steps in the CI are the following:

Build the application
Lint the code / Run the tests
Build an application image and push it to the registry
Update the repository (GitOps Repository) that holds the manifests with the latest image — Commit the changes

From the previous part, I intentionally added more focus on the latest step: “Update the repository ( GitOps Repository) that holds the manifests with the latest image — Commit the changes” because this is the part when we do things differently.

We’re not mentioning any deployment or delivery in the pipeline, are we? — Exactly, because we don’t really care now to write something in our pipeline that does the actual “delivery/deployment” — There is an agent ( ArgoCD) that keeps looking for changes in the repository holding the manifests for the infrastructure.

After the GitOps repository gets updated by actions on the application that had changed, the process is continued by ArgoCD, which works as a Pull-Based Pipeline.

ArgoCD sits in the Kubernetes cluster and keeps polling for the desired state from the GitOps repository, comparing that state with the one that is actually running in the cluster.

This means that for the configured applications that are running in the cluster with ArgoCD, no changes can be made from the outside. Let’s say that for the “x” application that is deployed using ArgoCD we want to change the replicas from “1 to 2”, and we make this change from the outside using a declarative way, or simply by changing it with a UI tool like Lens.

The state will change for a second, but as soon as ArgoCD detects that the state that is running is not the same as in the Git Repository, it will automatically roll back the changes made from the outside and this is awesome when we want to keep a consistent infrastructure for our applications.

Developers and other teams may need some sort of access to the cluster and AWS resources. In the beginning, when we were self-hosting the Kubernetes, there wasn’t a way to use the same strategy to control the access to the AWS resources and also the Kubernetes cluster.

We needed to be very specific for each user/user group that we wanted to allow access to and most of the time it wasn’t very accurate to what we wanted to achieve.

In Cardo AI, we love the teams to be cross-functional, so even though it’s not always preferred or needed, we wanted to give access to some limited resources in Kubernetes for developers that are working on specific projects.

How are we managing access to cluster resources?

Since we are deploying our Kubernetes cluster in AWS, there are lots of possibilities to use the IAM mechanism to bind roles and access to our Kubernetes cluster.

As such, we are grouping the users and giving them specific access to the development resources. This is done with the tight integration of RBAC rules with the IAM User on AWS. So we are basically separating the resources with namespaces (development, staging, production), and then giving access to a specific namespace to the developer.

The user then would generate its Kubernetes configurations from the CLI using their AWS credentials, so there is no intervention from the outside, just the pre-configurations made on the EKS are deployed on AWS. This gives us the opportunity to limit the visibility to the maximum while allowing users to access only the resources they need.

Conclusions

In this article, we learned how Cardo AI made a transition from the traditional level of provisioning the infrastructure and monitoring it, to a whole new model which aims for security, reliability, and automation.

As everyone in this field, we are trying to find the best tools & strategies for our products, by continuously experimenting with the strategies and tools, but the goal remains the same: To automate and to make the development/delivery processes as smooth as possible. Changing our strategy to the one that I mentioned enabled us to worry less about the infrastructure provisions and delivery processes.

Now, we are doing hundreds of deployments every day without being scared of the failures that could happen on the infrastructure or software side. We have more mechanisms to provide an easier rollback to previous versions. Also, we are storing all the manifests that control the infrastructure and applications in a declarative way.

We also have improved on the security side: changes cannot be applied from the outside but only from the authorized users in a declarative way. The developer’s access to the resources is way more controlled without adding overhead with a lot of configurations that were hard to understand.

But we are not stopping here. We are always looking for ways to improve the whole picture; continuously integrating and developing, from infrastructure to developer interactivity with the software, and the resources related to it. Do you want to become part of this adventure? Check out our job openings on our talents page.

Feel free to reach out to me if you have any questions.
Connect with me on 👉 LinkedIn, Github

Asynchronous tasks in Python with Celery + RabbitMQ + Redis

In this article we are going to use Celery, RabbitMQ, and Redis to build a distributed Task queue. But what is a…

levelup.gitconnected.com

Deploy a dockerized FastAPI application to AWS

You’ve created your FastAPI application and now you want to make it public by deploying it? — No worries got that…

levelup.gitconnected.com

Implement API Caching with Redis, Flask and Docker [Step-By-Step]

You want your API to be faster, more consistent and to reduce the requests to the server? — That’s where caching comes…

levelup.gitconnected.com

Continuous Integration and Delivery in our organization with Kubernetes, ArgoCD, GitOps — CardoAI

How did we start implementing continuous integration?

How we are moving forward

What is GitOps

Continuous Integration / Continuous Delivery Pipeline

How are we managing access to cluster resources?

Conclusions

Other articles by me on medium:

Asynchronous tasks in Python with Celery + RabbitMQ + Redis

In this article we are going to use Celery, RabbitMQ, and Redis to build a distributed Task queue. But what is a…

Deploy a dockerized FastAPI application to AWS

You’ve created your FastAPI application and now you want to make it public by deploying it? — No worries got that…

Implement API Caching with Redis, Flask and Docker [Step-By-Step]

You want your API to be faster, more consistent and to reduce the requests to the server? — That’s where caching comes…

Written by Valon Januzaj