Using GitOps for Stateful Workload Management with vSphere CSI driver on on-premises Kubernetes

1 month ago 12
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Kubernetes has become the de-facto standard for container orchestration, providing powerful capabilities for deploying and managing stateless workloads. However, users running stateful applications on Kubernetes face unique challenges, especially in VMware environments. A key issue is that the virtual disks used by stateful apps can’t be attached to pods as easily as ephemeral storage. The volumes need to persist even when pods fail and restart. Overall, IT teams need to carefully evaluate challenges and constraints before running stateful workloads on Kubernetes clusters on VMware.

Users who run containerized workloads on Kubernetes clusters on their vSphere environment use Amazon EKS Anywhere (EKS Anywhere). EKS Anywhere on vSphere does not include a default Container Storage Interface (CSI) driver. However, VMware offers a CSI driver with the vSphere Container Storage Plug-in for managing stateful workloads. The vSphere Container Storage Plug-in is a volume plug-in that runs in a native Kubernetes cluster deployed in vSphere and is responsible for provisioning persistent volumes on vSphere storage. An advantage of using this plug-in is snapshot capabilities, which is important for backup and disaster recovery (DR) scenarios.

GitOps manages application and infrastructure deployment so that the system is described declaratively in a Git repository. It is an operational model that allows you to manage the state of multiple Kubernetes clusters by using the best practices of version control, immutable artifacts, and automation. Flux is a GitOps tool that can be used to automate the deployment of applications on Kubernetes as well as manage EKS Anywhere clusters. It works by continuously monitoring the state of a Git repository and applying changes to a cluster.

In this post we demonstrate the process of using GitOps to deploy and manage stateful workloads on your EKS Anywhere cluster on your vSphere environment with vSphere CSI driver.

 GitOps vSphere CSI driver install architecture diagram

In this setup, we start with creating vCenter configuration secrets, which are necessary to create storage with vCenter. Then, we install External Secrets Operator to query access keys from AWS Secrets Manager, which are necessary for setting up the vSphere CSI Driver. For this demonstration, we are using Secrets Manager to illustrate the approach, and users can also use any other vault implementation. Next, we configure GitOps through Flux to deploy the vSphere CSI driver manifests from a git repository. Finally, we deploy a stateful workload to validate the backup and restore capabilities of persistent volumes on vCenter storage through our vSphere CSI driver. The outline of this is shown in the preceding diagram.

Prerequisites

Make sure the following prerequisites are complete:

  1. A Linux-based host machine using an Amazon Elastic Compute Cloud (Amazon EC2) instance, an AWS Cloud9 instance, or a local machine with access to your AWS account.
  2. Configure admin access to the EKS Anywhere cluster from the host machine.
  3. Configure IAM Roles for Service Account (IRSA) on the EKS Anywhere cluster.
  4. Install the following tools on the host machine from the previous two steps:

Create the vCenter configuration secrets

The first step in our setup process is to create the necessary vCenter configuration secrets. Let’s export a few vCenter details to the environment variable

export EKSA_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text) export EKSA_OIDC_PROVIDER=<value of $ISSUER_HOSTPATH as configured in IRSA setup> export EKSA_ES_SERVICE_ACCOUNT="external-secrets-sa" # Comments reflect example values from the vsphere inventory above - Set Env variables to reflect your vsphere cluster's environment export VSPHERE_USERNAME=<Your VCenter Admin Username> export VSPHERE_PASSWORD=<Your VCenter Admin Password> export VCENTER_DOMAIN_NAME=<Your Vcenter Server Domain> # sc2-rdops-vm06-dhcp-215-129.eng.vmware.com export VSPHERE_DATA_CENTER_NAME=<Your Data Center name> # datacenter export VSPHERE_CLUSTER_NAME=<Your Cluster Name> # vSAN-cluster export VCENTER_NAME=<Your Vcenter Server # sc2-rdops-vm06-dhcp-215-129 export VSPHERE_IP_ADDRESS=$(getent hosts $VCENTER_DOMAIN_NAME | awk '{ print $1 }')

Next, let’s set up the configuration secrets that are loaded from Secrets Manager:

cat << 'EOF' >> vmwareconf.txt global: port: 443 insecureFlag: true # vcenter section vcenter: $VCENTER_NAME: server: $VSPHERE_IP_ADDRESS user: $VSPHERE_USERNAME password: $VSPHERE_PASSWORD datacenters: - $VSPHERE_DATA_CENTER_NAME EOF cat << 'EOF' >> csi-vsphere-vars.txt [Global] insecure-flag = "true" port = "443" [VirtualCenter "$VSPHERE_IP_ADDRESS"] cluster-id = "$VSPHERE_CLUSTER_NAME" user = "$VSPHERE_USERNAME" password = "$VSPHERE_PASSWORD" datacenters = "$VSPHERE_DATA_CENTER_NAME" EOF

Next, let’s load the configuration secrets to Secrets Manager:

export VSPHERE_CONTROLLER_SECRET_ARN=$(aws secretsmanager create-secret \ --name vsphere.conf \ --secret-string "$(envsubst < vmwareconf.txt)" | jq -r '.ARN') export CSI_DRIVER_SECRET_ARN=$(aws secretsmanager create-secret \ --name csi-vsphere.conf \ --secret-string "$(envsubst < csi-vsphere-vars.txt)" | jq -r '.ARN')

Installing external secrets operator

The next step in our setup process is to setup external secrets to securely access the vCenter Cloud Controller Manager and CSI Driver configuration secrets from Secrets Manager. First, let’s start with creating an AWS Identity and Access Management (IAM) policy and role to allow the cluster to access only the Secrets Manager secrets we created in the previous step:

cat << EOF > vmware-csi-secrets-reader-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "secretsmanager:ListSecrets", "secretsmanager:GetSecretValue" ], "Resource": [ "$VSPHERE_CONTROLLER_SECRET_ARN", "$CSI_DRIVER_SECRET_ARN" ] } ] } EOF aws iam create-policy \ --policy-name vmware-csi-secrets-reader \ --policy-document file://vmware-csi-secrets-reader-policy.json export POLICY_ARN=$(aws iam list-policies \ --query 'Policies[?PolicyName==`vmware-csi-secrets-reader`].Arn' \ --output text) cat << EOF > secrets-manager-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$EKSA_ACCOUNT_ID:oidc-provider/$EKSA_OIDC_PROVIDER" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "$EKSA_OIDC_PROVIDER:sub": [ "system:serviceaccount:kube-system:$EKSA_ES_SERVICE_ACCOUNT", "system:serviceaccount:vmware-system-csi:$EKSA_ES_SERVICE_ACCOUNT" ] } } } ] } EOF export ES_ROLEARN=$(aws iam create-role --role-name ${EKSA_ES_SERVICE_ACCOUNT}-role \ --assume-role-policy-document file://secrets-manager-trust-policy.json \ --query Role.Arn --output text) aws iam attach-role-policy --role-name ${EKSA_ES_SERVICE_ACCOUNT}-role \ --policy-arn $POLICY_ARN

Next, we deploy external-secrets through Helm to sync secrets between Secrets Manager and EKS Anywhere cluster.

helm repo add external-secrets https://charts.external-secrets.io helm install external-secrets \ external-secrets/external-secrets \ -n external-secrets \ --create-namespace

Next, let’s verify if external-secrets has been successfully deployed and all pods are ready:

> kubectl get pods -n external-secrets NAME READY STATUS RESTARTS AGE pod/external-secrets-5477599d89-7spkg 1/1 Running 0 100s pod/external-secrets-cert-controller-6cc64794fc-5czqj 1/1 Running 0 100s pod/external-secrets-webhook-55555fc4fd-mncm5 1/1 Running 0 100s

To use IRSA for our secrets retrieval we need a service account in each namespace using external-secrets to assume that role. Since one of the service accounts resides in the vmware-system-csi namespace, we also create that now:

kubectl create ns vmware-system-csi cat << EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount metadata: name: ${EKSA_ES_SERVICE_ACCOUNT} namespace: kube-system annotations: eks.amazonaws.com/role-arn: ${ES_ROLEARN} eks.amazonaws.com/audience: "sts.amazonaws.com" eks.amazonaws.com/sts-regional-endpoints: "true" eks.amazonaws.com/token-expiration: "86400" --- apiVersion: v1 kind: ServiceAccount metadata: name: ${EKSA_ES_SERVICE_ACCOUNT} namespace: vmware-system-csi annotations: eks.amazonaws.com/role-arn: ${ES_ROLEARN} eks.amazonaws.com/audience: "sts.amazonaws.com" eks.amazonaws.com/sts-regional-endpoints: "true" eks.amazonaws.com/token-expiration: "86400" --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: ${EKSA_ES_SERVICE_ACCOUNT}-cluster-role rules: - apiGroups: - "" resources: - nodes - nodes/proxy - services - endpoints - pods verbs: - get - list - watch - apiGroups: - extensions resources: - ingresses verbs: - get - list - watch - nonResourceURLs: - /metrics verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: ${EKSA_ES_SERVICE_ACCOUNT}-role-binding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: ${EKSA_ES_SERVICE_ACCOUNT}-role subjects: - kind: ServiceAccount name: ${EKSA_ES_SERVICE_ACCOUNT} namespace: kube-system - kind: ServiceAccount name: ${EKSA_ES_SERVICE_ACCOUNT} namespace: vmware-system-csi EOF

Next, let’s create the ClusterSecretStore which is a scoped SecretStore that can be referenced by all ExternalSecrets from all namespaces

cat << EOF | kubectl apply -f - apiVersion: external-secrets.io/v1beta1 kind: ClusterSecretStore metadata: name: eksa-secret-store spec: provider: aws: # set secretStore provider to AWS. service: SecretsManager # Configure service to be Secrets Manager region: us-west-2 # Region where the secret is. auth: jwt: serviceAccountRef: name: ${EKSA_ES_SERVICE_ACCOUNT} EOF

Verify the ClusterSecretStore status using the following command:

> kubectl get clustersecretstore eksa-secret-store NAME AGE STATUS CAPABILITIES READY eksa-secret-store 2m38s Valid ReadWrite True

Configure GitOps with Flux to install Cloud Controller Manager and vSphere CSI Driver

Note that you can skip flux install step can be skipped if you are already using GitOps enabled EKS Anywhere cluster, the EKS Anywhere installation process installs Flux on your behalf.

We use GitOps sync through Flux to handle the deployment of the CSI driver into our EKS Anywhere cluster. Deploy Flux in your EKS Anywhere cluster using the following command

flux install flux create source git vmware-csi \ --url=https://github.com/aws-samples/containers-blog-maelstrom \ --branch=main flux create kustomization csi-driver-main \ --source=Gitrepository/vmware-csi \ --path="./vmware-csi-driver-gitops" \ --prune=true \ --interval=1m --namespace=flux-system

Verify that the CSI driver installation is successful

Check that the cloud controller manager and CSI driver were installed successfully and that the Storage class was created

> kubectl get pods -n kube-system -l name=vsphere-cloud-controller-manager NAME READY STATUS RESTARTS AGE pod/vsphere-cloud-controller-manager-5mcm7 1/1 Running 0 5m58s pod/vsphere-cloud-controller-manager-9zqgq 1/1 Running 0 5m58s pod/vsphere-cloud-controller-manager-rhgm9 1/1 Running 0 5m58s > kubectl get pods -n vmware-system-csi NAME READY STATUS RESTARTS AGE pod/vsphere-csi-controller-84bb459bd5-8llmm 7/7 Running 0 3m33s pod/vsphere-csi-controller-84bb459bd5-mh922 7/7 Running 0 3m33s pod/vsphere-csi-controller-84bb459bd5-vrfkw 7/7 Running 0 3m33s pod/vsphere-csi-node-g6jfr 3/3 Running 1 (3m20s ago) 3m33s pod/vsphere-csi-node-gmlpd 3/3 Running 2 (3m19s ago) 3m33s pod/vsphere-csi-node-lmfvq 3/3 Running 1 (3m21s ago) 3m33s pod/vsphere-csi-node-s4cdt 3/3 Running 2 (3m20s ago) 3m33s pod/vsphere-csi-node-xqj7z 3/3 Running 1 (3m20s ago) 3m33s pod/vsphere-csi-node-z6rbp 3/3 Running 2 (3m20s ago) 3m33s > kubectl get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE vmware-sc (default) csi.vsphere.vmware.com Delete Immediate false 4h42m

Verify GitOps deployment of sample stateful workload along with backup and restore

Finally, we validate our GitOps setup, which deployed a sample stateful workload along with validating the backup and restore capabilities with our deployed vSphere CSI driver. This sample stateful workload has deployed a sample app, which created a volume and then created a snapshot.

> kubectl get pods -l job-name=app NAME READY STATUS RESTARTS AGE pod/app-4677h 1/1 Running 0 3m1s

For managing storage for stateful workloads, the vSphere CSI driver uses two API resources: PersistentVolume (PV) and PersistentVolumeClaim (PVC) of PersistentVolume subsystem. A PVC is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and memory). Claims can request specific size and access modes (such as they can be mounted ReadWriteOnce, ReadOnlyMany, ReadWriteMany, or ReadWriteOncePod, see AccessModes). A PV is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like volumes, but they have a lifecycle independent of any individual pod that uses the PV. This API object captures the details of the implementation of the storage, be that NFS, iSCSI, or a cloud-provider-specific storage system.

Run the following commands to see the pod of sample stateful workload along with the PVC, which is bound to our vmware-sc storage class and a PV dynamically provisioned with vCenter storage:

> kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-5ec73b4b-3d1e-41a6-8cac-5502094462eb 4Gi RWO Delete Bound default/vmware-csi-claim vmware-sc 44s > kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE vmware-csi-claim Bound pvc-5ec73b4b-3d1e-41a6-8cac-5502094462eb 4Gi RWO vmware-sc 114s
 vCenter storage UI showing the test workload's volume

Figure 2: vCenter storage UI showing the test workload’s volume

Similar to how API resources PersistentVolume and PersistentVolumeClaim are used to provision volumes, VolumeSnapshotContent and VolumeSnapshot API resources are provided to create volume snapshots.

A VolumeSnapshotContent is a snapshot taken from a volume in the cluster and it is a resource in the cluster just like a PV is a cluster resource. A VolumeSnapshot is a request for a snapshot and it is similar to a PersistentVolumeClaim. Next, let’s check on the VolumeSnapshot, which is a point-in-time snapshot of our volume, which can be used for restoring the storage for the stateful workload

> kubectl get volumesnapshot NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE vmware-csi-volume-snapshot true vmware-csi-claim 4Gi vmware-csi-snapshotclass snapcontent-6034c162-e256-4557-b0b9-08f545d723a6 3m35s 4m8s

Finally, let’s validate the restore operation on the stateful workload by creating a workload that uses the created point-in-time snapshot using the following commands:

> kubectl get pods -l kustomize.toolkit.fluxcd.io/name=storage-tester NAME READY STATUS RESTARTS AGE app-restore 1/1 Running 0 4m6s

With that restored pod created we can now see that an additional volume has been created, both in our vCenter UI and on our cluster by running the following command again:

> kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-5ec73b4b-3d1e-41a6-8cac-5502094462eb 4Gi RWO Delete Bound default/vmware-csi-claim vmware-sc 13m pvc-b0f57d0c-ca93-45a4-9ea3-b7adc1c0f096 4Gi RWO Delete Bound default/restored-vmware-csi-claim vmware-sc 2s
 vCenter storage UI showing the snapshot restore test workload's volume

Figure 3: vCenter storage UI showing the snapshot restore test workload’s volume

Cleaning up

To avoid incurring future charges, clean up the EKS Anywhere cluster resources and AWS resources created during the lab:

# clean up EKSA resources flux delete kustomization -n flux-system csi-driver-main kubectl delete sc vmware-sc kubectl delete clustersecretstore eksa-secret-store kubectl delete clusterrolebinding ${EKSA_ES_SERVICE_ACCOUNT}-role-binding kubectl delete sa ${EKSA_ES_SERVICE_ACCOUNT} -n kube-system kubectl delete sa ${EKSA_ES_SERVICE_ACCOUNT} -n vmware-system-csi kubectl delete ns vmware-system-csi kubectl delete secret snapshot-webhook-certs -n kube-system helm uninstall external-secrets -n external-secrets rm -fv ./vmwareconf.txt ./csi-vsphere-vars.txt ./vmware-csi-secrets-reader-policy.json ./secrets-manager-trust-policy.json kubectl delete ns external-secrets flux uninstall # clean up AWS resources aws iam detach-role-policy --role-name ${EKSA_ES_SERVICE_ACCOUNT}-role \ --policy-arn $POLICY_ARN aws iam delete-policy \ --policy-arn $POLICY_ARN aws iam delete-role \ --role-name ${EKSA_ES_SERVICE_ACCOUNT}-role aws secretsmanager delete-secret \ --region ${AWS_REGION} \ --secret-id $VSPHERE_CONTROLLER_SECRET_ARN \ --recovery-window-in-days=7 aws secretsmanager delete-secret \ --region ${AWS_REGION} \ --secret-id $CSI_DRIVER_SECRET_ARN \ --recovery-window-in-days=7

Conclusion

In this post, we demonstrated the process of using GitOps to deploy a vSphere CSI driver on your EKS Anywhere cluster on your vSphere environment. Furthermore, we demonstrated the process of deploying a stateful workload to our EKS Anywhere cluster using vSphere CSI driver. Then, we demonstrated the underlying process of persistent volume claim and persistent volume creation with stateful workload, which dynamically created a storage on vCenter storage. Finally, we backed up the volume by creating a point-in-time snapshot of the volume and performed a restore operation of the stateful workload with the created point-in-time snapshot. Users looking to run stateful workloads on EKS Anywhere clusters on vSphere can seamlessly follow this approach to operate stateful workloads at scale.

To learn more about managing your EKS Anywhere environment, check the following resources:

Read Entire Article