QuickTip: Tanzu TKC/Guest Cluster: Error “Permission denied” when creating file with securityContext

This post was published 3 years 6 months 16 days ago, so the post may be outdated.

This time, I’m not getting into much detail about specific details around this topic. Consider this as a quick write-up/FYI. If you’ve ever encountered this issue or worked with securityContext‘s in Kubernetes, you might know what I’m talking about.

Disclaimer: Before proceeding, please don’t see this is as an official response from VMware. You continue on your own risk.

Some context

When deploying Grafana via TKG Extensions on your TKG(m) or TKGS/TKC, you might have encountered some "Permission denied" issues. This error is described in kb.vmware.com/s/article/82108 pretty well.

This also affects some TKGs releases, for example TKGS 1.20.2.

Reproduction

For some testing, I’ve written a yaml (inspired by kubernetes docs) to reproduce this error. I’ve used following:

apiVersion: v1
kind: Pod
metadata:
  name: fstypetest
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
  volumes:
    - name: fstypetest-volume
      persistentVolumeClaim:
        claimName: fstypetest-volume-claim
  containers:
  - name: fstypetest-pod
    image: busybox
    command: [ "sh", "-c", "sleep 1h" ]
    volumeMounts:
    - name: fstypetest-volume
      mountPath: /data
    securityContext:
      allowPrivilegeEscalation: false
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: fstypetest-volume-claim
spec:
  storageClassName: vsan-raid1-1-failure
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Mi

Then get into the shell of this pod via kubectl exec -it fstypetest -- sh and you will see this:

/ $ ls -lsh /data
total 12K
    12 drwx------    2 root     root       12.0K Jul  8 21:49 lost+found
/ $ ls -lsh /data^C
/ $ touch /data/test
touch: /data/test: Permission denied

How you want it to look like

Now, when running the same specification on a GuestCluster with TKGS 1.20.7 you however will see this:

/data $ touch test
/data $ ls -lsh
total 0
     0 -rw-r--r--    1 1000     2000           0 Jul  8 21:22 test

(Does look better, doesn’t it?)

So long story short: This issue seems to be finally solved in TKGS 1.20.7.

The solution

Updating your Tanzu Kubernetes Cluster/Guest Cluster to recently (5th of July 2021) released TKGS 1.20.7 does fix the issue finally.

$ kubectl get tanzukubernetesreleases | tail -n 5
v1.19.7---vmware.1-tkg.1.fc82c41    1.19.7+vmware.1-tkg.1.fc82c41    True    True         20d       [1.20.7+vmware.1-tkg.1.7fb9067 1.19.7+vmware.1-tkg.2.f52f85a]
v1.19.7---vmware.1-tkg.2.f52f85a    1.19.7+vmware.1-tkg.2.f52f85a    True    True         20d       [1.20.7+vmware.1-tkg.1.7fb9067]
v1.20.2---vmware.1-tkg.1.1d4f79a    1.20.2+vmware.1-tkg.1.1d4f79a    True    True         20d       [1.20.7+vmware.1-tkg.1.7fb9067]
v1.20.2---vmware.1-tkg.2.3e10706    1.20.2+vmware.1-tkg.2.3e10706    True    True         20d       [1.20.7+vmware.1-tkg.1.7fb9067]
v1.20.7---vmware.1-tkg.1.7fb9067    1.20.7+vmware.1-tkg.1.7fb9067    True    True         3d2h

The workaround

When you can’t or don’t want to update, there are two workarounds.

Workaround #1

This workaround is inspired by the steps provided for TKG(m) here), so they should be fine. For TKGS, you can use following:

kubectl patch deployment -n vmware-system-csi vsphere-csi-controller --type=json -p='[{"op": "add", "path": "/spec/template/spec/containers/4/args/-", "value": "--default-fstype=ext4"}]'
kubectl delete pod -n vmware-system-csi -l app=vsphere-csi-controller

This will patch your deployment named vsphere-cis-controller, then we remove all pods labeled with app=vsphere-cis-controller. As the deployment enforces multiple pods of the controller running at a given time, Kubernetes will start re-creating those deleted pods. This step is required to restart the pods, so that the deployment change takes effect.

However to note, this could result in some interruption when interacting with PVCs (e.g. during creationg/deletion) in a busy environment.

When issues appear, please consider raising an offical service request with VMware.

Workaround #2

The second workaround is the one I was using alternatively and should be less disruptive, as it’s going to modify Storage Classes we already have. However the caveat is, that all storage classes needs to be modified.

Check kubectl get sc for the names of your storage classes in use.

  1. Export the Storage Class definition:

    kubectl get sc vsan-raid1-1-failure -o yaml > tmp_sc.yaml
  2. Modify Storage Class by adding this below parameters (watch out for indentation!)

    csi.storage.k8s.io/fstype: "ext4"
  3. Replace the Storage Class:

    kubectl replace -f tmp_sc.yaml --force

The more automated way:

kubectl get sc vsan-raid1-1-failure -o yaml > tmp_sc.yaml
sed '/^parameters:.*/a\ \ csi.storage.k8s.io/fstype: "ext4"' -i tmp_sc.yaml
kubectl replace -f tmp_sc.yaml --force

Beside this, getting to the latest version for the final fix might be the best idea on a longer run. Hope this helps some people out there!

(Fun-Fact: I was writing this post as a solution for the current 1.20.2 release. During reproduction I figured out there’s 1.20.7 with a fix released.)

Patrik Kernstock

May I introduce my self? I am Patrik Kernstock, 25 years old, perfectionist, born in Austria and living in Ireland, Cork. Me explained in short: Tech- and security enthusiast, series & movies junky. Interesting in Linux, Container-stuff and many software solutions by Microsoft, Veeam and VMware.

0 0 votes
Article Rating
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
newest
oldest most voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x