{"id":3010744,"date":"2025-01-17T12:50:33","date_gmt":"2025-01-17T20:50:33","guid":{"rendered":"urn:uuid:79b32fa4-d0bf-416e-9282-a576fd8cd07b"},"modified":"2025-03-07T09:59:16","modified_gmt":"2025-03-07T17:59:16","slug":"fully-automated-kubernetes-operations","status":"publish","type":"research-post","link":"https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/","title":{"rendered":"Fully Automated Kubernetes Operations"},"content":{"rendered":"\n<p class=\"sie-paragraph\">Now, imagine a world where your Kubernetes clusters manage themselves. No more late-night crisis management, no manual updates, and no firefighting. What might sound like a distant fantasy is the new reality at PSN Platform. In the fast-paced, cloud-native world, Kubernetes quickly became the standard for container orchestration, offering immense power but at a cost: complexity. Traditionally, managing Kubernetes clusters demanded high manual intervention and deep expertise.<\/p>\n\n\n\n<p class=\"sie-paragraph\">But what if we could turn that complexity into simplicity? What if the entire process could be automated, freeing up developers and SREs to focus on what really drives value\u2014innovation and growth? That\u2019s exactly the transformation we\u2019ve achieved. Welcome to the new era of fully automated Kubernetes operations at PSN, where teams can now focus on what matters most while the platform takes care of itself.<\/p>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">The Challenge: Wrestling with Complexity<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\">Our journey began with a simple yet profound question: How can we make Kubernetes easier to manage while maintaining the flexibility and power that make it essential? Like many organizations, we faced numerous challenges:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Inconsistent Infrastructure as Code (IaC) Definitions:<\/strong> Different teams used different tools, leading to discrepancies in cluster configurations and resource allocations. This lack of consistency caused environments to fall out of sync, leading to instability and unpredictability.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Manual Processes:<\/strong> Setting up new Kubernetes clusters was a manual, multi-step process. Configuring VPCs, subnets, IAM roles\u2014each step required manual effort, leading to inefficiencies, errors, and delays. Scaling resources was labor-intensive, and keeping everything up-to-date required constant manual intervention.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Limited Automation:<\/strong> Repetitive tasks consumed valuable engineering time, bogging down our operations. Manual scaling, backups, and security patches were the norm, leaving our infrastructure vulnerable and our teams overworked.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Operational Complexity:<\/strong> Managing over 50 Kubernetes clusters, each with more than 50+ add-ons, became increasingly complex. From networking and IAM roles to performance optimization and security, the manual effort was immense.<\/li>\n<\/ul>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">The Root of the Problem: Why Complexity Ruled<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\">As we dug deeper, we uncovered the root causes of our challenges:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Customization Needs:<\/strong> Geographical dispersed departments had different network and security designs that were built to standards and practices developed in isolation. This led to different network designs and approaches on how and where security controls were deployed.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Multi-Team Involvement:<\/strong> Managing these clusters required extensive coordination among teams\u2014network, security, observability. Each change needed approvals, often leading to bottlenecks and miscommunications.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Manual Validation:<\/strong> Manual processes introduced errors and delayed deployments. Without automated checks, standards varied, leading to inconsistencies and potential vulnerabilities.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Lack of Unified Processes:<\/strong> Without standardized tools and practices, each team followed its own methods, causing confusion and inefficiency. Onboarding new team members was slow, and resolving conflicts took time.<\/li>\n<\/ul>\n\n\n\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;69d9763b57ca3&quot;}\" data-wp-interactive=\"core\/image\" data-wp-key=\"69d9763b57ca3\" class=\"wp-block-image wp-lightbox-container\"><img decoding=\"async\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcHm7s-gWTVs4Hbnte03_F8PEq1idHP3BtHy8fS5g6ebglLi_N1u5GgFrl-1cIu0P0Ir11NbQhaEnbwbt0bC-zfbXrf5SVq3lO5jn_X1uqUWv2to1-LZSP4QmvsIqT94HUGY3PEFQ?key=VIK-cAVeUAxHUf-SCtjLALtT\" alt=\"\" \/><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\taria-label=\"Enlarge\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"state.imageButtonRight\"\n\t\t\tdata-wp-style--top=\"state.imageButtonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><\/figure>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">The Turning Point: Embracing Standards and Automation<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\">Faced with these complexities, we knew a fundamental shift was needed. The answer lay in defining standard and embracing automation\u2014not just as a tool, but as a new operational philosophy.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Standardizing Tools and Practices: <\/strong>We selected Terraform as our unified IaC tool, allowing us to standardize our environments and reduce surprises during deployments. By adopting a cookie-cutter process, we streamlined cluster creation, making it faster and more consistent across the organization. Regulatory standards were addressed by deploying policy engines that enforced security standards ensuring that our Kubernetes clusters met PSN Security Policy and met regulatory requirements such as PCI and SOX.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Automating IaC and Monitoring: <\/strong>We automated the creation and configuration of Kubernetes clusters using Terraform and custom scripts integrated with a GitOps pipeline. Monitoring tools were implemented to detect and alert on issues proactively, reducing the need for manual intervention. This included monitoring cluster add-ons, worker nodes, and subnet IP usage, allowing auto-scaling based on traffic patterns.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Self-Service Processes: <\/strong>To empower our teams, we introduced self-service tools that allowed them to handle common tasks independently. Teams could now create IAM roles, container image repositories, and more without waiting for approvals, accelerating the onboarding process.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Automated Validation: <\/strong>We automated the validation of our IaC configurations and Kubernetes cluster setups. Using tools like Terratest and Sonobuoy, we ensured that our infrastructure was provisioned correctly and that critical add-ons were configured properly. Monitoring and logging configurations were also validated to ensure they meet our standards for ongoing management.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Fostering Team Collaboration: <\/strong>We established a RACI (Responsible, Accountable, Consulted, and Informed) structure to set clear expectations about each team\u2019s role in the cluster creation and validation process. This approach removed the confusion of what needed approvals, accelerating the delivery of clusters from weeks to hours.<\/li>\n<\/ul>\n\n\n\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;69d9763b58061&quot;}\" data-wp-interactive=\"core\/image\" data-wp-key=\"69d9763b58061\" class=\"wp-block-image wp-lightbox-container\"><img decoding=\"async\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXe8xTziD86O4QFrDf3foR_ArUvVx38fTxxwcfAXGWeyhxVnPa9JtX1pgUCUHMgquk3AdGviUJSHxeyzgdyHE19dvssdC-6DmQ9zklYk2ofBkJWIwuifZssxDdsurRP7EHaGbLiGhQ?key=VIK-cAVeUAxHUf-SCtjLALtT\" alt=\"\" \/><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\taria-label=\"Enlarge\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"state.imageButtonRight\"\n\t\t\tdata-wp-style--top=\"state.imageButtonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><\/figure>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Conquering the Chaos of Kubernetes Add-Ons<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">The Initial Setup: Helm and Terraform<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\">When we first ventured into managing Kubernetes add-ons, we chose Helm for most configurations, complemented by Terraform modules for each component. This approach provided consistency across our growing fleet of Kubernetes clusters, seamlessly fitting into our deployment workflow. Initially, everything worked like a charm, and we were confident in our setup.<\/p>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">The Growing Pains: Emerging Issues<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\">However, as our clusters expanded\u2014reaching over 50 clusters and supporting more than 50+ add-ons\u2014we began encountering challenges that threatened to disrupt our operations:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Operational Gaps:<\/strong> With the increase in clusters, we started facing PR conflicts and difficulties in managing dependencies among the add-ons. What once felt manageable now required constant attention.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Intermittent Outages:<\/strong> Critical add-ons like CoreDNS and VPC CNI began experiencing outages, often accompanied by vague, generic error messages that made debugging a frustrating process.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Team Ownership:<\/strong> The management of cluster add-ons was distributed across various teams\u2014security, observability, and costing, to name a few. This division of responsibilities made it clear that we needed a more sustainable and coordinated deployment strategy.<\/li>\n<\/ul>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">A Path Forward: Identifying the Requirements<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\">To address these challenges, we outlined key requirements for a robust add-on management approach:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Distributed Deployment:<\/strong> We needed a system that could efficiently roll out add-ons across multiple clusters without causing disruptions.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Support for Variability:<\/strong> The solution had to accommodate clusters with different configurations, ensuring flexibility without compromising reliability.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Failure Reduction:<\/strong> Minimizing the impact of failures while maintaining deployment agility became a priority, as we aimed to reduce the risks associated with scaling.<\/li>\n<\/ul>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">The Solution: Embracing ArgoCD<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\">To meet these demands, we turned to ArgoCD, which provided a declarative, automated approach to managing Kubernetes add-ons:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Declarative Deployment:<\/strong> With ArgoCD, we began deploying Helm charts in a declarative manner, which not only streamlined the process but also provided clear visibility into the health of our Kubernetes workloads.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Application Set Controller:<\/strong> By using the ArgoCD application set controller, we managed multiple clusters with a single manifest, ensuring consistency across the board.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Distributed Approach:<\/strong> We implemented dedicated ArgoCD instances and application sets for different teams within the organization. This approach reduced conflicts, kept workloads in sync with their source manifests, and allowed for more granular control.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Independent Operation:<\/strong> Each team was now empowered to manage their add-ons independently, with clear ownership and accountability. This not only improved efficiency but also reduced dependencies, allowing for faster, more reliable deployments.<\/li>\n<\/ul>\n\n\n\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;69d9763b5863d&quot;}\" data-wp-interactive=\"core\/image\" data-wp-key=\"69d9763b5863d\" class=\"wp-block-image wp-lightbox-container\"><img decoding=\"async\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXclLnvsIH7Bahtzun0yUgrwE5RMAPV1kc1gzN9RPR2NbdBcbVD-8ObqziCc8kYGwE8trfDPD2ITag0vnA6KAmWhuPek_LzpUPY-mojMfKm29IlameBX0EZGvP-n4qSXFgnHvElW?key=VIK-cAVeUAxHUf-SCtjLALtT\" alt=\"\" \/><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\taria-label=\"Enlarge\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"state.imageButtonRight\"\n\t\t\tdata-wp-style--top=\"state.imageButtonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><\/figure>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">ArgoCD Incident: Lessons in Managing Critical Cluster Add-ons<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"sie-paragraph\">As part of our ongoing efforts to optimize cluster operations and enhance the reliability of our infrastructure, we recently faced a significant incident involving ArgoCD and critical add-ons like CoreDNS and VPC-CNI. This event affected the entire cluster environment due to a cyclic dependency that emerged during routine updates. Both the tools cluster and service clusters experienced issues when these vital add-ons were inadvertently removed during the update process.<\/li>\n\n\n\n<li class=\"sie-paragraph\">The situation was further complicated by the fact that we had been managing multiple clusters through a single manifest as was one of our original goals to reduce complexity. However, when that manifest was updated incorrectly, it had a much larger blast radius and affected multiple clusters, leading to widespread disruption. After conducting a detailed review, we identified that the root cause lay in a few design decisions made early in the implementation phase.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p class=\"sie-paragraph\">Specifically, managing critical cluster add-ons such as CoreDNS and VPC-CNI through ArgoCD, and consolidating multiple clusters into a single manifest, introduced practical issues that ultimately compromised the stability of our system.<\/p>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Post-Incident Improvements<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"sie-paragraph\">In response to these challenges, we took immediate action to improve our infrastructure&#8217;s resilience. First, we decided to migrate the management of critical add-ons, like CoreDNS and VPC-CNI, from ArgoCD to Terraform. By tying these critical components more closely to our cluster Infrastructure-as-Code (IaC) operations, we established a more stable and reliable environment. This migration was executed seamlessly, with Terraform taking control of the existing resources without causing any downtime for applications.<\/li>\n\n\n\n<li class=\"sie-paragraph\">In addition, we split the ArgoCD add-on manifests to manage each cluster independently. This approach allows for more granular control and prevents errors in one cluster from cascading to others. To avoid the overhead of maintaining duplicate manifests for multiple clusters, we implemented an automated progressive deployment framework. This ensures smooth and error-free add-on management across all clusters, while still harnessing the power of ArgoCD for continuous delivery.<\/li>\n\n\n\n<li class=\"sie-paragraph\">Thanks to these improvements, we were able to perform the migration across all clusters without any impact on application infrastructure, achieving a more reliable and robust operational environment.<\/li>\n<\/ul>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Defining Our Disaster Recovery Plan<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\">Additionally, we also considered and implemented our disaster recovery plan to enhance the effectiveness of our backup strategy, and we set clear objectives to guide the process:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Recovery Time Objective (RTO):<\/strong> We defined the maximum acceptable downtime based on business requirements and SLAs, ensuring that our disaster recovery plan aligns with our operational needs.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Recovery Point Objective (RPO):<\/strong> By determining<strong class=\"sie-paragraph\"> <\/strong>the maximum acceptable data loss, we guided our backup frequency, balancing protection with resource efficiency.<\/li>\n<\/ul>\n\n\n\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;69d9763b58aaf&quot;}\" data-wp-interactive=\"core\/image\" data-wp-key=\"69d9763b58aaf\" class=\"wp-block-image wp-lightbox-container\"><img decoding=\"async\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXd6VNsIDJnqU-kh3Hz6gXWqoeCpQ3RvO_hO4wr4fb_g4yBUU816SGwee5aNev_JUNY_oEVV9z0ZFyWM9IvdsS6Y8bSGqyOqN-XMIDxjLqHinWZR2mTg_mqa5ICDxlct2CViFzG1?key=VIK-cAVeUAxHUf-SCtjLALtT\" alt=\"\" \/><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\taria-label=\"Enlarge\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"state.imageButtonRight\"\n\t\t\tdata-wp-style--top=\"state.imageButtonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><\/figure>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Reaping the Rewards: The Benefits of Automation<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\">Our journey to a fully automated Kubernetes platform has yielded remarkable results:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Consistent IaC and Efficiency: <\/strong>By selecting Terraform as our unified IaC tool, we standardized environments, reducing cluster creation issues considerably low. A cookie-cutter approach cut <strong class=\"sie-paragraph\">cluster creation time by 90% <\/strong>i.e<strong class=\"sie-paragraph\"> <\/strong>from 2 -3 weeks to just 4 &#8211; 5 hours, improving consistency across the organization. Automated validation improved IaC accuracy, reducing misconfigurations considerably.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Scalability and Flexibility:<\/strong> We leveraged Kubernetes\u2019 native auto-scaling features to handle varying traffic loads, maintaining performance and user experience. External metrics through KEDA (Kubernetes Event-driven Autoscaling) allowed us to scale service pods automatically based on demand. Scalability has been improved <strong class=\"sie-paragraph\">drastically <\/strong>by eliminating manual intervention by engineers to scale-up service pods&nbsp; based on resource metrics, which in-turn scale up worker nodes automatically to schedule the pods.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Enhanced Security and Compliance:<\/strong> Automated container scanning and policy enforcement tools like Kyverno helped us maintain continuous compliance with security and network standards. By automating the process, we reduced the<strong class=\"sie-paragraph\"> RTO by 90% <\/strong>by ensuring the recovery of the platform from backup is available after any disaster.<\/li>\n\n\n\n<li class=\"sie-paragraph\"><strong class=\"sie-paragraph\">Increased Developer Productivity:<\/strong> The faster provisioning of Kubernetes Jenkins agents enabled quicker deployments and more frequent releases. Spinning up in seconds, significantly faster than EC2 agents, Kubernetes agents <strong class=\"sie-paragraph\">reduced deployment times by 70%<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p class=\"sie-paragraph\"><strong class=\"sie-paragraph\">The Future: Continuing the Journey&nbsp;<\/strong><\/p>\n\n\n\n<p class=\"sie-paragraph\">As we look to the future, we remain committed to refining and expanding our automation capabilities. Our next steps include integrating more advanced scale ups, enhancing our disaster recovery strategies, and exploring Service Mesh on Kubernetes for even greater efficiency and cost savings.<\/p>\n\n\n\n<p class=\"sie-paragraph\">Enabling Service mesh using Isitio to provide application capabilities like mTLS authentication, encryption, policy controls, observability, and advanced traffic management without any code change by developers. Establishing a multi-cluster environment to ensure the distributed systems are resilient while staying connected and protected.&nbsp;<\/p>\n\n\n\n<p class=\"sie-paragraph\">Karpenter, Kubernetes native cluster autoscaler with advanced and highly efficient worker node manager. Karpenter simplifies Kubernetes infrastructure by provisioning right sized nodes at the right time using the most cost effective worker node ec2 instances with fully automated life-cycle management. After multi-month of evaluation, rollout has been initiated across our Kubernetes platform to modernize the workload management.&nbsp;<\/p>\n\n\n\n<p class=\"sie-paragraph\">AWS Pod Identity, a new way of managing application IAM credentials in Kubernetes. An evolution from IRSA, Pod identity simplifies IAM permission management for applications in Kubernetes clusters by removing the OIDC limitations when large numbers of service on-boarding in our platform by shifting to a unified trust policy model with granular access control.<\/p>\n\n\n\n<p class=\"sie-paragraph\">The journey to fully automated Kubernetes operations is ongoing, but our successes so far have proven that automation is not just a luxury\u2014it\u2019s a necessity for scaling and sustaining modern cloud-native environments. We\u2019re excited to continue pushing the boundaries of what\u2019s possible. <a class=\"sie-paragraph\" href=\"https:\/\/www.playstation.com\/en-us\/corporate\/playstation-careers\/\" target=\"_blank\" rel=\"noreferrer noopener\">We invite you to join us on this<\/a> journey to operational excellence and make PlayStation the best place to work.<\/p>\n\n\n\n<div class=\"wp-block-sie-scroll-to-top\" data-aa-modulename=\"sie-scroll-to-top\"><button class=\"sie-btn sie-btn--action\" type=\"button\"><span>Scroll to Top<\/span><\/button><\/div>\n","protected":false},"author":34,"parent":0,"template":"","byline":[319],"research-post-category":[197],"class_list":["post-3010744","research-post","type-research-post","status-publish","hentry","research-post-category-cloud","post-fully-automated-kubernetes-operations"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Fully Automated Kubernetes Operations - Sony Interactive Entertainment<\/title>\n<meta name=\"robots\" content=\"noindex, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Fully Automated Kubernetes Operations\" \/>\n<meta property=\"og:description\" content=\"Now, imagine a world where your Kubernetes clusters manage themselves. No more late-night crisis management, no manual updates, and no firefighting. What might sound like a distant fantasy is the new reality at PSN Platform. In the fast-paced, cloud-native world, Kubernetes quickly became the standard for container orchestration, offering immense power but at a cost: [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/\" \/>\n<meta property=\"og:site_name\" content=\"Sony Interactive Entertainment\" \/>\n<meta property=\"article:modified_time\" content=\"2025-03-07T17:59:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcHm7s-gWTVs4Hbnte03_F8PEq1idHP3BtHy8fS5g6ebglLi_N1u5GgFrl-1cIu0P0Ir11NbQhaEnbwbt0bC-zfbXrf5SVq3lO5jn_X1uqUWv2to1-LZSP4QmvsIqT94HUGY3PEFQ?key=VIK-cAVeUAxHUf-SCtjLALtT\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"10 minutes\" \/>\n\t<meta name=\"twitter:label2\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data2\" content=\"saorib\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/innovation\\\/research-academia\\\/research\\\/fully-automated-kubernetes-operations\\\/\",\"url\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/innovation\\\/research-academia\\\/research\\\/fully-automated-kubernetes-operations\\\/\",\"name\":\"Fully Automated Kubernetes Operations - Sony Interactive Entertainment\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/innovation\\\/research-academia\\\/research\\\/fully-automated-kubernetes-operations\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/innovation\\\/research-academia\\\/research\\\/fully-automated-kubernetes-operations\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/lh7-rt.googleusercontent.com\\\/docsz\\\/AD_4nXcHm7s-gWTVs4Hbnte03_F8PEq1idHP3BtHy8fS5g6ebglLi_N1u5GgFrl-1cIu0P0Ir11NbQhaEnbwbt0bC-zfbXrf5SVq3lO5jn_X1uqUWv2to1-LZSP4QmvsIqT94HUGY3PEFQ?key=VIK-cAVeUAxHUf-SCtjLALtT\",\"datePublished\":\"2025-01-17T20:50:33+00:00\",\"dateModified\":\"2025-03-07T17:59:16+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/innovation\\\/research-academia\\\/research\\\/fully-automated-kubernetes-operations\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/innovation\\\/research-academia\\\/research\\\/fully-automated-kubernetes-operations\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/innovation\\\/research-academia\\\/research\\\/fully-automated-kubernetes-operations\\\/#primaryimage\",\"url\":\"https:\\\/\\\/lh7-rt.googleusercontent.com\\\/docsz\\\/AD_4nXcHm7s-gWTVs4Hbnte03_F8PEq1idHP3BtHy8fS5g6ebglLi_N1u5GgFrl-1cIu0P0Ir11NbQhaEnbwbt0bC-zfbXrf5SVq3lO5jn_X1uqUWv2to1-LZSP4QmvsIqT94HUGY3PEFQ?key=VIK-cAVeUAxHUf-SCtjLALtT\",\"contentUrl\":\"https:\\\/\\\/lh7-rt.googleusercontent.com\\\/docsz\\\/AD_4nXcHm7s-gWTVs4Hbnte03_F8PEq1idHP3BtHy8fS5g6ebglLi_N1u5GgFrl-1cIu0P0Ir11NbQhaEnbwbt0bC-zfbXrf5SVq3lO5jn_X1uqUWv2to1-LZSP4QmvsIqT94HUGY3PEFQ?key=VIK-cAVeUAxHUf-SCtjLALtT\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/innovation\\\/research-academia\\\/research\\\/fully-automated-kubernetes-operations\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Fully Automated Kubernetes Operations\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/\",\"name\":\"Sony Interactive Entertainment\",\"description\":\"Pushing the Boundaries of Play\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/sie-dev.altis.cloud\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Fully Automated Kubernetes Operations - Sony Interactive Entertainment","robots":{"index":"noindex","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"og_locale":"en_US","og_type":"article","og_title":"Fully Automated Kubernetes Operations","og_description":"Now, imagine a world where your Kubernetes clusters manage themselves. No more late-night crisis management, no manual updates, and no firefighting. What might sound like a distant fantasy is the new reality at PSN Platform. In the fast-paced, cloud-native world, Kubernetes quickly became the standard for container orchestration, offering immense power but at a cost: [&hellip;]","og_url":"https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/","og_site_name":"Sony Interactive Entertainment","article_modified_time":"2025-03-07T17:59:16+00:00","og_image":[{"url":"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcHm7s-gWTVs4Hbnte03_F8PEq1idHP3BtHy8fS5g6ebglLi_N1u5GgFrl-1cIu0P0Ir11NbQhaEnbwbt0bC-zfbXrf5SVq3lO5jn_X1uqUWv2to1-LZSP4QmvsIqT94HUGY3PEFQ?key=VIK-cAVeUAxHUf-SCtjLALtT","type":"","width":"","height":""}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"10 minutes","Written by":"saorib"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/","url":"https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/","name":"Fully Automated Kubernetes Operations - Sony Interactive Entertainment","isPartOf":{"@id":"https:\/\/sie-dev.altis.cloud\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/#primaryimage"},"image":{"@id":"https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/#primaryimage"},"thumbnailUrl":"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcHm7s-gWTVs4Hbnte03_F8PEq1idHP3BtHy8fS5g6ebglLi_N1u5GgFrl-1cIu0P0Ir11NbQhaEnbwbt0bC-zfbXrf5SVq3lO5jn_X1uqUWv2to1-LZSP4QmvsIqT94HUGY3PEFQ?key=VIK-cAVeUAxHUf-SCtjLALtT","datePublished":"2025-01-17T20:50:33+00:00","dateModified":"2025-03-07T17:59:16+00:00","breadcrumb":{"@id":"https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/#primaryimage","url":"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcHm7s-gWTVs4Hbnte03_F8PEq1idHP3BtHy8fS5g6ebglLi_N1u5GgFrl-1cIu0P0Ir11NbQhaEnbwbt0bC-zfbXrf5SVq3lO5jn_X1uqUWv2to1-LZSP4QmvsIqT94HUGY3PEFQ?key=VIK-cAVeUAxHUf-SCtjLALtT","contentUrl":"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcHm7s-gWTVs4Hbnte03_F8PEq1idHP3BtHy8fS5g6ebglLi_N1u5GgFrl-1cIu0P0Ir11NbQhaEnbwbt0bC-zfbXrf5SVq3lO5jn_X1uqUWv2to1-LZSP4QmvsIqT94HUGY3PEFQ?key=VIK-cAVeUAxHUf-SCtjLALtT"},{"@type":"BreadcrumbList","@id":"https:\/\/sie-dev.altis.cloud\/en\/innovation\/research-academia\/research\/fully-automated-kubernetes-operations\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/sie-dev.altis.cloud\/en\/"},{"@type":"ListItem","position":2,"name":"Fully Automated Kubernetes Operations"}]},{"@type":"WebSite","@id":"https:\/\/sie-dev.altis.cloud\/en\/#website","url":"https:\/\/sie-dev.altis.cloud\/en\/","name":"Sony Interactive Entertainment","description":"Pushing the Boundaries of Play","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/sie-dev.altis.cloud\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"}]}},"ab_tests":{},"_links":{"self":[{"href":"https:\/\/sie-dev.altis.cloud\/en\/wp-json\/wp\/v2\/research-post\/3010744","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sie-dev.altis.cloud\/en\/wp-json\/wp\/v2\/research-post"}],"about":[{"href":"https:\/\/sie-dev.altis.cloud\/en\/wp-json\/wp\/v2\/types\/research-post"}],"author":[{"embeddable":true,"href":"https:\/\/sie-dev.altis.cloud\/en\/wp-json\/wp\/v2\/users\/34"}],"version-history":[{"count":2,"href":"https:\/\/sie-dev.altis.cloud\/en\/wp-json\/wp\/v2\/research-post\/3010744\/revisions"}],"predecessor-version":[{"id":3010998,"href":"https:\/\/sie-dev.altis.cloud\/en\/wp-json\/wp\/v2\/research-post\/3010744\/revisions\/3010998"}],"wp:attachment":[{"href":"https:\/\/sie-dev.altis.cloud\/en\/wp-json\/wp\/v2\/media?parent=3010744"}],"wp:term":[{"taxonomy":"byline","embeddable":true,"href":"https:\/\/sie-dev.altis.cloud\/en\/wp-json\/wp\/v2\/byline?post=3010744"},{"taxonomy":"research-post-category","embeddable":true,"href":"https:\/\/sie-dev.altis.cloud\/en\/wp-json\/wp\/v2\/research-post-category?post=3010744"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}