Skip to main content
Brief History

I am a Senior Cloud & Platform Engineer (DevOps & SRE) with 10+ years of experience building and operating platforms on AWS, GCP and Azure. I focus on reliable cloud-native platforms and Kubernetes ecosystems, backed by Infrastructure as Code and strong observability – helping product teams ship faster, stay resilient and keep cloud costs under control.

Over the last decade, I have worked with e-commerce, fintech and analytics teams, designing and operating cloud platforms that power real-world products. Recently my work has spanned AWS, Azure and GCP, using Terraform and AWS CDK in Python, Kubernetes where it fits, and observability tooling such as Datadog and Dynatrace.

I enjoy turning loosely defined problems into automated, reliable platforms – from greenfield designs to migrations, cost optimisation and incident response.

Here are some highlights of my profile:


  • Senior Cloud & DevOps Engineer (10+ years)
  • Hands-on with AWS, Azure & GCP
  • Infrastructure as Code: Terraform & AWS CDK (Python)
  • Writes about DevOps & cloud on Medium
  • Strong incident management and on-call experience
  • MSc in Cybersecurity
  • DevOps mindset – close collaboration with developers
  • Microservices & Kubernetes platforms
  • Change & release management in regulated environments
  • CI/CD automation and platform governance


Know more
Experience

Who I am and what I do

Senior Cloud Engineer

European Online Retail Platform

Designed and operated a Kubernetes-based cloud platform for a large European e-commerce organisation, hosting high-traffic web and backend services.

  • Deployed and evolved AWS EKS using Infrastructure as Code (Terraform and AWS CDK in Python).
  • Supported ~60% workload growth while maintaining 99.99% uptime.
  • Reduced cloud spend by ~20% through rightsizing, cleanup and autoscaling improvements.

Site Reliability Engineer

Online Auctions & Automotive Platform

Worked as an SRE for an online auctions and automotive platform, focusing on reliability, performance and modernising legacy systems.

  • Implemented monitoring, logging and alerting for critical services.
  • Helped drive cloud migration and containerisation efforts.
  • Supported incident response and post-incident reviews.

Associate Technical Lead – DevOps

Global Payments & Fintech

Led DevOps initiatives for a global payments and fintech organisation, building secure, scalable infrastructure for payment and merchant services.

  • Designed CI/CD pipelines and environments for payment services.
  • Strengthened security and compliance posture of cloud workloads.
  • Collaborated with engineering and product on deployment strategies.

Senior DevOps Engineer

Foodservice & Supply Chain Technology

Supported large-scale foodservice and supply chain systems, modernising infrastructure and improving deployment workflows.

  • Migrated services to containerised and cloud-native architectures.
  • Automated infrastructure and deployments with Terraform and CI/CD.
  • Improved reliability and observability across multiple environments.

DevOps Engineer / Senior DevOps Engineer

Analytics & Machine Learning Products

Built and operated cloud infrastructure for analytics and machine learning products used by customers across multiple industries.

  • Managed CI/CD, environments and configuration for product teams.
  • Introduced monitoring, logging and alerting as part of SRE practices.
  • Worked closely with data science and engineering teams.

Systems Engineer

Travel & Enterprise Solutions

Worked on enterprise systems in the travel and hospitality space, supporting mission-critical applications and infrastructure.

  • Supported application deployments and production operations.
  • Collaborated with developers and QA on release processes.
  • Gained strong foundations in Linux, networking and automation.

Associate Application Support Engineer

Capital Markets & Trading Platforms

Started my career supporting capital markets and trading platforms, working closely with customers and engineering teams.

  • Provided application support and troubleshooting for trading systems.
  • Helped investigate incidents and performance issues.
  • Built a foundation in financial technology and market data.
Projects & Case Studies

Selected impact stories

A few examples of real projects where I designed, debugged and improved cloud platforms with measurable impact.

Fixing EKS Cluster Autoscaler after AL2023 migration (IRSA + RBAC)

Role: Senior Cloud Engineer for a high-traffic e-commerce EKS platform.

During our migration from Amazon Linux 2 (AL2) to Amazon Linux 2023 (AL2023), the EKS Cluster Autoscaler suddenly stopped scaling: pods were stuck in Pending and logs showed “Failed to get nodes from apiserver: Unauthorized”. The tighter metadata behaviour on AL2023 broke our previous assumption that the autoscaler could “borrow” the node IAM role.

  • Identified that the autoscaler was implicitly using the node IAM role via EC2 instance metadata, which no longer worked with AL2023 defaults.
  • Moved the autoscaler to a dedicated IAM Role for Service Accounts (IRSA) with least-privilege AWS permissions.
  • Created a Kubernetes service account + RBAC role so the autoscaler had exactly the cluster permissions it needed (nodes, pods, leases, etc.).
  • Cleaned up legacy permissions on the node IAM role to remove hidden dependency on metadata and reduce blast radius.

Impact: Restored safe, predictable autoscaling on AL2023 in non-production before touching production, and created a reusable IRSA + RBAC pattern for other controllers (Cluster Autoscaler, ExternalDNS, load balancer controllers) across the organisation.

Read the full story on Medium »

Key lessons & tech stack
  • Move critical controllers (like Cluster Autoscaler) to IRSA or Pod Identity before changing AMIs.
  • Separate concerns: IRSA for AWS APIs, Kubernetes RBAC for what the pod can do inside the cluster.
  • Treat AMI upgrades as application changes: test in non-production with cordon/drain and synthetic scale-up/scale-down runs.

Why IRSA here: For this incident we used IRSA as the fastest safe fix: the cluster already had an OIDC provider, the Helm chart supported the “service account + annotation” pattern, and our AWS CDK stack had IRSA helpers. Pod Identity stays on the roadmap for new clusters where we can design the model from day one.

Tech stack: AWS EKS, Amazon Linux 2 & Amazon Linux 2023, Kubernetes Cluster Autoscaler, IAM Roles for Service Accounts (IRSA), Kubernetes RBAC, EKS OIDC, Terraform / AWS CDK (Python), Datadog.

Tags-based log retention in Datadog – giving ownership back to teams

Role: Senior Cloud & DevOps Engineer, leading Datadog governance for a multi-team engineering organisation.

Our Datadog logs setup “worked”, but ownership and costs were blurry. Dozens of teams shipped logs with inconsistent tags and ad-hoc indexes, making it hard to see who owned which volume, how long data stayed, and why costs kept creeping up.

  • Designed a tags-based, retention-first indexing strategy where teams choose their retention while platform enforces guardrails.
  • Defined mandatory tags on every log: team, costcenter, appgroup, env and retention.
  • Replaced per-team indexes with shared “retention lanes” (3 / 7 / 15 / 30 / 90 days) driven entirely by tags.
  • Introduced a temporary “punishment lane” with short retention and a quota for untagged or badly tagged logs.

Impact: Made log retention an explicit product team decision instead of a central bottleneck, improved cost transparency and paved the way for full IaC ownership of Datadog log indexes and enforcement rules.

Read the full story on Medium »

Index model, guardrails & automation
  • Created strict retention indexes such as index-retention-period-03, -07, -15, -30, -90 matching only fully tagged logs with allowed retention values.
  • Added a 7-day temporary index with a daily quota for logs missing mandatory tags, giving teams short-term visibility but a strong incentive to fix tagging.
  • Built a global Datadog monitor that detects logs with missing tags or invalid retention and alerts the platform team.
  • Implemented a LogsIndexManager module in Pulumi (Python) to manage indexes, routing rules and (optionally) index order and enforcement.

Tech stack: Datadog logs & monitors, tag-based routing and indexes, Pulumi (pulumi-datadog), AWS workloads (EKS/Lambda/EC2), shared tagging model for logs, metrics and traces across 70+ engineering teams.

Migrating container images from GCP to AWS ECR safely and repeatably

Role: Platform/DevOps Engineer leading a registry migration from Google Artifact Registry / Container Registry to AWS ECR.

As part of a wider platform move to AWS, dozens of image repositories had to move from GCP (with hierarchical paths like eu.gcr.io/project/app/service) to AWS ECR, which uses flatter repositories and tags. A naive “pull & push” risked overwriting tags or losing the original structure.

  • Designed a deterministic mapping from GCP’s hierarchical image paths to ECR repositories and tags (for example project-app-service:1.2.3).
  • Built a Python CLI that discovers tags, pulls from GCP, retags, pushes to ECR and validates digests to ensure images are identical.
  • Added tag filtering (semver awareness, prefix filters, --limit) and a safe dry-run mode with clear logging.
  • Included retry logic and optional cleanup so teams could migrate repositories one by one with confidence.

Impact: Enabled teams to migrate image repositories without accidentally overwriting tags or losing traceability, and produced a reusable migration tool that can be shared or open-sourced for similar GCP → ECR moves.

Read the full story on Medium »

Migration workflow & tech stack
  • Discover images and tags from GCP, then normalise hierarchical image names into ECR-compatible repository + tag pairs using a deterministic mapping.
  • For each selected tag: pull from GCP → retag to the mapped ECR repository/tag → push to ECR → compare digests.
  • On digest match, optionally clean up local images and, if desired, the source images in GCP.
  • Log every action (discovery, mapping, pull, push, validation) with clear, human-readable output so teams can audit exactly how GCP paths were translated into ECR.

Tech stack: Python, Docker CLI, Google Artifact Registry / Container Registry, AWS ECR, AWS CLI, bash automation and CI integration where needed. The tool encapsulates the GCP hierarchical naming model and the flatter AWS ECR repository/tag model so teams don’t have to think about it on every migration.

Writing & Talks

From my blog & podcast

I write about Cloud, DevOps and platform engineering on Medium, and occasionally join podcasts to share lessons from real-world migrations and incidents.

Podcast / YouTube talk

A recent conversation where I talk about my work, platform engineering and lessons learned.

foggy mountains
Tech stack

Tools I Used

All
Cloud Platforms
Microservices & Orchestration
Incident & Change Management
Security
Ops tools
aws
AWS
Cloud Platform
azure
Azure
Cloud Platform
gcp
GCP
Cloud Platform
docker
Docker
Containers
kubernetes
Kubernetes
Container Orchestration
servicenow
ServiceNow
Incident & Change Management
jira
Jira Service Management
Incident & Change Management
xmatters
Xmatters
Incident Escalation
crowdstrike
CrowdStrike
Security Scan
git
CI/CD
SCM, CI/CD
jenkins
Jenkins
Automation / CI-CD
datadog
Datadog
Monitoring
terraform
Terraform
Infrastructure as Code
Education

School & University


School Education – 2011

Completed primary and secondary education at Royal College, Colombo 7, Sri Lanka .



Bachelor of Science – 2016

BSc (Hons) in Information Technology from Sri Lanka Institute of Information Technology (SLIIT) .



Master of Science – 2019

MSc in Information Technology – Cyber Security from Sri Lanka Institute of Information Technology (SLIIT) .


contact background

Get in touch

Skill Set


Contact Me


Location:
Rotterdam, South Holland, Netherlands

Email:

info@dilshanwijesooriya.me