Jun 2024 - Present
Amazon Web Services - Cloud Support Engineer
Lead production troubleshooting and root-cause analysis for enterprise-scale AWS workloads
(EC2, EKS, DNS, networking, load balancing), leveraging CloudWatch metrics, Grafana dashboards,
VPC Flow Logs, and Python-based diagnostics to restore reliability.
Technical customer support delivery
- Provide technical guidance to customers through email, phone, and live chat,
translating AWS-service internals into actionable steps for on-call engineers and
platform owners.
- Diagnose and resolve complex technical issues by analysing metrics, logs, packet
captures, and system behaviours to identify root causes and implement effective
solutions.
- Maintain a reusable Terraform toolkit for common customer environments, reducing case
resolution time by 10%-30% by reproducing customer setups quickly instead of rebuilding
from scratch every case.
FIX protocol disconnections on Network Load Balancers
Worked on a long-running reliability issue where customers experienced intermittent
disconnections on long-lived FIX protocol connections running through a VPC PrivateLink
endpoint. The case was unusually hard to reason about because it affected multiple
customers in different time zones, produced no clear application errors, and was not
consistently reproducible on demand.
- Collaborated with the affected customers to collect VPC Flow Logs alongside packet
captures from both the client and server sides, so the signal set was wide enough to
rule out common explanations before proposing a root cause.
- Kept a daily written summary of the findings throughout the investigation so that
everyone involved (across time zones and teams) was aligned on what had been ruled out
and what was still open.
- Identified a previously unknown issue inside the service and worked with the owning
service team to develop and deploy a fix, improving reliability for the affected
customers.
- Invited to present the case at a regional all-hands so other engineers could recognise
the same pattern faster in the future.
ALB operational issue identification and product-reliability improvement
- Identified a recurring ALB DNS failover issue by analysing internal system logs and
reproducing the behaviour with a fellow engineer, then delivered a temporary mitigation
plan to the customer.
- Captured and analysed packets in Wireshark to clarify the ALB connection mechanism,
then shared improvement suggestions with the internal development team to inform a
longer-term product fix.
Customer IP blockage investigation and escalation
- Investigated customer reports of being blocked from AWS web properties by correlating
customer-provided logs with internal telemetry and confirming the IPs had been blocked
due to cyberattack traffic from shared proxies.
- Escalated beyond standard support by collaborating with my manager and the sales team
to engage the internal security team, ensuring the resolution path went through the
right channel rather than bouncing between queues.
EKS cross-cluster autoscaling misconfiguration
A customer reported that when they terminated a node in Cluster B, a new node would
unexpectedly launch in Cluster A. The obvious next step would have been to dig into the
Cluster Autoscaler or Karpenter logs, but those were not available on the affected side, so
I had to reason from adjacent signals instead.
- Checked CloudTrail logs for Auto Scaling events and confirmed the scale-out action was
triggered by the Cluster Autoscaler rather than Karpenter, narrowing the search before
inspecting any configuration.
- Inspected the IAM role used by the Cluster Autoscaler and noticed its OIDC trust
relationship pointed at Cluster A, which immediately suggested a cross-cluster
misconfiguration.
- Hypothesised that the customer had copied the autoscaler configuration from Cluster A
to Cluster B without updating the IAM role and OIDC trust, so Cluster B's autoscaler
was authorised to act on Cluster A's resources. Confirmed the root cause once we
collected Cluster B's autoscaler configuration file.
- The cross-cluster scaling behaviour stopped once the customer scoped each autoscaler
role to its own cluster. The case made a clean example of diagnosing an EKS issue from
infrastructure and security-configuration signals rather than application logs.
Terraform-based lab for customer-environment reproduction
Many networking and Kubernetes cases involve complex customer environments that cannot be
fully observed from logs or configuration snippets alone. Troubleshooting directly from
what customers can share is often slow, so I invested in a reusable reproduction
environment.
- Built modular Terraform configurations that can quickly stand up typical customer
architectures: VPCs, EKS clusters, load balancers, and common networking pieces such
as NAT gateways and security-group layouts.
- When a new case arrives, recreate a similar environment in my own AWS account and
test hypotheses directly instead of bouncing questions back to the customer.
- Reduced my average case resolution time by around 10% by turning "ask the customer to
re-run something" into "reproduce it locally first."
Security and compliance automation
- Automated ISO 27001 security notification workflows using GuardDuty and EventBridge,
so customers could detect and react to suspicious activity without relying on periodic
manual reviews.
- Wired the notifications into existing customer alerting paths so the new signal
improved their compliance posture without adding a new surface to maintain.