Platform Engineer

Building reliable systems on AWS and Kubernetes.

Platform Engineer with 4+ years of experience operating distributed systems on AWS and Kubernetes. Focused on reliability engineering, incident response, production troubleshooting, and root-cause analysis.

I work on improving availability, reducing operational cost, and strengthening observability in high-traffic enterprise environments.

Snapshot

Skills, Certifications, and Contact

Technical Skills

Certifications

AWS Certified Solutions Architect - Professional (SAP)
Certified Kubernetes Administrator (CKA)
Certified Kubernetes Application Developer (CKAD)

Contact

Professional Experience

Jun 2024 - Present

Amazon Web Services - Cloud Support Engineer

Lead production troubleshooting and root-cause analysis for enterprise-scale AWS workloads (EC2, EKS, DNS, networking, load balancing), leveraging CloudWatch metrics, Grafana dashboards, VPC Flow Logs, and Python-based diagnostics to restore reliability.
Technical customer support delivery
- Provide technical guidance to customers through email, phone, and live chat, translating AWS-service internals into actionable steps for on-call engineers and platform owners.
- Diagnose and resolve complex technical issues by analysing metrics, logs, packet captures, and system behaviours to identify root causes and implement effective solutions.
- Maintain a reusable Terraform toolkit for common customer environments, reducing case resolution time by 10%-30% by reproducing customer setups quickly instead of rebuilding from scratch every case.
FIX protocol disconnections on Network Load Balancers

Worked on a long-running reliability issue where customers experienced intermittent disconnections on long-lived FIX protocol connections running through a VPC PrivateLink endpoint. The case was unusually hard to reason about because it affected multiple customers in different time zones, produced no clear application errors, and was not consistently reproducible on demand.
- Collaborated with the affected customers to collect VPC Flow Logs alongside packet captures from both the client and server sides, so the signal set was wide enough to rule out common explanations before proposing a root cause.
- Kept a daily written summary of the findings throughout the investigation so that everyone involved (across time zones and teams) was aligned on what had been ruled out and what was still open.
- Identified a previously unknown issue inside the service and worked with the owning service team to develop and deploy a fix, improving reliability for the affected customers.
- Invited to present the case at a regional all-hands so other engineers could recognise the same pattern faster in the future.
ALB operational issue identification and product-reliability improvement
- Identified a recurring ALB DNS failover issue by analysing internal system logs and reproducing the behaviour with a fellow engineer, then delivered a temporary mitigation plan to the customer.
- Captured and analysed packets in Wireshark to clarify the ALB connection mechanism, then shared improvement suggestions with the internal development team to inform a longer-term product fix.
Customer IP blockage investigation and escalation
- Investigated customer reports of being blocked from AWS web properties by correlating customer-provided logs with internal telemetry and confirming the IPs had been blocked due to cyberattack traffic from shared proxies.
- Escalated beyond standard support by collaborating with my manager and the sales team to engage the internal security team, ensuring the resolution path went through the right channel rather than bouncing between queues.
EKS cross-cluster autoscaling misconfiguration

A customer reported that when they terminated a node in Cluster B, a new node would unexpectedly launch in Cluster A. The obvious next step would have been to dig into the Cluster Autoscaler or Karpenter logs, but those were not available on the affected side, so I had to reason from adjacent signals instead.
- Checked CloudTrail logs for Auto Scaling events and confirmed the scale-out action was triggered by the Cluster Autoscaler rather than Karpenter, narrowing the search before inspecting any configuration.
- Inspected the IAM role used by the Cluster Autoscaler and noticed its OIDC trust relationship pointed at Cluster A, which immediately suggested a cross-cluster misconfiguration.
- Hypothesised that the customer had copied the autoscaler configuration from Cluster A to Cluster B without updating the IAM role and OIDC trust, so Cluster B's autoscaler was authorised to act on Cluster A's resources. Confirmed the root cause once we collected Cluster B's autoscaler configuration file.
- The cross-cluster scaling behaviour stopped once the customer scoped each autoscaler role to its own cluster. The case made a clean example of diagnosing an EKS issue from infrastructure and security-configuration signals rather than application logs.
Terraform-based lab for customer-environment reproduction

Many networking and Kubernetes cases involve complex customer environments that cannot be fully observed from logs or configuration snippets alone. Troubleshooting directly from what customers can share is often slow, so I invested in a reusable reproduction environment.
- Built modular Terraform configurations that can quickly stand up typical customer architectures: VPCs, EKS clusters, load balancers, and common networking pieces such as NAT gateways and security-group layouts.
- When a new case arrives, recreate a similar environment in my own AWS account and test hypotheses directly instead of bouncing questions back to the customer.
- Reduced my average case resolution time by around 10% by turning "ask the customer to re-run something" into "reproduce it locally first."
Security and compliance automation
- Automated ISO 27001 security notification workflows using GuardDuty and EventBridge, so customers could detect and react to suspicious activity without relying on periodic manual reviews.
- Wired the notifications into existing customer alerting paths so the new signal improved their compliance posture without adding a new surface to maintain.
Aug 2022 - Jun 2024

KKCompany - Site Reliability Engineer

Operated distributed production systems on AWS (EC2, ECS, EKS) supporting 140k DAU / 900k MAU; participated in 24/7 on-call rotation via PagerDuty and resolved critical production incidents to maintain 99.9% SLA of service reliability and availability.
Service availability during high-traffic baseball events
- Identified database scaling speed as the underlying bottleneck behind service dips during traffic surges rather than treating the symptoms one incident at a time.
- Calculated required server resources and drafted pre-warm plans for upcoming games based on expected user load and known service capacity.
- Improved availability and resource efficiency by transitioning from Amazon RDS to Amazon ElastiCache for the hot read path, working with the back-end engineers on the change-over.
- Together, the pre-warm strategy and the DB scaling improvements lifted service availability and reduced operational cost by around 30% across the event window.
CDN migration evaluation and execution
- Generated CDN reports and reviewed them with the Product Manager to align the migration with customer-facing requirements before any infrastructure moved.
- Collaborated with the new CDN provider's engineer to replicate routing rules from the old provider after the initial routing assessment, keeping request behaviour stable through the switch.
- Drafted the migration plan with effort estimates in consultation with the Product Manager so the schedule reflected engineering reality, not wishful thinking.
- Executed a zero-downtime CDN switch on ~11 TB/day of production traffic, and achieved roughly 10% cost optimisation at the new provider thanks to the re-assessed routing rules.
Automation-tool maintainability improvements
- Simplified the maintenance-mode infrastructure by replacing a Route 53 DNS switch and ALB-rule mechanism with a single WAF-based traffic-control design, removing a class of DNS-propagation surprises from the runbook and cutting maintenance-mode execution time by about 60% while standardising the usage pattern across teams.
- Revamped the automation tool's Python script so new SREs could read it, extend it, and trust it under incident pressure.
- Migrated the fleet from AWS Classic Load Balancer to Application Load Balancer, unlocking target-group health checks, richer routing rules, and modern observability.
CI/CD modernisation from CloudFormation to Terraform

Our CI/CD pipelines were running inside containers hosted in a private Docker registry that was no longer actively maintained. The images were also owned and published by an engineer who had already left the company, which made the pipeline runtime both unmaintainable and a long-term security risk.
- Rebuilt the CI/CD container runtime on Amazon ECR under company ownership so the pipeline images were versioned, maintainable, and fully controlled by the organisation.
- Replaced the existing CloudFormation-based deployment process with a Terraform-based one as the company was gradually migrating off CloudFormation, rather than letting SRE-owned repos drift further from the rest of the stack.
- Abstracted the deployment logic that most projects had duplicated into a shared Terraform module and reused it across SRE-owned repositories, which eliminated the copy-paste YAML, made future changes consistent, and reduced the maintenance cost of each pipeline.
Alarm architecture redesign using CloudWatch composite alarms

On-call engineers were being paged by P0 alerts that did not actually represent real system issues. One example was an alert that triggered whenever a single database instance reached 90% CPU — that condition usually came from uneven load and did not affect availability or user experience, but it still required a five-minute response.
- Ran a design discussion within the SRE team and agreed on two rules for P0 alerts: they should reflect real user or system impact, and they should be actionable so the on-call engineer knows exactly what to do when paged.
- Rebuilt the database CPU alert using CloudWatch composite alarms so it only triggers when all database instances exceed the threshold at the same time, rather than firing on any single instance.
- Reduced alert triggers by around 30% across the service and, more importantly, made paged alerts trustworthy again, which improved on-call experience and operational efficiency.
ISO 27001 compliance support
- Built monitoring systems to detect abnormal logins across key systems, including the databases and the CMS, so compliance evidence came from real signals rather than periodic manual reviews.
- Helped devise a streamlined CloudTrail log-monitoring plan focused on the events that actually mattered for the audit scope, not the full log firehose.
P0 incident resolution
- Helped identify a customer-side misconfiguration as the root cause of a DNS routing issue during a P0, instead of letting the initial assumption ("it's our CDN") drive the remediation.
- Collaborated with the Product Manager to expedite the resolution path across teams and customer communication.
Feb 2022 - Jun 2022

AppWorks School - Back-End Trainee

Built a full-stack API documentation platform from scratch using React, Node.js, MongoDB, Docker, and AWS within five weeks.
End-to-end product delivery on a compressed timeline
- Shipped the API documentation platform (web, server, and database) in five weeks, demonstrating quick, self-taught, problem-solving delivery under schedule pressure.
- Teamed up with Front-End and Android developers to build a backstage and coupon management system in a single week, integrating across three layers without a pre-made contract.
- Built a URL-shortener system in the system-design workshop across three days, which drove the lesson on predicting web traffic, choosing cache policies, and splitting read/write database design for scale.
Performance and reliability research
- Researched and ran experiments on Nginx as an HTTP load balancer over two days, converting "I read about it once" into something I could explain and defend.
- Explored cache penetration, cache stampede, and cache avalanche mitigations with Redis locks and Redis Cluster across a weekend, producing concrete examples rather than theory.
- Stood out in the Trouble-Shooting Lab on networking problems, where the core skill was matching symptoms to the relevant layer (DNS, TCP, HTTP) fast.
Technical writing and peer support
- Published two technical articles on Medium during the programme; both passed 100 claps and were referenced by fellow team members, which is where I learned that clear writing compounds.
- Shared English-writing tips for resumes with teammates when asked, which was appreciated by the cohort and reinforced the habit of treating help-asks as a two-way exchange.
Jul 2021 - Dec 2021

DIGI+ Talent Accelerator & Jumpstart Program - Software Engineer Intern

Designed and developed an end-to-end interactive game using Unity (C#) with Arduino hardware devices and MySQL backend services.

Talks & Awards

Aug 2024

Invited Talk on High-Traffic Reliability Lessons

Invited by iThome to record a 30-minute webinar, "「系統被大流量衝垮了怎麼辦？」—用30分鐘濃縮1年的高流量維運經驗談," sharing roughly a year of lessons from handling a high-traffic incident through observation, monitoring, architecture adjustments, and operational response.
Watch Talk
2023

IT Ironman DevOps Top Honour

Published the series "A Glimpse into the Life of a Novice SRE" (一窺SRE初心者的生活：讓警報為您的人生畫下如交響樂一般的新篇章). Received IT Ironman DevOps Top Honour (IT 鐵人賽 DevOps 組冠軍), and extended the work into a book adaptation.
Read Series View Book

Education

Sep 2019 - Jan 2022

National Taiwan University

M.A, Philosophy
Sep 2015 - Jan 2019

National Taiwan Normal University

BEd, Civic Education and Leadership

Building reliable systems on AWS and Kubernetes.

Skills, Certifications, and Contact

Technical Skills

Certifications

Contact

Professional Experience

Amazon Web Services - Cloud Support Engineer

Technical customer support delivery

FIX protocol disconnections on Network Load Balancers

ALB operational issue identification and product-reliability improvement

Customer IP blockage investigation and escalation

EKS cross-cluster autoscaling misconfiguration

Terraform-based lab for customer-environment reproduction

Security and compliance automation

KKCompany - Site Reliability Engineer

Service availability during high-traffic baseball events

CDN migration evaluation and execution

Automation-tool maintainability improvements

CI/CD modernisation from CloudFormation to Terraform

Alarm architecture redesign using CloudWatch composite alarms

ISO 27001 compliance support

P0 incident resolution

AppWorks School - Back-End Trainee

End-to-end product delivery on a compressed timeline

Performance and reliability research

Technical writing and peer support

DIGI+ Talent Accelerator & Jumpstart Program - Software Engineer Intern

Talks & Awards

Invited Talk on High-Traffic Reliability Lessons

IT Ironman DevOps Top Honour

Education

National Taiwan University

National Taiwan Normal University

Languages