Abhishek Soppanna.
I build the boring, invisible plumbing that keeps payments moving — multi-region Kubernetes, GPU ML infra, and FinOps tooling at Stripe.
Five years of DevOps / SRE across Stripe, Databricks, and Paytm — payments rails, multi-region Kubernetes, IaC, observability, FinOps. The site you're on is a working dashboard, not a brochure: scale the cluster, run a deploy, query the terminal. Click anything that looks clickable.
Watch $5 travel through 9 systems in ~140ms.
This is what payments infrastructure does in the moment between "Pay" and "Paid." Type any amount, hit trace. Hover any node for the why. Inject a failure to see how it survives.
Three dimensions. One impossible job.
Speed, reliability, cost. You don't get all three. Drag the dot. Watch what you'd give up.
What happens when one machine dies?
Click the button. Watch a node fail, traffic re-route, replica promote, alerts settle — under five seconds. No human in the loop.
Turn the knob. Watch it breathe.
One physical knob controls incoming traffic — 100 RPS to 100K. Pods auto-scale. Latency creeps. Costs climb. At saturation, watch graceful degradation, not a fall over.
Ask me anything in zsh.
Real command parser. Try whoami · stripe · kubectl get pods. Use ↑/↓ for history, tab to autocomplete.
help whoami skills kubectl deploy contact stripe clearPush to prod, watch it ship.
Click "Run pipeline" to simulate a real deploy: tests, build, security scan, staging, prod rollout. Same flow I build for engineering teams.
Where I've shipped.
Three companies, one throughline: making payments infrastructure boring. The numbers below are what I shipped, not what I aspire to.
Stripe.
SEP 2024 — PRESENT- Designed and operate a multi-region AWS EKS platform with Helm, GitHub Actions, and Ansible — 99.9% uptime through 15% transaction volume growth, zero infrastructure-caused incidents.
- Architected AWS VPC networking for K8s workloads across multi-AZ environments — private subnets, ingress, security groups — cutting environment-related deployment issues by 20%.
- Integrated Trivy + Open Policy Agent into GitHub Actions, enforcing SOC 2 / PCI-DSS and blocking non-compliant deployments pre-prod.
- Partnered with ML and Quant teams on GPU infra (SageMaker + EKS) — reduced model inference latency by 20%.
- Built observability + FinOps dashboards in Prometheus / Grafana — cut MTTR by 25% and surfaced cost-optimization across the platform.
- Shipped Python predictive failure detection (scikit-learn / TensorFlow) on telemetry — cut incident response time by 30%.
- Tuned HA Postgres (RDS) + Redis (ElastiCache) with automated backups, perf tuning, and connection pooling for payment-system data integrity.
Databricks.
NOV 2023 — AUG 2024- Engineered Azure IaC with Terraform + Bicep across AKS, App Services, and VNets — multi-team provisioning that went from days to minutes.
- Built scalable Azure DevOps YAML pipelines for microservices: blue/green rollouts, automated rollback on canary regression — release cycles up 15%.
- Containerized distributed apps on Docker + AKS for portability and environment consistency across staging and prod.
- Stood up Azure Monitor + Log Analytics for cross-service tracing and proactive alerting — cut MTTR by 20%.
Paytm.
JUN 2019 — JUL 2022- Led SRE for the UPI payment gateway — peak ~30M+ txns/day. Deployed Prometheus + PagerDuty, drove SLO/SLI practices, cut incident resolution time by 15%.
- Built CI/CD with Jenkins, GitLab CI, Argo CD, Docker — daily deploys, release reliability up 25%.
- Designed automated disaster recovery with Terraform + CloudFormation — RTO 4hr → ~18min, validated quarterly.
- Implemented PCI-DSS-compliant IAM, AWS KMS, and HashiCorp Vault to secure cardholder data and harden cloud security posture.
What I'm shipping right now.
Live work at Stripe — payment-grade reliability, ML-driven ops, infrastructure that pays for itself.
Multi-region EKS Platform
Active-active K8s for payment workloads. Helm-based service rollouts, Ansible node config, GitHub Actions for promotion. Built to absorb regional failure without blinking.
GPU ML Infrastructure
SageMaker + EKS hybrid for fraud / risk model training and inference. Partnered with ML and Quant teams to size GPUs to actual workload shape, not vibes.
CI/CD Security Gates
Trivy image scanning + OPA policy-as-code + GitHub Actions. Catches CVEs and policy violations pre-merge — non-compliant deploys never reach prod.
Predictive Failure Detection
scikit-learn + TensorFlow models on Prometheus telemetry. Flags pre-incident anomalies — node thrash, latency drift, GC pressure — before they page on-call.
FinOps + Observability
Unified Prometheus / Grafana dashboards mapping cost and reliability to service / team / SLO. Engineers see what their deploys cost — that changes behavior fast.
HA Postgres + Redis
Tuning RDS Postgres + ElastiCache Redis for payment workloads. Automated backups, replica routing, connection pooling — the unsexy work that keeps p99 honest.
Got a gnarly infra problem?
Let's talk.
Best for: payments-grade SRE, multi-cloud platform engineering, FinOps audits, ML-ops infrastructure, and the occasional "why is our cluster on fire" consult.