AI Infrastructure Engineer

Ade
Daramola

I architect the foundational layers that make AI systems reliable at enterprise scale. Not prototypes — production systems running on AWS.

// 17+ years in production infrastructure · Cloud · Kubernetes · LLM systems · GitOps

See My Work GitHub ↗ Get in Touch
Scroll
The Story

Infrastructure
first. Always.

I started in 2007 managing Oracle databases for defense clients — environments where a bad query plan or a failed backup had real, sometimes irreversible consequences. That's not where most cloud engineers start, and it shaped how I think about systems in ways that are hard to unlearn.

From there I spent close to a decade at Viper Technology building and operating enterprise AWS infrastructure for government and commercial workloads. VPCs, IAM, EC2 fleets, RDS clusters at scale. I wasn't reading about these things — I was the person on call when they broke.

At Zolon Tech I moved into senior infrastructure architecture. Multi-account AWS landing zones, EKS clusters, GitOps delivery pipelines, zero-downtime cloud migrations. The work became more complex and the stakes were higher, but the discipline was the same: build it so it doesn't break, and when it does, fix it faster than anyone notices.

The AI infrastructure work I do now isn't a career change — it's the same foundation applied to a harder problem. LLM systems have all the failure modes of distributed systems, plus a whole new class of problems that most engineers haven't seen yet. Production scars help.

17+
Years in Production
3
AI Infra Projects Shipped
4
Industry Certifications
AWS
Primary Cloud Platform

Featured Projects

Built to solve
problems that exist

Project 01
GitOps Sentinel
AIOps · Kubernetes Remediation
The real-world problem: In 2021, a misconfigured BGP announcement at Facebook took down Instagram, WhatsApp, and Facebook itself for six hours. In 2023, a bad Kubernetes config change at a fintech startup cascaded into a 14-hour incident. The pattern is the same every time — a config change hits production, Prometheus fires alerts at 3am, the on-call engineer is asleep, and by the time anyone is online the blast radius has grown. The question isn't whether this will happen to your cluster. It's whether your system can respond before a human has to.

GitOps Sentinel is an autonomous remediation platform for Kubernetes clusters. When Prometheus fires an alert, Sentinel doesn't just send a PagerDuty notification — it intercepts the anomaly, reasons through root cause using a multi-agent pipeline, and resolves the incident by committing a fix to Git. Argo CD picks it up, reviews it against policy, and applies it. The cluster heals itself through the same controlled write path a human engineer would use.

The design comes from a real operational insight: not all incidents are equal. A pod OOMKilled on a non-critical service at 3am doesn't need a human. A cascading failure across stateful workloads does. Sentinel gates every remediation on a confidence score — routine incidents resolve automatically in seconds, while ambiguous or high-risk scenarios escalate to on-call before anything touches the cluster. If a remediation fails, it triggers an automatic revert. The cluster is never worse off than it was before Sentinel ran.

This is the kind of system that pays for itself the first time it silently fixes something at 4am that would have been a two-hour incident.

Python AWS Lambda Step Functions EventBridge Terraform Argo CD Kubernetes / EKS Amazon Bedrock Prometheus · Grafana OPA Gatekeeper
Project 02
MedQuery
Agentic RAG · Medical AI
The real-world problem: Healthcare AI has a hallucination problem that other domains can absorb but medicine can't. A wrong drug interaction surfaced by a general-purpose LLM isn't a nuisance — it's a liability, and in a clinical setting, potentially a harm. Most "medical AI" products are thin wrappers around GPT-4 with no source attribution, no scope enforcement, and no mechanism to distinguish between a response grounded in PubMed data and one that's confidently fabricated. The FDA has started flagging this. Hospitals are pulling deployments. The infrastructure problem is real: how do you build a system that knows when it doesn't know?

MedQuery is a medical Q&A system built around the assumption that in a high-stakes domain, a wrong answer is worse than no answer. Before any query reaches a retrieval step, a safety guardrail classifies it — rejecting non-medical and high-risk queries before any LLM processing runs. This isn't a content filter bolted on at the end; it's the first thing that executes.

When a query passes the guardrail, it's routed across three source tiers — a structured PubMed/FDA medical corpus, FDA drug data, and live web search — with up to three relevance checks before a response is generated. The system knows whether it found enough grounding evidence before committing to an answer. Answers stream token-by-token with source-quality labels attached, so a clinician or patient always knows whether a response is backed by peer-reviewed data, an FDA record, or a web result. That distinction matters enormously in practice.

The infrastructure runs fully on AWS — ECS Fargate, RDS with pgvector, CloudFront — deployed via Terraform. No manual steps, no snowflake servers.

Python LangGraph FastAPI PostgreSQL / pgvector Terraform ECS Fargate · RDS · CloudFront OpenAI GPT React · SSE Docker · Alembic
Project 03
Multi-LLM Platform
LLM Infrastructure · Inference Gateway
The real-world problem: In early 2024, OpenAI experienced a series of degraded service windows. Startups that had built their entire product on a direct GPT-4 integration went dark with them — no failover, no fallback, nothing. Separately, teams that deployed Claude for one use case and GPT-3.5 for another were managing two different SDKs, two different rate limit strategies, two different observability pipelines. At scale, every LLM provider you add multiplies your operational surface area. The engineering overhead becomes its own problem, and most teams solve it wrong — by writing more glue code instead of building a proper abstraction layer.

The Multi-LLM Platform is an inference gateway that puts OpenAI, Anthropic, and AWS Bedrock behind a single unified API. Application code calls one endpoint and never knows — or needs to know — which provider handled the request. The gateway routes based on request complexity, cost threshold, and caller latency SLA. Simple requests go to cheaper models. Tight-SLA requests go to faster tiers. The routing logic is policy-driven, not hardcoded.

Provider health is monitored in real time. When a provider degrades, it's removed from the active pool immediately — not after a timeout, not after a human notices. A semantic cache layer sits in front of the routing logic and returns results for prompts that are similar enough to previous requests, which means a meaningful percentage of production traffic never hits an LLM at all. That translates directly to lower cost and lower latency.

Every request emits structured telemetry — tokens consumed, model selected, latency, cache hit rate, and cost by provider. If you're running an AI product without this level of visibility, you're operating blind. This platform is what it looks like to operate LLM infrastructure the same way you'd operate any other production service: with observability, redundancy, and cost accountability built in from day one.

Python FastAPI LangChain / LangGraph Terraform OpenAI · Anthropic · AWS Bedrock Docker Structured telemetry

Career Arc

From Oracle DBA
to AI infrastructure

Nov 2020 – Present
Lead Cloud Engineer
Zolon Tech Inc.

At Zolon I moved from building infrastructure into designing it. The shift mattered. My mandate wasn't to keep the lights on — it was to define the architecture that other people would build on top of. That meant designing a multi-account AWS landing zone from scratch: the VPC topology, Transit Gateway configuration, IAM role hierarchy, and Service Control Policies that would become the security and network model for production, staging, and development environments across the organization. Get this wrong and you're doing remediation work for years.

I led the containerisation strategy — standing up EKS clusters, writing the Helm charts, implementing OPA Gatekeeper policies that enforced security posture at the admission controller level before workloads ever ran. I also inherited several legacy environments and ran zero-downtime cloud migrations using Terraform and CloudFormation, working directly with client stakeholders to sequence cutovers that couldn't afford a maintenance window.

On the delivery side, I standardised how the engineering teams shipped code — CI/CD pipelines via Jenkins and GitHub Actions, deployment patterns documented and enforced, release cycle time reduced across multiple workstreams. And beyond the technical work, I spent a meaningful chunk of this role translating between business requirements and cloud architecture: running design sessions, writing solution patterns, and making the decisions visible to non-technical stakeholders. That kind of work doesn't show up in a Terraform plan, but it's the difference between infrastructure that gets adopted and infrastructure that gets worked around.

Nov 2012 – Nov 2020
AWS Cloud Engineer
Viper Technology Services — Patuxent River, MD

Eight years is a long time to spend inside the same problem set, and I used it. At Viper I designed and operated enterprise AWS infrastructure for large-scale government and commercial clients — the kind of accounts where the blast radius of a mistake is measured in users and dollars, not test cases. VPCs, IAM architecture, EC2 fleets, RDS clusters, S3 at production scale. I wasn't experimenting in a sandbox; I was the person accountable for whether these systems stayed up.

This is where I built the automation foundation that I still use today. Terraform for infrastructure-as-code, Ansible for configuration management, Jenkins for CI/CD, Python for everything else. By the time I left, the practices I'd built at Viper were standard enough that I carried them directly into the AI infrastructure work I do now. That's what a decade of production work gives you: patterns that are actually tested.

2007 – 2012
Oracle DBA / Systems Analyst
CACI International Inc. & Apptis, Inc.

This is where the discipline was formed. Managing Oracle environments for defense and enterprise clients means operating in contexts where data integrity is non-negotiable and downtime is not an acceptable outcome. Performance tuning, backup and recovery, complex analytical queries across relational systems serving real operational workflows. Starting here meant I never had the luxury of treating infrastructure as abstract — it was always connected to something that mattered. That stays with you.

Technical Skills

Tools earned
in production

AI & Agentic Systems
LangGraphLangChain RAG pipelinesMulti-agent orchestration LLM routingpgvector Amazon BedrockOpenAI API
Cloud (AWS)
EKSECS Fargate LambdaStep Functions EventBridgeRDS CloudFrontAPI Gateway IAMSecrets Manager
Infrastructure & GitOps
TerraformKubernetes Argo CDHelm OPA GatekeeperAnsible DockerJenkins
Observability
PrometheusGrafana AlertmanagerAWS X-Ray CloudWatchStructured telemetry
Backend & Data
PythonFastAPI PostgreSQLVector DBs SSE streamingREST APIs SQLBash
Platform Engineering
Event-driven architecture CI/CDGitHub Actions MLOpsSecurity-first IAM Cost-aware architecture TDD / pytest

Certifications

Validated
across domains

🟢
NVIDIA-Certified Professional — Agentic AI
NVIDIA · Advanced
🟢
NVIDIA-Certified Professional — Generative AI (LLMs)
NVIDIA · Advanced
🟠
AWS Certified Solutions Architect — Associate
Amazon Web Services
🟣
HashiCorp Certified: Terraform Associate
HashiCorp
Let's Talk

Looking for AI
infrastructure

I'm looking for AI Infrastructure Engineering roles where production reliability, cloud architecture, and LLM systems come together. If you're building something serious — not a prototype, an actual production system — and you need an engineer who has done this at scale, I'd like to hear about it.