Brillfy Technology Inc.
MLOps / AI Infrastructure Engineer
Job Title: - MLOps / AI Infrastructure Engineer
Location: - Remote (Periodic travel to client sites primarily government/on-prem environments)
Duration: - Full-Time / Permanent
Work Hours: - Standard business hours (flexible, async-first environment)
Job Overview & Description:
We are seeking a highly skilled MLOps / AI Infrastructure Engineer to design, deploy, and operate mission-critical AI infrastructure in on-premises and edge environments. This role focuses on building and maintaining GPU-powered compute clusters, Kubernetes-based orchestration systems, and scalable MLOps pipelines for AI training and inference workloads.
The ideal candidate will have deep expertise in GPU infrastructure (NVIDIA H200/A100), Kubernetes on bare-metal, high-performance networking, and software-defined storage, along with hands-on experience in deploying secure, compliant systems in air-gapped and government-regulated environments.
You will collaborate closely with AI/ML engineers, architects, and client technical teams to ensure high availability, performance, and compliance of AI platforms operating outside traditional cloud environments.
Roles & Responsibilities:
GPU Compute & Infrastructure
• Deploy, configure, and maintain GPU servers (NVIDIA H200, A100)
• Manage CUDA, drivers, firmware, NVLink/NVSwitch topology
• Implement NVIDIA tools (DCGM, MIG, NVIDIA Container Toolkit)
• Monitor hardware health, utilization, and performance
• Automate bare-metal provisioning (PXE/iPXE, MAAS, Foreman)
Kubernetes & Container Orchestration
• Build and manage Kubernetes clusters (kubeadm / Rancher RKE2)
• Configure GPU node pools using NVIDIA GPU Operator
• Implement CNI solutions (Calico, Cilium, SR-IOV)
• Manage ingress, load balancing (MetalLB), and service mesh (Istio/Linkerd)
• Enforce cluster security (RBAC, network policies, secrets management)
MLOps & AI Workloads
• Deploy ML platforms (MLflow, Kubeflow)
• Manage model serving using NVIDIA Triton Inference Server
• Build CI/CD pipelines (ArgoCD, Flux GitOps approach)
• Optimize GPU utilization for training and inference
• Manage model storage (Ceph, MinIO)
Networking & Storage
• Design high-bandwidth networking (InfiniBand, RoCE v2, Ethernet)
• Configure RDMA for distributed AI workloads
• Deploy software-defined storage (Ceph, Rook, MinIO)
• Implement VLANs, firewall policies, and secure connectivity (VPN)
Security & Compliance
• Implement controls aligned with NIST SP 800-171 / CMMC
• Maintain OS hardening (RHEL/Rocky Linux, Ubuntu CIS benchmarks)
• Automate compliance checks (OpenSCAP)
• Document infrastructure (SSP, diagrams, DR plans)
• Support audits and penetration testing remediation
Required Qualifications:
• 6+ years of infrastructure engineering experience
• 3+ years managing GPU clusters or HPC environments
• Strong expertise in NVIDIA GPU stack (CUDA, DCGM, MIG, NVLink)
• Hands-on Kubernetes (bare-metal deployment & operations)
• Strong networking knowledge (BGP, VLANs, RDMA, load balancing)
• Experience with storage solutions (Ceph, MinIO, Rook)
• MLOps experience (MLflow, Kubeflow, Triton, GitOps pipelines)
• Knowledge of NIST SP 800-171 compliance
• Experience with Terraform or Ansible
• Strong Linux administration skills
• Excellent documentation and communication skills
Certifications (Preferred / Required):
• Certified Kubernetes Administrator (CKA) / Certified Kubernetes Security Specialist (CKS)
• NVIDIA Certifications (GPU / AI Infrastructure)
• RHCSA / RHCE (Red Hat Certified System Administrator/Engineer)
• Security or compliance certifications (preferred)