Brillfy Technology Inc.

Irving, TX • Posted 1 weeks ago

Remote Full Time Not specified Level general

Job Title: - MLOps / AI Infrastructure Engineer Location: - Remote (Periodic travel to client sites primarily government/on-prem environments) Duration: - Full-Time / Permanent Work Hours: - Standard business hours (flexible, async-first environment) Job Overview & Description: We are seeking a highly skilled MLOps / AI Infrastructure Engineer to design, deploy, and operate mission-critical AI infrastructure in on-premises and edge environments. This role focuses on building and maintaining GPU-powered compute clusters, Kubernetes-based orchestration systems, and scalable MLOps pipelines for AI training and inference workloads. The ideal candidate will have deep expertise in GPU infrastructure (NVIDIA H200/A100), Kubernetes on bare-metal, high-performance networking, and software-defined storage, along with hands-on experience in deploying secure, compliant systems in air-gapped and government-regulated environments. You will collaborate closely with AI/ML engineers, architects, and client technical teams to ensure high availability, performance, and compliance of AI platforms operating outside traditional cloud environments. Roles & Responsibilities: GPU Compute & Infrastructure • Deploy, configure, and maintain GPU servers (NVIDIA H200, A100) • Manage CUDA, drivers, firmware, NVLink/NVSwitch topology • Implement NVIDIA tools (DCGM, MIG, NVIDIA Container Toolkit) • Monitor hardware health, utilization, and performance • Automate bare-metal provisioning (PXE/iPXE, MAAS, Foreman) Kubernetes & Container Orchestration • Build and manage Kubernetes clusters (kubeadm / Rancher RKE2) • Configure GPU node pools using NVIDIA GPU Operator • Implement CNI solutions (Calico, Cilium, SR-IOV) • Manage ingress, load balancing (MetalLB), and service mesh (Istio/Linkerd) • Enforce cluster security (RBAC, network policies, secrets management) MLOps & AI Workloads • Deploy ML platforms (MLflow, Kubeflow) • Manage model serving using NVIDIA Triton Inference Server • Build CI/CD pipelines (ArgoCD, Flux GitOps approach) • Optimize GPU utilization for training and inference • Manage model storage (Ceph, MinIO) Networking & Storage • Design high-bandwidth networking (InfiniBand, RoCE v2, Ethernet) • Configure RDMA for distributed AI workloads • Deploy software-defined storage (Ceph, Rook, MinIO) • Implement VLANs, firewall policies, and secure connectivity (VPN) Security & Compliance • Implement controls aligned with NIST SP 800-171 / CMMC • Maintain OS hardening (RHEL/Rocky Linux, Ubuntu CIS benchmarks) • Automate compliance checks (OpenSCAP) • Document infrastructure (SSP, diagrams, DR plans) • Support audits and penetration testing remediation Required Qualifications: • 6+ years of infrastructure engineering experience • 3+ years managing GPU clusters or HPC environments • Strong expertise in NVIDIA GPU stack (CUDA, DCGM, MIG, NVLink) • Hands-on Kubernetes (bare-metal deployment & operations) • Strong networking knowledge (BGP, VLANs, RDMA, load balancing) • Experience with storage solutions (Ceph, MinIO, Rook) • MLOps experience (MLflow, Kubeflow, Triton, GitOps pipelines) • Knowledge of NIST SP 800-171 compliance • Experience with Terraform or Ansible • Strong Linux administration skills • Excellent documentation and communication skills Certifications (Preferred / Required): • Certified Kubernetes Administrator (CKA) / Certified Kubernetes Security Specialist (CKS) • NVIDIA Certifications (GPU / AI Infrastructure) • RHCSA / RHCE (Red Hat Certified System Administrator/Engineer) • Security or compliance certifications (preferred)

Back to Job Search