1. Home
  2. Jobs
  3. CloudOps Engineer
  4. Azure CloudOps Engineer
Embrace Software Inc. logoES
Embrace Software Inc.embracesoftwareinc.com

Azure CloudOps Engineer

Worldwide (Remote)Full-time1h ago

This is a remote position.

We are looking for a CloudOps Engineer to operate and continuously improve the reliability, security, scalability, observability, and cost efficiency of our Azure-hosted SaaS products. Our products are deployed across development, QA, staging, and production environments, with infrastructure managed through Terraform and CI/CD automated through GitHub Actions.

This role will work closely with engineering teams to ensure our SaaS platforms and AI-enabled solutions are deployed consistently, monitored effectively, secured properly, and operated reliably in production.

Environment and Technology Context

  • Microsoft Azure-hosted SaaS products across dev, QA, staging, and production environments.
  • Terraform for infrastructure as code and repeatable environment provisioning.
  • GitHub Actions for application and infrastructure CI/CD workflows.
  • Azure services including Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, Speech-to-Text services, Azure Arc, and related services.
  • AI-enabled product capabilities including STT workloads, LLM integrations, AI service endpoints, quotas, usage monitoring, latency monitoring, and cost controls.

Key Responsibilities

Cloud Infrastructure Operations

  • Manage and support Azure cloud infrastructure across dev, QA, staging, and production environments.
  • Maintain operational health of Azure services including Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, Azure Arc, and related platform services.
  • Ensure cloud resources are provisioned, configured, monitored, maintained, and retired according to company standards.
  • Support environment setup for new products, customers, integrations, and internal initiatives.
  • Identify and resolve infrastructure issues affecting performance, reliability, availability, or security.

Terraform and Infrastructure as Code

  • Build, maintain, and improve Terraform modules and environment configurations.
  • Ensure infrastructure changes are version-controlled, peer-reviewed, tested, approved, and repeatable.
  • Manage Terraform state, workspaces, variables, secrets integration, and deployment workflows.
  • Detect and resolve configuration drift between Terraform and deployed Azure resources.
  • Standardize naming conventions, tagging, resource group structure, environment isolation, and module patterns.
  • Support scalable provisioning of new SaaS environments using reusable infrastructure templates.

GitHub Actions and CI/CD

  • Build, maintain, and troubleshoot GitHub Actions workflows for application and infrastructure deployments.
  • Support CI/CD pipelines for multiple SaaS products and environments.
  • Implement deployment promotion flows from development to QA to staging to production.
  • Add deployment safeguards such as environment protection rules, approvals, rollback procedures, validation checks, release gates, and audit trails.
  • Manage pipeline secrets, service principals, managed identities, and secure deployment credentials.
  • Improve build and deployment reliability, speed, traceability, and auditability.

AI Service Operations

  • Operate and monitor Azure AI services, including Azure AI Foundry and Speech-to-Text workloads.
  • Support production operations for LLM-based integrations and AI-enabled product features.
  • Monitor AI service availability, latency, quota usage, token consumption, API failures, throttling, and cost.
  • Help define operational standards for AI workloads, including access control, logging, alerting, failover, usage governance, and provider disruption handling.
  • Work with engineering teams to troubleshoot AI service issues, integration failures, degraded model responses, or provider-side service disruptions.
  • Support secure handling of AI-related secrets, endpoints, keys, managed identities, and private network access where applicable.

Monitoring, Alerting, and Observability

  • Implement and maintain monitoring using Azure Monitor, Log Analytics, Application Insights, and related tools.
  • Create dashboards for infrastructure, application, database, messaging, storage, AI service, and deployment health.
  • Configure alerts for availability, latency, errors, resource saturation, queue depth, failed jobs, failed deployments, database health, quota exhaustion, and cost anomalies.
  • Improve signal quality by reducing alert noise and ensuring alerts are actionable.
  • Partner with engineering teams to define service-level indicators, service-level objectives, and production health metrics.

Incident Response and Production Support

  • Participate in production incident response for cloud infrastructure, deployments, integrations, and platform services.
  • Triage and resolve issues across Azure services, CI/CD pipelines, Terraform, networking, databases, messaging, and AI integrations.
  • Create and maintain runbooks for common operational issues.
  • Support root cause analysis and post-incident reviews.
  • Implement preventive actions after incidents to improve system reliability.
  • Help define severity levels, escalation paths, response expectations, on-call processes, and production support procedures.

Security, Identity, and Access Management

  • Implement cloud security best practices across Azure environments.
  • Manage Azure RBAC, managed identities, service principals, Key Vault access, and least-privilege permissions.
  • Secure GitHub Actions workflows, deployment credentials, environment secrets, and production access.
  • Support secret rotation, certificate management, and secure configuration management.
  • Help enforce network security using private endpoints, firewalls, IP restrictions, and environment-specific access rules.
  • Support compliance readiness for audits, security reviews, customer due diligence, SOC 2, ISO 27001, or similar frameworks.

Database, Storage, and Messaging Operations

  • Support operational management of Azure PostgreSQL databases, including backups, restores, performance monitoring, connection limits, high availability, and capacity planning.
  • Monitor and maintain Azure Storage Accounts, lifecycle policies, access controls, backup strategy, and usage trends.
  • Support Azure Service Bus operations, including queue/topic monitoring, dead-letter handling, retry behavior, and throughput issues.
  • Support SignalR operational health, connection metrics, scaling behavior, and related production issues.

Cost Management and Optimization

  • Monitor Azure spend across products, environments, services, and customers where applicable.
  • Implement tagging standards to support cost allocation by product, environment, customer, or business unit.
  • Create cost dashboards, budget alerts, anomaly detection processes, and recurring cost reviews.
  • Identify underutilized resources and recommend right-sizing opportunities.
  • Review AI service costs, LLM usage, token consumption, STT usage, storage growth, database sizing, and environment costs.
  • Recommend savings plans, reservations, scaling rules, lifecycle policies, or shutdown schedules where appropriate.

Reliability, Backup, and Disaster Recovery

  • Define and maintain backup and recovery procedures for critical cloud services.
  • Test database restores and validate backup reliability.
  • Help define recovery time objectives and recovery point objectives for production systems.
  • Support disaster recovery planning for SaaS products and customer-facing services.
  • Improve resilience through scaling rules, failover patterns, health checks, synthetic monitoring, and production readiness reviews.

Documentation and Operational Standards

  • Create and maintain CloudOps documentation, runbooks, deployment guides, troubleshooting guides, and environment standards.
  • Define standards for resource naming, tagging, logging, alerting, access control, Terraform structure, GitHub Actions workflow patterns, and production changes.
  • Document operational procedures for cloud services, CI/CD workflows, AI services, and incident response.
  • Enable engineering teams with reusable patterns, templates, and self-service guidance.

Requirements

Required Qualifications

  • 7+ years of hands-on experience operating production workloads in Microsoft Azure.
  • Strong experience with Terraform and infrastructure as code.
  • Experience building and maintaining CI/CD pipelines using GitHub Actions.
  • Experience supporting containerized workloads, preferably Azure Container Apps or similar platforms.
  • Experience with Azure monitoring and observability tools such as Azure Monitor, Log Analytics, and Application Insights.
  • Experience with Azure PostgreSQL or similar managed relational databases.
  • Strong understanding of Azure networking, DNS, identity, RBAC, managed identities, Key Vault, and security best practices.
  • Experience troubleshooting production incidents across infrastructure, application deployments, networking, and cloud services.
  • Comfortable writing scripts using Bash, PowerShell, Python, or similar tools.
  • Strong documentation, communication, and cross-functional collaboration skills.

Preferred Qualifications

  • Experience operating AI-enabled applications or Azure AI services.
  • Experience with Azure AI Foundry, Azure OpenAI, Speech-to-Text, or LLM-based integrations.
  • Experience monitoring AI service usage, quotas, latency, throttling, token consumption, and cost.
  • Experience with Azure Service Bus, SignalR, Storage Accounts, and Static Web Apps.
  • Experience with Azure Arc.
  • Experience supporting multi-product or multi-tenant SaaS platforms.
  • Experience with SOC 2, ISO 27001, or similar compliance frameworks.
  • Experience with FinOps, cloud cost governance, or Azure cost optimization.
  • Experience designing production support processes, incident response workflows, on-call rotations, and operational runbooks.

Benefits

  • Competitive salary commensurate with experience.
  • Opportunities for career advancement and professional development.
  • Experience collaborating with a diverse, global team within a remote work setting.