Senior Engineer, Site Reliability
Journey with us! Combine your career goals and sense of adventure by joining our exciting team of employees. Royal Caribbean Group is pleased to offer a competitive compensation and benefits package, and excellent career development opportunities, each offering unique ways to explore the world.
We are proud to be the vacation-industry leader with global brands — including Royal Caribbean International, Celebrity Cruises and Silversea Cruises — the most innovative fleet and private destinations, and the best people. Together, we are dedicated to turning the vacation of a lifetime into a lifetime of vacations for our guests.
The Royal Caribbean Group’s Site Reliability Team has an exciting career opportunity for a full time Senior Engineer, Site Reliability reporting to the Senior Manager.
This position is onsite and based in Miramar, Florida.
Tis position is also not eligible for work authorization sponsorship.
Position Summary:
We are seeking a highly skilled Senior Site Reliability Engineer to own, operate, and continuously mature our enterprise observability platform across one of the most complex hospitality and maritime technology environments in the world. This role is the engineering backbone of RCG’s observability practice — responsible for ensuring deep, reliable system visibility across 950+ applications serving 100,000+ users across Royal Caribbean International, Celebrity Cruises, and Silversea.
You will operate at the intersection of infrastructure, application performance, network intelligence, and AIOps — driving measurable improvements in mean-time-to-detect (MTTD), mean-time-to-resolve (MTTR), and overall service reliability. This is a platform engineering and standards leadership role, not a tool administration position.
Key Responsibilities:
Platform Ownership & Architecture
- Own and evolve the enterprise observability platform spanning Cisco AppDynamics, Splunk, ThousandEyes, and PagerDuty AIOps across AWS and Azure environments.
- Architect and enforce a unified telemetry strategy — metrics, logs, traces, and events — standardized via OpenTelemetry across all application tiers.
- Design and govern telemetry data pipelines including ingestion, filtering, routing, and retention to optimize signal quality and platform cost at enterprise scale.
- Drive full-stack observability coverage across ship and shore environments, including maritime network paths, contact center platforms, and revenue-critical booking systems.
SLIs, SLOs & Reliability Engineering
- Define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for all critical services across RCG’s three brands.
- Build alerting frameworks that minimize noise, surface actionable signals, and integrate cleanly with PagerDuty AIOps on-call workflows.
- Partner with SRE teams to drive MTTR reduction, post-incident observability improvements, and proactive reliability practices.
- Instrument and publish DORA metrics (Deployment Frequency, Lead Time, Change Failure Rate, MTTR) to support engineering productivity and release confidence.
AIOps & Intelligent Detection
- Drive AI-assisted incident detection, anomaly correlation, and root cause analysis using PagerDuty AIOps and Splunk IT Service Intelligence (ITSI).
- Tune and mature ML-based alert grouping and noise suppression models to reduce alert fatigue and accelerate triage.
- Integrate observability signals with ServiceNow ITSM for automated incident creation, enrichment, and closed-loop resolution workflows.
Kubernetes & Cloud-Native Observability
- Enable and govern Kubernetes observability for EKS and AKS workloads — container health, resource utilization, pod-level tracing, and cluster performance.
- Integrate observability instrumentation into CI/CD pipelines (GitHub Actions) to enable deployment-correlated performance analysis.
- Maintain and extend AWS CloudWatch and Azure Monitor integrations to ensure cloud infrastructure is fully represented in the observability estate.
Standards, Enablement & Technical Leadership
- Define observability standards, instrumentation best practices, and onboarding frameworks for product and platform engineering teams.
- Mentor junior engineers and serve as the technical authority for observability discipline across SRE and Platform Engineering.
- Lead post-incident reviews (PIRs) and translate findings into observability platform improvements.
- Govern observability cost optimization: telemetry volume management, retention tiering, and platform licensing efficiency.
Required Qualifications
- 6–9+ years in Observability, SRE, or Platform Engineering in enterprise-scale environments.
- Deep hands-on expertise with Cisco AppDynamics — APM configuration, business transaction mapping, code-level diagnostics, and baseline management.
- Strong proficiency with Splunk — SPL query development, ITSI service health trees, KPI configuration, alert policy management, and log pipeline design.
- Experience with Cisco ThousandEyes for network path monitoring, ISP/WAN intelligence, and BGP-level visibility.
- Proficiency with PagerDuty AIOps — intelligent alert grouping, noise suppression, event orchestration, and on-call workflow design.
- Strong command of OpenTelemetry — collector configuration, SDK instrumentation, semantic conventions, and multi-backend exporting.
- Hands-on Kubernetes experience (EKS/AKS) — container observability, resource metrics, and pod-level distributed tracing.
- Experience with AWS CloudWatch and/or Azure Monitor for cloud infrastructure observability.
- Scripting and automation proficiency: Python, Bash, Terraform, and/or Ansible for observability tooling deployment and configuration.
- Experience defining SLIs/SLOs, error budgets, and actionable alerting strategies tied to business service reliability.
- ServiceNow ITSM integration experience — event management, incident auto-creation, and CMDB-enriched alerting.
- Experience with CI/CD observability integration (GitHub Actions or equivalent).
Preferred Qualifications
- Experience with Prometheus, Grafana, Loki, or Tempo for supplemental or hybrid observability architectures.
- Familiarity with eBPF-based observability tooling (e.g., Pixie, Cilium) for deep kernel-level and network-layer visibility.
- Experience with synthetic monitoring and real user monitoring (RUM) to capture end-user experience across digital channels.
- Familiarity with Cribl or equivalent telemetry pipeline tooling for data routing, enrichment, and cost governance.
- Exposure to DORA metrics instrumentation and developer experience observability frameworks.
- Experience in large-scale hospitality, travel, maritime, or consumer digital platforms.
- Certifications: Cisco AppDynamics Certified Associate, Splunk Core Certified Power User, AWS Solutions Architect, Kubernetes (CKA/CKAD), or OpenTelemetry Certified Associate (OTCA/CNCF).
Agency and Third-Party Submissions: Please note this is a direct search by the Company, and applications through agencies and other third parties will not be accepted, nor will fees be paid for unsolicited resumes. Any unsolicited resumes will be considered the Company's property.
We know there's a lot to consider. As you go through the application process, our recruiters will be glad to provide guidance, and more relevant details to answer any additional questions. Thank you again for your interest in Royal Caribbean Group. We'll hope to see you onboard soon!
It is the policy of the Company to ensure equal employment and promotion opportunity to qualified candidates without discrimination or harassment on the basis of race, color, religion, sex, age, national origin, disability, sexual orientation, sexuality, gender identity or expression, marital status, or any other characteristic protected by law. Royal Caribbean Group and each of its subsidiaries prohibit and will not tolerate discrimination or harassment.
Nearest Major Market: Miami