Apply now »

Director, Digital Reliability Engineering

 

Journey with us! Combine your career goals and sense of adventure by joining our exciting team of employees. Royal Caribbean Group is pleased to offer a competitive compensation and benefits package, and excellent career development opportunities, each offering unique ways to explore the world.

 

The Royal Caribbean Group’s Digital Team has an exciting career opportunity for a full-time Director, Digital Reliability Engineering reporting to the VP of Engineering.

 

The position is onsite and based in Miami, Florida.

 

Position Summary:

 

The Director, Digital Reliability Engineering will lead the global Technology Operations portfolio for Royal Caribbean’s Digital organization, ensuring the reliability, availability, and performance of guest-facing pre-cruise platforms across web and mobile.

 

This leader is responsible for both Site Reliability Engineering (SRE) practices and run-the-business engineering support. Beyond incident response, the Director is accountable for managing and delivering on the resolution of all production issues, executing ongoing maintenance activities, and coordinating technical communications. This role also manages a dedicated engineering development capacity focused on production fixes, ongoing maintenance, and technical debt reduction. This ensures that stability improvements are not only identified but also delivered. This person is expected to walk the talk—able to jump in during incidents, work side by side with engineers, and demonstrate technical depth when guiding solutions

This is a hands-on role where the leader is expected to actively support teams during critical incidents, work directly with engineers to troubleshoot, and ensure sustained improvements in reliability.

 

This role also carries executive accountability for critical incidents. The Director must be prepared to provide leadership and direct support during major incidents at any time, ensuring the organization responds with speed, clarity, and effectiveness.

 

 

Essential Duties and Responsibilities:

 

  • Strategic Leadership
    • Define and execute the global SRE strategy for Digital Operations, aligning with business priorities and Royal Caribbean’s long-term technology vision.
    • Build and nurture a culture of reliability, resilience, and continuous improvement across all digital platforms.
    • Drive initiatives to maintain zero downtime by rapidly addressing issues, conducting root cause analysis, and implementing remediations.
    • Build strong relationships with product management, engineering, design, and operations stakeholders.
    • Own and drive operational metrics (e.g., MTTx metrics, incident rates, error budgets, service availability) with visible progress and accountability.

 

  • Hands-On Operational Engagement
    • Lead global site reliability and operations teams across onshore, nearshore, and offshore locations while actively engaging in day-to-day challenges.
    • Actively participate in major incident response, including log analysis, recovery validation, and executive updates.
    • Lead problem bridges, collaborating across technical and functional teams for timely issue resolution.
    • Partner with engineers to diagnose, troubleshoot, and resolve critical issues in real time, demonstrating technical credibility.
    • Strengthen ITSM processes (Incident, Problem, Change, Major Incident) using tools like ServiceNow, PagerDuty, and JIRA.

 

  • Run-the-Business
    • Lead engineering support for production issue remediation, ensuring timely root-cause analysis, resolution, and prevention of recurring problems.
    • Lead a dedicated production engineering team responsible for developing and deploying fixes, patches, and enhancements that improve reliability and guest experience.
    • Ensure development workstreams include not only feature delivery but also operational hardening, technical debt remediation, and defect resolution
    • Manage and prioritize ongoing maintenance activities, patches, upgrades, and operational improvements across the digital technology stack.
    • Establish strong feedback loops with product and engineering teams so that recurring issues and operational pain points are systematically eliminated.

 

  • Technology & Engineering
    • Work directly with teams to ensure the reliability of a hybrid technology stack spanning:
      • Mobile: Native iOS, Android, and cross-platform frameworks.
      • Web: React, Angular, and modern web technologies.
      • Backend Services: Microservices, APIs, and integration layers.
      • Commerce: SAP Hybris platform.
      • Cloud Infrastructure: AWS (EC2, ECS, S3, API Gateway), DKP/on-prem clusters, and observability pipelines.
    • Champion observability and performance practices leveraging platforms such as Splunk, Dynatrace, Prometheus, Quantum Metric / RUM tools.
    • Promote automation, chaos engineering, and AI-driven anomaly detection to strengthen system resilience.
    • Guide teams in Infrastructure as Code, and modern operational tooling.
    • Environment Management: Oversee all environment activities, including new code deployments.

 

  • Team Development & Leadership by Example
    • Recruit, mentor, and develop global SRE talent while modeling hands-on technical engagement.
    • Encourage engineers to take ownership and proactively solve problems, supported by your direct involvement when needed.
    • Manage vendor and partner teams with the same “roll-up-your-sleeves” approach as internal teams.
    • Deliver executive-ready dashboards and insights to communicate the health of digital operations.

 

 

Qualifications:

 

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
  • 15+ years of experience in technology operations, including 8+ years in global leadership roles.
  • Engineering Management: Experience leading software engineering teams delivering production fixes and technical debt remediation, not only operational monitoring.
  • Proven track record supporting and stabilizing large-scale digital/commerce platforms with high transaction volumes and direct customer impact.
  • Experience managing fast-paced 24x7 environments, demonstrating adaptability and confident decision-making.
  • Strong technical background in cloud platforms (AWS, hybrid/on-prem clusters), container orchestration (Docker, Kubernetes, DKP), and microservices.
  • Deep understanding of SOA principles and Web Services.
  • Proficiency in scripting: Bash, Python, JavaScript.
  • Experience running and scaling commerce platforms (preferably SAP Hybris or equivalent).
  • Advanced knowledge of observability, performance engineering, telemetry, automation, and incident management frameworks.
  • Ability to personally dive into logs, code, and dashboards during critical incidents.
  • Strong troubleshooting, root-cause analysis, and application design skills.
  • Demonstrated ability to lead through crisis situations with composure, speed, and clear communication.

 

Knowledge and Skills:

 

  • Technical Depth & Breadth: Mobile, web, backend, and commerce systems at enterprise scale.
  • Leadership by Example: Hands-on, willing to engage directly with engineers in solving problems.
  • Strategic Thinking: Ability to drive long-term improvements while ensuring short-term incident readiness.
  • Maintenance & Communication: Experience managing ongoing maintenance programs and crafting technical communications
  • Engineering Collaboration: Skilled at bridging operations and engineering to ensure production issues are treated as high-priority deliverables.
  • Communication: Executive presence with the ability to brief leadership clearly during outages.
  • Global Experience: Skilled at leading distributed teams and managing vendor partnerships.
  • Resiliency Mindset: Comfortable with 24/7 operational accountability, especially during major incidents.

 

Financial Responsibilities:

 

  • Own and manage the Operational Expenditure (OPEX) budget for Digital Operations, ensuring efficient allocation of resources while balancing reliability, scalability, and cost optimization.
  • Provide transparency into operational spend through regular reporting and executive updates.
  • Partner with Finance and Procurement to negotiate, track, and optimize vendor contracts and third-party services.
  • Ensure budget discipline while identifying opportunities for automation and efficiency improvements to reduce operational costs without compromising reliability.

 

Working Conditions:

 

  • Global role requiring flexible availability to lead and engage directly in critical incidents outside of standard business hours.
  • Domestic and international travel may be required to support operations and vendor partners

 

 

We know there's a lot to consider. As you go through the application process, our recruiters will be glad to provide guidance, and more relevant details to answer any additional questions. Thank you again for your interest in Royal Caribbean Group. We'll hope to see you onboard soon!

 

It is the policy of the Company to ensure equal employment and promotion opportunity to qualified candidates without discrimination or harassment on the basis of race, color, religion, sex, age, national origin, disability, sexual orientation, sexuality, gender identity or expression, marital status, or any other characteristic protected by law. Royal Caribbean Group and each of its subsidiaries prohibit and will not tolerate discrimination or harassment.

 

#LI-MP1


Nearest Major Market: Miami

Apply now »