Site Reliability Engineer - VentureDive

Job Brief:
We are hiring a hands-on Senior Site Reliability Engineer to own the reliability, observability, cost, and security of a live mobility platform operating at real-time scale. This role is not advisory and not support-only. You will own production — including application, infrastructure, pipelines, and signals.
This role also requires meaningful AI fluency. We expect you to actively leverage AI-powered observability, AIOps tooling, and intelligent automation to reduce toil, detect incidents earlier, and operate at a higher level of confidence than manual methods alone can deliver. If you are curious about how AI is reshaping SRE practice and ready to help define that at VentureDive, this role is for you.

VentureDive Overview:
Founded in 2012 by veteran technology entrepreneurs from MIT and Stanford, VentureDive is the fastest-growing technology company in the region that develops and invests in products and solutions that simplify and improve the lives of people worldwide. We aspire to create a technology organization and an entrepreneurial ecosystem in the region that is recognized as second to none in the world.

Key Responsibilities

1. Production Reliability & Availability

Own uptime, latency, error rates, and system stability across the full production stack.
Design and enforce SLOs, SLIs, and error budgets with clear ownership and accountability.
Ensure zero- or near-zero-downtime deployments across all services.
Lead incident response, mitigation, blameless postmortems, and follow-through on action items.
Expectation: Fear of deploying or operating prod is unacceptable.

2. Observability & Early Warning Systems

Build and maintain a comprehensive observability stack: metrics, logs, distributed traces, and actionable alerts.
Detect abnormal traffic patterns, latency spikes, and network anomalies before they escalate to incidents.
Identify trends and leading indicators before incidents happen.
Ruthlessly eliminate noisy alerts — every signal must be actionable.
Expectation: The system must warn us before users do.

3. Infrastructure Ownership (AWS)

Own AWS infrastructure end-to-end — compute, networking, storage, and IAM.
Enforce Infrastructure as Code (Terraform or equivalent) as the sole mechanism for production changes.
Ensure scalability for ride/reservation spikes, peak hours, and variable geo traffic patterns.
Improve network ingress/egress visibility and performance.
Expectation: No manual, undocumented production infrastructure.

4. CI/CD Performance & Delivery

Own CI/CD pipelines during and after migration to Bitbucket Pipelines.
Reduce pipeline execution time through build caching, parallelism, and test efficiency improvements.
Ensure safe, repeatable, and fast deployments across all environments.
Expectation: CI/CD is a productivity engine, not a bottleneck.

5. Cost Ownership & FinOps

Monitor and optimize AWS and tooling costs continuously.
Detect abnormal cost increases and surface them proactively to engineering leadership.
Right-size resources without impacting reliability or performance.
Provide cost visibility and trend reporting to engineering leadership.
Expectation: Cost is a reliability signal, not an afterthought.

6. Security, Auditing & Incident Forensics

Enforce secure configurations, secrets management, and least-privilege access across all systems.
Detect and investigate suspicious access, traffic, or behavioral anomalies.
Make security investigations accessible and efficient for Dev and QA teams.
Improve auditability and traceability across all systems and environments.
Expectation: Security incidents should be detectable, explainable, and recoverable.

7. Migration Awareness & System Change Detection

Actively track behavior changes during platform migrations and architecture transitions.
Detect regressions caused by architecture, dependency, or traffic shifts.
Validate performance and reliability during all significant system transitions.
Expectation: Changes in system behavior should never go unnoticed.

8. AI-Augmented SRE Practice

Deploy and operationalize AIOps platforms (e.g., Dynatrace AI, Datadog Watchdog, New Relic AI, or equivalent) for automated anomaly detection, intelligent baselining, and ML-driven root-cause analysis — reducing mean time to detection (MTTD) and mean time to resolution (MTTR).
Use AI coding assistants (GitHub Copilot, Claude, or equivalent) to accelerate authoring and review of Terraform, Bash, Python, and CI/CD pipeline configurations — with mandatory human validation before any production application.
Apply AI-assisted security tools to continuously surface IAM misconfigurations, CVE exposure, and compliance drift across AWS environments.
Use AI-driven log intelligence to accelerate root-cause analysis across high-volume log streams during incidents and migrations.
Apply AI-powered FinOps tooling ( AWS Cost Anomaly Detection) to detect cost spikes, automate right-sizing recommendations, and surface waste proactively.
Critically evaluate all AI-generated infrastructure code, configurations, and alert logic for correctness, security implications, and production-readiness — AI recommendations are inputs, not approvals.

Required Experience

Core SRE Skills

5+ years in SRE, DevOps, or Platform Engineering roles operating production systems at scale.
Strong AWS experience across security, networking, IAM, and compute.
Strong observability background: metrics, logs, distributed traces, and alert design.
Hands-on experience with New Relic and the Elastic stack.
Experience with AWS CloudWatch and CloudTrail for monitoring and audit.
PostgreSQL performance tuning and reliability engineering experience.
CI/CD pipeline optimization experience — build time, caching, test parallelism.
Incident response and postmortem leadership experience.
Demonstrated use of AIOps or AI-enhanced observability platforms in a production SRE or DevOps context.
Ability to configure and operationalize ML-based anomaly detection or intelligent alerting within an existing observability stack.
Experience using AI coding assistants to author, review, or refactor infrastructure-as-code or pipeline configurations.
Strong critical judgment in validating AI-generated infrastructure code and alert thresholds before production deployment.
Familiarity with AI-assisted FinOps or cost anomaly detection tooling is a strong plus.
Awareness of AI-driven security posture management (CSPM) tools and their integration into AWS and other cloud based environments.

Strongly Preferred

Experience with real-time or high-traffic platforms — mobility domain experience is a strong plus.
Hands-on experience with build systems including compilation, dependency resolution, and artifact generation across multiple technology stacks.
Hands-on experience with Ruby on Rails and/or NestJS application performance and reliability.
Experience with application profiling and performance monitoring — identifying, analyzing, and resolving production bottlenecks.
Infrastructure migration experience at production scale.
Demonstrated FinOps and cost optimization track record.
Security incident investigation and forensics experience.

Soft Skills & Professional Attributes

Ownership mindset — production is yours, not someone else's problem.
Clear communicator who can translate complex reliability and infrastructure topics for engineering leadership and non-technical stakeholders.
Analytical and structured under pressure — calm, methodical, and decisive during incidents.
Proactive and self-driven — surfaces risks and improvements without being asked.
Collaborative across Dev, QA, Security, and Product — SRE is a team sport.
Growth mindset with genuine curiosity about AI-driven reliability engineering and emerging SRE tooling.

What we look for beyond required skills
In order to thrive at VentureDive, you
…are intellectually smart and curious
…have the passion for and take pride in your work
…deeply believe in VentureDive’s mission, vision, and values
…have a no-frills attitude
…are a collaborative team player
…are ethical and honest

Are you ready to put your ideas into products and solutions that will be used by millions?
You will find VentureDive to be a quick pace, high standards, fun and a rewarding place to work at. Not only will your work reach millions of users world-wide, you will also be rewarded with competitive salaries and benefits. If you think you have what it takes to be a VenDian, come join us ... we're having a ball!
#LI-Hybrid

Apply for this job

This website uses cookies and other analytics technologies.