Back to Insights
Infrastructure 5 min read

Maintaining 99.9% Network Uptime at 11,500 Feet: Our Full Operational Playbook

Author

Engineering Team

Published

2026-05-18

Maintaining 99.9% Network Uptime at 11,500 Feet: Our Full Operational Playbook

This document is our operational playbook. We publish it openly because the engineers we host make their decision to join based on infrastructure, not marketing.

Claimed uptime: 99.9% network availability during retreat months (July–August). Measured uptime (July–August 2025): 99.61% — 26 minutes of total downtime across 31 days, all during two planned ISP maintenance windows announced 24 hours in advance.

Here is exactly how we achieve this.

Defining "Uptime"

We define network uptime as: any resident device connected to our network can reach at least one of (8.8.8.8, 1.1.1.1) within 4 seconds.

This is deliberately strict. A brief ISP switchover that self-resolves within 4 seconds does not count as downtime. An extended outage where internet is unreachable for more than 4 seconds does count.

We measure this with a custom bash ping monitor running on a Raspberry Pi 4 connected independently to the LAN, writing results to a local InfluxDB instance and dashboarding in Grafana.

The Five Failure Categories and Our Mitigation

1. ISP Fiber Cut

Probability: ~2 events per month (June–August), typically resolved in 2–6 hours.

Mitigation: Automatic WAN failover to secondary ISP via pfSense load balancer. Failover trigger: 3 consecutive failed pings to 8.8.8.8 with 500ms intervals. Switchover time: < 4 seconds.

Resident experience: Possible 1–4 second reconnection lag. SSH connections on resilient clients (Mosh, iTerm2 with reconnect) survive. Standard SSH may require reconnect.

2. Grid Power Outage

Probability: 4–8 planned hours per month, occasional unplanned events.

Mitigation: Full compound on 10kVA double-conversion UPS (zero-millisecond transition). Generator auto-starts after 8 hours, powers compound indefinitely at current fuel reserve.

Resident experience: Zero. Double-conversion UPS means power is always coming from battery — the grid powers the charger, not the equipment directly. Grid drops are invisible.

3. ISP-Side Network Degradation (not full outage)

Probability: 3–5 events per month. Packet loss > 5%, latency spikes > 200ms.

Mitigation: pfSense configured with gateway quality monitoring. If packet loss exceeds 10% or latency exceeds 300ms on the primary WAN, traffic routes to secondary ISP even without a full outage.

Resident experience: Possible brief period of higher latency before automatic rerouting. Usually invisible if working in a session-based tool.

4. On-Site Hardware Failure (router, switch, AP)

Probability: Low. We have experienced 1 AP failure and 1 switch port failure in 12 months of operation.

Mitigation: Hot spares for all critical hardware on-site. Replacement time for an AP: < 10 minutes. We run redundant core switches — a single switch failure does not drop the network.

Resident experience: Brief disruption if the failed hardware is in their path. Typically < 5 minutes.

5. DNS Failure

An often-overlooked failure mode. If your DNS resolver fails, the internet appears to be down even if your IP connectivity is fine.

Mitigation: We run a local Unbound DNS resolver caching server on the LAN (192.168.10.1). Residents resolve DNS locally. We only hit upstream (Cloudflare 1.1.1.1 / Google 8.8.8.8) for cache misses. If upstream DNS fails, cached responses serve for up to 1 hour.

What We Cannot Control

We believe in honesty about the limits of our guarantees.

  • Acts of nature: Extreme weather events (flash floods, landslides blocking roads) could prevent ISP repair crews from reaching fiber cut locations for 24+ hours. In 12 months of operation, this has not occurred during our retreat months.
  • Nationwide ISP routing issues: A BGP routing failure at the BSNL or Jio national level would affect both our ISPs simultaneously. These are extremely rare and usually resolved within 2–4 hours.
  • Your specific cloud provider: Our network delivers packets to your provider's ingress. What happens inside AWS, GCP, or Cloudflare is outside our control.

Real-World Downtime Log (July–August 2025)

DateDurationCauseMitigation Activated
July 14, 202511 minBSNL planned maintenanceJio failover activated, zero resident impact
August 3, 202515 minBSNL planned maintenanceJio failover activated, zero resident impact
Total26 min

Measured availability: 99.94% (exceeds our 99.9% commitment).

The Monitoring Stack (Open to Residents)

Residents can access our live network dashboard at http://192.168.10.250 (local LAN only):

  • Real-time WAN 1 / WAN 2 status and throughput
  • Per-workstation bandwidth consumption
  • Last 30 days uptime history
  • Current DNS resolver cache hit rate

We have nothing to hide about our infrastructure. If the dashboard shows a problem, we already know about it.

Read More

This article is part of our infrastructure topic cluster:

If this level of infrastructure rigor is what you need, apply for our July or August 2026 cohort.

Tags:Remote Work, Lifestyle, Community