Multi-node streaming architecture: geographic failover between the Caribbean and France

The problem: why a single server isn't enough

An IPTV/OTT headend lives on continuity. A channel that drops ten seconds a day adds up to minutes per month and complaint tickets per year. When we're talking about real operations —not a trade-show demo— the team has to assume the original source can fail, the processing can crash, the link to the customer can degrade, and all three can happen at the same time at the worst possible moment.

The solution isn't buying a more expensive server. It's distributing the risk geographically: different capture sources, processing nodes in different regions, and clear rules about who takes over when something fails.

This post is a look inside how an architecture like that is built, based on the kind of platforms we've designed and operated: a headend with satellite and DTT capture across two continents, primary and secondary Flussonic Media Server, around ten TVHeadend servers with USB TBS multi-tuner cards, and a Stalker portal serving MAG STBs with its own EPG adjusted by audience. All in production, serving live streams.

The high-level architecture

Before diving into each component, it's worth seeing the full picture. The headend has three layers:

Capture: where and how the signal enters the system.
Processing: what the system does with the signal between entry and exit (decode, transcode, packaging, ABR, EPG).
Distribution: how that signal reaches the end customer (portal, STB, app, browser).

Each layer is duplicated across two physical sites —in this case, the Caribbean and France— with explicit rules about what gets prioritized when both sites are alive, and what happens when one disappears. The goal isn't technical heroism: it's that when something fails, the signal keeps flowing.

Capture: satellite and DTT, in parallel

The original source for the channels comes through two independent paths:

Satellite (DVB-S/S2) from the Caribbean — the French operator's channels are received via dishes and decoders. It's the operator's "natural" source.
DTT (DVB-T) from France — the same channels broadcast terrestrially, captured directly from Europe.

The two sources are independent in infrastructure: different media (satellite vs. terrestrial), different countries, different operators. If the satellite drops due to extended bad weather, DTT keeps going. If there's a terrestrial blackout in France, satellite covers. And for the case where both fail simultaneously, the system knows how to degrade in a controlled way rather than showing a black screen.

A detail that matters

The operator's official smart cards are mounted in USB readers connected to the capture servers. It's the only legitimate way to decode the channels: using the operator's real subscription, not clones or cardsharing. This isn't just a compliance point — it's what makes the platform sustainable long-term.

Hardware: TVHeadend + TBS cards

The capture software is TVHeadend running on roughly 10 Linux servers distributed across both sites. Each server has one or two USB TBS multi-tuner cards (DVB-S/S2 for the satellite nodes, DVB-T for the DTT ones). The TBS choice isn't random: they're the cards that play best with TVHeadend on Linux, and the ones with drivers maintained on recent kernels.

Each server exposes the channels it captures as MPEG-TS streams over HTTP, accessible to the next layer. The "casting" of satellite or DTT into the IP world happens here.

Processing: primary and secondary Flussonic

TVHeadend streams flow into Flussonic Media Server, which is where the signal gets transformed into something fit for mass distribution: re-packaging, ABR for different bitrates, HLS/DASH chunk generation, DVR management, catch-up control.

There are two Flussonic instances, one on each continent:

Primary Flussonic in the Caribbean — serves most of the traffic under normal conditions. Has local satellite capture as the main source and remote DTT as fallback.
Secondary Flussonic in France — functional replica. Takes DTT as the main source and remote satellite as fallback. If the primary fails, it absorbs all the traffic.

Switching between primary and secondary isn't manual. The distribution portal (the next layer) knows how to poll both endpoints and pick the healthy one. When a Flussonic stops responding, the end customer barely notices the change — seconds, not minutes.

Why Flussonic and not something else

Flussonic is chosen for three concrete reasons, not for hype:

It handles MPEG-TS, HLS, DASH, RTMP, SRT, RIST and native catch-up. Covers basically any protocol the end customer might need.
Its HTTP API is reasonable, which lets you automate operations tasks (checking streams, moving channels, debugging issues) without touching the UI.
It's stable under sustained load with hundreds of simultaneous streams on a single machine, without the pathologies open-source alternatives have under heavy load.

Distribution: Stalker portal and MAG STBs

The visible face for the end customer is the Stalker portal. It's the interface the user sees on their STB when they turn it on: channel list, EPG, search, settings. The reference STB for this category is the MAG family (Infomir), which ships with an integrated Stalker client and handles the streams Flussonic produces without exotic configuration.

The Stalker portal is configured to point at both primary and secondary Flussonic. The channel list is the same for both backends; only the stream origin changes. This adds another layer of redundancy: even if the failover logic in the portal fails, the STB can be manually pointed at the alternate backend while you sort it out.

EPG: the one almost nobody gets right

The EPG (Electronic Program Guide) is what users see when they hit "Guide". It looks trivial — download the operator's XMLTV and expose it. It isn't.

Three concrete problems show up in production:

Source and format. Every operator delivers EPG differently — some in XMLTV, others in proprietary formats, some at long intervals and others almost in real time. You have to normalize everything to a consistent format before injecting it into Flussonic and Stalker.
Sync frequency. The EPG changes. Programs move, special broadcasts get announced hours in advance. Periodic download has to be frequent but not aggressive — a script every 6 hours is reasonable, every 5 minutes gets you blocked.
Data quality. It's not unusual to find EPG entries with inconsistent times, empty descriptions, channel IDs that don't match your list. It's worth a validation step that discards garbage before serving it to the user.

The detail: per-channel timezone by audience

Here's the fine-grained adjustment almost nobody manages, and that matters in real production:

French channels broadcast in French time (UTC+1 or UTC+2 depending on summer). If the viewers are in the French Caribbean, that time works almost as-is. But if the audience includes French speakers in the United States —Florida, Louisiana, Quebec via VPN— the EPG time as it comes from satellite is off by 5 to 7 hours relative to their local zone.

The EPG doesn't adjust itself. You have to apply a per-channel timezone shift at the moment of packaging the guide for each audience, so that when the Miami user sees "8 PM News" on their STB, that means 8 PM Miami time, not 8 PM Paris time. Invisible when it works, a support nightmare when it doesn't.

Operational lesson

The EPG is the most boring component of the platform and the one that generates the most complaints. It's worth investing in active monitoring: check that every channel has at least N programs in the next 12 hours, and alert before the user notices.

Failover: what happens when something drops

The concrete failover rules, in plain language:

If satellite capture drops in the Caribbean (extended bad weather, antenna issue, decoder unresponsive): the Caribbean TVHeadend servers switch to consuming the remote DTT France feed over a private tunnel. Latency rises slightly; the content keeps flowing.
If DTT drops in France (terrestrial blackout, capture issue): the main flow switches to the Caribbean satellite via the same tunnel in reverse.
If primary Flussonic drops: the secondary on the other continent absorbs all traffic. The Stalker portal detects and redirects; the STB reloads the stream in a few seconds.
If both Flussonic instances drop at the same time (unlikely but not impossible — a shared bug in a new version, for example): the STBs display an informative portal message instead of a black screen. The operations team gets an immediate alert.

What matters is that these decisions are encoded, not improvised. When something fails at 3 AM, there's no time to think — there's time for the system to do what was already decided.

Lessons you learn operating this

After years building and maintaining architectures like this, these are the lessons that repeat most:

1. Redundancy matters where you don't see it

It's easy to duplicate Flussonic. Less easy to duplicate capture, and almost nobody duplicates the EPG. The dirtiest single points of failure are the ones "in the middle" — a single EPG download script running on a single machine, a single connection to the satellite antenna, a single TBS card with no backup. When the design starts, the reflex is to duplicate what's visible. You have to duplicate the invisible first.

2. Active monitoring > passive alerts

Waiting for a user to call saying "there's no signal" is losing. Monitoring has to test every channel every N minutes, pull a few seconds of the stream, validate it has audio and video, and alert before it's perceptible to the customer. Tools like Prometheus + dedicated scripts, or Flussonic's own APIs, work well here.

3. Flussonic updates happen in known windows

Flussonic has releases that break things. Never update directly in production. Test in staging with simulated traffic, validate for 24 hours minimum, and update the secondary first — if it's fine for a week, then the primary.

4. The end customer sees details you don't

Audio out of sync by a few frames, EPG with weird-capitalized titles, a channel that takes 8 seconds to start instead of 3 — the operator doesn't notice because they have channels open all day and get used to it. The new customer notices immediately. It's worth keeping a "new customer checks" list the team walks through periodically from the user's POV, not the operator's.

5. Documentation is part of the infrastructure

When the team grows or someone goes on vacation, the only thing keeping the operation running is the documentation. The install README isn't enough — you need the runbook with "what to do if the Caribbean antenna stops responding" or "how to manually switch to the secondary Flussonic if the portal doesn't do it on its own". If this lives only in one person's head, the operation is fragile no matter how solid the code is.

Closing

A streaming architecture like this isn't built in a weekend. It's designed, tested, operated, and every real incident refines the decisions. The difference between a system that holds and one that doesn't almost never lies in the most expensive component — it's in how the components are connected and what happens when something breaks.

If you have a streaming operation that needs this level of robustness, or you're evaluating standing up an IPTV/OTT headend from scratch and want a technical second opinion, let's talk.