Apr 25 · in engineering
by Ryan Welch · 10 Min Read
My recent posts on Kokoro and Qwen3 covered TTS models that produce narration with word-level timestamps. As part of a larger captioning project, I now need to split that stream of timestamped words into subtitle pages. A subtitle page is a short chunk of text that appears on screen for a few seconds, then disappears. The segmentation problem is deciding which words belong together on the same page, and where to break between pages. This is a surprisingly complex problem to do well, and the solution involves a mix of timing data, linguistic rules, and machine learning.
The input to subtitle segmentation is a list of words with timestamps: the millisecond each word started and ended. Where these timestamps come from depends on your pipeline. ASR/STT (automatic speech recognition) is the most common source, where you transcribe existing audio and get word-level timing as part of the output. TTS engines can also produce this data during synthesis. Either way, the segmentation problem is the same: given timestamped words, group them into subtitle pages that appear and disappear in sync with the audio.
For this article I’m using the following example sentence:
Golden hour hit different. We stood there, wordless, watching the day quietly surrender to night.
Each word arrives with a startMs and endMs. The gaps between consecutive words reflect natural pauses in the speech. The job is deciding where one subtitle page ends and the next begins.
Each approach below runs on the same timestamped sentence, so you can see where each technique breaks down and what the next one fixes.
Subtitle reading is time-pressured: the text appears, you read it, and it disappears. You can’t scroll back. If a page breaks at the wrong moment, the reader’s eye lands on a fragment like “We” and has to wait for the next page to understand what it belongs to. By the time the next page appears, some viewers have already looked away. Others re-read to reconstruct the sentence and fall behind.
The simplest approach: slice the word array into fixed-size chunks.
function chunkByCount(words: TimedWord[], n: number): SubtitlePage[] { const pages: SubtitlePage[] = []; for (let i = 0; i < words.length; i += n) { const chunk = words.slice(i, i + n); pages.push({ words: chunk, startMs: chunk[0].startMs, endMs: chunk[chunk.length - 1]!.endMs, }); } return pages;}Every break is arbitrary, landing wherever the count happens to fall. “We” gets split from “stood” across two pages. The reader sees “We” and has to wait for the next page to understand the subject.
A step up: if a word ends with ., !, or ?, flush the current page there instead of waiting for the word count to fill up.
function endsSentence(word: string): boolean { return /[.!?]["')\]]*$/.test(word);}This immediately creates a problem. “Dr.” triggers the sentence-ending check because the period looks like a full stop to a regex. “St.” in “St. Pauls Lane” does the same. Periods pull double duty in English (sentence termination and abbreviation markers) and a regex can’t tell the difference without knowing the word.
The fix is a set of known abbreviations to skip:
const ABBREVIATIONS = new Set([ "mr", "mrs", "ms", "dr", "prof", "st", "ave", "blvd", "rd", "vs", "etc", "approx", "dept", "jan", "feb", "mar" /* ... */,]);
function endsSentence(word: string): boolean { if (!/[.!?]["')\]]*$/.test(word)) return false; const stem = word.replace(/[.!?]["')\]]*$/, "").toLowerCase(); return !ABBREVIATIONS.has(stem);}Abbreviation lists are always finite and language-specific. They also can’t handle context: “May” could be a month, a name, or a modal verb. Domain-specific abbreviations (medical, legal, technical) need extending for specialised content.
“Golden hour hit different.” now breaks correctly at the period. But notice the timing: the speaker pauses for half a second after “wordless,” and the captions have no idea. Pages appear and disappear based on word count and punctuation alone. During long silences the previous caption sits frozen on screen, then the next page snaps in when speech resumes. The text is correct, but it doesn’t follow the audio.
The timestamp data includes gaps between consecutive words. If the gap exceeds a threshold, that’s a natural break point where the speaker paused.
const gap = words[i].startMs - words[i - 1].endMs;if (gap > maxPauseMs) { pages.push(currentPage); currentPage = [];}The captions now breathe with the speaker. The 500ms gap after “wordless,” triggers a page break, so there’s no more frozen text during silence. The sentence boundary after “different.” naturally produces a long gap too, so it breaks there as well.
@remotion/captions createTikTokStyleCaptions does roughly this: time-gap-based page grouping with no linguistic awareness. It’s a reasonable starting point if you just need something working.
Pause detection still leaves one awkward split: watching the day quietly surrender / to night. This respects timing and the 5-word cap, but it cuts through a prepositional phrase. The more natural subtitle would be watching the day / quietly surrender to night.
Word lists can’t solve this. You need phrase boundary information: knowledge of which words form a unit that shouldn’t be split. In a real system this comes from an NLP parser. Here’s a sketch of what that looks like:
import { parsePhrases } from "mock-nlp-lib";
// parsePhrases returns a set of word indices after which// a phrase boundary exists. For our sentence it identifies:// index 10 → end of "watching the day" (noun phrase)// index 11 → end of "quietly" (adverb modifying the verb phrase)// so we can break after index 10 without splitting a phrase.
async function getPhraseBoundaries(words: TimedWord[]): Promise<Set<number>> { const text = words.map((w) => w.text).join(" "); const phrases = await parsePhrases(text);
// phrases is an array of token spans: [{ start: 0, end: 3 }, ...] // We want the index of the last word in each phrase. const boundaries = new Set<number>(); for (const phrase of phrases) { boundaries.add(phrase.end - 1); } return boundaries;}
// Then use it in the paging loop:const phraseBreakAfter = await getPhraseBoundaries(words);
if ( gap > maxPauseMs || endsSentence(word.text) || currentPage.length >= wordsPerPage || phraseBreakAfter.has(i)) { pages.push(currentPage); currentPage = [];}The key detail: parsePhrases returns spans based on the parser’s understanding of sentence structure (noun phrases, verb phrases, prepositional phrases) rather than hardcoded indices. In this example it would identify “watching the day” as a noun phrase (span ending at index 10) and “quietly surrender to night” as a verb phrase, so the break falls at the right place.
For this sentence, that produces the split you actually want. The timing rules still apply first; the phrase boundary adds one more valid break point inside the final chunk.
A subtitle page can contain multiple lines. The demos above show single-line pages for simplicity, but production subtitles usually allow two lines per page with a maximum character count per line (Netflix specifies 42 characters; BBC uses 37). The line break within a page follows the same linguistic rules as the page break between pages: you don’t want to split a phrase across lines any more than across pages. Subtitle Edit, an open-source subtitle authoring tool, applies many of these Netflix/BBC rules in its line-breaking logic and is worth reading even if you don’t use it directly.
So the full problem has two levels: deciding where pages end, and deciding where lines break within a page. Everything above applies to both.
These four techniques (fixed chunks, punctuation and abbreviation detection, pause-based breaking, and phrase-boundary hints) cover a lot of ground for English with clean punctuation, but the remaining edge cases push beyond what word lists can handle.
Text without terminal punctuation (common in dialogue and social media captions) never triggers sentence breaks, falling back entirely to word count. Abbreviation lists miss domain-specific terms. And none of these techniques understand grammar directly.
NLP tools go further. A dependency parser (spaCy, for instance) identifies clause boundaries directly, so you’d never split a verb from its object or a preposition from its complement. Named Entity Recognition tags “St. Pauls” as a location without needing an abbreviation list.
wtpsplit takes a different approach: small transformer models trained specifically for text segmentation, including subtitle segmentation, supporting 85 languages. It handles both sentence boundary detection and subtitle-specific line breaking, making it useful as a pre-processing step. The simpler heuristics then only need to break within those boundaries.
Finding linguistically good break points isn’t enough on its own. The break also has to make sense with the audio timing.
Reading speed is part of the segmentation problem, not something you fix afterwards. Real subtitle systems enforce hard display constraints: maximum characters per line, maximum lines per subtitle, minimum event duration, and some target reading rate in characters per second or words per minute.
Netflix specifies a minimum of 5/6 of a second per subtitle event. The BBC targets 160-180 words per minute, about 0.33 seconds per word. A page that appears and disappears in 200ms is unreadable regardless of how grammatically perfect the break is.
Timing and linguistic quality often push in opposite directions. Pauses happen where the speaker breathes, not necessarily where clauses end. In practice, you treat pause-based grouping as a hard constraint and apply the linguistic fixes within those groups. Don’t split a group further if it would produce pages below the minimum duration.
Rule-based systems and parsers depend on explicit logic: regexes, dictionaries, grammar rules, dependency trees. ML-based systems learn what good subtitle boundaries look like from human-authored examples.
In practice this is often framed as sequence labelling or boundary scoring: for each token boundary, predict whether a subtitle break belongs there. Transformer models can use wider context than handwritten rules, which helps with unusual syntax, ambiguous abbreviations, and other edge cases.
ML doesn’t replace the hard constraints though. A model might prefer a natural linguistic break that produces a 250ms subtitle or a line that’s too long. Production systems still filter or rerank model suggestions through timing and display limits.
Everything above assumes word-delimited text: words separated by spaces, Western punctuation. Many languages don’t work this way.
Chinese and Japanese have no spaces between words. Before you can detect phrase boundaries you need word segmentation, which is its own problem. Common tools: MeCab for Japanese, jieba for Chinese. Break points in these languages typically fall after grammatical particles or at clause boundaries determined by morphological analysis.
Arabic and Hebrew are right-to-left. Mixing RTL text with LTR content (numbers, names) creates BiDi rendering issues where a bad line break can change how the renderer interprets directional context.
Agglutinative languages like Finnish, Turkish, and Hungarian pack what English expresses in phrases into single words. A “word count” limit behaves very differently here; character count is more reliable.
There’s no single subtitle segmentation system that handles all of this. Production multilingual systems either route each language through a language-specific processor, or use large pre-trained models like wtpsplit that have internalized language-specific behaviour.
Most real systems are hybrid: hard rules for non-negotiables like line length, event duration, and forbidden breaks; timing data to define where a split is feasible; and linguistic rules, parsers, or ML models to rank the feasible break points by naturalness. Each layer eliminates a different class of bad caption.
Liked the post?
Get in touch
Have a question? Spotted a mistake? Or just want to say thank you?
Send me an email at hello@ryanwelch.co.uk - seriously! I love hearing from you.