VOICEROID ASR Quality Issue

Status: RESOLVED

Mitigations implemented:

Whisper medium (Task 009) + large-v3-turbo (Task 285): EP01 accuracy 82.6% → 91.4%
Speaker diarization investigated (Task 056): general-purpose diarization not viable for VOICEROID (80.3% accuracy)
Multi-source comparison infrastructure (Tasks 297, 298): accuracy + agreement charts on transcription pages
OCR pipeline (Task 289): captures on-screen text as additional source

YouTube's auto-generated subtitles for SOLAR LINE have very poor accuracy.

The series uses VOICEROID/software-talk voices (CeVIO, VOICEVOX, etc.) which

are synthetic speech not well-handled by YouTube's speech recognition.

- Speaker detection by voice is impossible from ASR alone

- Must use dialogue patterns (AI=polite/technical, きりたん=casual/decisions)

Quote accuracy: Dialogue quotes in reports should be manually verified against video

Manual transcription: Most accurate but labor-intensive
Whisper ASR: Run OpenAI Whisper locally on audio — may handle synthetic speech better
Video OCR: Some series display text/subtitles on screen that could be extracted
Human correction pass: Use ASR as starting point, correct against video
Multiple ASR engines: Cross-reference YouTube auto-subs with Whisper output

Tested speaker diarization tools on VOICEROID content:

Resemblyzer embeddings: きりたん-ケイ cosine similarity 0.983 (near-identical)
Pitch analysis: F0 difference only 28.8 Hz with ~100 Hz std overlap
Best accuracy: 80.3% binary (embedding nearest-centroid), insufficient for production use
Conclusion: General-purpose speaker diarization is NOT viable for VOICEROID content
Non-VOICEROID speakers (管制, 船乗り) have distinct pitch and ARE separable

For now, use YouTube auto-subs as structural scaffolding (timing, line breaks)

but treat the text content as unreliable. Speaker attribution and quote

correction should be done by reviewing the actual video content.