Manga Frequency List Accuracy
Frequency lists for Japanese novels and TV subtitles are highly accurate because they use text-based data, easily parsed by software. Manga, however, relies on optical character recognition (OCR) to extract text from images, introducing errors.
OCR struggles with:
- Low-resolution kanji, often misidentified.
- Non-text elements mistaken for text.
- Handwritten text, which is hard to parse.
These issues create false positives in frequency lists. Manually correcting OCR output is impractical due to time and cost.
To avoid false positives when learning from a manga frequency list:
- Prioritize high-frequency words, as they’re more likely correct.
- When starting a series, use the series-wide frequency list over individual volume lists. Specific false positives are less likely to occur across volumes.
- If reading multiple series, use the series and volume combined frequency lists.
- Learn the lowest frequency words in context while reading.