• unpossum@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    5
    ·
    2 days ago

    State-of-the-art models rely on late-1800s and early-1900s print books for high-quality training data, and those books use ~30% more em-dashes than contemporary English prose. That’s why it’s so hard to get models to stop using em-dashes: because they learned English from texts that were full of them.

    That sounds really plausible – I associate the em-dash with old books and stilted prose, like Sherlock Holmes stories