• CerebralHawks@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    9
    ·
    2 days ago

    My guess is, those that do are trained from forum posts where intelligence, including the knowledge of how and the wisdom of when to use non-standard punctuation marks, like en and em dashes, the semicolon, and others, were considered valuable. These people would seem, on the surface, to know more about what they’re talking about and would provide better training data for the LLM. Those people used em dashes, so, so too do the AI models based on them.

    Also, sorry (not sorry). I am a religious em dash user and have been for over 30 years. I’m not saying I’m smarter than anyone about any one thing, but it is entirely possible some of my forum posts were used to train LLMs. I didn’t get paid for it though; hence the “not sorry” part. If it trained on my posts after the fact, I won’t take any blame for that. But, people were using em dashes long before AIs were.

    • DrunkenPirate@feddit.org
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      3 days ago

      Having this indicator in mind, it’s a bit fad to spot AI text easily (and everwhere)

      I‘m not a bot — AI m a bitch.

  • unpossum@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    5
    ·
    2 days ago

    State-of-the-art models rely on late-1800s and early-1900s print books for high-quality training data, and those books use ~30% more em-dashes than contemporary English prose. That’s why it’s so hard to get models to stop using em-dashes: because they learned English from texts that were full of them.

    That sounds really plausible – I associate the em-dash with old books and stilted prose, like Sherlock Holmes stories