• 4 Posts
  • 25 Comments
Joined 2 years ago
cake
Cake day: December 18th, 2023

help-circle








  • It doesn’t explain tokenizers well (or at all). There are better videos on the subject.

    Anyway. Suppose you wanted to spell giraffe with the English alphabet in any arbitrary, phonetic way. You could also spell, for example; “jeeruff”, or “djirough”. You could count how many phonetically correct ways there are to spell “giraffe”.

    Tokenizers break a text into sequences of characters (even individual characters), called tokens. Different tokenizers use different tokens. The one they use here has “gira” as a token, but also “g”, “i”, “r”, and “a”. So you could tokenize the same text in different ways. They have a slide where they show the possibilities.






  • Look… Doesn’t that feel kind of like self-pitying rot to anyone here? Why are we dependent on US technology? Because we are so ethical and pure…

    Here’s the truth.

    In the late 90s, a German student created a search engine in Germany. It was a little thing. It only scraped a few hundred media outlets. You signed up, defined some keywords, and when an article matching those keywords was published, you received a notification.

    He immediately was sued and forced to shut down under copyright law. Google could operate in the US under Fair Use.

    Eventually, years later search engines were legalized in Germany (and the EU). But by then the Internet was dominated by US companies. It makes no sense to spend billions to build a European Google that does exactly what Google already does.

    The reason that there is no European Google is that we insist that information must be owned. No data processing without the explicit consent of the owner. Which means, we insist that some intellectual property owners should be allowed to extract rent from us all.










    1. The problem with trad SoMe is that it is monopolistic. That’s because these companies “own” the data and gate-keep access. If you want open social media, you must not have a gate-keeper. Which means that you can’t have someone who controls access. That’s a fundamental trade-off.

    2. So what? Should posts be anonymous as long as they are short?

    3. No. It’s always data+owner. It doesn’t matter if the data is only a single bit.