Why Nostr? What is Njump?
2024-05-08 22:52:55

wired mia on Nostr: so, you've seen ™ and ™️ before. but like. why are there two. well, i have an ...

so, you've seen ™ and ™️ before. but like. why are there two. well, i have an explanation! the answer is: `FE0F`

first, unicode. unicode is a standard definition of a bunch of codepoints, where a codepoint is just a number with meaning. for example, unicode codepoint `U+263A` refers to ☺︎, or "White Smiling Face", and `U+1F431` refers to 🐱, or "Cat Face"

so, lets start by looking at the codepoints for ™. decoding it, it becomes the codepoint `U+2122`, referred to as "Trade Mark Sign". this was added in unicode 1.1 in 1993, a decent time ago!

next, the codepoints for ™️. decoding it, we get two codepoints! `U+2122` (™︎) and `U+FE0F`. wait. who is `FE0F`. why is he in my emoji

well, unicode isn't as simple as a series of codepoints that refer to single characters. take a look at `é̗` for example. this is three codepoints, `U+0065` (Latin Small Letter E), `U+0301` (Combining Acute Accent), and `U+0317` (Combining Acute Accent Below). the first codepoint is simple enough, it's just `e`. the next two, however, are *combining* codepoints. this means that they combine with the codepoint before them to modify it. `U+0301` adds an acute accent above the previous codepoint, and `U+0317` adds an acute accent below the previous codepoint. this example specifically isn't very useful (i don't know any language with a `é̗` character beyond conlangs), but it becomes very useful for languages that use a lot of diacritics. imagine if we had to make a new set of characters for each set of possible diacritics! big waste of space, we shouldn't have done that!

so, what is `U+FE0F`? well, it's a special codepoint called "Variation Selector-16". variation selectors are a reserved block of 16 unicode codepoints. only some have been defined, but among those currently in use are `U+FE0E` (VS15) and `U+FE0F` (VS16). from wikipedia: "VS15 and VS16 are reserved to request that a character should be displayed as text or as an emoji respectively." so, what's happening with ™️ is that it's combining a `U+2122` (™) and a `U+FE0F` (Variant Selector-16) to create an emoji version of ™. they're the same character, just that one has been instructed to become an emoji!


also, for the interested, here's the word "unicode" with a shit ton of combining characters: ù́̂̃̄̅̆̇̈̉n̖̗̘̙̐̑̒̓̔̕i̡̢̧̨̠̣̤̥̦̩c̴̵̶̷̸̰̱̲̳̹ò͇͈͉́͂̓̈́͆ͅd͓͔͕͖͙͐͑͒͗͘eͣͤͥͦͧͨͩ͢͠͡. what appears to be seven letters is actually 77 codepoints, taking up 147 bytes when encoded in utf-8. or 156 in utf-16. or 312 in utf-32. why does anyone use utf-16 if it's longer? historical reasons :3

TL;DR: ™️ is ™︎ but instructed to be an emoji
Author Public Key
npub1rtu82sne2xmvygpga3f5q6fzkmu9ndlskemhnx5xr6vsnxgqam6syykppj