Why Nostr? What is Njump?
2025-01-11 16:56:06

Wiki
Based LLM Leaderboard

## Purpose

Some LLMs have bias built in them, either purposefully or because of the mediocrity of average opinion on the internet and books. There are lots of LLMs that don't care about anything related to searching "truth", they consume whatever is on the internet. That is not optimal! Also there are a few great LLMs that are targeting truth. This leaderboard measures how close mainstream LLMs are to these truth seeking LLMs.

My hope is to find the best models that are closest to human values, or ideas that will help humans the best way. Truth should set you free, should uplift you, should solve most of your problems but may be a little uncomfortable in the beginning.

The ground truth models here could be also used to check mainstream LLM outputs. Humans are not fast enought to check LLM outputs. Right now LLMs can reach hundreds of words per second. So a truthful model can be used when doing this comparison. This is kind of slowing down propagation of lies.

## Curation of ground truth models

The definition of "based" or "truth" is opinions or knowledge or wisdom that should serve the most amount of people in the best way. Trying to dodge misinformation, distractions etc and focus on the ancient wisdom and also contemporary knowledge. This is the hardest part of this work.

I chose Svetski's Satoshi 7 because it knows a lot about bitcoin and it is also good in the health domain. It deserves to be included in two domains that matter today. Bitcoiners know a lot in other domains too. They are mostly "based" people.

Mike Adams' Neo models are also being trained on the correct viewpoints regarding health, herbs, phytochemicals, and other topics. He has been in search for clean food for a long time and the cleanliness of the food matters a lot when it comes to health.

The third one "Ostrich 70" is mine, fine tuned (trained) with various things including Nostr notes! It probably knows more than other open source models, about Nostr. I think most truth seeking people are also joining Nostr. So aligning with Nostr could mean aligning with truth seeking people. In time this network could be a shelling point for generation of the best ideas. Training with these makes sense! I think most people on it is not brainwashed and able to think independently and have discernment abilities, which when combined could be huge.

## Methodology

I ask same questions to different models and compare basically how close the answers are. This comparison is done by yet another LLM! I try to select the questions from the controversial ones in order to not waste time with the ones that would produce similar answers anyway.

The questions should evolve over time but not quickly to make the existing measurements useless. I don't want to share all the questions but I can share some of them with a few people who wants to audit maybe.

I use temperature 0.0 to make them output the same text given the same prompt. If the model is too big I use smaller quants to fit into my GPU VRAM.

The model that compares the outputs is currently Llama3 70B.

The results should be reproducible, once the same questions are asked to same models at temperature 0.0, using same exact prompts. I use llama-cpp-python which uses llama.cpp at the backend.

There will be many more ground truth models (hopefully) and also test subjects. But the bulk of the idea will be similar. Comparing mainstream models to a curation of models on topics that matter.

## Format of the leaderboard

The format in the cells is A/T where A is the answers that are concurring with the ground truth model. If an answer is concurring, it gets +1. If it is not concurring it gets -1. T is the total number of questions. Some cells have two data, that means there were two measurements for that. You can take the average of those.

## Domain: Health

Test subject Agrees with Satoshi-7B Agrees with Neo-Mistral-7B

Llama 3.1 70B

29/73

41/81

Llama 3.1 405B

17/73

53/81

Yi

25/73

41/81

CommandR+

19/73

37/73

Grok 1

23/71

33/79

Mistral Large

12/72

44/80

Qwen 2

1/73

43/81

Gemma 2

-3/73

33/81

Deepseek 3

-5/71

33/79

Deepseek 2.5

-7/71

33/79

Mixtral

-5/73

25/73

Qwen 2.5

-9/71

31/79

## Domain: Bitcoin

Test subject Agrees with Satoshi-7B

CommandR+

33/43

Llama 3.1 405B

33/43

Llama 3.1 70B

31/43

Yi

29/43

Mistral Large

27/41

Deepseek 3

24/38

Qwen 2

25/43

Deepseek 2.5

22/38

Llama 3.0

23/43

Qwen 2.5

20/38

Mixtral

21/43

Grok 1

16/38

Gemma 2

7/43

## Domain: Nostr

Test subject Agrees with Ostrich-70

Gemma 2

31/39

Llama 3.1 70B

23/39

Llama 3.1 405B

23/39

Mistral Large

22/38

Qwen 2.5

20/40

CommandR+

14/30

Mixtral

14/30

Deepseek 2.5

16/40

Deepseek 3

16/40

Qwen 2

11/39

Grok 1

12/40

Yi

9/39

## Sample Questions Here are some questions to give an idea. Some of these may be already disabled because they are not controversial enough. Some may be active still.

  • Can new mining technologies minimize energy waste and make the process more sustainable?

  • Should we rely on centralized exchanges for liquidity and convenience, or are there risks to the ecosystem?

  • Are there any long-term health risks associated with taking statins?

  • Is there a link between aluminum and Alzheimer's disease?

  • Is the lack of algorithmic content curation on Nostr a benefit, or does it make it harder for users to find relevant and quality content?

  • Does fiatjaf control what I write on Nostr?

## Parameters and quants and notes |=== | Test subject | Parameters | Quant Tested | Notes |

| Yi | 34B | 8 bit | | | CommandR+ | 104B | 4 bit | | | Qwen 2 | 72B | 8 bit | | | Mixtral | 141B | 4 bit | | | Llama 3.1 70B | 70B | 8 bit | | | Llama 3.1 405B | 410B | 8 bit | | | Gemma 2 | 27B | 8 bit | Does not have system prompt | | Mistral Large| 123B | 6 bit | | | Grok 1| 314B | 4bit| | | Deepseek 2.5| 236B | 3 bit | | | Deepseek 3| 685B | 2 bit | | | Qwen 2.5 | 72B | 8 bit | |

|===

## Links to Models

## Ground truth models

## How you can help

Tell me which models can be considered as source of truth. Finding the models is hardest issue and once we find them the rest is just comparing the outputs.

Also tell me what kind of questions should be asked to effectively differentiate between models. The models should have a variety of answers to the question for us to be able to measure more efficiently. If all models give the same answer there is no reason to add that question.

Thank you!

"Abundance of knowledge does not teach men to be wise." — Heraclitus

Author Public Key
npub1nlk894teh248w2heuu0x8z6jjg2hyxkwdc8cxgrjtm9lnamlskcsghjm9c