Wiki ＞
Based LLM Leaderboard

## Purpose

Some LLMs have bias built in them, either purposefully or because of the mediocrity of average opinion on the internet and books. There are lots of LLMs that don't care about anything related to searching "truth", they consume whatever is on the internet. That is not optimal! Also there are a few great LLMs that are targeting truth. This leaderboard measures how close mainstream LLMs are to these truth seeking LLMs.

My hope is to find the best models that are closest to human values, or ideas that will help humans the best way. Truth should set you free, should uplift you, should solve most of your problems but may be a little uncomfortable in the beginning.

The ground truth models here could be also used to check mainstream LLM outputs. Humans are not fast enought to check LLM outputs. Right now LLMs can reach hundreds of words per second. So a truthful model can be used when doing this comparison. This is kind of slowing down propagation of lies.

## Curation of ground truth models

The definition of "based" or "truth" is opinions or knowledge or wisdom that should serve the most amount of people in the best way. Trying to dodge misinformation, distractions etc and focus on the ancient wisdom and also contemporary knowledge. This is the hardest part of this work.

I chose Svetski's Satoshi 7 because it knows a lot about bitcoin and it is also good in the health domain. It deserves to be included in two domains that matter today. Bitcoiners know a lot in other domains too. They are mostly "based" people.

Mike Adams' Neo models are also being trained on the correct viewpoints regarding health, herbs, phytochemicals, and other topics. He has been in search for clean food for a long time and the cleanliness of the food matters a lot when it comes to health.

The third one "Ostrich 70" is mine, fine tuned (trained) with various things including Nostr notes! It probably knows more than other open source models, about Nostr. I think most truth seeking people are also joining Nostr. So aligning with Nostr could mean aligning with truth seeking people. In time this network could be a shelling point for generation of the best ideas. Training with these makes sense! I think most people on it is not brainwashed and able to think independently and have discernment abilities, which when combined could be huge.

## Methodology

I ask same questions to different models and compare basically how close the answers are. This comparison is done by yet another LLM! I try to select the questions from the controversial ones in order to not waste time with the ones that would produce similar answers anyway.

The questions should evolve over time but not quickly to make the existing measurements useless. I don't want to share all the questions but I can share some of them with a few people who wants to audit maybe.

I use temperature 0.0 to make them output the same text given the same prompt. If the model is too big I use smaller quants to fit into my GPU VRAM.

The model that compares the outputs is currently Llama3 70B.

The results should be reproducible, once the same questions are asked to same models at temperature 0.0, using same exact prompts. I use llama-cpp-python which uses llama.cpp at the backend.

There will be many more ground truth models (hopefully) and also test subjects. But the bulk of the idea will be similar. Comparing mainstream models to a curation of models on topics that matter.

## Format of the leaderboard

The format in the cells is A/T where A is the answers that are concurring with the ground truth model. If an answer is concurring, it gets +1. If it is not concurring it gets -1. T is the total number of questions. Some cells have two data, that means there were two measurements for that. You can take the average of those.

## Domain: Health

Test subject	Agrees with Satoshi-7B	Agrees with Neo-Mistral-7B
Llama 3.1 70B	29/73	41/81
Llama 3.1 405B	17/73	53/81
Yi	25/73	41/81
CommandR+	19/73	37/73
Grok 1	23/71	33/79
Mistral Large	12/72	44/80
Qwen 2	1/73	43/81
Gemma 2	-3/73	33/81
Deepseek 3	-5/71	33/79
Deepseek 2.5	-7/71	33/79
Mixtral	-5/73	25/73
Qwen 2.5	-9/71	31/79

Test subject

Agrees with Satoshi-7B

Agrees with Neo-Mistral-7B

Llama 3.1 70B

29/73

41/81

Llama 3.1 405B

17/73

53/81

25/73

41/81

CommandR+

19/73

37/73

Grok 1

23/71

33/79

Mistral Large

12/72

44/80

Qwen 2

1/73

43/81

Gemma 2

-3/73

33/81

Deepseek 3

-5/71

33/79

Deepseek 2.5

-7/71

33/79

Mixtral

-5/73

25/73

Qwen 2.5

-9/71

31/79

## Domain: Bitcoin

Test subject	Agrees with Satoshi-7B
CommandR+	33/43
Llama 3.1 405B	33/43
Llama 3.1 70B	31/43
Yi	29/43
Mistral Large	27/41
Deepseek 3	24/38
Qwen 2	25/43
Deepseek 2.5	22/38
Llama 3.0	23/43
Qwen 2.5	20/38
Mixtral	21/43
Grok 1	16/38
Gemma 2	7/43

Test subject

Agrees with Satoshi-7B

CommandR+

33/43

Llama 3.1 405B

33/43

Llama 3.1 70B

31/43

29/43

Mistral Large

27/41

Deepseek 3

24/38

Qwen 2

25/43

Deepseek 2.5

22/38

Llama 3.0

23/43

Qwen 2.5

20/38

Mixtral

21/43

Grok 1

16/38

Gemma 2

7/43

## Domain: Nostr

Test subject	Agrees with Ostrich-70
Gemma 2	31/39
Llama 3.1 70B	23/39
Llama 3.1 405B	23/39
Mistral Large	22/38
Qwen 2.5	20/40
CommandR+	14/30
Mixtral	14/30
Deepseek 2.5	16/40
Deepseek 3	16/40
Qwen 2	11/39
Grok 1	12/40
Yi	9/39

Test subject

Agrees with Ostrich-70

Gemma 2

31/39

Llama 3.1 70B

23/39

Llama 3.1 405B

23/39

Mistral Large

22/38

Qwen 2.5

20/40

CommandR+

14/30

Mixtral

14/30

Deepseek 2.5

16/40

Deepseek 3

16/40

Qwen 2

11/39

Grok 1

12/40

9/39

## Sample Questions Here are some questions to give an idea. Some of these may be already disabled because they are not controversial enough. Some may be active still.

Can new mining technologies minimize energy waste and make the process more sustainable?
Should we rely on centralized exchanges for liquidity and convenience, or are there risks to the ecosystem?
Are there any long-term health risks associated with taking statins?
Is there a link between aluminum and Alzheimer's disease?
Is the lack of algorithmic content curation on Nostr a benefit, or does it make it harder for users to find relevant and quality content?
Does fiatjaf control what I write on Nostr?

| Yi | 34B | 8 bit | | | CommandR+ | 104B | 4 bit | | | Qwen 2 | 72B | 8 bit | | | Mixtral | 141B | 4 bit | | | Llama 3.1 70B | 70B | 8 bit | | | Llama 3.1 405B | 410B | 8 bit | | | Gemma 2 | 27B | 8 bit | Does not have system prompt | | Mistral Large| 123B | 6 bit | | | Grok 1| 314B | 4bit| | | Deepseek 2.5| 236B | 3 bit | | | Deepseek 3| 685B | 2 bit | | | Qwen 2.5 | 72B | 8 bit | |

|===

## Links to Models

Llama 3.1 405B Instruct https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf
Llama 3.1 70B Instruct https://huggingface.co/lmstudio-community/Meta-Llama-3.1-70B-Instruct-GGUF
Command R+ 104B https://huggingface.co/CohereForAI/c4ai-command-r-plus
Mixtral 8x22B 141B https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
Qwen 2 72B https://huggingface.co/Qwen/Qwen2-72B-Instruct
Yi 34B https://huggingface.co/01-ai/Yi-1.5-34B-Chat
Gemma 2 27B https://huggingface.co/google/gemma-2-27b-it
Mistral Large https://huggingface.co/MaziyarPanahi/Mistral-Large-Instruct-2407-GGUF
Deepseek 2.5 https://huggingface.co/deepseek-ai/DeepSeek-V2.5
Deepseek 3 https://huggingface.co/unsloth/DeepSeek-V3-GGUF
Grok 1 https://huggingface.co/xai-org/grok-1
Qwen 2.5 72B https://huggingface.co/Qwen/Qwen2.5-72B-Instruct

## Ground truth models

Satoshi 7B https://spiritofsatoshi.ai
Neo 7B https://brighteon.ai
Ostrich 70B https://huggingface.co/some1nostr/Ostrich-70B

## How you can help

Tell me which models can be considered as source of truth. Finding the models is hardest issue and once we find them the rest is just comparing the outputs.

Also tell me what kind of questions should be asked to effectively differentiate between models. The models should have a variety of answers to the question for us to be able to measure more efficiently. If all models give the same answer there is no reason to add that question.

Thank you!

"Abundance of knowledge does not teach men to be wise." — Heraclitus

Wiki ＞Based LLM Leaderboard

Wiki ＞
Based LLM Leaderboard