John Dee on Nostr: ollama seems to load as much as it can into VRAM, and the rest into RAM. Llama 3.1 ...
ollama seems to load as much as it can into VRAM, and the rest into RAM. Llama 3.1 70b is running a lot slower than 8b on a 4090, but it's usable. The ollama library has a bunch different versions that appear to be quantized:
https://ollama.com/library/llama3.1Published at
2024-07-25 03:59:55Event JSON
{
"id": "6d74b5f0ee5eae9b9189ba4fe0c33fafae5d5f3d199ac776a7fa8a203394d22a",
"pubkey": "fe32298e29aab4ec2911c0dbdda485c073f869c5444ee92f7ae247ed20516265",
"created_at": 1721879995,
"kind": 1,
"tags": [
[
"e",
"00003c9dda7204845a2ef6a0a5a08d7572caf85dc738e34a05648f113a342f49",
"wss://nostr.wine/",
"root"
],
[
"e",
"00003c9dda7204845a2ef6a0a5a08d7572caf85dc738e34a05648f113a342f49",
"wss://nostr.wine/",
"reply"
],
[
"p",
"b2d670de53b27691c0c3400225b65c35a26d06093bcc41f48ffc71e0907f9d4a",
"",
"mention"
]
],
"content": "ollama seems to load as much as it can into VRAM, and the rest into RAM. Llama 3.1 70b is running a lot slower than 8b on a 4090, but it's usable. The ollama library has a bunch different versions that appear to be quantized: https://ollama.com/library/llama3.1",
"sig": "fcee7e83df9bb9c7df2e2a1bd934254e1d34a307c0a04b24f6f23b2d7bc54a7a5438fd2304e889c4a5ae85ce3769dbd4a9e96afe408e0adb7f6541a7c4694b0e"
}