Getting started
Cloud GPU selection
Find the appropriate instance type and cloud provider for your needs
Tested hardware & providers
We ran our tests on the following hardware:
- NVIDIA GeForce RTX 3060 (mobile)
- NVIDIA GeForce RTX 3070 (Scaleway GPU-3070-S)
- NVIDIA A10 (Lambda Cloud gpu_1x_a10)
- NVIDIA A10G (AWS g5.xlarge)
- NVIDIA L4 (Scaleway L4-1-24G)
The laptop hardware setup includes an Intel(R) Core(TM) i7-12700H for the CPU
Tested LLMs
The results are available for the following LLMs (cf. Ollama hub):
- Deepseek Coder 6.7b - instruct (Ollama, HuggingFace)
- OpenCodeInterpreter 6.7b (Ollama, HuggingFace, paper)
- Dolphin Mistral 7b (Ollama, HuggingFace, paper)
- CodeQwen 1.5 7b (Ollama, HuggingFace, blog)
- LLaMA 3 7b (Ollama, HuggingFace, blog)
- Phi 3 3.8b (Ollama, HuggingFace, paper)
- Coming soon: StarChat v2 (HuggingFace, paper)
and the following quantization formats: q3_K_M
, q4_K_M
, q5_K_M
.
Throughput benchmark
NVIDIA GeForce RTX 3060 (mobile)
Model | Ingestion mean (std) | Generation mean (std) |
---|---|---|
deepseek-coder:6.7b-instruct-q5_K_M | 35.43 tok/s (±3.46) | 23.68 tok/s (±0.74) |
deepseek-coder:6.7b-instruct-q4_K_M | 72.27 tok/s (±10.69) | 36.82 toks/s (±1.25) |
deepseek-coder:6.7b-instruct-q3_K_M | 90.1 tok/s (±32.43) | 50.34 toks/s (±1.28) |
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M | 78.94 tok/s (±10.2) | 37.95 toks/s (±1.65) |
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M | 126.75 tok/s (±31.5) | 50.05 toks/s (±0.84) |
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M | 89.47 tok/s (±29.91) | 47.09 toks/s (±0.67) |
codeqwen:7b-chat-v1.5-q4_1 | 171.72 tok/s (±53.37) | 54.74 toks/s (±0.82) |
dolphin-llama3:8b-v2.9-q4_K_M | 131.89 tok/s (±33.37) | 50.81 toks/s (±0.66) |
phi3:3.8b-mini-instruct-4k-q4_K_M | 271.40 tok/s (±52.48) | 88.43 toks/s (±13.22) |
NVIDIA GeForce RTX 3070 (Scaleway GPU-3070-S)
Model | Ingestion mean (std) | Generation mean (std) |
---|---|---|
deepseek-coder:6.7b-instruct-q4_K_M | 266.98 tok/s (±95.63) | 75.53 toks/s (±1.56) |
deepseek-coder:6.7b-instruct-q3_K_M | 141.43 tok/s (±50.4) | 73.69 toks/s (±1.61) |
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M | 285.81 tok/s (±73.55) | 75.14 toks/s (±3.13) |
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M | 234.2 tok/s (±79.38) | 71.54 toks/s (±1.0) |
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M | 114.54 tok/s (±38.24) | 69.29 toks/s (±0.98) |
NVIDIA A10 (Lambda Cloud gpu_1x_a10)
Model | Ingestion mean (std) | Generation mean (std) |
---|---|---|
deepseek-coder:6.7b-instruct-q4_K_M | 208.65 tok/s (±74.02) | 78.68 toks/s (±1.64) |
deepseek-coder:6.7b-instruct-q3_K_M | 111.84 tok/s (±39.9) | 71.66 toks/s (±1.75) |
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M | 226.66 tok/s (±65.65) | 77.26 toks/s (±2.72) |
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M | 202.43 tok/s (±69.55) | 73.9 toks/s (±0.87) |
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M | 112.82 tok/s (±38.46) | 66.98 toks/s (±0.79) |
NVIDIA A10G (AWS g5.xlarge)
Model | Ingestion mean (std) | Generation mean (std) |
---|---|---|
deepseek-coder:6.7b-instruct-q4_K_M | 186.61 tok/s (±66.03) | 79.62 toks/s (±1.52) |
deepseek-coder:6.7b-instruct-q3_K_M | 99.83 tok/s (±35.41) | 84.47 toks/s (±1.69) |
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M | 212.08 tok/s (±86.58) | 79.02 toks/s (±3.35) |
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M | 187.2 tok/s (±62.24) | 75.91 toks/s (±1.0) |
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M | 102.36 tok/s (±34.29) | 81.23 toks/s (±1.02) |
NVIDIA L4 (Scaleway L4-1-24G)
Model | Ingestion mean (std) | Generation mean (std) |
---|---|---|
deepseek-coder:6.7b-instruct-q4_K_M | 213.46 tok/s (±76.24) | 49.97 toks/s (±1.01) |
deepseek-coder:6.7b-instruct-q3_K_M | 118.87 tok/s (±43.35) | 54.72 toks/s (±1.31) |
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M | 225.62 tok/s (±60.21) | 49.39 toks/s (±1.9) |
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M | 211.52 tok/s (±72.76) | 47.27 toks/s (±0.58) |
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M | 120.13 tok/s (±41.09) | 51.9 toks/s (±0.71) |
If you’re looking for the latest benchmark results, head over here
Was this page helpful?