Tested hardware & providers

We ran our tests on the following hardware:

The laptop hardware setup includes an Intel(R) Core(TM) i7-12700H for the CPU

Tested LLMs

The results are available for the following LLMs (cf. Ollama hub):

and the following quantization formats: q3_K_M, q4_K_M, q5_K_M.

Throughput benchmark

NVIDIA GeForce RTX 3060 (mobile)

ModelIngestion mean (std)Generation mean (std)
deepseek-coder:6.7b-instruct-q5_K_M35.43 tok/s (±3.46)23.68 tok/s (±0.74)
deepseek-coder:6.7b-instruct-q4_K_M72.27 tok/s (±10.69)36.82 toks/s (±1.25)
deepseek-coder:6.7b-instruct-q3_K_M90.1 tok/s (±32.43)50.34 toks/s (±1.28)
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M78.94 tok/s (±10.2)37.95 toks/s (±1.65)
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M126.75 tok/s (±31.5)50.05 toks/s (±0.84)
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M89.47 tok/s (±29.91)47.09 toks/s (±0.67)
codeqwen:7b-chat-v1.5-q4_1171.72 tok/s (±53.37)54.74 toks/s (±0.82)
dolphin-llama3:8b-v2.9-q4_K_M131.89 tok/s (±33.37)50.81 toks/s (±0.66)
phi3:3.8b-mini-instruct-4k-q4_K_M271.40 tok/s (±52.48)88.43 toks/s (±13.22)

NVIDIA GeForce RTX 3070 (Scaleway GPU-3070-S)

ModelIngestion mean (std)Generation mean (std)
deepseek-coder:6.7b-instruct-q4_K_M266.98 tok/s (±95.63)75.53 toks/s (±1.56)
deepseek-coder:6.7b-instruct-q3_K_M141.43 tok/s (±50.4)73.69 toks/s (±1.61)
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M285.81 tok/s (±73.55)75.14 toks/s (±3.13)
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M234.2 tok/s (±79.38)71.54 toks/s (±1.0)
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M114.54 tok/s (±38.24)69.29 toks/s (±0.98)

NVIDIA A10 (Lambda Cloud gpu_1x_a10)

ModelIngestion mean (std)Generation mean (std)
deepseek-coder:6.7b-instruct-q4_K_M208.65 tok/s (±74.02)78.68 toks/s (±1.64)
deepseek-coder:6.7b-instruct-q3_K_M111.84 tok/s (±39.9)71.66 toks/s (±1.75)
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M226.66 tok/s (±65.65)77.26 toks/s (±2.72)
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M202.43 tok/s (±69.55)73.9 toks/s (±0.87)
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M112.82 tok/s (±38.46)66.98 toks/s (±0.79)

NVIDIA A10G (AWS g5.xlarge)

ModelIngestion mean (std)Generation mean (std)
deepseek-coder:6.7b-instruct-q4_K_M186.61 tok/s (±66.03)79.62 toks/s (±1.52)
deepseek-coder:6.7b-instruct-q3_K_M99.83 tok/s (±35.41)84.47 toks/s (±1.69)
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M212.08 tok/s (±86.58)79.02 toks/s (±3.35)
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M187.2 tok/s (±62.24)75.91 toks/s (±1.0)
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M102.36 tok/s (±34.29)81.23 toks/s (±1.02)

NVIDIA L4 (Scaleway L4-1-24G)

ModelIngestion mean (std)Generation mean (std)
deepseek-coder:6.7b-instruct-q4_K_M213.46 tok/s (±76.24)49.97 toks/s (±1.01)
deepseek-coder:6.7b-instruct-q3_K_M118.87 tok/s (±43.35)54.72 toks/s (±1.31)
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M225.62 tok/s (±60.21)49.39 toks/s (±1.9)
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M211.52 tok/s (±72.76)47.27 toks/s (±0.58)
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M120.13 tok/s (±41.09)51.9 toks/s (±0.71)

If you’re looking for the latest benchmark results, head over here