Ollama is now powered by MLX on Apple Silicon in preview

<- Back

Ollama is now powered by MLX on Apple Silicon in preview

redundantly

Comments (136)

franze
I created "apfel" https://github.com/Arthur-Ficial/apfel a CLI for the apple on-device local foundation model (Apple intelligence) yeah its super limited with its 4k context window and super common false positives guardrails (just ask it to describe a color) ... bit still ... using it in bash scripts that just work without calling home / out or incurring extra costs feels super powerful.
babblingfish
LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.
Yukonv
Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out. What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.
robotswantdata
Why are people still using Ollama? Serious.Lemonade or even llama.cpp are much better optimised and arguably just as easy to use.
LuxBennu
Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path
daveorzach
What are significant differences between Ollama and LM Studio now? I haven’t used Ollama because it was missing MLX when I started using LLM GUIs.
codelion
How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/
janandonly
> Please make sure you have a Mac with more than 32GB of unified memory.Yeah, I can still save money by buying a cheaper device with less RAM and just paying my PPQ.AI or OpenRouter.com fees .
harel
What would be the non Mac computer to run these models locally at the same performance profile? Any similar linux ARM based computers that can reach the same level?
dial9-1
still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram
mfa1999
How does this compare to llama.cpp in terms of performance?
anon
undefined
AugSun
"We can run your dumbed down models faster":#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.
puskuruk
Finally! My local infra is waiting for it for months!
brcmthrowaway
What is the difference between Ollama, llama.cpp, ggml and gguf?
darshanmakwana
Really nice to see this!
techpulselab
[dead]
charlotte12345
[dead]
firekey_browser
[dead]
charlotte12345
[flagged]