Learning OpenClaw and fighting local LLMs

Learning OpenClaw and fighting local LLMs

I’ve spent the last several days deep in the world of OpenClaw, and it’s not intuitive. Story time for the struggles, and why Qwen3-Coder-Next has been an absolute dream.

The Dream

I wanted to run a local LLM that could power OpenClaw with:

  • Reliable tool usage
  • Fast inference speeds
  • Long context handling (128K+ for serious work)
  • Consistent reasoning without losing the thread

I started with the two “best” models I could find: GPT-OSS:120B and GLM-4.7-Flash that could run on a NVIDIA GB10 (128gb memory )

GPT-OSS:120B - The Promise That Failed

[[https://artificialanalysis.ai]] always ranks GPT-OSS:120B and it seems to handle general task just fine. The model is smart, claims to know how to use tools, and had great reasoning capabilities. But… it seems to fail apart in large context task and loses the needle in the hay stack.

Tool usage

Initial tool usage was solid… key word “Initial”. Web search integration worked… when it wouldn’t hallucinate how to use it after using 10 times the day before or even just a few messages early on the same days. This was a reoccurring theme with most tools, even after using recommend temperatures and top_p parameters for tool usage. Sometimes it would use web search correctly, other times it would ignore the tools entirely and say “i dont have access to web search i am a llm with date XXXXX as my information curoff”. Or after setting up a web-search tool and using it, it would later try to use a tool that I don’t have configured, even though the memory had saved the working tool as default.

I spent to much time debugging why it wasn’t using tools consistently, only to find that even the same prompt would sometimes work and sometimes fail.

Reasoning

If though it runs at about 60 tokens/sec, while fast, it spent way to much time “thinking” before responding. Reasoning was thorough, but sometimes it would start reasoning down the wrong path, and once that started it was hard to get it to come back. Check out Reasoning Drift’ or Probability Cascade. A tiny hallucination in reasoning would wreck accuracy later and it would no let go of a false piece of information or reasoning, completely stuck on bad reasoning. Its like the the butterfly effect of LLM reasoning.

Lost in the Context

As conversations grew, it would lose fine details and start contradicting itself or forgetting a key piece of information. Which didn’t help when it would fill a large part of context with bad reasoning. It would consistently loose attention and/or lose the needle in the haystack of its own reasoning. Good for general purpose, not good for tools.

GLM-4.7-Flash

GLM-4.7-Flash was more promising initially. It was fast, it was accessible, and it had decent capabilities. Ranked and benchmarked really high for tool usage.

Tool Usage

Tool usage was clearly better than gpt-oss but it would still struggle consistently use the correct tools or follow rules set in place for tools. It struggled less than gpt-oss but didn’t feel powerful or smart.

Reasoning

Standard reasoning, didn’t reason forever like gpt-oss, still susceptible to reasoning drift, but with shorting reasoning it was less of an issue. It still didn’t seem as smart

Speed

Initial speed was great - about 45 tokens/sec, not as fast as gpt-oss, but very usable… until the context grew. This is a well known issue at the time of writing, GLM-4.7-Flash has massive speed issues at large context. Around 64K tokens, speed dropped from 45 to 10-20 tokens/sec. With Openclaw the context is always going to be high, making it near unusable.

Overall

Reasonably smart for basic tasks, good at following instructions, fails at long context speed, and loses the needle in the haystack.

Before we move onto the winner…

The DGX-Spark GB10 Setup

Both models ran on my DGX-Spark GB10 with:

  • VLLM for fast inference and concurrency
  • OpenWebUI as a gateway to the models

The hardware was capable - the issue was the models themselves.

vllm and spark-vllm-docker

I owe a huge thank you to the eugr/spark-vllm-docker project. This Docker setup has been critical to getting vllm optimizations and stability on my DGX-Spark GB10. Without it, I would have spent countless hours fighting with CUDA, drivers, and configuration issues.

The pre-configured Dockerfiles handle all the tricky bits:

  • CUDA driver compatibility
  • GPU memory optimization
  • Performance tuning for the DGX-Spark hardware
  • Stable container runtime
  • Prebaked model configuration for VLLM

It literally saved me days of debugging and let me focus on testing models instead of fixing infrastructure.

Qwen3-Coder-Next - The Winner

I gave up on the “best” models according to benchmarks and decided to try Qwen3-Coder-Next based on positive reviews. And wow, what a difference. Over and over again it would surprise me with correctly using tools. Providing Me with better tool options if the current one lacked a capability. Troubleshooting tool usage, this was critical when configuring browser usage. We will hit the juice bits of all the cool stuff further down.

Why It Works

  • Consistent speed - 35-45 tokens/sec even at 128K context
  • Minimal thinking time - It spends nearly no time “thinking” before responding (it’s not a reasoning model)
  • Reliable tool usage - Tools work every single time
  • Long context handling - Handles 128K+ without slowing down
  • Smart without being slow - It’s both fast and intelligent

Story Time

So I have a web app I wrote (vibe coded) awhile back that is web based todo list that I can access from multiple devices with auth. Strictly an app I wrote just for myself. Anyways I was at the gym, and I sent openclaw a message here is my from memory summarized transcript.

Me: Hey can you go pull down my todo app from github, i would like to explore giving it an api to make todo updates problematically. … few minutes past Openclaw: I pulled it down, nice app, here are 3 different ways we could implement an api –It shows me the 3 ways with it recommendation since it is a single user app and doesn’t require extra complexity. Me: I like your recommendation, do it. Openclaw: I can make the changes directly to the files or we can setup opencode and i can work through that framework, both ways work, what would you like me to do? Me: Wow, that would be cool, you can use the same api and api key from openclaw to configure opencode. Openclaw: Hmm its not working, let me check the OpenCode docs. Openclaw: I got it working, I also saved the config to your Obsidian vault for future reference. Implementing the api changes now. –reminder I’m still at the gym exercising, and a few more minutes pass– Openclaw: Changes are pushed to branch X, inserts PR link. Me: That looks great, but I forgot, I use auth in front of the app will that cause a problem? Openclaw: Maybe, can you tell me about the auth setup. Me: Yeah, here is the gitops config –insert link to gitops repo–, and i use tinyauth https://github.com/steveiliop56/tinyauth. Openclaw: I pulled down both repos and reviewed both code bases. I updated the PR. Once the changes are made and I can update the gitops repo with auth changes. Me: Okay I merged the changes and the Ci made the new image. Openclaw: I created new creds for me make API calls and updated the gitops repo, insert pr link Me: Wow, crazy, I merged the changes. Openclaw: Ready to test the api? Me: yeah, add something like “get bananas” Openclaw: I added “get bananas”, and have confirmed it was added with a list api call. Can you confirm from the ui. Me. Yes I see it. At this moment I see the glimpse of future of Ai agents

Fast Forward a few days

I’m laying in bed and decide again to see how powerful it is… Me: hey can you search the Amazon and find me a reverse osmosis water filter that has at least 4.6 stars and at least over 1000 reviews? Openclaw: I can, but it looks like you don’t have browser use configured yet, would you like assistance getting it set up? Me: Yes. –Blah Blah Blah walks me through setup– Openclaw: I have found a few that meet criteria, and I added them to your to-do list. (the app that it created the api for a few days earlier)

Why This Matters

Finding the right model made all the difference. Before Qwen3-Coder-Next, I was constantly fighting with the models to get them to work consistently. Now, OpenClaw just works.

The Tradeoff I Made

  • GPT-OSS-120b: More “thinking”, less reliability
  • GLM: Fast but limited by context and intelligence
  • Qwen3-Coder-Next: Smart, Excellent attention, Fast, reliable, and consistent

For me, reliability and speed are more important than raw reasoning time. I want my assistant to get things done, not spend 30 seconds “thinking” about a simple command… and then fail

Conclusion

If you’re running local LLMs for automation:

  1. Don’t chase the “biggest” model - chase the most reliable one
  2. Test consistency, not just peak performance
  3. Consider long-context handling for real-world use
  4. Speed matters, but consistency matters more

Qwen3-Coder-Next has been a game-changer for me. It’s fast, reliable, and actually gets things done. If you’re struggling with other models, give this one a try.


Published: February 18, 2026 Word Count: ~850 words