Quick Tip: Reasoning Model Inference (DeepSeek R1) Using Python

Network Illustration

Reasoning models such as OpenAI’s o1 and o3-mini as well as DeepSeek’s DeepSeek R1 are becoming more popular very quickly.

These models “think” before answering. Well, technically, they are producing an internal chain of thought, effectively prompting themselves. In practice, these models produce “thinking tokens” during the thinking/reasoning phase. This is done to allow the model to break down complex problems into smaller steps, identify errors, and refine its response before generating the final answer.

Recently, I was asked whether working with reasoning models, from a local inference and coding perspective, would be any different from using more traditional LLMs. While there aren’t too many open reasoning models around, we can use DeepSeek R1 to demonstrate how this works.

In the following, I will be inferencing DeepSeek-R1-Distill-Qwen-1.5B using both the transformers library as well as the OpenAI library.

Usage Recommendations

As DeepSeek R1 works differently than other models, keeping in mind their usage recommendations is important.

There are three key things to consider: Firstly, we are not supposed to use a system prompt. Secondly, we are supposed to trigger the reasoning by prompting <think>\n. Lastly, the ideal temperate range is 0.5 to 0.7.

Inference

For the following two examples, we will be running the model locally. The second example will access a locally hosted, OpenAI compatible API. I will be using Jan to do so, but there are many options available to do so.

Inference Using Transformers-Library

Running DeepSeek R1 using transformers is as simple as running any other model. Here, I am simply using the default pipeline to do so. The only somewhat special thing is the formatted_input, which also includes the special <think> instruction.

from transformers import pipeline, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
pipe = pipeline("text-generation", model=model_name, tokenizer=tokenizer)

# It is recommended to not use a system prompt and to trigger the thinking using <think>.
user_input = "What runs, but never walks, has a mouth, but never talks, has a bed, but never sleeps, has a head, but never weeps?"
formatted_input = f"User: {user_input}\nAssistant: <think>\n"

response = pipe(
    formatted_input,
    max_length=5000,
    do_sample=True,
    temperature=0.6, # Recommended value is 0.6
    top_p=0.9,
    truncation=False
)

for out in response:
    print(out['generated_text'])

As we can see, the model produces the thinking/reasoning tokens, effectively performing chain-of-thought prompting on itself. Then, the model provides the actual, final answer.

User: 
What runs, but never walks, has a mouth, but never talks, has a bed, but never sleeps, has a head, but never weeps?

Assistant: 
<think>
Okay, so I'm trying to solve this riddle: "What runs, but never walks, has a mouth, but never talks, has a bed, but never sleeps, has a head, but never weeps?" Hmm, let me break it down piece by piece.

[...]
</think>

The answer to the riddle is a **horse**. Despite the confusion with the "running but never walking" part, the other clues fit a horse: it has a bed, no sleep, a head, no weep, and is fast.

Unfortunately, DeepSeek R1 was not able to solve the riddle. While not the point of this article at all, o1 easily gets it right.

Inference Using Locally Hosted API and OpenAI

For this second example, I have hosted the model using Jan. We are also streaming the results from the model.

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:4444/v1", api_key="sk-1234567890")
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
user_input = "What runs, but never walks, has a mouth, but never talks, has a bed, but never sleeps, has a head, but never weeps?"

response = client.chat.completions.create(
    model=model_name,
    messages=[
        {
            "role": "user",
            "content": user_input
        },
    ],
    max_tokens=5000,
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        content_chunk = chunk.choices[0].delta.content
        print(content_chunk, end="", flush=True)

As you can see below, the response looks – as expected – exacly like before.

<think>
Okay, so I've got this riddle here: "What runs, but never walks, has a mouth, but never talks, has a bed, but never sleeps, has a head, but never weeps?" Hmm, let's break it down. The question is asking for something that meets all these criteria, right?

[...]
</think>

The answer to the riddle is a horse. It runs but does not walk, has a head, can lie down on a bed without sleeping, and never weeps. 

**Answer:** The horse

Conclusion

Different models will work differently. However, as we have seen, working with reasoning models such as DeepSeek R1 is not fundamentally different from more traditional LLMs from an inference point of view.

That said, when using these models, we have to, at least, consider how to treat the thinking tokens from a UI/UX perspective. For example, OpenAI has decided not to show these tokens (only a summary) to the users. In contrast, DeepSeek allows users to look at the thinking/reasoning process.

Furthermore, this example demonstrates why reasoning models use significantly more tokens. Even for relatively simple prompts, the reasoning/thinking step adds a serious number of tokens. Of course, this has to be considered when thinking about the best model from a cost to performance perspective.