Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions - acit

adriennehuff02/acit

I ran a fast experiment investigating how DeepSeek-R1 performs on agentic tasks, in spite of not supporting tool use natively, and I was rather satisfied by initial results. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not only prepares the actions however likewise creates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 outperforms Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% appropriate, and other models by an even bigger margin:

The followed design use standards from the DeepSeek-R1 paper and the model card: Don't utilize few-shot examples, avoid adding a system timely, and set the temperature to 0.5 - 0.7 (0.6 was used). You can discover further evaluation details here.

Approach

DeepSeek-R1's strong coding capabilities allow it to function as an agent without being clearly trained for tool usage. By allowing the model to produce actions as Python code, it can flexibly communicate with environments through code execution.

Tools are carried out as Python code that is included straight in the timely. This can be a simple function meaning or a module of a larger bundle - any legitimate Python code. The design then generates code actions that call these tools.

Results from executing these actions feed back to the design as follow-up messages, driving the next actions up until a final answer is reached. The agent framework is a basic iterative coding loop that mediates the discussion in between the model and its environment.

Conversations

DeepSeek-R1 is used as chat model in my experiment, where the model autonomously pulls additional context from its environment by utilizing tools e.g. by utilizing a search engine or bring information from websites. This drives the discussion with the environment that continues up until a final response is reached.

On the other hand, o1 designs are known to perform inadequately when used as chat designs i.e. they don't try to pull context throughout a discussion. According to the linked post, o1 models carry out best when they have the full context available, with clear directions on what to do with it.

Initially, I also attempted a complete context in a single timely technique at each step (with arise from previous steps consisted of), however this led to significantly lower scores on the GAIA subset. Switching to the conversational technique explained above, I had the ability to reach the reported 65.6% efficiency.

This raises an intriguing concern about the claim that o1 isn't a chat model - perhaps this observation was more relevant to older o1 models that lacked tool use abilities? After all, isn't tool usage support an essential mechanism for making it possible for classifieds.ocala-news.com designs to pull extra context from their environment? This conversational approach certainly appears effective for DeepSeek-R1, though I still require to perform comparable explores o1 models.

Generalization

Although DeepSeek-R1 was mainly trained with RL on math and coding jobs, it is amazing that generalization to agentic jobs with tool use by means of code actions works so well. This ability to generalize to agentic jobs reminds of recent research by DeepMind that shows that RL generalizes whereas SFT remembers, although generalization to tool use wasn't investigated in that work.

Despite its capability to generalize to tool usage, DeepSeek-R1 frequently produces very long thinking traces at each action, compared to other designs in my experiments, limiting the usefulness of this design in a single-agent setup. Even easier tasks in some cases take a very long time to complete. Further RL on agentic tool use, be it by means of code actions or not, could be one option to improve effectiveness.

Underthinking

I also observed the underthinking phenomon with DeepSeek-R1. This is when a thinking model often switches between different thinking thoughts without sufficiently exploring appealing courses to reach a correct solution. This was a significant reason for excessively long reasoning traces produced by DeepSeek-R1. This can be seen in the taped traces that are available for download.

Future experiments

Another typical application of reasoning models is to utilize them for planning just, while using other models for creating code actions. This might be a potential new function of freeact, if this separation of roles proves useful for more complex jobs.

I'm also curious about how thinking designs that currently support tool use (like o1, o3, ...) perform in a single-agent setup, with and without producing code actions. Recent developments like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which likewise uses code actions, look fascinating.