Neko agents leverage Reinforcement Learning (RL), specifically utilizing Direct Preference Optimization (DPO), to continuously learn, adapt, and refine their decision-making processes for DeFi strategies. This allows agents managing tasks like Earn vault optimization to move beyond static rulesets and towards truly intelligent, adaptive behavior based on real-time conditions and learned experience.

The core idea is to teach the agent’s underlying Large Language Model preferred behaviors by showing it which potential actions lead to better outcomes in simulated environments.

The Learning & Optimization Cycle

Neko employs an iterative cycle involving simulation, evaluation, and model fine-tuning, often orchestrated via specific API endpoints accessible through tool calling:

  1. State Input (Prompt): An agent (e.g., managing an Earn vault) receives its current context as a prompt. This includes real-time environmental data (market conditions, protocol states like lending rates) and the agent’s own status (asset balances, current positions).
  2. Action Prediction (predict API Call): Using the prompt, the agent (or its coordinating Orchestrator) makes a tool call to the predict API endpoint of the RL system. The underlying LLM, potentially augmented by RAG for context, processes the prompt and generates multiple candidate actions or strategies (e.g., “rebalance portfolio X,” “borrow Y to deposit in Z,” “hold current positions”).
  3. Simulation & Reward Evaluation:
    • These candidate actions are fed into a sophisticated Simulation Environment. This environment uses historical market data and the agent’s past state data to model the likely outcomes of each potential action.
    • A predefined Reward Function, designed to reflect the specific goals of the strategy (e.g., maximizing risk-adjusted yield, minimizing impermanent loss), evaluates the simulated outcomes. This assigns scores, effectively determining which simulated action is preferred over the others.
  4. Preference Data Generation: Based on the simulation results and reward scores, preference data triplets are constructed. Each triplet contains:
    • The original prompt (the state input).
    • The chosen action (the candidate action that resulted in a higher reward/was deemed preferable).
    • The rejected action (a candidate action that resulted in a lower reward/was deemed less preferable).
  5. Model Fine-tuning (learn API Call): The generated preference data triplets are sent via a tool call to the learn API endpoint. The Direct Preference Optimization (DPO) algorithm uses this data to directly fine-tune the core LLM. This update adjusts the model’s parameters to increase the likelihood of generating the chosen (preferred) actions and decrease the likelihood of generating rejected actions for similar future prompts.
  6. Performance Monitoring: The real-world performance of the agent’s subsequent actions is continuously monitored using key financial metrics (e.g., Profit and Loss (PnL), changes in Total Value Locked (TVL)). This provides crucial feedback on the effectiveness of the learning cycle and can inform future adjustments to the reward function, simulation parameters, or even the core agent logic.

Key Components

  • Simulation Environment: Models potential outcomes of DeFi actions using historical and agent-specific data.
  • Reward Function: Quantifies the desirability of different outcomes, guiding the agent’s learning towards specific objectives.
  • Direct Preference Optimization (DPO): An efficient algorithm for aligning LLM behavior with learned preferences derived from comparative data (chosen vs. rejected).
  • Core LLM (interchangable): The underlying model responsible for generating potential actions, continuously improved via the DPO process.
  • predict & learn APIs: Interface endpoints allowing agents to request action predictions from the RL model and submit preference data to update the model through tool calling.

Outcome

This iterative RL/DPO cycle allows Neko agents to autonomously improve their strategies over time. They learn from simulated experience, adapt to dynamic market conditions, and align their actions more closely with complex objectives defined by the reward function. This leads to more robust, effective, and intelligent automation of DeFi strategies within the Neko ecosystem.