How I Built a Custom AI Agent with Tools from Scratch

by

in

Even though there are existing frameworks for AI Agents, I had an idea for a rather easy way of building a custom AI Agent from scratch and decided to try it out.

Autonomous Agent with Tools

I specify tools for an LLM that functions as an Agent. A console app executes these tools based on the Agent’s instructions, returns the result to the Agent, and iterates until the Agent reports a final result.

This setup gives the Agent autonomy to solve the tasks as it sees fit. If it has the answer internally, it simply returns it. If it knows of a web page it wants to access but not the content, it retrieves it. If it wants to perform a search and then browse the search results, it does that. I prefer this approach to setting up static workflows with chained prompts.

Dry Run and Implementation

The first step was to dry run the setup by sending the system prompt to the LLM, reading the response, manually executing its instructions, and reporting the result back to it in the specified format until it reported a task result. Since this worked, I decided to implement it in a C# .NET 8 console app that manages the tools and calls to the Agent.

Web Browser Tool

The Agent sends the URL it wants along with instructions to a parser LLM that will extract information from the fetched page. The result is then sent back to the Agent. 

I get the web page with a HttpClient.GetAsync call. Then I send in the HTML body to the parser LLM.

Google Search API Tool

I prefer APIs to scraping when available, so I activated a custom search engine in the Google Search API, which allows 100 free calls a day.

The Agent sends the search text and gets the API json result in return.

The API call is implemented using a HttpClient.GetAsync.

YR Weather API Tool

The YR weather API takes longitude and latitude and returns a 10-day forecast for the location. Since the Agent has geo coordinates for many locations internally, it simply specifies them in the tool call and receives the JSON result returned by YR. I filter it a bit before sending it back to the Agent LLM due to the 30,000+ character length.

This API call is implemented using a HttpClient.GetAsync.

Windows Command Tool

This tool is a new addition that is described in detail in its own post. With this tool, the AI Agent can execute any command line task like building and running a .Net 8 program or doing file system and network operations. The AI Agent sends a collection of instructions that are executed and the results are returned to it for review and further action.

History Object

With every tool result sent to the Agent, I include a history object containing previous tool calls and their results, so the Agent has all the context.

Azure OpenAI 4o Model as the Agent

I currently use an OpenAI 4o model in Azure as the Agent. Azures SDK for it is a bit of a mess, so i use their REST API in a custom class with HttpClient POST. The S0 pricing tier I use has a limit of 3 calls per minute, so I have a timer that triggers a Task.Delay if the limit is reached.

Google Gemini Flash 2.0 for Parsing

I use Google Gemini Flash 2.0 Experimental as a parser through Google’s API for parsing the web pages fetched by the web browser tool. The Agent specifies an instruction in the tool execution, which is sent to the parser along with the body part of the page markup. Gemini Flash 2.0 Experimental does a good job, is fast and has a high token limit.

Agent Task Example

The Agent gets this task:

Which presidential candidate won the US presidential election November 2024? When is the inauguration and what will the weather be like during it?

The Agent triggers a search and gets the result. I have shortened the result for visibility.

{
    "searchText": "US presidential election 2024 winner",
    "result": "[{
       "kind": "customsearch#result",
       "title": "Presidential Election Results Map: Trump Wins - The New York Times",
       "htmlTitle": "<b>Presidential Election</b> Results Map: Trump <b>Wins</b> - The New York Times",
       "link":    "https://www.nytimes.com/interactive/2024/11/05/us/elections/results-president.html",
       "displayLink": "www.nytimes.com",
       "snippet": "Nov 5, 2024 ... ... 2024/11/05/us/elections/results-president.html. Advertisement. SKIP ADVERTISEMENT. Presidential Election Results: Trump Wins. Updated Jan. 15 ..."
    }]"
}
 

The Agent triggers a fetch of the web page in the search result it believes contains the best information. It also sends an instruction to the LLM parser specifying the information it wants from the page. The result is then returned to the Agent.

 {
    "url":     "https://www.nytimes.com/interactive/2024/11/05/us/elections/results-president.html",
    "instruction": "Find the name of the presidential candidate who won the US presidential election in November 2024.",
    "result": "Based on the scraped web content, the presidential candidate who won the US presidential election in November 2024 was Donald J. Trump.\n"   
 }  

The Agent triggers a search about the inauguration with the search API tool. It retrieves the information directly from the result and does not need to fetch a web page. I have shortened the json structure in the result for visibility.

 {
    "searchText": "US presidential inauguration date 2025",
    "result": "[{
        "kind": "customsearch#result",
        "title": "Inauguration of the president of the United States | USAGov",
        "htmlTitle": "<b>Inauguration</b> of the <b>president</b> of the <b>United States</b> | USAGov",
        "link": "https://www.usa.gov/inauguration",
        "displayLink": "www.usa.gov",
        "snippet": "The inauguration ceremony takes place at the U.S. Capitol building in Washington, DC. The next presidential inauguration is scheduled to be on January 20, 2025."
}]"
}  

The Agent has access to longitude and latitude in its model, so it gets the weather with the YR weather tool. I have shortened the result for visibility. 

{
    "longitude": "-77.0369",
    "latitude": "38.9072",
    "result": "[{
      "Time": "2025-01-16T09:00:00Z",
      "Temperature": -10.1,
      "Cloudiness": 16.4,
      "Humidity": 72.6,
      "WindSpeed": 1.4,
      "Precipitation": 0
     },
     {
      "Time": "2025-01-16T10:00:00Z",
      "Temperature": -10.4,
      "Cloudiness": 1.6,
      "Humidity": 72.9,
      "WindSpeed": 0.9,
      "Precipitation": 0
      }]"
}  

The Agent sends the task result to the console app.

{
  "taskResult": "Donald J. Trump won the US presidential election in November 2024. The inauguration is scheduled for January 20, 2025. On the day of the inauguration, the weather forecast for Washington, D.C. predicts a temperature of around -8.7°C at noon with no cloudiness and wind speed of 4.4 m/s, with no precipitation expected."
}

Possible Tools

The list of potential tools is long and growing, but here are a few interesting:

  • The Windows Console would allow the Agent to interact more dynamically with the OS on which the console app is running. Update: implemented.
  • Mail functionality would enable the Agent to communicate with humans and receive their input. The process could be paused, with its state saved to a database. A scheduler could periodically check for responses and continue the process by sending the response to the Agent.
  • A database tool would allow the Agent to store and retrieve data in various ways.
  • Image generation using DALL-E 3 in Azure.
  • A headless web browser tool that can be used programmatically to replicate user behavior, such as submitting forms.