Aussie AI
Chapter 20. Tool Usage in AI Projects
-
Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"
-
by David Spuler
Chapter 20. Tool Usage in AI Projects
Tool Usage in Generative AI
We’re talking about tools for the AI engine, not tools that developers can use to create models and engines. LLMs require tools to do more advanced things, just like humans. Some of the things that are hard for an LLM to do without tools include:
- Having real-time or up-to-date information (e.g., stock prices or the latest AI research papers).
- Computation-related questions beyond basic arithmetic.
- Information that’s only in a different place (e.g., the company’s internal ERP database).
- Time-specific or locale-specific information that differs from its training.
Another type of tool that LLMs can use are those that perform an action for you, such as sending an email. However, we’ll mostly examine those in the chapter on agentic architectures.
As another simple example, if someone asks you the time, you look at your watch (or your phone). If you ask an LLM “What is the time?” there is nothing in its training data set that could possibly answer this correctly. The only way is to use a clock that’s integrated into the LLM, and executed by the AI Engine as part of answering your query. In this case, the clock is a “tool” for the LLM.
The terminology for tool integrations used by LLMs is still evolving, as indeed are the tool capabilities themselves. Some other words you might hear include:
- Plug-ins — refers to the data source integrations in the ChatGPT version.
- Function calls — because that’s what the locally integrated tools are.
Tools are a very advanced part of an AI architecture. You don’t want to worry about them in your first rodeo. But if you want to extend the capabilities of LLMs beyond whatever was in their training data set, then tools are the way to achieve this.
Types of Tools
The AI engine can use several types of tools:
- Data tools
- Action tools
- Calculation tools
Data tools are ways that new data is integrated into the AI model without extra fine-tuning. The RAG architecture is the typical way that these tools are integrated. A “retriever” component looks up relevant data and then the AI engine converts that into an answer. This is how a lot of business chatbots learn the data about their entire website, and indeed a similar architecture is probably used in the AI-enabled search available with Bing or Google.
Action tools are ways that the AI engine can change something, rather than just output some text. For example, it could integrate with your internal financials app so as to post an employee expense report (or if the integration is two-way, then it could also access data from the financials app, which is like it’s a data tool, rather than an action tool).
One way to think about this is in terms of “read-only” versus “read-write” tool interfaces. A data tool is mainly a read-only tool, whereas an action tool can also write data.
Calculation tools are computational apps that the AI model can call to generate results. For example, if it detects a mathematical computation, it might launch that tool to answer the math question. Models have to be trained on how to launch tools, and when to launch particular tools. Some examples of tools include:
- Math tools (e.g., calculators, converters)
- Clocks and calendars (e.g., for time/date computations)
- Wordsmithing tools (e.g., wordcount)
Tool Architectures
LLMs with tool capabilities used to be called Tool-Augmented Language Models (TALMs), but this terminology has largely been dropped. Tool usage is table stakes for LLMs these days, and is just one part of its training regimen, along with high-protein egg nogg. LLMs are trained about tool generically in regard to issues like:
- Deciding whether to use a tool or not for a user query.
- Choosing which tool to use for a given query.
- Extracting the parameters from the user query to send to the tool.
- The input format required for the tool and its “function call” request (usually JSON or a Python script).
- The output format returned from the tool (e.g., JSON or HTML or plain text).
- How to use the tool results for a final answer.
Tool usage is intended to be hidden from the user. The normal user interaction won’t see the tool usage happening, because the LLM does everything in its own sandbox, before presenting the final results as a summary. The normal sequence is for the LLM to receive a user query, decide that a tool call is needed, wait for the tool’s results, and then summarize the tool output back into nice text. However, developers who are building LLM-based applications can view traces of tool function calls and their results.
One major limitation is that LLM tool usage is stateless. The LLM does not usually learn from, or even remember, the results it received from one of its tool interactions. The tool interfaces might support caching, which is a speedup, but this won’t help the LLM to make better use of tools, or to learn anything from the output results it sees from a tool.
How are Tools Integrated?
Like humans, an AI needs to learn to look at its watch if someone asks the time. Specific training data sets are required that tell the AI what tool to use, and when.
In the early days of LLMs, only a small number of internal tools were used. However, modern LLMs now often the ability to submit an interface specification for a new tool, usually via a JSON configuration file. This means that you can create and add new tools for business-specific purposes without needing to do any fine-tuning of these powerful LLMs.
During inference, the AI engine has to recognize in the LLM output that a tool must be executed. Not all queries require tools, and hence the LLM output results may or may not indicate tool usage. There are a variety of ways to do this:
- Tool-specific tokens — i.e., the LLM can emit a “trigger” token to run a tool. Note that PEFT could be used here to fine-tune new tool capabilities, by only adding a few new tool-triggering tokens to the vocabulary.
- Placeholder patterns — i.e., output something like an “--insert current time here--” special pattern is another way, and the engine then looks for these patterns, which avoids adding tool tokens to the vocabulary, but is inefficient in that there are multiple text tokens in the output).
- JSON data processing — this is the method used by OpenAI’s API, whereby some models have been trained to return JSON-formatted text that indicates a function call. In this case, the client must call the tool, rather than the OpenAI server. The models can also call tools themselves automatically on the server side, but this is less under the control over the client. The client OpenAI API supports “tools” and “tool_choice” parameters, which give some control over the tool launching processes.
- Code generation — there are various AI models that will generate code, such as in Python, that can be executed to generate the answer. This is a general solution, because Python can call various submodules and can thereby generate many tools.
- Multi-level planning — the AI first generates a plan of how to answer the query, including what tools to use, and then runs any tools, and then does another inference query to collate it into a final answer.
Tools can be valuable to any type of LLM. They can enhance the output produced by any AI application, but where they really hit their stride is with agents, by including tool capabilities in “agentic architectures” controlled by multiple agents.
Tool Usage in RAG Architectures
Tool usage is an issue in RAG architectures just as much as other LLM systems. Not all queries can be handled by LLMs even with a RAG datastore. Any dynamic queries that cannot be answered by a chunk of a stored document may need either:
(a) General knowledge from the model itself, or
(b) Tool execution, or
(c) Both of these.
Yet another case is where RAG chunks are required, but none are found. The model needs to be trained to emit a failure message, or a different prompt is required for the zero-chunks case: “Sorry, I do not know the answer to that question.”
Tool integration is a general LLM issue that is not specific to RAG architectures. Examples of tools include clocks, timers, calendars, calculators, and many more. Having an LLM launch tools means training it to know:
(a) Which queries need tools and/or RAG chunks.
(b) What tools are needed (if any).
(c) Whether the tools need to run on the RAG chunk as input or if they don’t require one.
(d) How to launch the tool (with input parameters).
(e) How to integrate the tool output results into the LLM’s answer.
Hence, an advanced LLM needs to decide on a number of issues at the same time: whether or not it needs to launch a tool, whether or not it needs to get an external RAG chunk, not to mention whether or not a user’s question is safe to answer (i.e., refusal and prompt shields). All of that in less than 200ms. That’s quite a lot to ask of the poor silicon elf.
LLM Computer Usage
Computers can be a useful tool! Also, you know, your phone is a computer and it’s a lot more powerful than the one they had on Apollo 13. Could an LLM use your computer?
This idea of having the LLM use your phone or your computer is the latest hot area of tool usage and agentic architectures. Research has shown that LLMs can access the screen of your device in two ways:
(a) Screenshot analysis (image models), or
(b) Hierarchical internal views.
Either of these methods is possible with existing multimodal LLMs. Arguably, the view method might even only need to use text analysis if it represents the logical layout of images (e.g., logical windows and vector image formats). As I write this, there are two major industry launches getting attention for putting this type of functionality live:
- Apple Intelligence — training on-device LLMs to view your screen and manipulate your apps.
- Anthropic — generalized LLM computer usage with mouse and keyboard control.
Apple Intelligence involves two separate facets. First, the on-device LLMs can learn about what’s on your screen, so it can response in-context depending on what app you’re currently using (e.g., texting versus searching the web). Secondly, Apple is rushing to have all its own apps, and those from third parties, add LLM integrations called “intents” to their apps, so that the AI on your phone can actually do actions using those apps.
Anthropic has trained a model with a similar goal, but it’s focused on full computer usage without any modifications to apps. Not only can it view the screen, but it can take control of the mouse and keyboard. Theoretically, the LLM can then do anything that a human could do with the GUI (for better or worse), using the apps as they already exist today.
References
- Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Sarath Chandar, 14 Apr 2024, Towards Practical Tool Usage for Continually Learning LLMs, https://arxiv.org/abs/2404.09339
- Amy Marks, Jun 11, 2024, Clarifying Function Calling / Tool Use in LLMs, https://medium.com/@aevalone/clarifying-function-calling-tool-use-in-llms-6511af510f99
- Yicheng Fu, Raviteja Anantha, Prabal Vashisht, Jianpeng Cheng, Etai Littwin, 6 Sep 2024, UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity, https://www.arxiv.org/abs/2409.04081
- V Adrakatti, 2024, Exploring screen summarization with large language and multimodal models, Masters Thesis, University of Illinois Urbana-Champaign, Urbana, Illinois, USA, https://www.ideals.illinois.edu/items/131510
- Anthropic, 23 Oct 2024, Developing a computer use model, https://www.anthropic.com/news/developing-computer-use
- Anthropic, 23 Oct 2024, Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku, https://www.anthropic.com/news/3-5-models-and-computer-use
- Anirban Ghoshal, 23 Oct 2024, How Anthropic’s new ‘computer use’ ability could further AI automation, https://www.cio.com/article/3583260/how-anthropics-new-computer-use-ability-could-further-ai-automation.html
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: Generative AI Applications: Planning, Design and Implementation |
|
The new Generative AI Applications book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI Applications |