Aussie AI

Chapter 8. Data for AI Projects

Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"

by David Spuler

Chapter 8. Data for AI Projects

AI Needs Data

One of the strategic imperatives for an advanced AI project is data. All of the pre-trained LLMs have already ingested vast amounts of data, and if you want to specialize their capabilities for your business, you’ll need to feed the beasts, too.

The process of shovelling data into an LLM is called “fine-tuning” or “specialization” of the model. The goal is to have an LLM that is better at whatever specific task it is being asked to do. A common example is a customer support chatbot needs to ingest the data sheets for all your products, so it can answer questions about them. However, there are many other types of project that need specific data.

Finding enough data is a common problem for AI projects. The requirements are new, and data has become far more important than in the past. Even worse, it has to be “good” data, and “dirty data” is a recurring theme in business projects. Hence, cleaning up your data is a common bottleneck, and there are various tools and external vendors that can help.

Data Inventory

There are several major problems with data causing strain in many commercial AI projects.

(a) Finding good data.

(b) Cleaning it up!

(d) Formatting it.

All of these issues are bottlenecks that are often underestimated, especially since the people pushing for AI projects tend to be tech staff, who only care about code.

Some of the aspects of the data that need to be considered:

Quality of the data
Formats (input and output)
Authorship
Legal license rights

Data Formats

With regard to data formats, they could be anything, but common examples include:

HTML pages of your public website (or internal intranet)
Database records
Emails
PDF files
Microsoft Word document files
Plain text
Free-form text (e.g., user questions and staff answers in a trouble ticketing support system)

If you’re working on code models, most of the programming languages are in plain text. There’s usually a fair amount of source code floating around if you visit the IT floor.

Multimodal Data

And that list of data was just the text. Depending on the product, you might want data for your multimodal LLM. There’s also these types of data to inventory:

Images (e.g., photos, diagrams, drawings)
Videos
Animations
Audio (e.g., music, sound effects)
Advanced formats (e.g., 3D CAD/CAM design data)

Each of these non-text data sources has a variety of different available formats. I’m not going to go into the details of image formats and video codecs, because, well, I don’t know anything about that stuff, although I know someone who does.

What is Good Data?

Higher quality data is better for fine-tuning, and is also one of the ways that Small Language Models (SLMs) have improved. Some of the issues to consider in terms of data quality include:

User-generated content versus professionally created content.
Completeness
Accuracy
Tone of writing (e.g., casual versus formal)
Reading level
Use of complex jargon
Up-to-date
Harmless, safe, and non-toxic

To put it more succinctly, would the content actually contain answers that would be helpful to users of your LLM project? This may depend on whether they’re internal staff or your external customers.

Generally speaking, good data for an LLM project is:

(a) professionally written,

(b) fully-owned by the company, and

A good example would be the customer “data sheets” about your products, which are either glossy brochures in PDF or “white papers” with technical details. That sort of data would be great to train a user support chatbot on how to answer customer questions about your products.

Hence, the first stop on your quest for good data: the marketing department.

Data Cleaning

Be careful if you load up a USB drive full of PDFs from the marketing server, or set up your Linux box spidering the entire corporate intranet. Might not be such a great idea.

Also, if you’ve gathered some dodgy data, don’t expect the LLM to save your bacon! The AI engines are really dumb about this kind of stuff, and won’t recognize that they shouldn’t regurgitate all this out in answers to the general public. It’s kind of like having your kids at show-and-tell announce that you only cook microwave TV dinners at home.

Some of the issues with cleaning of internal proprietary data include:

Confidential data (all sorts of things!)
Source code
Bank account numbers
Internal discussions (e.g., developers cussing at support staff in the trouble ticket database).
Individual names, email addresses, or other personally-identifiable information.
Out-of-date information
Irrelevant information
Cuss words
Sensitive topics (many!)

But those are only the exciting stuff. A lot of the problems are much more mundane:

Typos
Badly formatted documents
Poorly written content (e.g., in emails or trouble tickets)
Incomplete data
Just Plain Wrong (JPW) data

So, the main thing is that you have to very carefully curate the sources of all the information, and then run a lot of scans on it.

Open Source Data

Should you use open source data in your business application? There are plenty of curated data sets of AI training data available on the internet. These are free and mostly have permissive open licenses that permit commercial usage. Sounds great!

Firstly, what’s the point? If it’s a publicly available data set, it’s probably already been included in the training set of whatever pre-trained foundation model you’re going to use, whether it’s commercial or open source. Foundation models are data-hungry beasts, and everyone in the AI industry knows this trick!

Secondly, open source data is probably only generic data, too. The whole idea of fine-tuning is to find some special data, stored on a dusty mag tape hidden away in a cupboard down on the shipping dock floor. Then you can fine-tune your magic data into a fancy fine-tuned model, that has genius-level intelligence about “sewer solvents” or whatever you’re selling to your customers, and then you are the AI hero.

This is the data economy now. I read somewhere that it was the companies with the most data that would win at AI. Or maybe it was the companies with the most patents, I forget.

In any case, using open source data is not necessarily a great idea. There are probably situations where it’s useful, such as to fine-tune a model to be more likely to follow the information or style of a specialized set of data. Or if your boss demands that you get some data because Apple used LoRA, then use free data by all means. Here’s a tip: there are various companies that sell data, too.

Alternatively, if you can’t find any internal special data, you might consider just using a large foundation model in a dataless architecture, without any fine-tuning, LoRA or RAG component, where you focus the extra domain-specific application logic on prompt engineering, tool integrations, and UX enhancements.

Legal Issues with Data

Data is a good place to start the initial conversation with the legal department. Some of the legal concerns with regard to data being available for use with an LLM include:

Ownership — was the data internally generated and thereby fully owned by the business, or is it subject to a third-party content license?
Licensed rights — what does a third-party content license actually permit for third-party data?
Copyrighted data — it’s a currently unresolved issue with regard to non-licensed copyrighted data being used in the training data set of an LLM. It’s also a highly controversial issue at the moment, with active lawsuits.
User-generated content — does the user license and associated privacy policy allow you to use the data? Or do you want it to disclose that you won’t?
Minor-created content — are all the users who agreed to your website user terms actually old enough to do so? Can you identify such content to handle it separately? Even if you have rights to such content, would you want to use it?
Proof of license acceptance — if your user license supposedly permits your use of user data for training models, has anything documenting the user’s acceptance actually been retained? And what is required?
Mixed copyrighted data — some data may be a mix of user-created content and copyrighted data from other sources, such as where a user uploads an excerpt from a published book.
Open source license compliance — if some or all of a data set is open source, there are still some complicated compliance aspects, such as attribution and supplying a copy of the license, even for superficially simple licenses like the MIT License or the Apache 2 License.
Copyleft-licensed content — if the content has a “copyleft” or “share-alike” license, such as Wikipedia or CC-BY-SA, can you use it without onerous obligations attaching to your other intellectual property? It’s an unclear legal issue whether copyleft licenses attach to an LLM trained using that data.
International data — can data sets from overseas be moved between countries according to the applicable terms and privacy policies, and also the overarching legal systems of the government where it is located?
Synthetic data — if you’re using “synthetic data” created by some other LLM, what are the legal issues surrounding that? What data sets were used to train the other LLM?
Photos, images, and multimedia — don’t assume that you have unlimited rights to images, photos, or video just because they appear in company articles or multimedia materials. Many such media may have been licensed from clipart or stock photography websites, with very complex license terms that are often restrictive with onerous penalties for non-compliance.

There is, of course, a huge gray cloud of unclear boundaries hanging over all of these legal issues, some of which is currently making its way through the courts in various countries. There’s all that data publicly on the web for anyone to consume for free, but this does not necessarily mean it can be used for training, or does it? Where an author has not stated you cannot use it for AI, this does not imply they have stated you can use it, either. Five years ago, nobody had a clue that data like this could possibly be used for AI, and now we’re dealing with it.

Anyway, your only response to the gray fog and that awful list should be to immediately enrol in a law degree. There’s already plenty of these copyright lawsuits to get your book full, and the AI patent lawsuits will be ramping up in a year or two. The only job that pays more than AI IP attorney is training a trillion-parameter model.

Dataless Projects

Finally, note that despite my many words to the contrary, you can actually do AI without data. Some examples of this type of project include:

Summarization
Copilot AIs
Prompt engineering wrappers

Some use cases don’t really need a custom LLM. An example is summarization of documents. If the LLM project goal is to receive a PDF document from a user as its input, and then summarize that as its output, then the custom data is really always found inside the input’s PDF file. Thus, a summarization project doesn’t always need specialized data to be inside the LLM. On the other hand, a general LLM might not be familiar enough with jargon and terminology in your problem domain, so there are cases where it might need pre-training to learn that.

You also don’t need specialized company-specific data to buy an off-the-shelf copilot from a vendor. They’re happy to take your money, no matter how little data you have. In fact, some of these vendors offer ways to customize or modify the outputs of an LLM that’s acting as a copilot. Since it’s hard to do extra fine-tuning on-the-fly, this makes me wonder if they’re doing prompt engineering! Actually, no, some of them are doing RAG, and companies like Lamini are doing multi-LoRA, which is real fine-tuning, so I can eat my words now.

Regardless, my point is that you can also get a lot of the way towards a customized LLM by using prompt engineering, as covered more fully in Chapter 17. The basic idea is that you can add various “custom instructions” or “global context” to every user’s query, and thereby modify the way that the LLM answers the questions. Various factors such as style, tone, brand voice, and other overarching issues can sometimes be addressed without any fine-tuning, through astute use of prompt engineering techniques.

References

Morgan Cheatham, Steve Kraus, December 4, 2023, The six imperatives for AI-first companies, https://www.bvp.com/atlas/six-imperatives-for-ai-first-companies
Valentina Alto, 2024, Chapter 8: Using LLMs with Structured Data, Building LLM-Powered Applications: Create intelligence apps and agents with large language models, Packt Publishing, https://www.amazon.com/Building-LLM-Apps-Intelligent-Language/dp/1835462316/
Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, https://arxiv.org/abs/2312.00678
McKinsey, September 5, 2024, Charting a path to the data- and AI-driven enterprise of 2030, https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/charting-a-path-to-the-data-and-ai-driven-enterprise-of-2030
McKinsey, September 12, 2024, A data leader’s operating guide to scaling gen AI, https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/a-data-leaders-operating-guide-to-scaling-gen-ai
Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
Sean Michael Kerner, September 23, 2024, How agentic AI could improve enterprise data operations, https://venturebeat.com/ai/how-agentic-ai-could-improve-enterprise-data-operations/
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4, https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang, 16 Oct 2024, A Survey on Data Synthesis and Augmentation for Large Language Models, https://arxiv.org/abs/2410.12896
Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, Marius Hobbhahn, Jun 06, 2024, Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data, Epoch AI, https://epochai.org/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: Generative AI Applications: Planning, Design and Implementation

The new Generative AI Applications book by Aussie AI co-founders:

Deciding on your AI project
Planning for success and safety
Designs and LLM architectures
Expediting development
Implementation and deployment

Get your copy from Amazon: Generative AI Applications

Aussie AI

Chapter 8. Data for AI Projects