Close this search box.

Don’t Let GenAI’s Hoover Your Data

Written by Will Poole
May 23, 2023

Capria - OpenAI Data VacuumAmerican inventor James Spangler sold his idea for an electric broomstick-like cleaner —with a cloth filter and dust-collection bag attached to the long handle— to William Hoover in 1908. His invention resulted in the first truly practicable domestic vacuum cleaner. And just as the US brand “Kleenex” became a generic name for soft facial tissues, the concept of “to Hoover” has become a verb, meaning to suck things up.

OpenAI and other LLM providers have taken hoovering of data to an extreme that many never dreamed possible, now with recent versions expected to ingest up to 4 trillion “parameters”, aka words and numbers, that have been scraped from the web (with and without authorization) and used to train the AI.

Not long after ChatGPT popularity reached high orbit and became seemingly unstoppable, companies began realizing that the massive data-sucking monsters created by OpenAI and others were happy to hoover up data offered from within corporate firewalls, and to train on that as well, making the resulting knowledge readily available for all comers. Amazon appeared to be the first major company to ban employees from using ChatGPT in January, and other tech companies including Samsung and Apple have followed suit, as have banks and most others who care about their confidential information. Famously, the country of Italy also banned OpenAI, although it’s unclear how that can work without copying the Great Firewall of China.

I’ve been working with about 20 of our portfolio companies across the Global South since February to help raise awareness of the opportunities with Generative AI (GenAI) and also of the near guarantee of data leakage when using public LLM services. Many had no idea of the risk, and many more who had embraced the idea of rapid prototyping didn’t realize that the data hoovering issue was possible to avoid.

There are two fundamentally different approaches to addressing the data leakage problem. First is to use the right cloud LLM provider the right way and to trust them, and the second is to have your data used only with a local LLM (inside the firewall) for training and/or fine-tuning, and then to host the tuned model either securely in the cloud or within your corporate firewall. I will discuss the second model in part two of this blog.

Selecting a Trusted GenAI Provider: Azure with GPT4 or OpenAI with GPT4?

Microsoft was the first to offer hosted versions of OpenAI’s GPT version 4 on Azure with different models of privacy, from a “don’t train on my data” model for their publicly hosted instance to a privately hosted model that does train on your data but is accessible only to you and your customers. Microsoft describes its privacy model here and summarizes it with “…No prompts or completions are stored in the model during these operations, and prompts and completions are not used to train, retrain or improve the models.”

Not to be left out of the data privacy race, in late April Open AI announced “…an upcoming ChatGPT Business subscription in addition to its $20 / month ChatGPT Plus plan. The Business variant targets ‘professionals who need more control over their data as well as enterprises seeking to manage their end users.’ The new plan will follow the same data-usage policies as its API, meaning it won’t use your data for training by default. The plan will become available ‘in the coming months.’”

As of June 15, 2023, Open AI says that they will not use data submitted by customers via their API to train or improve their models unless you explicitly decide to share your data for this purpose. Whatever provider you use, do look around before making a decision, and don’t make decisions you can’t easily undo when something better comes along a month or two later.


Privacy and IP / data security are not being discussed enough and not being acted on as broadly as they should. In an otherwise-helpful 20+ page paper release by McKinsey in May on “What Every CEO Needs to Know About Generative AI” there are only two paragraphs on IP and privacy issues, both put forward as possibilities / risks vs. certainties. In my view, every CEO needs to take immediate steps to ensure: 1/ that your employees are not inadvertently leaking proprietary data into the public GenAI systems, and 2/ that you have a strategy for how to prototype and quickly deploy data-secure GenAI in either trusted cloud GenAI implementations like Azure or to use a privately trained and hosted small model GenAI, described in part two of this blog.


Subscribe to get latest updates

Be the first to hear the latest investment updates, AI tech trends, and partner insights from Capria Ventures by subscribing to our monthly newsletter. 

Report a Grievance

Capria Ventures and its related entities are committed to the highest standards of ethics and strictly enforce a zero-tolerance anti-corruption policy. Please report any suspicious activity to [email protected]. All reports will be treated with utmost urgency and resolved appropriately.

Unitus Ventures is now Capria India

Unitus Ventures, a leading venture capital firm in India, is joining forces with its US affiliate Capria Ventures, a Global South specialist, to operate with a unified global strategy under a single brand, Capria Ventures.