Data privacy concerns linger around LLMs training
We have all witnessed the accelerated capabilities of Large Language Models (LLMs) in recent years, with the scope of what’s possible widening at every exciting turn. On a technical level, they are becoming more and more sophisticated and are gradually finding their way into public consciousness.
While most may not be able to explain exactly what an LLM is, they will often have come into contact with one; in August 2024, ChatGPT notched over 200 million users in just one week. But do the wider public actually understand how their data might be used to train these LLMs?
The vast majority of the data used are texts scraped from publicly available internet resources (e.g. the latest Common Crawl dataset, which contains data from more than three billion pages). This can include personal data, which can cause problems if inaccurate. Clearly, data protection should be the priority here, but implementing these controls has proven very challenging.
In worst case scenarios, LLMs may reveal sensitive information included in the datasets used for training, leading to worrying data breaches.
Now, social media giants are scraping their own sites for training purposes; controversially, Meta has resumed this activity after previously bowing to public concerns on privacy. Similarly, LinkedIn began the same process unannounced earlier this month, before also suspending its output. Whether these cases thrust the data privacy challenges of LLMs more into the public domain remains to be seen. But there’s no doubt users should be educated and protected by robust regulation; balancing innovation with privacy is the billion dollar challenge.
With Meta forging ahead with the training of its LLMs via user data, despite the public outcry, it appears to be an irreversible decision. Therefore, the focus must now switch to which actions the social media giant must undertake as the process continues.
Unsurprisingly, transparent communication would be a good place to start, perhaps something users have found was lacking up to this point. After all, many users only found out about the training via a news story or word of mouth initially, not through direct communication from Meta. In recent weeks, Meta has been notifying users of the impending process, although the choice of opting out is not explicitly stated. Even then, the process is not a simple one; users must navigate multiple clicks and scrolls, while Meta also claims it is at their discretion whether they honour this wish anyway.
While Meta would argue that the policy is now GDPR compliant, the treatment of its users has been poor overall. Straightforward opt out mechanisms and greater transparency on how the data will be used would begin to repair those bridges a little. Regaining actual trust will be more challenging, as Meta will have to demonstrate it is adhering to regulation as it evolves. For many users, its current actions remain a little too murky.
Despite only being a casual user of the internet, many will have left a considerable digital footprint without realising. Conventional wisdom has often indicated that consumers are indifferent to protecting their privacy online but this view is becoming outdated, especially with the rise of language models scalping data across the web.
In 2023, the IAPP unveiled its first Privacy and Consumer Trust Report, which surveyed 4,750 individuals across 19 countries. It found 68% of consumers globally are either somewhat or very concerned about their privacy online, indicating that the value of online data may be becoming apparent.
It would not be surprising to see this figure rise year on year, especially as awareness grows. This is a positive development; personal data is a precious commodity and it should not be obtained without necessary protections and regulation, many of which have proven difficult to enforce. Therefore, it’s essential for consumers to take steps, however limited, to protect their personal information, if they wish for it not to be used. A robust understanding of a platform’s data privacy policies is ultimately a good place to start.
Balancing innovation and necessary protections
As ever, regulators are walking a tightrope when it comes to legislation that offers necessary protections but does not stifle innovation more broadly. That is very much easier said than done. The EU AI Act is an example of a policy which took time to come to fruition and, due to the fast pace of artificial intelligence, some have argued is already behind the times. This must be kept in mind alongside encouraging risk-taking in the space.
It is no different when applied to data privacy; regulators understandably do not want to starve LLM developers of the tools they need, but must also be mindful of the rights of consumers. We can expect to see a similar tightrope walk as the UK Government’s Data Protection and Digital Information Bill goes through the committee stage of approval. Time will tell as to whether the correct balance has been considered.
What comes next?
In terms of the future and data privacy, it’s very hard to predict. The rapid pace of AI means our understanding can change quickly, leaving previous explanations in the dust. And as AI becomes more advanced, the appetite for data will grow, despite some experts arguing that its capabilities will become more limited as it increasingly relies on similar AI-generated data.
This may prompt developers like Meta to be more creative in the data they capture. But with consumer awareness growing, it feels like we are on something of a collision course between those obtaining data and those who own it. Regulators have a colossal part to play in clearing a sensible path for both parties, the sooner the better.
Thomas Hughes is a Data Scientist and Lead LLM Developer at Bayezian.