How Managing Unstructured Data is Boosting Industries and AI
The modern world relies heavily on digital data, much of which exists outside of traditional spreadsheets or databases. This unstructured data encompasses a wide range of formats, including text, documents, audio and video files, images, emails, log files, genomic data, sensor data from IoT devices, and medical imagery. As the variety and volume of data generated by machines and applications continue to expand, it accumulates across data centers, edge locations, and the cloud. Many IT organizations struggle with limited visibility into this data—uncertain about its location, quantity, user access, and growth rate.
According to a survey conducted by my company this year, nearly 50% of enterprises are now storing over 5PB of unstructured data, with about 30% exceeding 10PB. To put this into perspective, 10PB is equivalent to around 110,000 ultra high-definition movies or roughly half of the data housed by the U.S. Library of Congress. Additionally, most organizations allocate more than 30% of their IT budgets to data storage.
The business challenge of managing unstructured data
Now with AI, big data analytics and digital processes dominating business strategies, it’s imperative to start leveraging all this data better. Unstructured data is the fuel needed for AI yet most organizations aren’t using it well. One reason for this is that unstructured data is difficult to find, search across and move due to its size and distribution in common hybrid cloud environments.
The other reason why unstructured data has been underutilized is that only until recently have we seen mainstream AI tools and services that are affordable for organizations—SaaS and cloud-based–and which don’t require deep data science expertise to use. But times are changing, and our survey found that preparing for AI is a top data management priority for enterprises.
Deducing from the survey findings, enterprises have two main priorities in managing unstructured data: the ability to quickly find, sort and leverage it for AI projects and at the same time, control rapidly growing storage and backup costs.
Accomplishing these goals requires new ways of managing data—tied less to managing individual storage devices, which has been the traditional approach—and focusing on managing data independently to deliver useful, needed data services to the business.
Unstructured data management solutions and strategies can help IT gain holistic visibility and a detailed understanding of unstructured data across the organization: how much data is stored and where, what types and sizes of files are most prominent, what are the costs to store it and back it up, who are the top owners, percentage of “cold” data, orphaned data and other identifying characteristics such as metadata describing file contents.
With this information, organizations can clean up their data estates and choose the optimal, most cost-effective storage for different data sets. Simultaneously, they can create automated data workflows to find their data, tag it with new contextual metadata to aid search and move it to AI and ML technologies.
Leveraging unstructured data to improve business outcomes and decision-making
Let’s start with a look at healthcare. The healthcare industry is one of the largest industry creators of data. Roughly 30% of the world’s data volume is generated by the healthcare industry, and this will grow to 36% by 2025, according to research compiled by RBC Capital Markets. Clinical notes and records, medical images, digital pathology and research studies are valuable sources of information to better inform personalized medicine and improve patient outcomes.
While still nascent in practice, AI is starting to enable more accurate, faster analysis of common scans such as mammograms and colonoscopies. AI is also behind intelligent alerting systems for community health, such as an environmental health crisis tracked to ER patients from the same location. Research published in the New England Journal of Medicine indicates that generative AI has improved patient outcomes by up to 45% in clinical trials, particularly in the treatment of chronic diseases such as diabetes and heart disease. Generative AI solutions have been reported to reduce the paperwork burden of clinicians and even improve communications between physicians and their patients.
One significant challenge in healthcare is being able to analyze and manage the complexity of data and file types while ensuring tight adherence to regulations governing its use and protection. Instilling the right policies and tools to analyze, discover, protect and safely move data to the right locations where it can be anonymized and cleansed prior to analysis is a key strategy.
The auto industry is another sector navigating technology disruption. It’s hard to drive down the road for more than a few minutes without seeing an electric vehicle, whereas two years ago they were still a rare sight. Electric and autonomous vehicles collect large quantities of data from sensors, which helps the car adjust and take actions on the fly or issue alerts to the driver. The collection and analysis of this data is also white gold for manufacturers to troubleshoot issues and improve their designs. Using an unstructured data management system, a car manufacturer could create a workflow like this:
- Find crash test data related to the abrupt stopping of a specific vehicle model;
- Use and AI tool to identify and tag data with “Reason = Abrupt Stop”.
- Move only the related data to a cloud data lake house to reduce time and cost associated with moving and analyzing unrelated data.
- Move the unrelated data to an archival storage tier for cost savings (or delete it) once the analysis is complete.
Imagine the implications for any manufacturer that wants to leverage the right machine data to avoid bad outcomes for its customers and to improve products faster than its competitors.
Businesses need easier ways to comply with data regulations and audits
From industry regulations governing sensitive data, to geolocation requirements, responding to e-discovery requests, preventing ransomware and managing data during an M&A or divestiture, the list of data compliance needs continues to grow. Holistic data governance is harder to achieve all the time given the volume of data, the prevalence of shadow IT and the distribution of data in so many places. Being able to easily search and move regulated data as needed is critical to avoid breaches and data loss or misuse that may result in fines, lawsuits, customer defections and brand damage.
Consider data management solutions which support automated workflows for compliance. For example, a user could create a query to find all data related to a divestiture project and then, through an API, use an external application like Amazon Macie to identify PII data and tag it. Next the system could automatically move the PII data to an object-locked cloud storage service where it cannot be modified or accessed.
Growing assets of unstructured data can be both a gift and a curse. Companies of all sizes are dealing with the strain on budget and time to store, manage and govern it all. Yet with intelligent automation, sound policies and collaboration among key data stakeholders across the business, IT teams can properly manage the data and effectively leverage it for game changing AI and analytics initiatives.
By Krishna Subramanian