A technical overview on meta and user-driven data classification

June 4, 2024
Mark Stone
6 min read

 Modern enterprises struggle with massive growth in data, often exponentially from year to year. With an equally massive migration of data to the cloud, organizations are harnessing diverse types of data (such as intellectual property, financial, business confidential, and regulated PII/PCI/PHI data) in increasingly complex environments.  

 According to a Cybersecurity Ventures report, total global data storage is expected to exceed 200 zettabytes by 2025. 

 It’s an understatement to say that protecting that data is critical. But how can you protect the data if you don’t know how to identify and classify it?  

 Defining terms: what is data classification?  

 Data classification is about the ability to label your data semantically.  

 Think about data classification like going to a conference where you may have a blue, red, or green badge. One color may be for speakers, one for press, one for the show floor and another for full access.  

 These are like classification badges, with each badge giving you some control for what you can do there. It’s the same way with data classification: we use labels to control access. The label makes it easy because the label can then be flexibly used across the network, where each system can have its own set of policies based on the label. The label becomes a way to identify a person or our data, and based on that label, you can take action. 

 Why is it important?  

 Data classification is one of the most critical steps in helping organizations identify high-value data by categorizing it into an agreed set of specific and meaningful categories.  

 Data classification drives multiple use cases — such as data labeling, sensitive data identification, automating protection, compliance, security, access control, and data retention.  

 Typically, organizations will classify their data as follows: 

  • Public, which means anybody can see it  
  • Internal, which means only people inside the company can see it  
  • Confidential, in which only certain sets of users and groups can see it  
  • Super secret, in which case it’s highly restricted 

The key is to take all the data you care about and classify it accordingly so that as the data moves through your network, the treatment of the data is consistent — no matter where it is.  

Then, you can put the appropriate set of access control rules around it.  

With classification, you will have some gradation and taxonomy that dictates how you want your data treated. 

The key factor of a data classification strategy is identifying any valuable data that must be secured with additional steps above and beyond your standard enterprise security posture.  

This means separating the vital few from the trivial many. 

Challenges in data classification  

Data classification presents numerous challenges, including the diversity of data types, the volume of data, and the dynamic nature of data usage.  

The key challenge is taking all the data an entity cares about and classifying it appropriately.   

Take unstructured data, for example, which can be emails and documents, Teams or Slack messages, social media posts, or audio and video files. These pose significant classification difficulties due to their varied formats and contexts. The increasing use of cloud storage and remote work has introduced new complexities in ensuring consistent data classification across different environments. 

Traditional data classification methodologies, such as rule-based and manual classification, have served organizations for years. However, these methods often fall short in handling the growing complexity and volume of data. Emerging methodologies, such as trainable classifiers powered by machine learning, offer a promising yet still limited alternative. 

How organizations classify their data and the challenges with each taxonomy 

There are several ways in which an organization can classify data: 

  • End-user based 
  • Centralized policy-based 
  • Metadata driven 
  • Trainable classifiers

End-user based  

Relying on end users for data classification is the least effective method, yet far too many organizations deploy this strategy. They end up going to their end users and asking them to self-label data as they are creating, modifying and duplicating it. When done correctly and at the time of document creation, this can be a very effective method if the user applies expert judgment. 

However, the end user’s job is not to secure data. In fact, they will probably err on the side of marking everything public for accessibility benefit. Unfortunately, while it makes users’ lives easier, it will circumvent all security policies and lead to data exfiltration 

If you’re a security team managing an enterprise, imagine 100,000 employees creating, modifying, and sharing content all over the place. If you have busy employees, data classification will rarely be top of mind.  

Centralized 

A centralized policy-based data classification technique applies central rules to classify documents without requiring end-user input. DLP (Data Loss Prevention) and CASB (Cloud Access Security Brokers) products generally have some policy-based classification in place. An example of a policy could be Mark all documents with the code word Sedona as sensitive”.   

The centralized approach can easily tag documents that are identified by rules.  

All this is well and good, but how do you manage this classification at scale? If you’re an infosec team, you’d be writing a lot of rules and spending a significant amount of time going to all the business units for feedback. Typically, this method breeds many false positives. Plus, if you end up classifying incorrectly, files can be blocked and hinder employee productivity. If you classify it as less public, people will have access that they shouldn’t.  

Most importantly, writing or relying upon regex rules is incredibly time-consuming and, depending on the size of your organization, can take months or even years.  

Metadata-driven  

Organizations also classify data by identifying document sensitivity using the associated metadata.  

Metadata might be:  

  • The file folder in which the document is stored  
  • The role of the person that creates or accesses a document  
  • The title level or organization the person belongs to etc.  

An example of metadata-driven classification might mean that any document to which a company executive has access should be marked as sensitive. Or, it could be automatically marking any document stored in the Revenue Forecasts” folder as sensitive.  

However, this approach has several notable limitations: 

  • A sweeping and broad approach leaves little room for nuance 
  • It is possible to restrict access to legitimate content, thereby creating friction in the business 
  • The quality of the meta-data itself may be suspect (e.g. user’s title, role etc.), which will reflect in the quality of classification 

Trainable classifiers  

Trainable classifiers learn from labeled data to automatically categorize new data with high accuracy, reducing the manual effort required and improving consistency. This advanced approach enables organizations to adapt to changing data landscapes and maintain accurate classification at scale. For example, machine learning models can be trained on a dataset of labeled documents to recognize patterns and classify new documents accurately. This not only enhances efficiency but also ensures that classification remains consistent across large datasets. 

While trainable classifiers offer flexibility, there are inherent sets of challenges. First, the training process requires a substantial amount of labeled data. Plus, the accuracy of that classifier is only as good as the data it’s trained on.   

More importantly, maintaining and updating these classifiers can be resource-intensive, especially as data evolves and new categories emerge.   

There’s also the risk of overfitting, where the classifier performs very well on the training data but fails to classify new, unseen data.  

Data Classification with Concentric AI  

Concentric AI’s Semantic Intelligence solution uses advanced machine learning technologies to scan and categorize data autonomously and with context. This eliminates the need for rules, regex patterns, or upfront policies — ensuring accurate and efficient data classification.  

Concentric AI autonomously identifies data, learns how it’s used, and determines whether it’s at risk. Concentric empowers you to know where your data is across unstructured or structured data repositories, email/ messaging applications, cloud or on-premises – all with semantic context. 

Our data classification process is so precise that we have introduced the concept of archetypes, a specific type of data or file containing sensitive information. Concentric AI can identify the exact type of document, be it a business insurance claim or an auto insurance policy, allowing for more precise risk assessments and data management strategies. 

With Concentric AI, customers do not need to provide our data models with good and bad data sets or validate the results. It’s as simple as pointing and connecting to data repositories to get a highly accurate, contextual view of your data in minutes. There is no need for training or validation.  

Want to see firsthand — with your own data — how Concentric AI can quickly and easily be deployed to identify, classify, and remediate risk for all your sensitive data? No rules, no regex, complex policies, or upfront work for your employees or security team.  

Book a demo today.   

concentric-logo

Libero nibh at ultrices torquent litora dictum porta info [email protected]

Getting started is easy

Start connecting your payment with Switch App.