Data classification: reviewing challenges, methodologies, and best practices

December 6, 2022
Mark Stone
6 min read

Modern enterprises are struggling with massive growth in data, often exponentially from year to year. With an equally massive migration of data to the cloud, organizations are harnessing diverse types of data (such as intellectual property, financial, business confidential, and regulated PII/PCI/PHI data) in increasingly complex environments. 

It’s an understatement to say that protecting that data is critical. But the question is: how can you protect the data if you don’t know how to identify and classify it? 

What is data classification?

Data classification is the critical step that helps you identify high-value data in your enterprise by categorizing it into an agreed set of specific and meaningful categories. 

Data classification drives multiple use cases such as data labeling, sensitive data identification, automating protection, compliance, security, access control, and data retention. 

Basically speaking, data classification is the ability to label your data semantically. 

Typically, organizations will classify their data as follows:

  • Public, which means anybody can see it 
  • Internal, which means only people inside the company can see it 
  • Confidential, in which only certain sets of users and groups can see it 
  • Super secret, in which case it’s highly restricted

The key is to take all the data you care about and classify it accordingly so that as the data moves through your network, the treatment of the data is consistent — no matter where it is. Then you can put the appropriate set of access control rules around it. 

With classification, you will have some gradation and taxonomy that dictates how you want your data treated.

Think about data classification like going to a conference where you may have a blue, red, or green badge. One color may be for speakers, one for press, one for the show floor and another for full access. 

These are essentially classification badges, with each badge giving you some control for what you can do there. It’s the same way with data classification: we use labels to control access. The label makes it easy because the label can then be flexibly used across the network, where each system can have its own set of policies based on the label. The label becomes a way to identify a person or our data, and based on that label, you can take action.

Why is data classification so important? 

The key factor of a data classification strategy is identifying any valuable data that must be secured with additional steps above and beyond your standard enterprise security posture. Essentially, this means separating the vital few from the trivial many.

The challenge is taking all the data an entity cares about and classifying it appropriately.  

With so much riding on this critical step, it is helpful to understand the upsides and downsides of the various approaches organizations can take to classify their data.

When deployed properly, data classification can be a foundational element to meaningful data security. But security leaders must continually assess the quality of their data classification efforts.

How organizations classify their data and the challenges with each taxonomy

There are three ways in which an organization can classify data:

  • End-user Based
  • Centralized Policy-Based
  • Metadata Driven

End-user Based 

Relying on end users for data classification is the least effective method, yet far too many organizations deploy this strategy. They end up going to their end users and asking them to self-label data as they are creating, modifying and duplicating it. When done correctly and at the time of document creation, this can be a very effective method if the user applies expert judgment.

Here’s a screenshot of how a user might classify a document using Microsoft Word.  

Sure, this step may only take 5 seconds, but the burden should not rest with users regarding something as critical as data protection. 

Relying on your end users is an error-prone process, because their job is not to secure your data. In fact, they will probably err on the side of marking everything public for accessibility benefit. Unfortunately, while it makes users’ lives easier, it will circumvent all security policies.

If you’re a security team managing an enterprise, imagine 100,000 employees creating, modifying, and sharing content all over the place. If you have busy employees, data classification will rarely be top of mind. 


A centralized policy-based data classification technique applies central rules to classify documents without requiring end-user input. DLP (Data Loss Prevention) and CASB (Cloud Access Security Brokers) products generally have some policy-based classification in place. An example of a policy could be “Mark all documents with the code word Sedona as sensitive”. The centralized approach can easily tag documents that are identified by rules.

All this is well and good, but how do you manage this classification at scale? If you’re an infosec team, you’d be writing a lot of rules and spending a significant amount of time going to all the business units for feedback. Typically, this method breeds many false positives. Plus, if you end up classifying incorrectly, files can be blocked and hinder employee productivity. If you classify it as less public, people will have access that they shouldn’t. 

Most importantly, writing or relying upon regex rules is incredibly time-consuming and, depending on the size of your organization, can take months or even years. 


Organizations also classify data by identifying document sensitivity using the associated metadata. 

What is metadata? 

It could mean:

  • The file folder in which the document is stored 
  • The role of the person that creates or accesses a document 
  • The title level or organization the person belongs to etc. 

An example of metadata-driven classification might mean that any document to which a company executive has access should be marked as sensitive. Or, it could be automatically marking any document stored in the “Revenue Forecasts” folder as sensitive. 

However, this approach has several notable limitations:

  • A sweeping and broad approach leaves little room for nuance
  • It is possible to restrict access to legitimate content, thereby creating friction in the business
  • The quality of the meta-data itself may be suspect (e.g. user’s title, role etc.), which will reflect in the quality of classification

To summarize, here are the pros and cons of each method:



Benefits Challenges
Less up-front burden on IT/security teams Massive long-term burden on IT/security teams
No tangible cost Significant overall costs, potentially leading to data breach
Minimal time required for end-users Users cannot be relied upon to make security decisions



Benefits Challenges
Shift classification burden to third party products Reliance on all business units, very time consuming
No reliance on end-users Difficult to manage at scale
Easy to classify data Too many rules and regex, too many false positives


Metadata driven

Benefits Challenges
Easy to classify data No room for error 
No reliance on end-users Potential for too much or too little access 
Low burden on IT/security teams  Reliance on potentially suspect metadata 


How Concentric classifies data

The Concentric Semantic Intelligence solution uses sophisticated machine learning technologies to autonomously scan and categorize data with context — from financial data to PII/PHI/PCI to intellectual property to confidential business information – wherever it is stored, without any rules, regex patterns or upfront policies. This allows enterprise infosec teams to centrally classify mission critical data they care about and apply the specific label tags without having to rely on end users or spend an inordinate time writing rules to discover and classify sensitive data. 

Our Risk Distance analysis autonomously identifies data, learns how it’s used, and determines whether it’s at risk. Concentric empowers you to know where your data is across unstructured or structured data repositories, email/ messaging applications, cloud or on-premises – all with semantic context.

To see firsthand — with your own data — how you can quickly and easily deploy Concentric’s MIND to classify your data without rules, regex, or end-user involvement, contact us today


Libero nibh at ultrices torquent litora dictum porta info [email protected]

Getting started is easy

Start connecting your payment with Switch App.