Modern enterprises are struggling with massive growth in data, often exponentially from year to year. With an equally massive migration of data to the cloud, organizations are harnessing diverse types of data (such as intellectual property, financial, business confidential, and regulated PII/PCI/PHI data) in increasingly complex environments.
It’s an understatement to say that protecting that data is critical. But the question is: how can you protect the data if you don’t know how to identify and classify it?
Data classification is the critical step that helps you identify high-value data in your enterprise by categorizing it into an agreed set of specific and meaningful categories.
Data classification drives multiple use cases such as data labeling, sensitive data identification, automating protection, compliance, security, access control, and data retention.
Basically speaking, data classification is the ability to label your data semantically.
Typically, organizations will classify their data as follows:
The key is to take all the data you care about and classify it accordingly so that as the data moves through your network, the treatment of the data is consistent — no matter where it is. Then you can put the appropriate set of access control rules around it.
With classification, you will have some gradation and taxonomy that dictates how you want your data treated.
Think about data classification like going to a conference where you may have a blue, red, or green badge. One color may be for speakers, one for press, one for the show floor and another for full access.
These are essentially classification badges, with each badge giving you some control for what you can do there. It’s the same way with data classification: we use labels to control access. The label makes it easy because the label can then be flexibly used across the network, where each system can have its own set of policies based on the label. The label becomes a way to identify a person or our data, and based on that label, you can take action.
The key factor of a data classification strategy is identifying any valuable data that must be secured with additional steps above and beyond your standard enterprise security posture. Essentially, this means separating the vital few from the trivial many.
The challenge is taking all the data an entity cares about and classifying it appropriately.
With so much riding on this critical step, it is helpful to understand the upsides and downsides of the various approaches organizations can take to classify their data.
When deployed properly, data classification can be a foundational element to meaningful data security. But security leaders must continually assess the quality of their data classification efforts.
There are three ways in which an organization can classify data:
Relying on end users for data classification is the least effective method, yet far too many organizations deploy this strategy. They end up going to their end users and asking them to self-label data as they are creating, modifying and duplicating it. When done correctly and at the time of document creation, this can be a very effective method if the user applies expert judgment.
Here’s a screenshot of how a user might classify a document using Microsoft Word.
Sure, this step may only take 5 seconds, but the burden should not rest with users regarding something as critical as data protection.
Relying on your end users is an error-prone process, because their job is not to secure your data. In fact, they will probably err on the side of marking everything public for accessibility benefit. Unfortunately, while it makes users’ lives easier, it will circumvent all security policies.
If you’re a security team managing an enterprise, imagine 100,000 employees creating, modifying, and sharing content all over the place. If you have busy employees, data classification will rarely be top of mind.
A centralized policy-based data classification technique applies central rules to classify documents without requiring end-user input. DLP (Data Loss Prevention) and CASB (Cloud Access Security Brokers) products generally have some policy-based classification in place. An example of a policy could be “Mark all documents with the code word Sedona as sensitive”. The centralized approach can easily tag documents that are identified by rules.
All this is well and good, but how do you manage this classification at scale? If you’re an infosec team, you’d be writing a lot of rules and spending a significant amount of time going to all the business units for feedback. Typically, this method breeds many false positives. Plus, if you end up classifying incorrectly, files can be blocked and hinder employee productivity. If you classify it as less public, people will have access that they shouldn’t.
Most importantly, writing or relying upon regex rules is incredibly time-consuming and, depending on the size of your organization, can take months or even years.
Organizations also classify data by identifying document sensitivity using the associated metadata.
What is metadata?
It could mean:
An example of metadata-driven classification might mean that any document to which a company executive has access should be marked as sensitive. Or, it could be automatically marking any document stored in the “Revenue Forecasts” folder as sensitive.
However, this approach has several notable limitations:
To summarize, here are the pros and cons of each method:
|Less up-front burden on IT/security teams||Massive long-term burden on IT/security teams|
|No tangible cost||Significant overall costs, potentially leading to data breach|
|Minimal time required for end-users||Users cannot be relied upon to make security decisions|
|Shift classification burden to third party products||Reliance on all business units, very time consuming|
|No reliance on end-users||Difficult to manage at scale|
|Easy to classify data||Too many rules and regex, too many false positives|
|Easy to classify data||No room for error|
|No reliance on end-users||Potential for too much or too little access|
|Low burden on IT/security teams||Reliance on potentially suspect metadata|
The Concentric Semantic Intelligence solution uses sophisticated machine learning technologies to autonomously scan and categorize data with context — from financial data to PII/PHI/PCI to intellectual property to confidential business information – wherever it is stored, without any rules, regex patterns or upfront policies. This allows enterprise infosec teams to centrally classify mission critical data they care about and apply the specific label tags without having to rely on end users or spend an inordinate time writing rules to discover and classify sensitive data.
Our Risk Distance analysis autonomously identifies data, learns how it’s used, and determines whether it’s at risk. Concentric empowers you to know where your data is across unstructured or structured data repositories, email/ messaging applications, cloud or on-premises – all with semantic context.
To see firsthand — with your own data — how you can quickly and easily deploy Concentric’s MIND to classify your data without rules, regex, or end-user involvement, contact us today.
If you’ve used ChatGPT, you know how powerful and helpful it can be. For the security conscious enterprise, however, there...
As more organizations adopt remote or hybrid work arrangements, cloud infrastructure provides the comprehensive flexibility and productivity gains required to...
Artificial intelligence (AI) has achieved remarkable advancements over the last few years, with examples like ChatGPT dominating recent headlines. Large...
The Payment Card Industry Data Security Standard (PCI DSS) is a globally recognized set of policies and procedures designed to...
The Sarbanes-Oxley Act (SOX) of 2002, a U.S. federal legislation, was created to protect investors by increasing transparency in financial...
GDPR and CCPA are significant data protection legislations that require businesses to reassess the way they manage consumer data. While...