Finding Sensitive Data

July 23, 2024

At C² Data Technology, we aim to find sensitive data in places where it’s not obvious. Practically, we seek to locate and classify sensitive entities in your data repositories. Using machine learning, we detect over 35 types of sensitive data, covering the bases for HIPPA, PII, and national and international regulations using machine learning. This post will focus on what makes C² Discover the next-generation tool to detect and monitor sensitive data.

 

What Is the Common Approach to Detecting Sensitive Data?

The most common approach is rule-based, as it relies mainly on hand-crafted rules with a foundation in regular expressions. Rules can be designed based on domain-specific labels and syntactic-lexical patterns.

Regex can work well with the lexicon is exhaustive. However, it’s impossible to cover all patterns due to domain-specific rules and incomplete dictionaries. Take entity “address” for example. It’s next to. Impossible to include all patterns for varied address formats around the work and it relies heavily on manual effort to construct. Regex’s don’t work when the data doesn’t follow any known rules!

 

How Does C² Discover Develop a Next-Generation Solution?

By tapping into the breadth and depth of machine learning algorithms and innovative cloud technologies, C² Data came up with a hybrid Machine Learning model. We call our solution C² Discover’s exclusive Deep Learning based model. It uses a combination of machine learning resources powered by AWS (e.g., AWS Comprehend) and additional layers of contextual rules based on our experience. The results based on these combined methods provide a higher degree of accuracy than either one alone.

 

How does C² Discover Detect Sensitive Data?

Reducing the Human Effort

Traditional rule-based approaches require a considerable amount of engineering skills and domain expertise. Applying deep learning-based models, on the other hand, is effective in automatically learning representations and underlying factors from raw data. C² Discover will save significant effort in designing rules and writing regex expressions as well as adapting quickly to new data environments.

Employing Rich Features in Model Training

By sourcing synthetic data based on the real-world schema, we were able to build C² Discover’s exclusive learning-based model. We incorporated not only world-level and character-based representation learned from an end-to-end neural model, but also additional information (e.g., gazetteers and linguistic dependency). These rich features allow our model to have a better understanding of different data repositories.

Applying Weighted Results

By combining different resources results, C² Discover’s robustness is guaranteed. In this way, bias can be hugely decreased by using C² Discover than other solutions that depend on one model only.