www.iri.com Open in urlscan Pro
173.230.136.199  Public Scan

Submitted URL: https://48263.r.a.d.sendibm1.com/mk/cl/f/oCXnBg-8oKE2OryN8EOy1OY8PcGbVeK4omPGAEh8ib85iE9XBXeE2LFJNpc49HT-ntt10slPhr-dSfxFEnKdT_vF...
Effective URL: https://www.iri.com/blog/data-protection/find-mask-pii-in-bigtable-cosmos-and-dynamo/
Submission: On March 30 via api from SE — Scanned from FR

Form analysis 3 forms found in the DOM

GET https://www.iri.com/blog/

<form role="search" method="get" id="searchform" class="search-form" action="https://www.iri.com/blog/">
  <label class="screen-reader-text" for="s">Search for:</label>
  <input type="text" placeholder="Type Here" value="" name="s" id="s">
  <button type="submit" class="searchsubmit"><i class="fa fa-search" aria-hidden="true"></i><span class="screen-reader-text">Search</span></button>
</form>

POST https://www.iri.com/blog/wp-comments-post.php

<form action="https://www.iri.com/blog/wp-comments-post.php" method="post" id="commentform" class="comment-form" novalidate="">
  <p class="comment-notes"><span id="email-notes">Your email address will not be published.</span> <span class="required-field-message" aria-hidden="true">Required fields are marked <span class="required" aria-hidden="true">*</span></span></p><input
    id="author" placeholder="Your Name" name="author" type="text" value="" size="30" required="required">
  <input id="email" name="email" type="email" placeholder="Email Address" value="" size="30" required="required">
  <input placeholder="Your Website (optional)" id="url" name="url" type="text" value="" size="30"><textarea placeholder="Comment" id="comment" name="comment" cols="45" rows="8" aria-required="true" required="required"></textarea><textarea
    name="ak_hp_textarea" cols="45" rows="8" maxlength="100"></textarea>
  <p class="comment-form-cookies-consent"><input id="wp-comment-cookies-consent" name="wp-comment-cookies-consent" type="checkbox" value="yes"> <label for="wp-comment-cookies-consent">Save my name, email, and website in this browser for the next time
      I comment.</label></p>
  <p style="width: auto;"><label><input type="checkbox" name="s2_comment_request" value="1"> Check here to Subscribe to notifications for new posts</label></p>
  <p class="form-submit"><input name="submit" type="submit" id="submit" class="submit" value="Leave Comment"> <input type="hidden" name="comment_post_ID" value="15269" id="comment_post_ID">
    <input type="hidden" name="comment_parent" id="comment_parent" value="0">
  </p>
  <p style="display: none;"><input type="hidden" id="akismet_comment_nonce" name="akismet_comment_nonce" value="5ab0c92112"></p>
  <p style="display: none !important;"><label>Δ</label><input type="hidden" id="ak_js_1" name="ak_js" value="1648624058337">
    <script>
      document.getElementById("ak_js_1").setAttribute("value", (new Date()).getTime());
    </script>
  </p>
</form>

GET https://www.iri.com/blog/

<form role="search" method="get" id="searchform" class="search-form" action="https://www.iri.com/blog/" target="_self">
  <label class="screen-reader-text" for="s">Search for:</label>
  <input id="popup-search-text" type="text" placeholder="Search Text" value="" name="s">
  <button id="popup-search-button" type="submit" class="searchsubmit">Search</button>
</form>

Text Content

Skip to content
Solutions Products

 * Solutions
 * Products
 * Blog
 * BI
 * Big Data
 * DQ
 * ETL
 * IRI
   * IRI Business
   * IRI Workbench
 * Mask
 * MDM
   * Master Data Management
   * Metadata Management
 * Migrate
   * Data Migration
   * Sort Migration
 * Test Data
 * Transform
 * VLDB
 * VLOG

Site Search
Search the Blog
Search for: Search


FIND & MASK PII IN BIGTABLE, COSMOS AND DYNAMO…

 * by Adam Lewis

Abstract: This article covers the use of the IRI DarkShield API for
automatically locating and de-identifying PII or other sensitive data in the
three major cloud provider NoSQL databases — Google BigTable, MS CosmosDB in
Azure, and Amazon DynamoDB. Prior articles in this blog cover how DarkShield
wizards in IRI Workbench find and mask data in other popular NoSQL DBs,
including Cassandra, Elasticsearch and MongoDB.1 A subsequent article covers
CouchDB, Redis and Solr.

WHAT IS NOSQL?

NoSQL typically stands for “not only SQL” although others may say it stands for
“non SQL”. NoSQL was introduced to provide an alternative to relational
databases that at the time, were the dominant force in the industry. 

Because NoSQL databases are non-tabular, data is stored differently compared to
SQL databases. There are actually various types of NoSQL databases based on
their data model. These data models include documents, key-value pairs,
wide-column, and graphs.



THE STRENGTH OF NOSQL DATABASES

According to CloudGuru.com, relational databases have “inflexible schemas and
notoriously difficult horizontal scaling [which means] they don’t always fit
well in a highly scalable and geographically distributed infrastructure
stack”2. 

In comparison, the flexibility of the NoSQL document-model makes it easier to
change data. NoSQL databases are also easier to scale horizontally, and usually
the cloud providers handle the operational overhead of managing infrastructure.



To know when to choose NoSQL over relational databases there are generally a few
factors to consider for decision makers. According to MongoDB the drivers are:
“fast-paced Agile development, storage of structured and semi-structured data,
huge volumes of data, requirements for scale-out architecture, modern
application paradigms like microservices and real-time streaming”3.

NOSQL DB SECURITY CONCERNS

As with traditional relational (SQL) databases, NoSQL DBs have similar security
issues, but also some unique risks. According to the International Journal of
Digital Society, NoSQL vulnerabilities include: “insufficient or ineffective
input validation, errors in the application level permissions handling, weak
authentication, insecure communication, illegal access to unencrypted data, etc.
are some of the vulnerabilities applicable for NoSQL”4.

Like SQL injections, NoSQL injections are also possible when input validation is
not handled properly. Because NoSQL databases do not have a common query
language, queries are written in the programming language (PHP, JavaScript,
Python, etc) of the application connected to the database. This means NoSQL
injections can result in commands being executed not only in the database, but
also in the application itself. 



There is a long list of endpoint security practices for NoSQL DBs. But even with
them, would-be assailants still manage to punch holes in those defenses.
Companies must thus evolve to harden the security profile of these collections
with another level of protection.  



This is where IRI DarkShield comes in. As a data-centric, or “startpoint
security” solution, DarkShield masking provides another important layer of data
protection atop the end-point measures deployed by cloud database service
providers.

ABOUT IRI DARKSHIELD

IRI DarkShield is a data masking tool for finding and de-identifying sensitive
data in semi-structured and unstructured files and databases. DarkShield is one
of three core data masking products in the IRI Data Protector Suite which
leverage graphical data classification, searching, and masking job design models
in the IRI Workbench IDE, built on Eclipse.



As of DarkShield Version 4, however, two powerful Remote Procedure Call (RPC)
Application Programming Interface (API) versions are also provided: the “Base”
DarkShield API and the DarkShield-Files API. The DarkShield APIs extend the use
of DarkShield functionality outside of Workbench and leverage a plugin on top of
an IRI Web Services platform named Plankton.



To find and protect sensitive data in a wide range of sources, the DarkShield
APIs use specified search matchers and masking rules that follow business rules.
For more information on creating search matchers and masking rules, please refer
to this article.

The “Base” DarkShield API is used to search and mask unstructured text outside
the context of files. Alternatively, the DarkShield-Files API provides the
ability to search and mask PII in files. 

With the assistance of the DarkShield-Files API, semi-structured and
unstructured data like plain text files, csv/tsv, word documents, excel, pdf,
json, xml, parquet, jpeg, and png images can be searched and masked.

AWS DYNAMODB, AZURE COSMOSDB, GOOGLE BIGTABLE AND THE DARKSHIELD API

The companies reigning over cloud services for NoSQL databases are Amazon AWS
with DynamoDB, Microsoft’s Azure CosmosDB, and Google’s Cloud BigTable. The
focus of this article is on these three well known service providers and how the
DarkShield-Files API can be leveraged to search and mask inside their NoSQL
databases located in the cloud.



For those unfamiliar with connecting and querying NoSQL databases
programmatically, not to worry. AWS, Azure, and Google Cloud are not only known
for providing high quality service but also provide copious amounts of
documentation on how to access their database content using Software Development
toolkits (SDK) supported in various programming languages.

The DarkShield-File API demos currently uploaded to GitHub are written in the
Python language; as such, those projects use client libraries for Python.
However, other calling languages, like Java, can be used.

These calling programs, or “glue code” to the API, is where these procedures can
be defined. See below for the links to the DarkShield-Files API demos:

 * Azure CosmosDB
 * AWS DynamoDB
 * Google Cloud BigTable

DarkShield Search and Mask Contexts

Within the IRI darkshield-files-api demos in GitHub, there will be a setup file
included. The setup file will define a search context, mask context, file search
context, and file mask context that are needed by the DarkShield-Files API.
Without these contexts defined, the DarkShield-Files API will not search or
mask.

DarkShield API Search Context

A search context designates the PII that will be annotated in the files read
through matchers. There are a variety of matcher types for search matchers. The
DarkShield-File API supports using search matchers based on regular expressions,
named entity recognition (NER) models, and matching based on predefined text
that would be matched against in SET files.

The image above displays an EmailMatcher that uses regular expression patterns
to search for any text that may contain a “@” and website suffix, a SsnMatcher
that uses regular expression patterns to search for any text that may follow the
format of SSN, and a NameMatcher that uses a Named Entity Recognition (NER)
model to identify names.

File Search Context

For specific file formats, the DarkShield-Files API provides users with
additional filtering and matching options. In this example, path matchers are
provided for json and xml files.

Mask Context
Note: In older versions of the DarkShield-Files API, the configuration for rules
and rulesMatcher requires the “type: cosort” and “type:name” in their respective
configurations.

For the API to know what to do with PII that has been discovered during search
operations, a mask context must be defined. The first part of a mask context
contains a list of rules that we want to apply. Each rule has an expression that
dictates what masking function will be used.

These expressions are also documented in the IRI FieldShield manual and IRI
Workbench, and because the functions are compatible, enterprise data integrity
can be preserved post-masking regardless of source. The list of possible masking
rules include:

 * Assignment Expressions
 * Blur Functions
 * Deletion Functions
 * Encoding Functions
 * Encryption Functions (AES, 3DES, FPE, GPG) 
 * Hashing Functions
 * Pseudonym Replacement
 * Redaction Functions
 * String Manipulation Functions

In the code above we have three rules called HashRule, RedactSsnRule, and
FpeRule. Respectively, the rules were assigned a hashing function, a function to
replace characters with ‘*’, and format preserving encryption. The DarkShield
API uses the same masking functions as IRI FieldShield (which masks structured
data in SortCL-compatible job scripts).

Following masking rules are rule matchers. The rule matchers are easy to
understand. Rule matchers pair search matchers with masking rules.

Lastly, is the file mask context. For specific file formats, the
DarkShield-Files API provides users with additional configuration options. In
this example, the configuration for json files has specified the implementation
of pretty print.

File Mask Context

AUTHENTICATION CREDENTIALS OF NOSQL DEMOS

Accessing BigTable, CosmosDB, or DynamoDB programmatically requires the user’s
login credentials in some form for authentication. There are various ways to
store and access these credentials securely, but for the sake of simplicity the
three NoSQL demos either use credential files or environment variables.

CosmosDB credentials.json | DynamoDB .aws/credentials file

Google BigTable allows you to generate a private key for your credential and
download the newly generated key in a file.

Google BigTable demo uses an environment variable GOOGLE_APPLICATION_CREDENTIALS
to designate a path to the private key contained in the file downloaded from
Google Cloud Platform console.

TAKING A CLOSER LOOK AT THE DARKSHIELD API INTERFACE TO BIGTABLE

THE MAIN PROGRAM

To get an idea of how the main program would be implemented below is a
screenshot of the Google BigTable main.py.





All of the previously linked demos use a main program that facilitates the
DarkShield-Files API call. The main program will contain the glue code that
performs the following actions:

 * Authentication to the datasource (NoSQL DB)
 * Accesses and queries the database
 * Makes POST requests to the DarkShield-Files API with the content of the DB
 * Resulting output from the DarkShield-Files API is written back to the
   database.

In the BigTable demo the resulting output has been written back into the
database. Alternatively, the code could be altered to write the masked results
to files or to a separate test database. The DarkShield-Files API is a flexible
tool that is only limited by the glue code that manipulates it.

EXECUTING THE PROGRAM

To execute, run python main.py “project_id” “instance_id” from your terminal.
For those wondering, project_id is your Cloud Platform project ID and
instance_id is the ID of the Cloud Bigtable instance you wish to connect to.

Below is an example of what the execution may look like:



RESULTS OF SEARCHING AND MASKING OF PII VIA THE DARKSHIELD API

GOOGLE BIGTABLE



Below is a demonstration of the results of search and masking operations
performed on Google Cloud BigTable using the BigTable demo on GitHub:

 

 

BigTable Demo Project

Original data and masked results after execution of the IRI DarkShield BigTable
demo

AZURE COSMOSDB



Below is a demonstration of the results of search and masking operations
performed on CosmosDB:

 

 

CosmosDB data source explorer

Vulnerable PII in a CosmosDB collection.

CosmosDB collection item after masking

AMAZON DYNAMODB



Below is a demonstration of the results of search and masking operations
performed on DynamoDB:

 



AWS NoSQL Workbench provides UI to DynamoDB

Unmasked PII in DynamoDB Collections

Masked results exported to csv format part 1:



Masked results exported to csv format part 2:



Conclusion

Finding and masking PII through the DarkShield-Files API is an “open” solution
not constrained by the data source or silo. As with RDBs, files, documents and
images, DarkShield’s API delivers flexible codable solutions to detect and
protect sensitive structured, semi-structured and unstructured data in almost
any NoSQL database, whether it runs on-premise or in the cloud. 

 1.  Note that the same DarkShield base API described herein can also be used on
    those three as well, and IRI is now also working to support Couchbase,
    Redis, and Solr. The DarkShield API for files finds and masks data in RDB
    C/BLOB columns, unstructured text and log files, semi-structured EDI files
    like HL7, JSON, X12 and XML, MS and PDF documents and many image formats.
 2. Vanbuskirk, Mike Nov, et al. “NoSQL Databases Comparison: Cosmos DB VS
    DynamoDB VS Cloud Datastore and Bigtable.” A Cloud Guru, 25 June 2021,
    acloudguru.com/blog/engineering/comparing-cloud-nosql-databases-dynamodb-vs-cosmos-db-vs-cloud-datastore-and-bigtable
 3. What Is Nosql? NoSQL Databases Explained.” MongoDB,
    www.mongodb.com/nosql-explained.
 4. Shahriar, Hossain, and Hisham M Haddad. “Security Vulnerabilities of NoSQL
    and SQL Databases for MOOC Applications.” International Journal of Digital
    Society, Mar. 2017.

LinkedInFacebookTwitterRedditEmailPrint
Automating IRI Jobs Using File Monitoring: A POC
Preprocessing Images to Improve OCR & DarkShield Results
Amazon DynamoDB Azure CosmosDB BigTable Cosmos CosmosDB Darkshield API
DarkShield RPC API DynamoDB Google BigTable IRI DarkShield NoSQL NoSQL database
PII pii masking search matcher


RELATED ARTICLES

 * 
 * 

Generating Test Data in PDF…
Using Tensorflow and PyTorch NER…
Testing with DB Subsets in…
Restoring Masked Values with IRI…
Finding and Masking PII in…
Load Balancing & Authenticating DarkShield…
Preprocessing Images to Improve OCR…
Generating Test Data for Azure…
Masked Test Data in an…
Finding and Masking PHI in…
Masking PHI in DICOM Files…
prev
next



LEAVE A REPLY CANCEL REPLY

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Check here to Subscribe to notifications for new posts





Δ


CATEGORIES

 * Big Data 64
 * Business Intelligence (BI) 74
 * Data Masking/Protection 137
 * Data Quality (DQ) 37
 * Data Transformation 84
 * ETL 117
 * IRI 205
   * IRI Business 69
   * IRI Workbench 149
 * MDM 38
   * Master Data Management 14
   * Metadata Management 24
 * Migration 55
   * Data Migration 50
   * Sort Migration 6
 * Test Data 83
 * VLDB 70
 * VLOG 40




TRACKING


© 2022 Innovative Routines International (IRI), Inc., All Rights Reserved |
Contact

✓
Thanks for sharing!
AddToAny
More…


X


SEARCH THE BLOG

Search for: Search
 Note that the same DarkShield base API described herein can also be used on
those three as well, and IRI is now also working to support Couchbase, Redis,
and Solr. The DarkShield API for files finds and masks data in RDB C/BLOB
columns, unstructured text and log files, semi-structured EDI files like HL7,
JSON, X12 and XML, MS and PDF documents and many image formats.
Vanbuskirk, Mike Nov, et al. “NoSQL Databases Comparison: Cosmos DB VS DynamoDB
VS Cloud Datastore and Bigtable.” A Cloud Guru, 25 June 2021,
acloudguru.com/blog/engineering/comparing-cloud-nosql-databases-dynamodb-vs-cosmos-db-vs-cloud-datastore-and-bigtable
What Is Nosql? NoSQL Databases Explained.” MongoDB,
www.mongodb.com/nosql-explained.
Shahriar, Hossain, and Hisham M Haddad. “Security Vulnerabilities of NoSQL and
SQL Databases for MOOC Applications.” International Journal of Digital Society,
Mar. 2017.