www.iri.com
Open in
urlscan Pro
173.230.136.199
Public Scan
Submitted URL: https://48263.r.a.d.sendibm1.com/mk/cl/f/oCXnBg-8oKE2OryN8EOy1OY8PcGbVeK4omPGAEh8ib85iE9XBXeE2LFJNpc49HT-ntt10slPhr-dSfxFEnKdT_vF...
Effective URL: https://www.iri.com/blog/data-protection/find-mask-pii-in-bigtable-cosmos-and-dynamo/
Submission: On March 30 via api from SE — Scanned from FR
Effective URL: https://www.iri.com/blog/data-protection/find-mask-pii-in-bigtable-cosmos-and-dynamo/
Submission: On March 30 via api from SE — Scanned from FR
Form analysis
3 forms found in the DOMGET https://www.iri.com/blog/
<form role="search" method="get" id="searchform" class="search-form" action="https://www.iri.com/blog/">
<label class="screen-reader-text" for="s">Search for:</label>
<input type="text" placeholder="Type Here" value="" name="s" id="s">
<button type="submit" class="searchsubmit"><i class="fa fa-search" aria-hidden="true"></i><span class="screen-reader-text">Search</span></button>
</form>
POST https://www.iri.com/blog/wp-comments-post.php
<form action="https://www.iri.com/blog/wp-comments-post.php" method="post" id="commentform" class="comment-form" novalidate="">
<p class="comment-notes"><span id="email-notes">Your email address will not be published.</span> <span class="required-field-message" aria-hidden="true">Required fields are marked <span class="required" aria-hidden="true">*</span></span></p><input
id="author" placeholder="Your Name" name="author" type="text" value="" size="30" required="required">
<input id="email" name="email" type="email" placeholder="Email Address" value="" size="30" required="required">
<input placeholder="Your Website (optional)" id="url" name="url" type="text" value="" size="30"><textarea placeholder="Comment" id="comment" name="comment" cols="45" rows="8" aria-required="true" required="required"></textarea><textarea
name="ak_hp_textarea" cols="45" rows="8" maxlength="100"></textarea>
<p class="comment-form-cookies-consent"><input id="wp-comment-cookies-consent" name="wp-comment-cookies-consent" type="checkbox" value="yes"> <label for="wp-comment-cookies-consent">Save my name, email, and website in this browser for the next time
I comment.</label></p>
<p style="width: auto;"><label><input type="checkbox" name="s2_comment_request" value="1"> Check here to Subscribe to notifications for new posts</label></p>
<p class="form-submit"><input name="submit" type="submit" id="submit" class="submit" value="Leave Comment"> <input type="hidden" name="comment_post_ID" value="15269" id="comment_post_ID">
<input type="hidden" name="comment_parent" id="comment_parent" value="0">
</p>
<p style="display: none;"><input type="hidden" id="akismet_comment_nonce" name="akismet_comment_nonce" value="5ab0c92112"></p>
<p style="display: none !important;"><label>Δ</label><input type="hidden" id="ak_js_1" name="ak_js" value="1648624058337">
<script>
document.getElementById("ak_js_1").setAttribute("value", (new Date()).getTime());
</script>
</p>
</form>
GET https://www.iri.com/blog/
<form role="search" method="get" id="searchform" class="search-form" action="https://www.iri.com/blog/" target="_self">
<label class="screen-reader-text" for="s">Search for:</label>
<input id="popup-search-text" type="text" placeholder="Search Text" value="" name="s">
<button id="popup-search-button" type="submit" class="searchsubmit">Search</button>
</form>
Text Content
Skip to content Solutions Products * Solutions * Products * Blog * BI * Big Data * DQ * ETL * IRI * IRI Business * IRI Workbench * Mask * MDM * Master Data Management * Metadata Management * Migrate * Data Migration * Sort Migration * Test Data * Transform * VLDB * VLOG Site Search Search the Blog Search for: Search FIND & MASK PII IN BIGTABLE, COSMOS AND DYNAMO… * by Adam Lewis Abstract: This article covers the use of the IRI DarkShield API for automatically locating and de-identifying PII or other sensitive data in the three major cloud provider NoSQL databases — Google BigTable, MS CosmosDB in Azure, and Amazon DynamoDB. Prior articles in this blog cover how DarkShield wizards in IRI Workbench find and mask data in other popular NoSQL DBs, including Cassandra, Elasticsearch and MongoDB.1 A subsequent article covers CouchDB, Redis and Solr. WHAT IS NOSQL? NoSQL typically stands for “not only SQL” although others may say it stands for “non SQL”. NoSQL was introduced to provide an alternative to relational databases that at the time, were the dominant force in the industry. Because NoSQL databases are non-tabular, data is stored differently compared to SQL databases. There are actually various types of NoSQL databases based on their data model. These data models include documents, key-value pairs, wide-column, and graphs. THE STRENGTH OF NOSQL DATABASES According to CloudGuru.com, relational databases have “inflexible schemas and notoriously difficult horizontal scaling [which means] they don’t always fit well in a highly scalable and geographically distributed infrastructure stack”2. In comparison, the flexibility of the NoSQL document-model makes it easier to change data. NoSQL databases are also easier to scale horizontally, and usually the cloud providers handle the operational overhead of managing infrastructure. To know when to choose NoSQL over relational databases there are generally a few factors to consider for decision makers. According to MongoDB the drivers are: “fast-paced Agile development, storage of structured and semi-structured data, huge volumes of data, requirements for scale-out architecture, modern application paradigms like microservices and real-time streaming”3. NOSQL DB SECURITY CONCERNS As with traditional relational (SQL) databases, NoSQL DBs have similar security issues, but also some unique risks. According to the International Journal of Digital Society, NoSQL vulnerabilities include: “insufficient or ineffective input validation, errors in the application level permissions handling, weak authentication, insecure communication, illegal access to unencrypted data, etc. are some of the vulnerabilities applicable for NoSQL”4. Like SQL injections, NoSQL injections are also possible when input validation is not handled properly. Because NoSQL databases do not have a common query language, queries are written in the programming language (PHP, JavaScript, Python, etc) of the application connected to the database. This means NoSQL injections can result in commands being executed not only in the database, but also in the application itself. There is a long list of endpoint security practices for NoSQL DBs. But even with them, would-be assailants still manage to punch holes in those defenses. Companies must thus evolve to harden the security profile of these collections with another level of protection. This is where IRI DarkShield comes in. As a data-centric, or “startpoint security” solution, DarkShield masking provides another important layer of data protection atop the end-point measures deployed by cloud database service providers. ABOUT IRI DARKSHIELD IRI DarkShield is a data masking tool for finding and de-identifying sensitive data in semi-structured and unstructured files and databases. DarkShield is one of three core data masking products in the IRI Data Protector Suite which leverage graphical data classification, searching, and masking job design models in the IRI Workbench IDE, built on Eclipse. As of DarkShield Version 4, however, two powerful Remote Procedure Call (RPC) Application Programming Interface (API) versions are also provided: the “Base” DarkShield API and the DarkShield-Files API. The DarkShield APIs extend the use of DarkShield functionality outside of Workbench and leverage a plugin on top of an IRI Web Services platform named Plankton. To find and protect sensitive data in a wide range of sources, the DarkShield APIs use specified search matchers and masking rules that follow business rules. For more information on creating search matchers and masking rules, please refer to this article. The “Base” DarkShield API is used to search and mask unstructured text outside the context of files. Alternatively, the DarkShield-Files API provides the ability to search and mask PII in files. With the assistance of the DarkShield-Files API, semi-structured and unstructured data like plain text files, csv/tsv, word documents, excel, pdf, json, xml, parquet, jpeg, and png images can be searched and masked. AWS DYNAMODB, AZURE COSMOSDB, GOOGLE BIGTABLE AND THE DARKSHIELD API The companies reigning over cloud services for NoSQL databases are Amazon AWS with DynamoDB, Microsoft’s Azure CosmosDB, and Google’s Cloud BigTable. The focus of this article is on these three well known service providers and how the DarkShield-Files API can be leveraged to search and mask inside their NoSQL databases located in the cloud. For those unfamiliar with connecting and querying NoSQL databases programmatically, not to worry. AWS, Azure, and Google Cloud are not only known for providing high quality service but also provide copious amounts of documentation on how to access their database content using Software Development toolkits (SDK) supported in various programming languages. The DarkShield-File API demos currently uploaded to GitHub are written in the Python language; as such, those projects use client libraries for Python. However, other calling languages, like Java, can be used. These calling programs, or “glue code” to the API, is where these procedures can be defined. See below for the links to the DarkShield-Files API demos: * Azure CosmosDB * AWS DynamoDB * Google Cloud BigTable DarkShield Search and Mask Contexts Within the IRI darkshield-files-api demos in GitHub, there will be a setup file included. The setup file will define a search context, mask context, file search context, and file mask context that are needed by the DarkShield-Files API. Without these contexts defined, the DarkShield-Files API will not search or mask. DarkShield API Search Context A search context designates the PII that will be annotated in the files read through matchers. There are a variety of matcher types for search matchers. The DarkShield-File API supports using search matchers based on regular expressions, named entity recognition (NER) models, and matching based on predefined text that would be matched against in SET files. The image above displays an EmailMatcher that uses regular expression patterns to search for any text that may contain a “@” and website suffix, a SsnMatcher that uses regular expression patterns to search for any text that may follow the format of SSN, and a NameMatcher that uses a Named Entity Recognition (NER) model to identify names. File Search Context For specific file formats, the DarkShield-Files API provides users with additional filtering and matching options. In this example, path matchers are provided for json and xml files. Mask Context Note: In older versions of the DarkShield-Files API, the configuration for rules and rulesMatcher requires the “type: cosort” and “type:name” in their respective configurations. For the API to know what to do with PII that has been discovered during search operations, a mask context must be defined. The first part of a mask context contains a list of rules that we want to apply. Each rule has an expression that dictates what masking function will be used. These expressions are also documented in the IRI FieldShield manual and IRI Workbench, and because the functions are compatible, enterprise data integrity can be preserved post-masking regardless of source. The list of possible masking rules include: * Assignment Expressions * Blur Functions * Deletion Functions * Encoding Functions * Encryption Functions (AES, 3DES, FPE, GPG) * Hashing Functions * Pseudonym Replacement * Redaction Functions * String Manipulation Functions In the code above we have three rules called HashRule, RedactSsnRule, and FpeRule. Respectively, the rules were assigned a hashing function, a function to replace characters with ‘*’, and format preserving encryption. The DarkShield API uses the same masking functions as IRI FieldShield (which masks structured data in SortCL-compatible job scripts). Following masking rules are rule matchers. The rule matchers are easy to understand. Rule matchers pair search matchers with masking rules. Lastly, is the file mask context. For specific file formats, the DarkShield-Files API provides users with additional configuration options. In this example, the configuration for json files has specified the implementation of pretty print. File Mask Context AUTHENTICATION CREDENTIALS OF NOSQL DEMOS Accessing BigTable, CosmosDB, or DynamoDB programmatically requires the user’s login credentials in some form for authentication. There are various ways to store and access these credentials securely, but for the sake of simplicity the three NoSQL demos either use credential files or environment variables. CosmosDB credentials.json | DynamoDB .aws/credentials file Google BigTable allows you to generate a private key for your credential and download the newly generated key in a file. Google BigTable demo uses an environment variable GOOGLE_APPLICATION_CREDENTIALS to designate a path to the private key contained in the file downloaded from Google Cloud Platform console. TAKING A CLOSER LOOK AT THE DARKSHIELD API INTERFACE TO BIGTABLE THE MAIN PROGRAM To get an idea of how the main program would be implemented below is a screenshot of the Google BigTable main.py. All of the previously linked demos use a main program that facilitates the DarkShield-Files API call. The main program will contain the glue code that performs the following actions: * Authentication to the datasource (NoSQL DB) * Accesses and queries the database * Makes POST requests to the DarkShield-Files API with the content of the DB * Resulting output from the DarkShield-Files API is written back to the database. In the BigTable demo the resulting output has been written back into the database. Alternatively, the code could be altered to write the masked results to files or to a separate test database. The DarkShield-Files API is a flexible tool that is only limited by the glue code that manipulates it. EXECUTING THE PROGRAM To execute, run python main.py “project_id” “instance_id” from your terminal. For those wondering, project_id is your Cloud Platform project ID and instance_id is the ID of the Cloud Bigtable instance you wish to connect to. Below is an example of what the execution may look like: RESULTS OF SEARCHING AND MASKING OF PII VIA THE DARKSHIELD API GOOGLE BIGTABLE Below is a demonstration of the results of search and masking operations performed on Google Cloud BigTable using the BigTable demo on GitHub: BigTable Demo Project Original data and masked results after execution of the IRI DarkShield BigTable demo AZURE COSMOSDB Below is a demonstration of the results of search and masking operations performed on CosmosDB: CosmosDB data source explorer Vulnerable PII in a CosmosDB collection. CosmosDB collection item after masking AMAZON DYNAMODB Below is a demonstration of the results of search and masking operations performed on DynamoDB: AWS NoSQL Workbench provides UI to DynamoDB Unmasked PII in DynamoDB Collections Masked results exported to csv format part 1: Masked results exported to csv format part 2: Conclusion Finding and masking PII through the DarkShield-Files API is an “open” solution not constrained by the data source or silo. As with RDBs, files, documents and images, DarkShield’s API delivers flexible codable solutions to detect and protect sensitive structured, semi-structured and unstructured data in almost any NoSQL database, whether it runs on-premise or in the cloud. 1. Note that the same DarkShield base API described herein can also be used on those three as well, and IRI is now also working to support Couchbase, Redis, and Solr. The DarkShield API for files finds and masks data in RDB C/BLOB columns, unstructured text and log files, semi-structured EDI files like HL7, JSON, X12 and XML, MS and PDF documents and many image formats. 2. Vanbuskirk, Mike Nov, et al. “NoSQL Databases Comparison: Cosmos DB VS DynamoDB VS Cloud Datastore and Bigtable.” A Cloud Guru, 25 June 2021, acloudguru.com/blog/engineering/comparing-cloud-nosql-databases-dynamodb-vs-cosmos-db-vs-cloud-datastore-and-bigtable 3. What Is Nosql? NoSQL Databases Explained.” MongoDB, www.mongodb.com/nosql-explained. 4. Shahriar, Hossain, and Hisham M Haddad. “Security Vulnerabilities of NoSQL and SQL Databases for MOOC Applications.” International Journal of Digital Society, Mar. 2017. LinkedInFacebookTwitterRedditEmailPrint Automating IRI Jobs Using File Monitoring: A POC Preprocessing Images to Improve OCR & DarkShield Results Amazon DynamoDB Azure CosmosDB BigTable Cosmos CosmosDB Darkshield API DarkShield RPC API DynamoDB Google BigTable IRI DarkShield NoSQL NoSQL database PII pii masking search matcher RELATED ARTICLES * * Generating Test Data in PDF… Using Tensorflow and PyTorch NER… Testing with DB Subsets in… Restoring Masked Values with IRI… Finding and Masking PII in… Load Balancing & Authenticating DarkShield… Preprocessing Images to Improve OCR… Generating Test Data for Azure… Masked Test Data in an… Finding and Masking PHI in… Masking PHI in DICOM Files… prev next LEAVE A REPLY CANCEL REPLY Your email address will not be published. Required fields are marked * Save my name, email, and website in this browser for the next time I comment. Check here to Subscribe to notifications for new posts Δ CATEGORIES * Big Data 64 * Business Intelligence (BI) 74 * Data Masking/Protection 137 * Data Quality (DQ) 37 * Data Transformation 84 * ETL 117 * IRI 205 * IRI Business 69 * IRI Workbench 149 * MDM 38 * Master Data Management 14 * Metadata Management 24 * Migration 55 * Data Migration 50 * Sort Migration 6 * Test Data 83 * VLDB 70 * VLOG 40 TRACKING © 2022 Innovative Routines International (IRI), Inc., All Rights Reserved | Contact ✓ Thanks for sharing! AddToAny More… X SEARCH THE BLOG Search for: Search Note that the same DarkShield base API described herein can also be used on those three as well, and IRI is now also working to support Couchbase, Redis, and Solr. The DarkShield API for files finds and masks data in RDB C/BLOB columns, unstructured text and log files, semi-structured EDI files like HL7, JSON, X12 and XML, MS and PDF documents and many image formats. Vanbuskirk, Mike Nov, et al. “NoSQL Databases Comparison: Cosmos DB VS DynamoDB VS Cloud Datastore and Bigtable.” A Cloud Guru, 25 June 2021, acloudguru.com/blog/engineering/comparing-cloud-nosql-databases-dynamodb-vs-cosmos-db-vs-cloud-datastore-and-bigtable What Is Nosql? NoSQL Databases Explained.” MongoDB, www.mongodb.com/nosql-explained. Shahriar, Hossain, and Hisham M Haddad. “Security Vulnerabilities of NoSQL and SQL Databases for MOOC Applications.” International Journal of Digital Society, Mar. 2017.