www.acryldata.io Open in urlscan Pro
76.76.21.9  Public Scan

Submitted URL: https://pages.acryl.io/e3t/Ctc/GF*113/d1KqKk04/MW9pb-fKMLcW8kbWrr3RFwS1W3V7XmG57HpqnN4HzmfH3lYMRW7Y8-PT6lZ3mJW4Qj1mM8bK...
Effective URL: https://www.acryldata.io/blog/the-what-why-and-how-of-data-contracts?utm_medium=email&_hsmi=288122898&_hsenc=p2ANqtz-9n_p...
Submission: On January 08 via manual from IN — Scanned from DE

Form analysis 1 forms found in the DOM

POST https://forms.hsforms.com/submissions/v3/public/submit/formsnext/multipart/14552909/fb81cbcf-c1ce-41c7-8e91-79d56105e507

<form id="hsForm_fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05" method="POST" accept-charset="UTF-8" enctype="multipart/form-data" novalidate=""
  action="https://forms.hsforms.com/submissions/v3/public/submit/formsnext/multipart/14552909/fb81cbcf-c1ce-41c7-8e91-79d56105e507"
  class="hs-form-private hsForm_fb81cbcf-c1ce-41c7-8e91-79d56105e507 hs-form-fb81cbcf-c1ce-41c7-8e91-79d56105e507 hs-form-fb81cbcf-c1ce-41c7-8e91-79d56105e507_c5d76315-f404-46ad-90a9-734b656aff34 hs-form stacked"
  target="target_iframe_fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05" data-instance-id="c5d76315-f404-46ad-90a9-734b656aff34" data-form-id="fb81cbcf-c1ce-41c7-8e91-79d56105e507" data-portal-id="14552909"
  data-hs-cf-bound="true">
  <div class="hs_email hs-email hs-fieldtype-text field hs-form-field"><label id="label-email-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05" class="" placeholder="Enter your "
      for="email-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05"><span></span></label>
    <legend class="hs-field-desc" style="display: none;"></legend>
    <div class="input"><input id="email-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05" name="email" required="" placeholder="Your Email Address*" type="email" class="hs-input" inputmode="email"
        autocomplete="email" value=""></div>
  </div>
  <div class="hs_utm_source hs-utm_source hs-fieldtype-text field hs-form-field" style="display: none;"><label id="label-utm_source-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05" class=""
      placeholder="Enter your utm_source" for="utm_source-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05"><span>utm_source</span></label>
    <legend class="hs-field-desc" style="display: none;"></legend>
    <div class="input"><input name="utm_source" class="hs-input" type="hidden" value="hs_email"></div>
  </div>
  <div class="hs_utm_medium hs-utm_medium hs-fieldtype-text field hs-form-field" style="display: none;"><label id="label-utm_medium-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05" class=""
      placeholder="Enter your utm_medium" for="utm_medium-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05"><span>utm_medium</span></label>
    <legend class="hs-field-desc" style="display: none;"></legend>
    <div class="input"><input name="utm_medium" class="hs-input" type="hidden" value="email"></div>
  </div>
  <div class="hs_utm_campaign hs-utm_campaign hs-fieldtype-text field hs-form-field" style="display: none;"><label id="label-utm_campaign-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05" class=""
      placeholder="Enter your utm_campaign" for="utm_campaign-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05"><span>utm_campaign</span></label>
    <legend class="hs-field-desc" style="display: none;"></legend>
    <div class="input"><input name="utm_campaign" class="hs-input" type="hidden" value=""></div>
  </div>
  <div class="hs_utm_content hs-utm_content hs-fieldtype-text field hs-form-field" style="display: none;"><label id="label-utm_content-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05" class=""
      placeholder="Enter your utm_content" for="utm_content-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05"><span>utm_content</span></label>
    <legend class="hs-field-desc" style="display: none;"></legend>
    <div class="input"><input name="utm_content" class="hs-input" type="hidden" value="288122898"></div>
  </div>
  <div class="hs_utm_term hs-utm_term hs-fieldtype-text field hs-form-field" style="display: none;"><label id="label-utm_term-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05" class=""
      placeholder="Enter your utm_term" for="utm_term-fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05"><span>utm_term</span></label>
    <legend class="hs-field-desc" style="display: none;"></legend>
    <div class="input"><input name="utm_term" class="hs-input" type="hidden" value=""></div>
  </div>
  <div class="hs_submit hs-submit">
    <div class="hs-field-desc" style="display: none;"></div>
    <div class="actions"><input type="submit" class="hs-button primary large" value="Get updates"></div>
  </div><input name="hs_context" type="hidden"
    value="{&quot;embedAtTimestamp&quot;:&quot;1704714240602&quot;,&quot;formDefinitionUpdatedAt&quot;:&quot;1696975841100&quot;,&quot;lang&quot;:&quot;en&quot;,&quot;embedType&quot;:&quot;REGULAR&quot;,&quot;renderRawHtml&quot;:&quot;true&quot;,&quot;userAgent&quot;:&quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.129 Safari/537.36&quot;,&quot;pageTitle&quot;:&quot;The What, Why, and How of Data Contracts&quot;,&quot;pageUrl&quot;:&quot;https://www.acryldata.io/blog/the-what-why-and-how-of-data-contracts?utm_medium=email&amp;_hsmi=288122898&amp;_hsenc=p2ANqtz-9n_pNxorIwd6dZhASv3jr44r779lz1agP7_c3kGwbbdE4shsH_q1_g6PZJ8YQUAVadKmZbDJSS3ok-CnwseIdn82ySiU7iLAdHcL2CATz0-wdoQGA&amp;utm_content=288122898&amp;utm_source=hs_email&quot;,&quot;urlParams&quot;:{&quot;utm_medium&quot;:&quot;email&quot;,&quot;_hsmi&quot;:&quot;288122898&quot;,&quot;_hsenc&quot;:&quot;p2ANqtz-9n_pNxorIwd6dZhASv3jr44r779lz1agP7_c3kGwbbdE4shsH_q1_g6PZJ8YQUAVadKmZbDJSS3ok-CnwseIdn82ySiU7iLAdHcL2CATz0-wdoQGA&quot;,&quot;utm_content&quot;:&quot;288122898&quot;,&quot;utm_source&quot;:&quot;hs_email&quot;},&quot;isHubSpotCmsGeneratedPage&quot;:false,&quot;hutk&quot;:&quot;3850d9704d764ef10c3c4e6e7eb89659&quot;,&quot;__hsfp&quot;:1132539230,&quot;__hssc&quot;:&quot;209249869.1.1704714241146&quot;,&quot;__hstc&quot;:&quot;209249869.3850d9704d764ef10c3c4e6e7eb89659.1704714241145.1704714241145.1704714241145.1&quot;,&quot;formTarget&quot;:&quot;#form-18ace317-1c20-40c9-aa6f-45d0e0566f05&quot;,&quot;formInstanceId&quot;:&quot;instance-18ace317-1c20-40c9-aa6f-45d0e0566f05&quot;,&quot;rumScriptExecuteTime&quot;:839.0999999046326,&quot;rumTotalRequestTime&quot;:1055.0999999046326,&quot;rumTotalRenderTime&quot;:1079.0999999046326,&quot;rumServiceResponseTime&quot;:216,&quot;rumFormRenderTime&quot;:24,&quot;locale&quot;:&quot;en&quot;,&quot;timestamp&quot;:1704714241152,&quot;originalEmbedContext&quot;:{&quot;portalId&quot;:&quot;14552909&quot;,&quot;formId&quot;:&quot;fb81cbcf-c1ce-41c7-8e91-79d56105e507&quot;,&quot;region&quot;:&quot;na1&quot;,&quot;target&quot;:&quot;#form-18ace317-1c20-40c9-aa6f-45d0e0566f05&quot;,&quot;isBuilder&quot;:false,&quot;isTestPage&quot;:false,&quot;isPreview&quot;:false,&quot;formInstanceId&quot;:&quot;instance-18ace317-1c20-40c9-aa6f-45d0e0566f05&quot;,&quot;isMobileResponsive&quot;:true},&quot;correlationId&quot;:&quot;c5d76315-f404-46ad-90a9-734b656aff34&quot;,&quot;renderedFieldsIds&quot;:[&quot;email&quot;,&quot;utm_source&quot;,&quot;utm_medium&quot;,&quot;utm_campaign&quot;,&quot;utm_content&quot;,&quot;utm_term&quot;],&quot;captchaStatus&quot;:&quot;NOT_APPLICABLE&quot;,&quot;emailResubscribeStatus&quot;:&quot;NOT_APPLICABLE&quot;,&quot;isInsideCrossOriginFrame&quot;:false,&quot;source&quot;:&quot;forms-embed-1.4371&quot;,&quot;sourceName&quot;:&quot;forms-embed&quot;,&quot;sourceVersion&quot;:&quot;1.4371&quot;,&quot;sourceVersionMajor&quot;:&quot;1&quot;,&quot;sourceVersionMinor&quot;:&quot;4371&quot;,&quot;allPageIds&quot;:{},&quot;_debug_embedLogLines&quot;:[{&quot;clientTimestamp&quot;:1704714240676,&quot;level&quot;:&quot;INFO&quot;,&quot;message&quot;:&quot;Retrieved pageContext values which may be overriden by the embed context: {\&quot;pageTitle\&quot;:\&quot;The What, Why, and How of Data Contracts\&quot;,\&quot;pageUrl\&quot;:\&quot;https://www.acryldata.io/blog/the-what-why-and-how-of-data-contracts?utm_medium=email&amp;_hsmi=288122898&amp;_hsenc=p2ANqtz-9n_pNxorIwd6dZhASv3jr44r779lz1agP7_c3kGwbbdE4shsH_q1_g6PZJ8YQUAVadKmZbDJSS3ok-CnwseIdn82ySiU7iLAdHcL2CATz0-wdoQGA&amp;utm_content=288122898&amp;utm_source=hs_email\&quot;,\&quot;userAgent\&quot;:\&quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.129 Safari/537.36\&quot;,\&quot;urlParams\&quot;:{\&quot;utm_medium\&quot;:\&quot;email\&quot;,\&quot;_hsmi\&quot;:\&quot;288122898\&quot;,\&quot;_hsenc\&quot;:\&quot;p2ANqtz-9n_pNxorIwd6dZhASv3jr44r779lz1agP7_c3kGwbbdE4shsH_q1_g6PZJ8YQUAVadKmZbDJSS3ok-CnwseIdn82ySiU7iLAdHcL2CATz0-wdoQGA\&quot;,\&quot;utm_content\&quot;:\&quot;288122898\&quot;,\&quot;utm_source\&quot;:\&quot;hs_email\&quot;},\&quot;isHubSpotCmsGeneratedPage\&quot;:false}&quot;},{&quot;clientTimestamp&quot;:1704714240678,&quot;level&quot;:&quot;INFO&quot;,&quot;message&quot;:&quot;Retrieved countryCode property from normalized embed definition response: \&quot;DE\&quot;&quot;},{&quot;clientTimestamp&quot;:1704714241149,&quot;level&quot;:&quot;INFO&quot;,&quot;message&quot;:&quot;Retrieved analytics values from API response which may be overriden by the embed context: {\&quot;hutk\&quot;:\&quot;3850d9704d764ef10c3c4e6e7eb89659\&quot;}&quot;}]}"><iframe
    name="target_iframe_fb81cbcf-c1ce-41c7-8e91-79d56105e507_instance-18ace317-1c20-40c9-aa6f-45d0e0566f05" data-lf-form-tracking-inspected-ywvko4xp0de4z6bj="true" data-lf-yt-playback-inspected-ywvko4xp0de4z6bj="true"
    data-lf-vimeo-playback-inspected-ywvko4xp0de4z6bj="true" style="display: none;"></iframe>
</form>

Text Content

Products

Products


Acryl DataHub
Acryl Observe


DATAHUB WORKFLOWS FOR DATA PLATFORM & GOVERNANCE LEADS

Data powers crucial decision-making and insight generation at a wide variety of
organizations and businesses. It’s frequently up to data

Customer StoriesCustomer Stories
BlogBlog
CommunityCommunity
Company

Company


About
Careers
Privacy

“Acryl Data’s vision is to bring clarity to your data through its next
generation multi-cloud metadata management platform.”

Swaroop Jagadish, Co-Founder and CEO

Join our Slack

Join DataHub Slack Community

Book a Demo

BACK TO ALL POSTS


THE WHAT, WHY, AND HOW OF DATA CONTRACTS

Data Contract

Data Engineering

Metadata

Data Quality

Data Practitioner

Maggie Hays

Mar 14, 2023

Data Contract

Data Engineering

Metadata

Data Quality

Data Practitioner

Ah, Data Contracts — one of the buzziest topics in the data world. Despite the
topic flooding my LI/Reddit/Substack/Medium feeds, I found myself repeatedly
scratching my head, trying to make sense of the hype.



I wanted to get to the bottom of this, so I crowd-sourced questions and hosted
an AMA with Chad Sanderson (one of the biggest proponents of data contracts) and
Shirshanka Das (co-founder at Acryl Data) to talk about all things data
contracts:

 * The What: What, exactly, is a data contract?
 * The Why: Why do data contracts matter? What are the core use cases behind
   them? What problems do they solve?
 * The How: How do we implement data contracts? How do we start building them
   into our data stack?

There’s a lot to unpack here — let’s dig in!

FIRST THING FIRST: MEET THE EXPERTS

Chad Sanderson, one of the most prolific voices in the data platform and quality
space, runs the Data Quality Camp community. Chad writes at length
(https://dataproducts.substack.com/) about data, data products, data modeling,
and the future of data engineering and architecture.

Shirshanka Das is the CEO and Co-Founder of Acryl Data
(https://www.acryldata.io/), the company maintaining the open-source DataHub
project. He spent almost a decade at LinkedIn leading its data platform strategy
and founded the DataHub project. He continues to lead the charge on DataHub’s
developer-led approaches for modern data discovery, quality, and automated
governance.

THE WHAT: DEFINING A DATA CONTRACT

Let’s start with the basics.

WHAT, EXACTLY, IS A DATA CONTRACT?

At its core, a data contract is an agreement between a producer and a consumer
that clearly defines the following:

 * what data needs to move from a (producer’s) source to a (consumer’s)
   destination
 * the shape of that data, its schema, and semantics
 * expectations around availability and data quality
 * details about contract violation(s) and enforcement
 * how (and for how long) the consumer will use the data

DATA CONTRACTS CLEARLY DEFINE ROLES & RESPONSIBILITIES

Data contracts are bi-directional: an effective data contract sets clear
expectations for both the producer and consumer of data.

Even more, it holds both producers and consumers accountable for adherence to
the contract and is frequently revisited and renegotiated as use cases and/or
relevant parties evolve.

This ensures the producer reliably generates high-quality and timely data while
enforcing how that data is used downstream. This could mean auditing who has
access, how it has been shared with others, or how it has been used/replicated
for unforeseen use cases.

ISN’T A DATA CONTRACT JUST A ________?

DATA CONTRACTS VS. DATASET DDL (DATA DEFINITION LANGUAGE)

Dataset DDL defines the physical storage of data — what your technology will or
will not accept as a new record within the storage layer.

While dataset DDL is undoubtedly a part of the data contract, it fails to
capture semantic detail (what the data represents), data retention policies (how
long the data can be stored), SLA/SLO requirements (when the data will reliably
be available for consumption), and more.



DATA CONTRACTS VS. DATA PRODUCTS

Look at contracts as inputs to data products: a mechanism on which actual data
products can be constructed and fulfilled.

A data product can have multiple data contracts, and multiple data products can
rely on the same data contract(s).

THE WHY: WHY SHOULD WE CARE ABOUT DATA CONTACTS?

Data practitioners’ workflows commonly include rapid iteration and prototyping
to find specific slices and dices of data to address business needs. Whether
building BI reporting tools, analyses, or training datasets for ML models, it’s
expected that data practitioners prioritize speed to delivering business value
over long-term scalability.



By the time a data asset/data product is deployed to production, it’s highly
likely to be multiple steps of enrichment and transformation removed from its
source. The numerous layers of abstraction make it difficult for original data
producers to understand which fields/attributes are critical to driving business
value.

Introducing a data contract for these prod-level assets is an effective way to
align producers and consumers on the following:

 * technical schema requirements to be enforced upstream to minimize the impact
   of dropped columns, changes in data types, etc.
 * field- and dataset-level quality assertions to ensure high accuracy in
   output; no more “garbage-in, garbage-out”
 * Service Level Objectives to set guarantees of when the data will be available
   for processing
 * retention and masking policies to minimize compliance risk
 * in-scope business use cases to provide line-of-sight to data producers of how
   their resources are driving revenue

THE HOW: WHERE DO DATA CONTRACTS FIT WITHIN OUR STACKS?

Don’t overthink this one. You can introduce a contract anywhere you see a
handoff between a producer and consumer. Keep in mind that you & your team may
act as the producer *and* the consumer in your ETL pipelines.

No matter where that handoff happens, contracts should be version-controlled,
easily discoverable, and programmatically enforced.

Some suggestions are to define your technical schema with Protobuf, Avro, or the
like and store it within a registry. If you use Kafka or Confluent, the Kafka
schema registry is a great starting point, but even GitHub works just fine to
store contracts.

While you need a way to discover/catalog your contracts, you must also detect
and flag violations and take action based on them. This means you must run
monitors, programmatically prevent breaking changes, and isolate bad data for
review.

Here are three ways to take action against violations:

 * The CI/CD workflow — Eg: evaluate and prevent schema-breaking changes before
   they are deployed.
 * On the data itself — If you’re using a stream processing system, you can
   check each data record to validate that it meets the contract’s expectations.
   Any contract violations are sent to an isolated queue for review, preventing
   low-quality records from entering the data product.
 * Through a monitoring layer — In this case, after the data arrives, you can
   look at the statistical distributions of the data and detect any unexpected
   changes in the shape of the data.

MAKING A BUSINESS CASE FOR DATA CONTRACT

> You manage the rest of your software as code. Why not your data?

This, Shirshanka shared, resonates with executive leaders — given they are
already bought in on the idea. Focus on the principle of ‘managing data using
software engineering practices.’

The most effective way to secure funding for data contracts is to take advantage
of existing initiatives and implement them iteratively on a subset of the data
stack.

MANAGING DATA CONTRACTS AT SCALE

The big challenge in managing contracts is less of a technical challenge and
more of a social-cultural challenge.

You need to get people who don’t think about downstream data use cases to change
their approach and consider playing an active engineering role around the data.

Here’s an approach Chad recommends based on his work at Convoy:

STEP 1: SPREAD AWARENESS

The first step is building awareness of how producers’ data is leveraged
downstream.

Convoy had a data contract mechanism for defining column-level dependencies
between data sources. Any time an engineer went to change a data source, they
could easily see what impact that would have on downstream assets: what would
potentially break, the use case, and how important it was.

That went a long way in helping engineers understand the impact of breaking the
contract and generating accountability.

STEP 2: MEET PEOPLE WHERE THEY ARE

At Convoy, a contract was implemented and defined through a schema registry and
a schema serialization framework. Software engineers would use an SDK to define
and push new versions of contracts. If backward-incompatible changes were
detected, they surfaced in their GitHub flow.

Whenever possible, meet people where they are and introduce as little change to
their existing workflows as possible. The more deviation from their current
workflow, the harder it will be to scale.

DATA CONTRACTS AND THE MODERN DATA CATALOG

The cost of creating a single data contract is non-trivial, and managing a large
volume of contracts can quickly become challenging; you must ensure that you’re
creating contracts on the most valuable data assets.

The data catalog and its underlying metadata graph can help you prioritize which
assets require a contract by using the following:

 * data lineage to understand how often business-critical downstream assets
   reference a dataset
 * data quality assertions and profiling results to determine a dataset’s
   reliability

Companies like Optum, Saxo Bank, Zendesk, etc., already use this approach. If
you’re looking for inspiration, check out how Stripe uses DataHub to solve their
observability changes by encoding their data contracts in the Airflow DAGs.

STARTING THE DATA CONTRACT JOURNEY: ADVICE AND RECOMMENDATIONS

START SMALL

Start with valuable, revenue-generating use cases. Introduce constraints
gradually. Start with one or two meaningful and easy-to-debug constraints and
introduce more nuanced use cases over time.

LEVERAGE WHAT YOU HAVE

Don’t look at data contracts as a net-new phenomenon. Maybe you’re already using
dbt Tests or encoding quality checks within your Airflow DAGs — treat that as
your starting point and build from there.

Phew, we made it through. I hope this cleared up a concept or two to help you
get started with data contracts. Best of luck on your data contract journey!

CONNECT WITH DATAHUB

Join us on Slack • Sign up for our Newsletter • Follow us on Twitter

Data Contract

Data Engineering

Metadata

Data Quality

Data Practitioner



NEXT UP


ACRYL CLOUD FOR DATA LEADERS AND PRACTITIONERS

Data work is a true team sport. Each and every data asset is the product of a
clear distribution of labor, with people in a diversity of roles—including data
practitioners, software developers, architects, governance authorities, and
business domain experts—working collaboratively.

Swaroop Jagadish

2023-12-11




DETECTING DEEP DATA QUALITY ISSUES WITH COLUMN-LEVEL ASSERTIONS

You're a data engineer at a boutique e-commerce start-up. Your company sells
luxury goods at steep discounts. One of your many responsibilities involves
monitoring the "flash_sale_purchase_events" table in your start-up’s Snowflake
data warehouse. Updates to columns in this table are supposed to reflect
real-time participation by customers in the limited-time flash sales your
company offers.

John Joyce

2023-12-11




EXTRACTING COLUMN-LEVEL LINEAGE FROM SQL

We built a SQL lineage parser that's schema-aware and can generate accurate
column-level lineage from SQL queries. In our tests, it works significantly
better than other open-source, Python-based lineage tools.

Harshal Sheth

2023-11-03

Get started with Acryl today.
Acryl Data delivers an easy to consume DataHub platform for the enterprise
See it in action
Acryl DataHub
Acryl Observe
Customer Stories
CommunityBlog
About
CareersPrivacy
utm_source

utm_medium

utm_campaign

utm_content

utm_term



TermsPrivacySecurity
© 2024 Acryl Data