How Synthetic Data Can Speed Up Development

When building data products; access, governance, and security are paramount. Incorrectly handling data can cause costly data breaches and reputational damage. However, these processes take large amounts of time, slowing development. By leveraging synthetic data – companies can safely speed up development time.

By utilizing synthetic data as an alternative to production data, developers can build tools in parallel to governance and security without risking production data use. This significantly reduces lead time as development no longer proceeds in a waterfall manner, it also aids the security testing as the scale of synthetic data mimics the volume of production data, greatly derisking the entire project.

Alternate Approaches

Traditional methods to solve these problems typically involve either small amounts of manually made data or anonymizing the production dataset. Small amounts of data do not reflect the load demands of production data and there are demonstrable ways to de-anonymize data, revealing sensitive information.

As synthetic data contains no PII and cannot be de-anonymized, teams can be outsourced to other territories where they can develop tools on synthetic data without ever needing access to production data.

Use Cases for Synthetic Data

Reduce Risk For Data Migrations

When conducting data migrations from one warehouse to another, utilizing synthetic data can ensure testing is done with realistic data loads, greatly reducing the risks of a data leak or error. You can even take this further by having more synthetic data to test if the system is able to meet future requirements.

Single Customer View

When combining data from multiple sources, records may contain slightly different versions of the truth. An address may have changed, names could be spelled differently, and login locations change – but how do you connect all these disparate data sources accurately?

Utilizing synthetic data you can make custom datasets containing these data mismatches – but containing unique identifiers which can be used as the ground truth to validate matching. This enables testing and accuracy statistics to be attributed to the matching capability of the product- improving its reputability.

Offshore Workers

There are many instances where production data cannot leave particular territories (such as data leaving the EU for GDPR concerns). This massively reduces the available worker pool for these projects. If an offshore team was to develop using synthetic data – they could work within data compliance using representative data. Then have the end product be deployed directly into production with maximal compatibility.

Machine Learning & Analytics

The majority of synthetic data is reflective of the production data, in that, data types will match, as well as data limits such as min and max values. However, if your workflow involves analytics or machine learning you will require the synthetic data to also follow the same distributions and correlations as the production data. This allows data scientists to develop models which typically would require highly sensitive data in lower security environments.

The caveats however are that in order to build a machine learning model, the model needs to access and learn the dataset. This can be a critical blocker for some projects. Another is proper validation that the model is truly generating new values and not copying the production data.

Making Synthetic Data

Generating typical synthetic data does not require scanning or seeing the production data (with the expectation of Machine Learning generated data), it does, however, need some descriptive information, such as table names, column names, and high-level information about the type of data and any limits or patterns.

For example, for numbers and dates, a lower and upper limit will aid in making data that reflects the production data. And for text information knowing if the column contains names, addresses and the pattern of any IDs (eg. 2 letters followed by 4 numbers) is vital.

Conclusions

Synthetic Data is a vital tool in developing products in that it can help teams get stood up faster, and develop safer with proper load-testing. There are many applications of synthetic data to meet many demands

Adam Fletcher