Insights

Generating Realistic Mock Data in Microsoft Fabric

Tech In Focus
Category
Blog
Date published
08.05.2026
Written by
Tom Legge, Principal Data Solutions Architect at Seriös Group

In modern data architectures and analytics, development often starts before the data becomes available. However, without the process to generate realistic mock data, nothing can be meaningfully tested. Pipelines remain unvalidated, models stay unproven, and dashboards cannot be easily built or performance tested. 

We’ve repeatedly encountered this challenge while delivering complex solutions on modern cloud platforms. To solve it, Tom Legge, Principal Data Solutions Architect at Seriös Group, developed a set of PySpark notebooks capable of generating realistic mock data directly into Lakehouse tables in Microsoft Fabric. These notebooks replicate the structure, grain, distribution, and behaviour of production data using a Kimball dimensional model and real dimension tables where available.

This guide breaks down that complete approach in a real scenario.

You’ll learn:

  • Why mock data is essential for testing models, dashboards, and realistic data volumes.

  • How to build dimension tables and define a fact table schema that matches your intended design.

  • How to craft the initial PySpark notebook to generate large volumes of realistic test data.

  • How to improve realism through skewed distributions, completeness options, and meaningful business rules.

  • How to incorporate existing dimensional data already present in the Lakehouse.

  • How to introduce controlled “bad data” such as nulls and negative values to expose edge cases early.

Generating Realistic Mock Data in Microsoft Fabric

Generating Realistic Mock Data in Microsoft Fabric PDF
Download

Keep Reading...

Back to all insights

This website uses cookies to ensure you get the best experience on our website. Please let us know your preferences.


Please read our Cookie policy.

Manage