Insights

Generating Realistic Mock Data in Microsoft Fabric

Category: Blog
Date published: 08.05.2026
Written by: Tom Legge, Principal Data Solutions Architect at Seriös Group

In modern data architectures and analytics, development often starts before the data becomes available. However, without the process to generate realistic mock data, nothing can be meaningfully tested. Pipelines remain unvalidated, models stay unproven, and dashboards cannot be easily built or performance tested.

We’ve repeatedly encountered this challenge while delivering complex solutions on modern cloud platforms. To solve it, Tom Legge, Principal Data Solutions Architect at Seriös Group, developed a set of PySpark notebooks capable of generating realistic mock data directly into Lakehouse tables in Microsoft Fabric. These notebooks replicate the structure, grain, distribution, and behaviour of production data using a Kimball dimensional model and real dimension tables where available.

This guide breaks down that complete approach in a real scenario.

You’ll learn:

Why mock data is essential for testing models, dashboards, and realistic data volumes.
How to build dimension tables and define a fact table schema that matches your intended design.
How to craft the initial PySpark notebook to generate large volumes of realistic test data.
How to improve realism through skewed distributions, completeness options, and meaningful business rules.
How to incorporate existing dimensional data already present in the Lakehouse.
How to introduce controlled “bad data” such as nulls and negative values to expose edge cases early.