Due to persistent issues concerning sensitive information, when working with
big data, we present a new approach of generating arti cial data1in the form
of datasets. For this purpose, we specify the term dataset to represent a
UNIX directory structure, consisting of various les and folders.
Especially in computer science, there exists a distinct need for data. Mostly,
this data already exists, but contains sensitive information. Thus, such critical
data is supposed to stay protected against third parties. Hence, this
reservation of data leads to a lack of available data for open source developers
as well as for researchers.
Therefore, we discovered a way to produce replicated datasets, given an origin
dataset as input. Such replicated datasets represent the origin dataset as
accurate as possible, without leaking any sensitive information.
Thus, we introduce the Dataset Anonymization and Replication Tool, short
DART, a Python based framework, which allows the replication of datasets.
Since we aim to encourage the data science community to participate in our
work, we constructed DART as a framework with high degree of adaptability
We started with the analysis of datasets and various le and MIME types
to nd suitable properties which characterize datasets. Thus, we de ned
a broad range of properties, respectively characteristics, initiating with the
number of les, to the point of le speci c characteristics like permissions. In
the next step, we explored several mathematical and statistical approaches
to replicate the selected characteristics. Therefore, we chose to model characteristics
using relative frequency distributions, respectively unigrams, discrete
as well as continuous random variables. Finally, we started to produce replicated datasets and analyzed the replicated characteristics against the
characteristics of the corresponding origin dataset. Thus, the comparison
between origin and replicated datasets is exclusively based on the selected
The achieved results highly depend on the origin dataset as well as on the
characteristics of interest. Thus, origin datasets, which indicate a simple
structure, tend more likely to deliver utilizable results. Otherwise, large and
complex origin datasets might struggle to be replicated succiently. Nevertheless,
the results aspire that tools like DART will be utilized to provide
arti cial data1for persistent use cases.