A digital watermark is a kind of marker covertly embedded in data and is also sometimes referred to as “the practice of imperceptibly altering a work to embed a message about that work”. For Semantic Container a digital watermark is a unique digital fingerprint that is applied to data provided by a Semantic Container, i.e., any data request results in a dataset with insignificant errors that uniquely identifies the recipient of the data set. In case such a dataset is leaked and appears in an unintended location, the person who originally requested and leaked the dataset can be identified. This blog post describes the design of the digital watermarking that will be implemented in the course of the currently ongoing MyPCH project.
To embed a watermark into a dataset the following two steps are performed:
- Pre-processing: the available data is split up into fragments of a defined size, e.g., all measurements from a single day
- Encoding: based on a secret parameter (or key) unique to the requesting party a sequence of errors with the same size as a data fragment is created and then applied to the original data, i.e., for numerical values this is just adding value and error
- Distortion Attack: There are different kind of distortions which may be applied to a dataset, e.g., rounding to the n-th digit. Rounding the values on the least significant digit preserves the data’s usability the most but may be detected more easily than rounding digits further up.
- Deletion Attack: As with distortions attacks, different kind of deletions may be applied to a dataset to make the identification of the original recipient harder.
- Collusion Attack: A collusion attack is performed by combination of n copies of the same dataset. For each measurement the mean of all n copies is calculated to create a new dataset.
To detect a watermark in a suspicious dataset the following two steps are performed and require the original data to be available:
- Detection: Through similarity search the suspicious dataset (already fragmented) is matched against original data fragments and in case of a match the difference between suspicious dataset and original dataset is the (possibly noisy) unique error
- Mapping: The extracted error is compared through similarity search with the original error based on the secret parameter (or key). In case of a match the original recipient of the data is identified.
The above process including various test cases for attacks will be implemented in the next weeks and will soon be available in the Semantic Container base package. Feel free to reach out to us with any questions or comments!