There is no better performance test data than production data, BUT sometimes production data must be "anonymized" before being used by developers. Does anyone have any suggestions about how to do this so that the SQL Anywhere query optimizer will pick the more-or-less the same plans and result in more-or-less the same performance? Some possible rules come to mind: Any two equal input values must result in two equal output values; any two different input values must result in two different output values; the distribution of output values must "look" more-or-less-the same as the distribution of input values. Does this only matter for columns in primary keys, foreign keys and indexes? Should statistics be recreated after anonymizing? ...or is there a danger recreation will throw everything off? |
It seems that you need the obfuscating functionality. I do not know if it makes sense to obfuscate primitive values, but for strings you can try to use the funny function from the link http://dba.stackexchange.com/questions/23786/data-obfuscation-in-sql-server.
Or something similar.
Thanks!
The point of the question, however, is not how to anonymize data, but how to do it to a SQL Anywhere database without affecting query performance one way or another.
It may not be possible... perhaps the answer is to leave the database data alone and anonymize the query output, but that may not satisfy the letter of the law.
In order to understand: Is there a further rule "input and output values must be of (almost) the same length"? - Say, encrypting all strings with the same key would satisfy the first and second rule but the distribution would certainly be different, and (at least when using the builtin ENCRYPT() function) output data would be noticably longer, so I guess that could have a significant influence on plans.
If you would however "simply" modify data by replacing particular bytes with other bytes (as in a "Substitution cipher", I guess all four rules would hold but apparently it would be not too difficult to encode that "anonymization".
Yes, that's a comment, no answer at all...
Or at least, translate the strings to any ASCII-based language before encrypting them. This will add a little more complexity :)