Data Masking

Background

Security control has always been a crucial link of orchestration; data masking falls into this category. For both Internet enterprises and traditional sectors, data security has always been a highly valued and sensitive topic. Data masking refers to transforming some sensitive information through masking rules to safely protect the private data. Data involves client’s security or business sensibility, such as ID number, phone number, card number, client number and other personal information, requires data masking according to relevant regulations.

Because of that, ShardingSphere has provided data masking, which stores users’ sensitive information in the database after encryption. When users search for them, the information will be decrypted and returned to users in the original form.

ShardingSphere has made the encryption and decryption processes totally transparent to users, who can store desensitized data and acquire original data without any awareness. In addition, ShardingSphere has provided internal masking algorithms, which can be directly used by users. In the same time, we have also provided masking algorithm related interfaces, which can be implemented by users themselves. After simple configurations, ShardingSphere can use algorithms provided by users to perform encryption, decryption and masking.

Solution

ShardingSphere has provided two data masking solutions, corresponding to two ShardingSphere encryption and decryption interfaces, i.e., ShardingEncryptor and ShardingQueryAssistedEncryptor.

On the one hand, ShardingSphere has provided internal encryption and decryption implementations for users, which can be used by them only after configuration. On the other hand, to satisfy users’ requirements for different scenarios, we have also opened relevant encryption and decryption interfaces, according to which, users can provide specific implementation types. Then, after simple configurations, ShardingSphere can use encryption and decryption solutions defined by users themselves to desensitize data.

ShardingEncryptor

The solution has provided two methods, encrypt() and decrypt(), to encrypt and decrypt data to be desensitized.

When users INSERT, DELETE and UPDATE, ShardingSphere will parse, rewrite and route SQL according to the configuration. It will also use encrypt() to encrypt data and store them in the database. When using SELECT, they will decrypt sensitive data from the database with decrypt() reversely and return them to users at last.

Currently, ShardingSphere has provided two types of implementations for this kind of masking solution, MD5 (irreversible) and AES (reversible), which can be used after configuration.

ShardingQueryAssistedEncryptor

Compared with the first masking scheme, this one is more secure and complex. Its concept is: even the same data, two same user passwords for example, should not be stored as the same desensitized form in the database. It can help to protect user information and avoid credential stuffing.

This scheme provides three functions to implement, encrypt(), decrypt() and queryAssistedEncrypt(). In encrypt() phase, users can set some variable, timestamp for example, and encrypt a combination of original data + variable. This method can make sure the encrypted masking data of the same original data are different, due to the existence of variables. In decrypt() phase, users can use variable data to decrypt according to the encryption algorithms set formerly.

Though this method can indeed increase data security, another problem can appear with it: as the same data is stored in the database in different content, users may not be able to find out all the same original data with equivalent query (SELECT FROM table WHERE encryptedColumnn = ?) according to this encryption column.Because of it, we have brought out assistant query column, which is generated by queryAssistedEncrypt(). Different from decrypt(), this method uses another way to encrypt the original data; but for the same original data, it can generate consistent encryption data. Users can store data processed by queryAssistedEncrypt() to assist the query of original data. So there may be one more assistant query column in the table.

queryAssistedEncrypt() and encrypt() can generate and store different encryption data; decrypt() is reversible and queryAssistedEncrypt() is irreversible. So when querying the original data, we will parse, rewrite and route SQL automatically. We will also use assistant query column to do WHERE queries and use decrypt() to decrypt encrypt() data and return them to users. All these can not be felt by users.

For now, ShardingSphere has abstracted the concept to be an interface for users to develop rather than providing accurate implementation for this kind of masking solution. ShardingSphere will use the accurate implementation of this solution provided by users to desensitize data.