The SQL written by engineers facing logic databases and tables cannot be executed directly in actual databases. SQL rewrite is used to rewrite logic SQL into rightly executable ones in actual databases, including two parts, correctness rewrite and optimization rewrite.
In situation with sharding tables, it requires to rewrite logic table names in sharding settings into actual table names acquired after routing. Database sharding does not require to rewrite table names. In addition to that, there are also column derivation, pagination information revision and other content.
Identifiers that need to be rewritten include table name, index name and schema name. Table name rewrite refers to the process to locate the position of logic tables in the original SQL and rewrite it as the physical table. Table name rewrite is one typical situation that requires to parse SQL. From a most plain case, if the logic SQL is as follow:
SELECT order_id FROM t_order WHERE order_id=1;
If the SQL is configured with sharding key order_id=1, it will be routed to Sharding Table 1. Then, the SQL after rewrite should be:
SELECT order_id FROM t_order_1 WHERE order_id=1;
In this most simple kind of SQL, whether parsing SQL to abstract syntax tree seems unimportant, SQL can be rewritten only by searching for and substituting characters. But in the following situation, it is unable to rewrite SQL rightly merely by searching for and substituting characters:
SELECT order_id FROM t_order WHERE order_id=1 AND remarks=' t_order xxx';
The SQL rightly rewritten is supposed to be:
SELECT order_id FROM t_order_1 WHERE order_id=1 AND remarks=' t_order xxx';
SELECT order_id FROM t_order_1 WHERE order_id=1 AND remarks=' t_order_1 xxx';
Because there may be similar characters besides the table name, the simple character substitute method cannot be used to rewrite SQL. Here is another more complex SQL rewrite situation:
SELECT t_order.order_id FROM t_order WHERE t_order.order_id=1 AND remarks=' t_order xxx';
The SQL above takes table name as the identifier of the field, so it should also be revised when SQL is rewritten:
SELECT t_order_1.order_id FROM t_order_1 WHERE t_order_1.order_id=1 AND remarks=' t_order xxx';
But if there is another table name defined in SQL, it is not necessary to revise that, even though that name is the same as the table name. For example:
SELECT t_order.order_id FROM t_order AS t_order WHERE t_order.order_id=1 AND remarks=' t_order xxx';
SQL rewrite only requires to revise its table name:
SELECT t_order.order_id FROM t_order_1 AS t_order WHERE t_order.order_id=1 AND remarks=' t_order xxx';
Index name is another identifier that can be rewritten. In some databases (such as MySQL), the index is created according to the table dimension, and its names in different tables can repeat. In some other databases (such as PostgreSQL), however, the index is created according to the database dimension, index names in different tables are required to be one and the only.
In sharding table, the same logic table will be split into many physical tables in one database. Therefore, index names created for those physical tables cannot repeat. As a result, ShardingSphere will revise the index name as the logic index name + its physical table name. In ShardingSphere, schema management method is similar to that of the table. It uses logic schema to manage a set of data sources, so it requires to replace the logic schema written by users in SQL with physical database schema.
ShardingSphere only supports to use schema in database management statements but not in DQL and DML statements, for example:
SHOW COLUMNS FROM t_order FROM order_ds;
Schema rewrite refers to rewriting logic schema as a right and real schema found arbitrarily with unicast route.
Column derivation in query statements usually results from two situations. First, ShardingSphere needs to acquire the corresponding data when merging results, but it is not returned through the query SQL. This kind of situation aims mainly at GROUP BY and ORDER BY. Result merger requires sorting and ranking according to items of
GROUP BY and
ORDER BYfield. But if sorting and ranking items are not included in the original SQL, it should be rewritten. Look at the situation where the original SQL has the information required by result merger:
SELECT order_id, user_id FROM t_order ORDER BY user_id;
Since user_id is used in ranking, the result merger needs the data able to acquire user_id. The SQL above is able to acquire user_id data, so there is no need to add columns.
If the selected item does not contain the column required by result merger, it will need to add column, as the following SQL:
SELECT order_id FROM t_order ORDER BY user_id;
Since the original SQL does not contain user_id needed by result merger, the SQL needs to be rewritten by adding columns, and after that, it will be:
SELECT order_id, user_id AS ORDER_BY_DERIVED_0 FROM t_order ORDER BY user_id;
What’s to be mentioned, column derivation will only add the missing column rather than all of them; the SQL that includes
* in SELECT will also selectively add columns according to the meta-data information of tables. Here is a relatively complex SQL column derivation case:
SELECT o.* FROM t_order o, t_order_item i WHERE o.order_id=i.order_id ORDER BY user_id, order_item_id;
Suppose only t_order_item table contains order_item_id column, according to the meta-data information of tables, the user_id in sorting item exists in table t_order as merging result, but order_item_id does not exist in t_order, so it needs to add columns. The SQL after that will be:
SELECT o.*, order_item_id AS ORDER_BY_DERIVED_0 FROM t_order o, t_order_item i WHERE o.order_id=i.order_id ORDER BY user_id, order_item_id;
Another situation of column derivation is using AVG aggregation function. In distributed situations, it is not right to calculate the average value with avg1 + avg2 + avg3 / 3, and it should be rewritten as (sum1 + sum2 + sum3) / (count1 + count2 + count3). This requires to rewrite the SQL that contains AVG as SUM and COUNT and recalculate the average value in result merger. Such as the following SQL:
SELECT AVG(price) FROM t_order WHERE user_id=1;
Should be rewritten as:
SELECT COUNT(price) AS AVG_DERIVED_COUNT_0, SUM(price) AS AVG_DERIVED_ SUM _0 FROM t_order WHERE user_id=1;
Then it can calculate the right average value through result merger.
The last kind of column derivation is in SQL with INSERT. With database auto-increment key, there is no need to fill in primary key field. But database auto-increment key cannot satisfy the requirement of only one primary key being in the distributed situation. So ShardingSphere provides a distributed auto-increment key generation strategy, enabling users to replace the current auto-increment key invisibly with a distributed one without changing existing codes through column derivation. Distributed auto-increment key generation strategy will be expounded in the following part, here we only explain the content related to SQL rewrite. For example, if the primary key of t_order is order_id, and the original SQL is:
INSERT INTO t_order (`field1`, `field2`) VALUES (10, 1);
It can be seen that the SQL above does not include an auto-increment key, which will be filled by the database itself. After ShardingSphere set an auto-increment key, the SQL will be rewritten as:
INSERT INTO t_order (`field1`, `field2`, order_id) VALUES (10, 1, xxxxx);
Rewritten SQL will add auto-increment key name and its value generated automatically in the last part of INSERT FIELD and INSERT VALUE.
xxxxx in the SQL above stands for the latter one.
If INSERT SQL does not contain the column name of the table, ShardingSphere can also automatically generate auto-increment key by comparing the number of parameter and column in the table meta-information. For example, the original SQL is:
INSERT INTO t_order VALUES (10, 1);
The rewritten SQL only needs to add an auto-increment key in the column where the primary key is:
INSERT INTO t_order VALUES (xxxxx, 10, 1);
When auto-increment key derives column, if the user writes SQL with placeholder, he only needs to rewrite parameter list but not SQL itself.
The scenarios of acquiring pagination data from multiple databases is different from that of one single database. If every 10 pieces of data are taken as one page, the user wants to take the second page of data. It is not right to take, acquire LIMIT 10, 10 under sharding situations, and take out the first 10 pieces of data according to sorting conditions after merging. For example, if the SQL is:
SELECT score FROM t_score ORDER BY score DESC LIMIT 1, 2;
The following picture shows the pagination execution results without SQL rewrite.
As shown in the picture, if you want to acquire the second and the third piece of data ordered by score common in both tables, and they are supposed to be
Since the executed SQL can only acquire the second and the third piece of data from each table, i.e.,
80 from t_score_0,
75 from t_score_1.
When merging results, it can only merge from
75 already acquired, so the right result cannot be acquired anyway.
The right way is to rewrite pagination conditions as
LIMIT 0, 3, take out all the data from the first two pages and combine sorting conditions to calculate the right data.
The following picture shows the execution of pagination results after SQL rewrite.
The latter the offset position is, the lower the efficiency of using LIMIT pagination will be. There are many ways to avoid using LIMIT as pagination method, such as constructing a secondary index to record line record number and line offset amount, or using the tail ID of last pagination data as the pagination method of conditions of the next query.
When revising pagination information, if the user uses placeholder method to write SQL, he only needs to rewrite parameter list rather than SQL itself.
When using batch inserted SQL, if the inserted data crosses sharding, the user needs to rewrite SQL to avoid writing excessive data into the database. The differences between insert operation and query operation are: though the query sentence has used sharding keys that do not exist in current sharding, they will not have any influence on data, but insert operation has to delete extra sharding keys. Take the following SQL for example:
INSERT INTO t_order (order_id, xxx) VALUES (1, 'xxx'), (2, 'xxx'), (3, 'xxx');
If the database is still divided into two parts according to odd and even number of order_id, this SQL will be executed after its table name is revised. Then, both shards will be written with the same record. Though only the data that satisfies sharding conditions can be taken out from query statement, it is not reasonable for the schema to have excessive data. So the SQL should be rewritten as:
INSERT INTO t_order_0 (order_id, xxx) VALUES (2, 'xxx'); INSERT INTO t_order_1 (order_id, xxx) VALUES (1, 'xxx'), (3, 'xxx');
IN query is similar to batch insertion, but IN operation will not lead to wrong data query result. Through rewriting IN query, the query performance can be further improved. Like the following SQL:
SELECT * FROM t_order WHERE order_id IN (1, 2, 3);
Is rewritten as:
SELECT * FROM t_order_0 WHERE order_id IN (2); SELECT * FROM t_order_1 WHERE order_id IN (1, 3);
The query performance will be further improved. For now, ShardingSphere has not realized this rewrite strategy, so the current rewrite result is:
SELECT * FROM t_order_0 WHERE order_id IN (1, 2, 3); SELECT * FROM t_order_1 WHERE order_id IN (1, 2, 3);
Though the execution result of SQL is right, but it has not achieved the most optimized query efficiency.
Its purpose is to effectively improve the performance without influencing the correctness of the query. It can be divided into single node optimization and stream merger optimization.
It refers to the optimization that stops the SQL rewrite from the route to the single node. After acquiring one route result, if it is routed to a single data node, result merging is unnecessary to be involved, so there is no need for rewrites as derived column, pagination information and others. In particular, there is no need to read from the first piece of information, which reduces the pressure for the database to a large extent and saves meaningless consumption of the network bandwidth.
It only adds sorting items and sorting orders identical with grouping items and
ORDER BY to
GROUP BY SQL, and they are used to transfer memory merger to stream merger. In the result merger part, stream merger and memory merger will be explained in detail.
The overall structure of rewrite engine is shown in the following picture.