Tables are a key concept for transparent data sharding. Apache ShardingSphere adapts to the data sharding requirements under different scenarios by providing diverse table types.
The logical name of the horizontally sharded database (table) of the same structure is the logical identifier of the table in SQL. Example: Order data is split into 10 tables according to the primary key endings, are t_order_0
to t_order_9
, and their logical table names are t_order
.
Physical tables that exist in the horizontally sharded databases. Those are, t_order_0
to t_order_9
in the previous example.
Refers to a set of sharded tables with consistent sharding rules. When using binding tables for multi-table associated query, a sharding key must be used for the association, otherwise, Cartesian product association or cross-library association will occur, affecting query efficiency.
For example, if the t_order
table and t_order_item
table are both sharded according to order_id
and are correlated using order_id
, the two tables are binding tables. The multi-table associated queries between binding tables will not have a Cartesian product association, so the associated queries will be much more effective. Here is an example,
If SQL is:
SELECT i.* FROM t_order o JOIN t_order_item i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
In the case where no binding table relationships are being set, assume that the sharding key order_id
routes the value 10 to slice 0 and the value 11 to slice 1, then the routed SQL should be 4 items, which are presented as a Cartesian product:
SELECT i.* FROM t_order_0 o JOIN t_order_item_0 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
SELECT i.* FROM t_order_0 o JOIN t_order_item_1 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
SELECT i.* FROM t_order_1 o JOIN t_order_item_0 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
SELECT i.* FROM t_order_1 o JOIN t_order_item_1 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
After the relationships between binding tables are configured and associated with order_id, the routed SQL should then be 2 items:
SELECT i.* FROM t_order_0 o JOIN t_order_item_0 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
SELECT i.* FROM t_order_1 o JOIN t_order_item_1 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
The t_order table will be used by ShardingSphere as the master table for the entire binding table since it specifies the sharding condition. All routing calculations will use only the policy of the primary table, then the sharding calculations for the t_order_item
table will use the t_order
condition.
Note: multiple sharding rules in the binding table need to be configured according to the combination of logical table prefix and sharding suffix, for example:
rules:
- !SHARDING
tables:
t_order:
actualDataNodes: ds_${0..1}.t_order_${0..1}
t_order_item:
actualDataNodes: ds_${0..1}.t_order_item_${0..1}
Refers to tables that exist in all data sources. The table structure and its data are identical in each database. Suitable for scenarios where the data volume is small and queries are required to be associated with tables of massive data, e.g., dictionary tables.
Refers to the only table that exists in all sharded data sources. Suitable for tables with a small amount of data and do not need to be sharded.
Note: Single tables that meet the following conditions will be automatically loaded:
For other single tables that do not meet the above conditions, ShardingSphere will not automatically load them, and users can configure single table rules as needed for management.
The smallest unit of the data shard, consists of the data source name and the real table. Example: ds_0.t_order_0
.
The mapping relationship between the logical table and the real table can be classified into two forms: uniform distribution and custom distribution.
refers to situations where the data table exhibits a uniform distribution within each data source. For example:
db0
├── t_order0
└── t_order1
db1
├── t_order0
└── t_order1
The configuration of data nodes:
db0.t_order0, db0.t_order1, db1.t_order0, db1.t_order1
Data table exhibiting a patterned distribution. For example:
db0
├── t_order0
└── t_order1
db1
├── t_order2
├── t_order3
└── t_order4
configuration of data nodes:
db0.t_order0, db0.t_order1, db1.t_order2, db1.t_order3, db1.t_order4
A database field is used to split a database (table) horizontally. Example: If the order primary key in the order table is sharded by modulo, the order primary key is a sharded field. If there is no sharded field in SQL, full routing will be executed, of which performance is poor. In addition to the support for single-sharding fields, Apache ShardingSphere also supports sharding based on multiple fields.
Algorithm for sharding data, supporting =
, >=
, <=
, >
, <
, BETWEEN
and IN
. The sharding algorithm can be implemented by the developers themselves or can use the Apache ShardingSphere built-in sharding algorithm, syntax sugar, which is very flexible.
Sharding algorithm—syntactic sugar is for conveniently hosting all data nodes without users having to concern themselves with the physical distribution of actual tables. Includes implementations of common sharding algorithms such as modulo, hash, range, and time.
Provides a portal for application developers to implement their sharding algorithms that are closely related to their business operations, while allowing users to manage the physical distribution of actual tables themselves. Customized sharding algorithms are further divided into:
=
, IN
, BETWEEN AND
, >
, <
, >=
, <=
.Consisting of a sharding key and sharding algorithm, which is abstracted independently due to the independence of the sharding algorithm. What is viable for sharding operations is the sharding key + sharding algorithm, known as sharding strategy.
For the scenario where the sharded field is not determined by SQL but by other external conditions, you can use SQL Hint to inject the shard value. Example: Conduct database sharding by employee login primary key, but there is no such field in the database. SQL Hint can be used both via Java API and SQL annotation. See Mandatory Sharding Routing for details.
Row expressions are designed to address the two main issues of configuration simplification and integration. In the cumbersome configuration rules of data sharding, the large number of repetitive configurations makes the configuration itself difficult to maintain as the number of data nodes increases. The data node configuration workload can be effectively simplified by row expressions.
For the common sharding algorithm, using Java code implementation does not help to manage the configuration uniformly. But by writing the sharding algorithm through line expressions, the rule configuration can be effectively stored together, which is easier to browse and store.
A Row Value Expressions consists of two parts as a string, the Type Name part of the corresponding SPI implementation at the beginning of the string and the expression part.
Take <GROOVY>t_order_${1..3}
as sample, the GROOVY
substring in the part of the <GROOVY>
string is the Type Name used by the corresponding SPI implementation for this Row Value Expressions, which is identified by the <>
symbol.
And the t_order_${1..3}
string is the expression part of this Row Value Expressions. When a Row Value Expressions does
not specify a Type Name, such as t_order_${1..3}
, the Row Value Expressions defaults to parse expressions by GROOVY
implementation for InlineExpressionParser
SPI.
The following sections describe the syntax rules for the GROOVY
implementation.
Row expressions are very intuitive, just use ${ expression }
or $->{ expression }
in the configuration to identify the row expressions. Data nodes and sharding algorithms are currently supported. The content of row expressions uses Groovy syntax, and all operations supported by Groovy are supported by row expressions. For example:
${begin..end}
denotes the range interval
${[unit1, unit2, unit_x]}
denotes the enumeration value
If there are multiple ${ expression }
or $->{ expression }
expressions in a row expression, the final result of the whole expression will be a Cartesian combination based on the result of each sub-expression.
e.g. The following row expression:
${['online', 'offline']}_table${1..3}
Finally, it can be parsed as this:
online_table1, online_table2, online_table3, offline_table1, offline_table2, offline_table3
In traditional database software development, automatic primary key generation is a basic requirement. Various databases provide support for this requirement, such as self-incrementing keys of MySQL, self-incrementing sequences of Oracle, etc. After data sharding, it is very tricky to generate global unique primary keys for different data nodes. Self-incrementing keys between different actual tables within the same logical table generate repetitive primary keys because they are not mutually aware. Although collisions can be avoided by constraining the initial value and step size of self-incrementing primary keys, additional operational and maintenance rules are necessary to be introduced, rendering the solution lacking in completeness and scalability.
Many third-party solutions can perfectly solve this problem, such as UUID, which relies on specific algorithms to self-generate non-repeating keys, or by introducing primary key generation services. To facilitate users and meet their demands for different scenarios, Apache ShardingSphere not only provides built-in distributed primary key generators, such as UUID and SNOWFLAKE but also abstracts the interface of distributed primary key generators to enable users to implement their own customized self-extending primary key generators.