Core Concept :: ShardingSphere

Core Concept

Table

Tables are a key concept for transparent data sharding. Apache ShardingSphere adapts to the data sharding requirements under different scenarios by providing diverse table types.

Logic Table

The logical name of the horizontally sharded database (table) of the same structure is the logical identifier of the table in SQL. Example: Order data is split into 10 tables according to the primary key endings, are t_order_0 to t_order_9, and their logical table names are t_order.

Actual Table

Physical tables that exist in the horizontally sharded databases. Those are, t_order_0 to t_order_9 in the previous example.

Binding Table

Refers to a set of sharded tables with consistent sharding rules. When using binding tables for multi-table associated query, a sharding key must be used for the association, otherwise, Cartesian product association or cross-library association will occur, affecting query efficiency.

For example, if the t_order table and t_order_item table are both sharded according to order_id and are correlated using order_id, the two tables are binding tables. The multi-table associated queries between binding tables will not have a Cartesian product association, so the associated queries will be much more effective. Here is an example,

If SQL is:

SELECT i.* FROM t_order o JOIN t_order_item i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);

In the case where no binding table relationships are being set, assume that the sharding key order_id routes the value 10 to slice 0 and the value 11 to slice 1, then the routed SQL should be 4 items, which are presented as a Cartesian product:

SELECT i.* FROM t_order_0 o JOIN t_order_item_0 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);

SELECT i.* FROM t_order_0 o JOIN t_order_item_1 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);

SELECT i.* FROM t_order_1 o JOIN t_order_item_0 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);

SELECT i.* FROM t_order_1 o JOIN t_order_item_1 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);

After the relationships between binding tables are configured and associated with order_id, the routed SQL should then be 2 items:

SELECT i.* FROM t_order_0 o JOIN t_order_item_0 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);

SELECT i.* FROM t_order_1 o JOIN t_order_item_1 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);

The t_order table will be used by ShardingSphere as the master table for the entire binding table since it specifies the sharding condition. All routing calculations will use only the policy of the primary table, then the sharding calculations for the t_order_item table will use the t_order condition.

Note: multiple sharding rules in the binding table need to be configured according to the combination of logical table prefix and sharding suffix, for example:

rules:
- !SHARDING
  tables:
    t_order:
      actualDataNodes: ds_${0..1}.t_order_${0..1}
    t_order_item:
      actualDataNodes: ds_${0..1}.t_order_item_${0..1}
  bindingTables:
    - t_order, t_order_item

Naming reminder: ShardingSphere derives binding information by stripping numeric suffixes from actual table names. Keep the digits at the end exclusively for sharding suffixes such as _0, _1, etc.

Broadcast data frame

Refers to tables that exist in all data sources. The table structure and its data are identical in each database. Suitable for scenarios where the data volume is small and queries are required to be associated with tables of massive data, e.g., dictionary tables.

Single Table

Refers to the only table that exists in all sharded data sources. Suitable for tables with a small amount of data and do not need to be sharded.

Note: Single tables that meet the following conditions will be automatically loaded:

A single table showing the configuration in rules such as encrypt and mask
A single table created by users executing DDL statements through ShardingSphere

For other single tables that do not meet the above conditions, ShardingSphere will not automatically load them, and users can configure single table rules as needed for management.

Data Nodes

The smallest unit of the data shard, consists of the data source name and the real table. Example: ds_0.t_order_0.

The mapping relationship between the logical table and the real table can be classified into two forms: uniform distribution and custom distribution.

Uniform Distribution

refers to situations where the data table exhibits a uniform distribution within each data source. For example:

db0
  ├── t_order0
  └── t_order1
db1
  ├── t_order0
  └── t_order1

The configuration of data nodes:

db0.t_order0, db0.t_order1, db1.t_order0, db1.t_order1

Customized Distribution

Data table exhibiting a patterned distribution. For example:

db0
  ├── t_order0
  └── t_order1
db1
  ├── t_order2
  ├── t_order3
  └── t_order4

configuration of data nodes:

db0.t_order0, db0.t_order1, db1.t_order2, db1.t_order3, db1.t_order4

Sharding

Sharding key

A database field is used to split a database (table) horizontally. Example: If the order primary key in the order table is sharded by modulo, the order primary key is a sharded field. If there is no sharded field in SQL, full routing will be executed, of which performance is poor. In addition to the support for single-sharding fields, Apache ShardingSphere also supports sharding based on multiple fields.

Sharding Algorithm

Algorithm for sharding data, supporting =, >=, <=, >, <, BETWEEN and IN. The sharding algorithm can be implemented by the developers themselves or can use the Apache ShardingSphere built-in sharding algorithm, syntax sugar, which is very flexible.

Automatic Sharding Algorithm

Sharding algorithm—syntactic sugar is for conveniently hosting all data nodes without users having to concern themselves with the physical distribution of actual tables. Includes implementations of common sharding algorithms such as modulo, hash, range, and time.

Customized Sharding Algorithm

Provides a portal for application developers to implement their sharding algorithms that are closely related to their business operations, while allowing users to manage the physical distribution of actual tables themselves. Customized sharding algorithms are further divided into:

Standard Sharding Algorithm Used to deal with scenarios where sharding is performed using a single key as the sharding key =, IN, BETWEEN AND, >, <, >=, <=.
Composite Sharding Algorithm Used to cope with scenarios where multiple keys are used as sharding keys. The logic containing multiple sharding keys is very complicated and requires the application developers to handle it on their own.
Hint Sharding Algorithm For scenarios involving Hint sharding.

Sharding Strategy

Consisting of a sharding key and sharding algorithm, which is abstracted independently due to the independence of the sharding algorithm. What is viable for sharding operations is the sharding key + sharding algorithm, known as sharding strategy.

Mandatory Sharding routing

For the scenario where the sharded field is not determined by SQL but by other external conditions, you can use SQL Hint to inject the shard value. Example: Conduct database sharding by employee login primary key, but there is no such field in the database. SQL Hint can be used both via Java API and SQL annotation. See Mandatory Sharding Routing for details.

Row Value Expressions

Row expressions are designed to address the two main issues of configuration simplification and integration. In the cumbersome configuration rules of data sharding, the large number of repetitive configurations makes the configuration itself difficult to maintain as the number of data nodes increases. The data node configuration workload can be effectively simplified by row expressions.

For the common sharding algorithm, using Java code implementation does not help to manage the configuration uniformly. But by writing the sharding algorithm through line expressions, the rule configuration can be effectively stored together, which is easier to browse and store.

A Row Value Expressions consists of two parts as a string, the Type Name part of the corresponding SPI implementation at the beginning of the string and the expression part.

Take <GROOVY>t_order_${1..3} as sample, the GROOVY substring in the part of the <GROOVY> string is the Type Name used by the corresponding SPI implementation for this Row Value Expressions, which is identified by the <> symbol. And the t_order_${1..3} string is the expression part of this Row Value Expressions. When a Row Value Expressions does not specify a Type Name, such as t_order_${1..3}, the Row Value Expressions defaults to parse expressions by GROOVY implementation for InlineExpressionParser SPI.

The following sections describe the syntax rules for the GROOVY implementation.

Row expressions are very intuitive, just use ${ expression } or $->{ expression } in the configuration to identify the row expressions. Data nodes and sharding algorithms are currently supported. The content of row expressions uses Groovy syntax, and all operations supported by Groovy are supported by row expressions. For example:

${begin..end} denotes the range interval

${[unit1, unit2, unit_x]} denotes the enumeration value

If there are multiple ${ expression } or $->{ expression } expressions in a row expression, the final result of the whole expression will be a Cartesian combination based on the result of each sub-expression.

e.g. The following row expression:

${['online', 'offline']}_table${1..3}

Finally, it can be parsed as this:

online_table1, online_table2, online_table3, offline_table1, offline_table2, offline_table3

Distributed Primary Key

In traditional database software development, automatic primary key generation is a basic requirement. Various databases provide support for this requirement, such as self-incrementing keys of MySQL, self-incrementing sequences of Oracle, etc. After data sharding, it is very tricky to generate global unique primary keys for different data nodes. Self-incrementing keys between different actual tables within the same logical table generate repetitive primary keys because they are not mutually aware. Although collisions can be avoided by constraining the initial value and step size of self-incrementing primary keys, additional operational and maintenance rules are necessary to be introduced, rendering the solution lacking in completeness and scalability.

Many third-party solutions can perfectly solve this problem, such as UUID, which relies on specific algorithms to self-generate non-repeating keys, or by introducing primary key generation services. To facilitate users and meet their demands for different scenarios, Apache ShardingSphere not only provides built-in distributed primary key generators, such as UUID and SNOWFLAKE but also abstracts the interface of distributed primary key generators to enable users to implement their own customized self-extending primary key generators.