Character SetsΒΆ

The latest Unicode 9.0 is supported in MySQL 8.0 under the name of utf8mb4. utf8mb4 is variable length with individual characters consuming between one and four bytes. There are many naunces between bytes and characters and which is variable length:

  1. When creating a column as VARCHAR(n), n represents the character length. The byte storage requirement could be up to 4 times more (but is frequently less).
  2. Internally the InnoDB storage engine always [1] stores utf8mb4 as variable length (in indexes and table rows) for VARCHAR, CHAR and TEXT data types.
  3. In memory temporary tables that are used as part of materialization are fixed length. This may result in cases where temporary tables are either larger, or spill over to disk earlier when using the utf8mb4 character set.
  4. The buffers used to sort data are variable length (since MySQL 5.7).
  5. EXPLAIN will always show the maximum length of a variable length index (byte length). Frequently the storage requirements are much lower.

Example 37 shows EXPLAIN reporting the use of an index on a column of type CHAR(52) in latin1 character set. After converting the table to utf8mb4 the storage requirements of the table have not increased, but EXPLAIN is now showing an increase in key_length.

Example 37: EXPLAIN showing the maximum key length of an index (latin1 character set)

     EXPLAIN FORMAT=JSON
     SELECT * FROM Country WHERE name='Canada';
     {
       "query_block": {
         "select_id": 1,
         "cost_info": {
           "query_cost": "1.20"
         },
         "table": {
           "table_name": "Country",
           "access_type": "ref",
           "possible_keys": [
             "Name"
           ],
           "key": "Name",
           "used_key_parts": [
             "Name"
           ],
           "key_length": "52",  # CHAR(52)
           "ref": [
             "const"
           ],
           "rows_examined_per_scan": 1,
           "rows_produced_per_join": 1,
           "filtered": "100.00",
           "cost_info": {
             "read_cost": "1.00",
             "eval_cost": "0.20",
             "prefix_cost": "1.20",
             "data_read_per_join": "264"
           },
           "used_columns": [
             "Code",
             "Name",
             "Continent",
             "Region",
             "SurfaceArea",
             "IndepYear",
             "Population",
             "LifeExpectancy",
             "GNP",
             "GNPOld",
             "LocalName",
             "GovernmentForm",
             "HeadOfState",
             "Capital",
             "Code2"
           ]
         }
       }
     }

Example 38: EXPLAIN showing the maximum key length of an index (utf8mb4 character set)

     ALTER TABLE Country CONVERT TO CHARACTER SET utf8mb4;
     EXPLAIN FORMAT=JSON
     SELECT * FROM Country WHERE name='Canada';
     {
       "query_block": {
         "select_id": 1,
         "cost_info": {
           "query_cost": "1.20"
         },
         "table": {
           "table_name": "Country",
           "access_type": "ref",
           "possible_keys": [
             "Name"
           ],
           "key": "Name",
           "used_key_parts": [
             "Name"
           ],
           "key_length": "208", # CHAR(52) * 4 = 208
           "ref": [
             "const"
           ],
           "rows_examined_per_scan": 1,
           "rows_produced_per_join": 1,
           "filtered": "100.00",
           "cost_info": {
             "read_cost": "1.00",
             "eval_cost": "0.20",
             "prefix_cost": "1.20",
             "data_read_per_join": "968"
           },
           "used_columns": [
             "Code",
             "Name",
             "Continent",
             "Region",
             "SurfaceArea",
             "IndepYear",
             "Population",
             "LifeExpectancy",
             "GNP",
             "GNPOld",
             "LocalName",
             "GovernmentForm",
             "HeadOfState",
             "Capital",
             "Code2"
           ]
         }
       }
     }
[1]Always when using the DYNAMIC, COMPACT and COMPRESSED row formats. There is typically no practical use to use the earlier REDUNDANT row format.