Cost-based Optimization

The way the optimizer makes its decisions on how to execute queries is based on a methodology called cost-based optimization. A simplification of this process is as follows:

  1. Assign a cost to each operation.
  2. Evaluate how many operations each possible plan would take.
  3. Sum up the total.
  4. Choose the plan with the lowest overall cost.
_images/cost-based-optimization.png

Why I said that the above is a simplification, is that the optimizer does not exhaustively search each possible execution plan. If there are 5 tables to be joined, each with 5 possible indexes, there could be over 5! * 5! = 14400 ways that the query could be executed:

  • Each index may have more than one potential access method. (e.g., index scan, range scan, or index lookup). Additionally, each table could potentially use a table scan.
  • For an INNER JOIN query, the tables could be joined in any order (the order the tables are specified does not matter).
  • When joining there may be multiple join buffering methods or subquery strategies available.

It is not feasible for the optimizer to evaluate each potential execution plan. For example, consider the case that optimizing takes longer than execution. For this the optimizer will by default skip evaluating certain plans [1]. The configuration option optimizer_search_depth also exists to limit the search depth of query plans, but it is not enabled by default [2].

Modifying the cost constants

The cost used for each operation is configurable via the tables server_cost and engine_cost that reside in the mysql system database. Here are the default values as used in MySQL 8.0:

Cost Operation
40 disk_temptable_create_cost
1 disk_temptable_row_cost
2 memory_temptable_create_cost
0.2 memory_temptable_row_cost
0.1 key_compare_cost
0.2 row_evaluate_cost
1 io_block_read_cost
1 memory_block_read_cost

Tip

MySQL 8.0 introduces a new feature where the cost model adapts to the percentage of the index that is in memory. The cost model in prior versions of MySQL always assumed that IO was required to access a page.

Cost itself is a logical unit that represents resource usage. A single unit no longer has an exact meaning, but it’s origin can be traced back to being one random IO on a 1990s hard drive.

As hardware improves, it may not do so at a consistent rate for all components (i.e., latency to storage has improved substantially with SSDs). Similarly, as software addresses changes in hardware (with features such as compression), resource consumption can also change. Having configurable cost constants allows for refinement to handle these cases.

Example 4 shows that changing the row_evaluate_cost to be five times higher has the effect that table scans become a lot more expensive (versus eliminating work via an index). This results in the optimizer choosing to use the p (Population) index that was created in Example 2.

Example 4: Increasing the cost of row evaluation makes table scans more expensive

# Increase the cost from 0.2 to 1.0
UPDATE mysql.server_cost SET cost_value=1 WHERE cost_name='row_evaluate_cost';
FLUSH OPTIMIZER_COSTS;

# in a new session
EXPLAIN FORMAT=JSON
SELECT * FROM Country WHERE continent='Asia' and population > 5000000;
{
   "select_id": 1,
   "cost_info": {          # Since row evaluate is 5x higher
   "query_cost": "325.01"  # The total cost for the query
   },                      # has gone up!
   "table": {
   "table_name": "Country",
   "access_type": "range",  # But it will execute as a range
   "possible_keys": [
      "p"
   ],
   "key": "p",
   "used_key_parts": [
      "Population"
   ],
   "key_length": "4",
   "rows_examined_per_scan": 108,
   "rows_produced_per_join": 15,
   "filtered": "14.29",
   "index_condition": "(`world`.`Country`.`Population` > 5000000)",
   "cost_info": {
      "read_cost": "309.58",
      "eval_cost": "15.43",
      "prefix_cost": "325.01",
      "data_read_per_join": "3K"
   },
   "used_columns": [
      "Code",
      "Name",
      "Continent",
      "Region",
      "SurfaceArea",
      "IndepYear",
      "Population",
      "LifeExpectancy",
      "GNP",
      "GNPOld",
      "LocalName",
      "GovernmentForm",
      "HeadOfState",
      "Capital",
      "Code2"
   ],
   "attached_condition": "(`world`.`Country`.`Continent` = 'Asia')"
   }
  }
}

Warning

Be careful when modifying cost constants, as a number of your query plans could change for the worse! This is shown here for demonstration purposes; in most production situations you will be better served by adding a query hint.

Warning

Be careful to reset costs before proceeding onto the other examples:

UPDATE mysql.server_cost SET cost_value=NULL WHERE cost_name='row_evaluate_cost'; FLUSH OPTIMIZER_COSTS; # close session

Metadata and Statistics

As demonstrated in Example 3, the distribution of data can affect the cost of execution plans. The optimizer makes use of both the data dictionary and statistics to as part of its decision process.

Metadata

  Index Information Uniqueness Nullability
Description The dictionary provides a list of indexes per table. If an index is unique, it can be used as part of a permanent transformation, shortcutting some parts of planning. The optimizer needs to handle potential null values correctly. Nullability of a column can affect the use of some execution plans.

Statistics

  Table Size Cardinality Range Estimates
Description Provides an estimate of the total table size. Samples a small number of random pages (default 20), and extrapolates to determine the number of unique values in the indexed column. The optimizer provides InnoDB with a minimum and a maximum value and receives back an estimate of the number of rows in the range.
Used For All Columns Indexed Columns Indexed Columns
Calculated In advance In advance On demand
Automatically Updated During Regular Operation [3] After 10% of the table has changed N/A
Manually Updated ANALYZE TABLE ANALYZE TABLE N/A
Configuration Options N/A Number of pages to sample [4] Maximum # of index dives [5] Maximum memory usage [6]
Accuracy Least Accurate Can be subject to distribution skew Most Accurate
Commonly Used For Determining table scan cost. May be used for join order (largest table first) when lack of indexes. Determining join order. Also used when range estimates exceed maximum # of index dives. Used to evaluate predicates (i.e., it looks at the possible indexes that could be used, and estimates how many rows would match). A range estimate was used to determine that population > 5000000 should not use an index.

Tip

Because of statistics, seemingly identical queries in QA and production may execute very differently. Even a query plan in production can change over time as data distribution does.

[1]The default optimizer_prune_level=1. To disable this heuristic and evaluate all plans, set it to zero. http://dev.mysql.com/doc/refman/5.7/en/controlling-query-plan-evaluation.html
[2]The default optimizer_search_depth=64. Lower values may reduce plan evaluation time, as the cost of potentially less optimal plans.
[3]Statistic are updated during regular operation, but not in a manner which is guaranteed to be accurate.
[4]The number of pages sampled can be changed with innodb_stats_persistent_sample_pages. A higher value may produce more accurate estimates (at some increase to cost generation).
[5]The optimizer will switch to cardinality estimates when IN lists exceed eq_range_index_dive_limit items (default: 200).
[6]The range optimizer is limited to range_optimizer_max_mem_size (default: 8M). Queries that use multiple IN lists could exceed this, as the combination of options may be expanded internally.