Performance Enhancement in Full-Text Search Query

Posted by Calvin Sun on Oracle Blogs See other posts from Oracle Blogs or by Calvin Sun
Published on Sat, 29 Sep 2012 16:00:00 +0000 Indexed on 2012/09/30 15:44 UTC
Read the original article Hit count: 561

Filed under:

/Performance

Ever since its first release, we are continuing consolidating and developing InnoDB Full-Text Search feature. There is one recent improvement that worth blogging about. It is an effort with MySQL Optimizer team that simplifies some common queries’ Query Plans and dramatically shorted the query time. I will describe the issue, our solution and the end result by some performance numbers to demonstrate our efforts in continuing enhancement the Full-Text Search capability.

The Issue:

As we had discussed in previous Blogs, InnoDB implements Full-Text index as reversed auxiliary tables. The query once parsed will be reinterpreted into several queries into related auxiliary tables and then results are merged and consolidated to come up with the final result. So at the end of the query, we’ll have all matching records on hand, sorted by their ranking or by their Doc IDs.

Unfortunately, MySQL’s optimizer and query processing had been initially designed for MyISAM Full-Text index, and sometimes did not fully utilize the complete result package from InnoDB.

Here are a couple examples:

Case 1: Query result ordered by Rank with only top N results:

mysql> SELECT FTS_DOC_ID, MATCH (title, body) AGAINST ('database')

AS SCORE FROM articles ORDER BY score DESC LIMIT 1;

In this query, user tries to retrieve a single record with highest ranking. It should have a quick answer once we have all the matching documents on hand, especially if there are ranked. However, before this change, MySQL would almost retrieve rankings for almost every row in the table, sort them and them come with the top rank result. This whole retrieve and sort is quite unnecessary given the InnoDB already have the answer.

In a real life case, user could have millions of rows, so in the old scheme, it would retrieve millions of rows' ranking and sort them, even if our FTS already found there are two 3 matched rows. Apparently, the million ranking retrieve is done in vain. In above case, it should just ask for 3 matched rows' ranking, all other rows' ranking are 0. If it want the top ranking, then it can just get the first record from our already sorted result.

Case 2: Select Count(*) on matching records:

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body)

AGAINST ('database' IN NATURAL LANGUAGE MODE);

In this case, InnoDB search can find matching rows quickly and will have all matching rows. However, before our change, in the old scheme, every row in the table was requested by MySQL one by one, just to check whether its ranking is larger than 0, and later comes up a count.

In fact, there is no need for MySQL to fetch all rows, instead InnoDB already had all the matching records. The only thing need is to call an InnoDB API to retrieve the count

The difference can be huge. Following query output shows how big the difference can be:

mysql> select count(*) from searchindex_inno where match(si_title, si_text) against ('people')

+----------+
| count(*) |
+----------+
| 666877 |
+----------+
1 row in set (16 min 17.37 sec)

So the query took almost 16 minutes.

Let’s see how long the InnoDB can come up the result. In InnoDB, you can obtain extra diagnostic printout by turning on “innodb_ft_enable_diag_print”, this will print out extra query info:

Error log:

keynr=2, 'people'
NL search
Total docs: 10954826 Total words: 0
UNION: Searching: 'people'
Processing time: 2 secs: row(s) 666877: error: 10
ft_init()
ft_init_ext()
keynr=2, 'people'
NL search
Total docs: 10954826 Total words: 0
UNION: Searching: 'people'
Processing time: 3 secs: row(s) 666877: error: 10

Output shows it only took InnoDB only 3 seconds to get the result, while the whole query took 16 minutes to finish. So large amount of time has been wasted on the un-needed row fetching.

The Solution:

The solution is obvious. MySQL can skip some of its steps, optimize its plan and obtain useful information directly from InnoDB. Some of savings from doing this include:

1) Avoid redundant sorting. Since InnoDB already sorted the result according to ranking. MySQL Query Processing layer does not need to sort to get top matching results.

2) Avoid row by row fetching to get the matching count. InnoDB provides all the matching records. All those not in the result list should all have ranking of 0, and no need to be retrieved. And InnoDB has a count of total matching records on hand. No need to recount.

3) Covered index scan. InnoDB results always contains the matching records' Document ID and their ranking. So if only the Document ID and ranking is needed, there is no need to go to user table to fetch the record itself.

4) Narrow the search result early, reduce the user table access. If the user wants to get top N matching records, we do not need to fetch all matching records from user table. We should be able to first select TOP N matching DOC IDs, and then only fetch corresponding records with these Doc IDs.

Performance Results and comparison with MyISAM

The result by this change is very obvious. I includes six testing result performed by Alexander Rubin just to demonstrate how fast the InnoDB query now becomes when comparing MyISAM Full-Text Search.

These tests are base on the English Wikipedia data of 5.4 Million rows and approximately 16G table. The test was performed on a machine with 1 CPU Dual Core, SSD drive, 8G of RAM and InnoDB_buffer_pool is set to 8 GB.

Table 1: SELECT with LIMIT CLAUSE

mysql> SELECT si_title, match(si_title, si_text) against('family') as rel FROM si WHERE match(si_title, si_text) against('family') ORDER BY rel desc LIMIT 10;

	InnoDB	MyISAM	Times Faster
Time for the query	*1.63 sec*	*3 min 26.31 sec*	127

You can see for this particular query (retrieve top 10 records), InnoDB Full-Text Search is now approximately 127 times faster than MyISAM.

Table 2: SELECT COUNT QUERY

mysql>select count(*) from si where match(si_title, si_text) against('family‘);

+----------+
| count(*) |
+----------+
| 293955 |
+----------+

	InnoDB	MyISAM	Times Faster
Time for the query	*1.35 sec*	*28 min 59.59 sec*	1289

In this particular case, where there are 293k matching results, InnoDB took only 1.35 second to get all of them, while take MyISAM almost half an hour, that is about 1289 times faster!.

Table 3: SELECT ID with ORDER BY and LIMIT CLAUSE for selected terms

mysql> SELECT <ID>, match(si_title, si_text) against(<TERM>) as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;

Term	InnoDB (time to execute)	MyISAM(time to execute)	Times Faster
family	*0.5 sec*	*5.05 sec*	10.1
family film	0.95 sec	25.39 sec	26.7
Pizza restaurant orange county California	0.93 sec	32.03 sec	34.4
President united states of America	2.5 sec	36.98 sec	14.8

Table 4: SELECT title and text with ORDER BY and LIMIT CLAUSE for selected terms

mysql> SELECT <ID>, si_title, si_text, ... as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;

Term	InnoDB (time to execute)	MyISAM(time to execute)	Times Faster
family	*0.61 sec*	*41.65 sec*	68.3
family film	1.15 sec	*47.17 sec*	41.0
Pizza restaurant orange county california	*1.03 sec*	*48.2 sec*	46.8
President united states of america	2.49 sec	*44.61 sec*	17.9

Table 5: SELECT ID with ORDER BY and LIMIT CLAUSE for selected terms

mysql> SELECT <ID>, match(si_title, si_text) against(<TERM>) as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;

Term	InnoDB (time to execute)	MyISAM(time to execute)	Times Faster
family	*0.5 sec*	*5.05 sec*	10.1
family film	0.95 sec	25.39 sec	26.7
Pizza restaurant orange county califormia	0.93 sec	32.03 sec	34.4
President united states of america	2.5 sec	36.98 sec	14.8

Table 6: SELECT COUNT(*)

mysql> SELECT count(*) FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) LIMIT 10;

Term	InnoDB (time to execute)	MyISAM(time to execute)	Times Faster
family	*0.47 sec*	*82 sec*	174.5
family film	0.83 sec	*131 sec*	157.8
Pizza restaurant orange county califormia	*0.74 sec*	*106 sec*	143.2
President united states of america	*1.96 sec*	*220 sec*	112.2

Again, table 3 to table 6 all showing InnoDB consistently outperform MyISAM in these queries by a large margin. It becomes obvious the InnoDB has great advantage over MyISAM in handling large data search.

Summary:

These results demonstrate the great performance we could achieve by making MySQL optimizer and InnoDB Full-Text Search more tightly coupled. I think there are still many cases that InnoDB’s result info have not been fully taken advantage of, which means we still have great room to improve. And we will continuously explore the area, and get more dramatic results for InnoDB full-text searches.

Jimmy Yang, September 29, 2012

Developer IT