字符串搜索算法
目录
字符串搜索是指在一个字符串中查找另一个字符串(子串)是常用的基本编程操作。协议解析、文本编辑都会用到大量字符串搜索操作。从1977年至今,人们对此进行了大量研究,本文介绍本领域典型算法:暴力搜索、two-way、KMP、BM、BMH、BMS,以及x86 CPU SSE2 硬件加速实现。
1 问题定义
字符串搜索问题,又称为文本匹配问题。其定义如下:
- 定义长度为\(n\)的字符数组\(text\);
- 定义长度为\(m\)的字符数组\(pattern\),且\(m\le{}n\);
- 定义有限字符集\(\Sigma\),\(text\)和\(pattern\)由\(\Sigma\)元素组成;
- 在\(text\)中查找\(pattern\)出现的位置。
2 测试结果
测试Linux系统glibc strstr,native, KMP, BM, BMH算法。使用大量随机字符串测试,其性能排名是:strstr、BMH、BM、native、KMP。KMP的实现应该是有问题,后期分析更正。
3 glibc C库字符串算法
3.1 概要介绍
GLIBC库提供two-way搜索算法,以及针对各平台的硬件优化实现(X86平台是SSE2)。TWO WAY 算法由Maxime Crochemore和Dominique Perrin发表于1991年ACM。主要包括一个关键分解理论(critical factorization theorem pdf)。
代码实现参考Github string/strstr.c、x86_64/multiarch strstr.c和 strstr-sse2-unaligned.S。
3.2 参考资料
- TWO WAY ALGORITHM
- http://www-igm.univ-mlv.fr/~lecroq/string/node26.html
- TWO WAY STRING MATCHING (Maxime Crochemore, Dominique Perrin)
- https://dl.acm.org/citation.cfm?id=116845
4 Boyer-Moore
[BM 77]
4.1 概要介绍
BM算法由Robert S. Boyer和J Strother Moore于1977年发表。BM算法使用坏字符和好后缀规则,在失配后一次跳动多个字符,进行匹配加速。
4.2 参考资料
4.3 代码实现
2012/2/29实现此算法,发布于GitHub:
/* * [BM 77] A fast string search algorithm, * R. S. Boyer and J. S. Moore, * Comm. ACM, 20, 1977, pp. 762–772. * [Smit 1982] A Comparison of Three String Matching Algorithms, * Smit G. De V., * Software-Practice and Experience, 12(1), 57-66, 1982 */ #include <config-os.h> #include <ycc/common/debug.h> #include <ycc/algos/string.h> void strbm_init(const char *needle, size_t n, size_t *table_sgs, size_t *table_ebc) { /* * table_sgs: strong good suffix * ebc: extended bad character */ size_t i, j, k; size_t *backup = table_sgs + n; /* init */ memset(table_sgs, 0, n*sizeof(*table_sgs)); /* suffix */ i = n - 1; j = n; while ((size_t)-1 != i) { backup[i] = j; while (j < n && needle[i] != needle[j]) { if (!table_sgs[j]) table_sgs[j] = j - i; j = backup[j]; } --i; --j; } /* prefix */ k = j + 1; for (i = 0; i < k; ++i) { if (!table_sgs[i]) table_sgs[i] = k; } i = backup[j]; while (j < n) { while (j < i) { if (!table_sgs[j]) table_sgs[j] = i; ++j; } i = backup[i]; } /* ebc ... */ strbmh_init(needle, n, table_ebc); } char *strbm_find(const char *haystack, const char *needle, size_t h, size_t n, const size_t *table_sgs, const size_t *table_ebc) { size_t i, j, n1 = n - 1; if (!n) return (char*)haystack; while (h > n1) { for (i = n1; i != (size_t)-1 && haystack[i] == needle[i]; --i); if (i == (size_t)-1) /* [0, n1] are equal. */ return (char*)haystack; if (table_sgs[i] > table_ebc[(u_char)haystack[n1]]) j = table_sgs[i]; else j = table_ebc[(u_char)haystack[n1]]; DBG_PR("jump offset = %zu, h,n = %zu, %zu", j, h, n); h -= j; haystack += j; } return NULL; }
5 Boyer-Moore-Horspool
[Hor 80]
5.1 概要介绍
BMH算法由Nigel Horspool发表于1980年。在随机测试结果中,比BM有更好的表现,性能提升10%左右。BMH是BM算法的简化。
5.2 参考资料
5.3 代码实现
2012/2/29发布于GitHub:
/* * [Hor 80] Practical Fast Searching in Strings, * R. Nigel Horspool, * Softw. Pratt. Exp., 10, 501–506 (1980). */ #include <assert.h> #include <limits.h> #include <stddef.h> #include <sys/types.h> #include <config-os.h> #include <ycc/algos/string.h> void strbmh_init(const char *needle, size_t n, size_t *table) { size_t i; for (i = 0; i <= UCHAR_MAX; ++i) table[i] = n; --needle; for (i = 1; i < n; ++i) table[(u_char)needle[i]] = n - i; } char *strbmh_find(const char *haystack, const char *needle, size_t h, size_t n, const size_t *table) { size_t i, n1 = n - 1; if (!n) return (char*)haystack; while (h > n1) { for (i = n1; i != (size_t)-1 && haystack[i] == needle[i]; --i); if (i == (size_t)-1) /* [0, n1] are equal. */ return (char*)haystack; h -= table[(u_char)haystack[n1]]; haystack += table[(u_char)haystack[n1]]; } return NULL; }
6 Boyer-Moore-Sunday
[Sun 90]
7 Knuth-Morris-Pratt
[KMP 77]
7.1 概要介绍
7.3 代码实现
此算法测试结果十分糟糕,怀疑实现有问题。以后更正分析。 2012/2/29发布于GitHub:
/* * [KMP 77] Fast pattern matching in strings, * D. E. Knuth, J. H. Morris, Jr and V. B. Pratt, SIAM J. * Computing, 6, 1977, pp. 323–350. */ #include <assert.h> #include <stddef.h> #include <config-os.h> #include <ycc/algos/string.h> void strkmp_init(const char *patn, size_t *table) { size_t i, *t = table; const char *s = patn + 1; assert(patn && table); *t++ = 0; for(i = 0; *s; ++s) { while (i && patn[i] != *s) i = table[i-1]; if (patn[i] == *s) ++i; *t++ = i; } } char *strkmp_find(const char *text, const char *patn, const size_t *table) { size_t i; const char *s; assert(text && patn && table); if (!*patn) return (char*)text; for (i = 0, s = text; *s; ++s) { while (i && *s != patn[i]) i = table[i-1]; if (*s == patn[i]) { if (!patn[i+1]) return (char*)s - i; ++i; } } return NULL; }
8 References
- why grep so fast
- https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
- grep code
- https://savannah.gnu.org/git/?group=grep
- grep git url
git clone https://git.savannah.gnu.org/git/grep.git
- (no term)
- http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html
- WIKI
- https://en.wikipedia.org/wiki/String_searching_algorithm
- Stackoverflow Discuss
- https://stackoverflow.com/questions/7586990/strstr-faster-than-algorithms
- EXACT STRING MATCHING ALGORITHMS
- http://www-igm.univ-mlv.fr/~lecroq/string/index.html