
本文通过阅读 Go 语言标准库 strings 包中与搜索相关的源码,聚焦于一个核心的内部结构 stringFinder。它的作用是高效地在源文本中查找字符串,其实现基于经典的 Boyer-Moore 字符串搜索算法。该算法由 Robert S. Boyer 和 J Strother Moore 于 1977 年提出,其设计哲学是通过 坏字符规则(bad character rule) 和 好后缀规则(good suffix rule) 在匹配失败时跳过大量不必要的字符比较,从而在实际应用中达到比暴力匹配(Brute-Force)高数十倍的性能。
下面是 stringFinder 及相关方法的源码:
package strings
// stringFinder efficiently finds strings in a source text. It's implemented
// using the Boyer-Moore string search algorithm:
// https://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm
// https://www.cs.utexas.edu/~moore/publications/fstrpos.pdf (note: this aged
// document uses 1-based indexing)
type stringFinder struct {
// pattern is the string that we are searching for in the text.
pattern string
// badCharSkip[b] contains the distance between the last byte of pattern
// and the rightmost occurrence of b in pattern. If b is not in pattern,
// badCharSkip[b] is len(pattern).
//
// Whenever a mismatch is found with byte b in the text, we can safely
// shift the matching frame at least badCharSkip[b] until the next time
// the matching char could be in alignment.
badCharSkip [256]int
// goodSuffixSkip[i] defines how far we can shift the matching frame given
// that the suffix pattern[i+1:] matches, but the byte pattern[i] does
// not. There are two cases to consider:
//
// 1. The matched suffix occurs elsewhere in pattern (with a different
// byte preceding it that we might possibly match). In this case, we can
// shift the matching frame to align with the next suffix chunk. For
// example, the pattern "mississi" has the suffix "issi" next occurring
// (in right-to-left order) at index 1, so goodSuffixSkip[3] ==
// shift+len(suffix) == 3+4 == 7.
//
// 2. If the matched suffix does not occur elsewhere in pattern, then the
// matching frame may share part of its prefix with the end of the
// matching suffix. In this case, goodSuffixSkip[i] will contain how far
// to shift the frame to align this portion of the prefix to the
// suffix. For example, in the pattern "abcxxxabc", when the first
// mismatch from the back is found to be in position 3, the matching
// suffix "xxabc" is not found elsewhere in the pattern. However, its
// rightmost "abc" (at position 6) is a prefix of the whole pattern, so
// goodSuffixSkip[3] == shift+len(suffix) == 6+5 == 11.
goodSuffixSkip []int
}
func makeStringFinder(pattern string) *stringFinder {
f := &stringFinder{
pattern: pattern,
goodSuffixSkip: make([]int, len(pattern)),
}
// last is the index of the last character in the pattern.
last := len(pattern) - 1
// Build bad character table.
// Bytes not in the pattern can skip one pattern's length.
for i := range f.badCharSkip {
f.badCharSkip[i] = len(pattern)
}
// The loop condition is < instead of <= so that the last byte does not
// have a zero distance to itself. Finding this byte out of place implies
// that it is not in the last position.
for i := 0; i < last; i++ {
f.badCharSkip[pattern[i]] = last - i
}
// Build good suffix table.
// First pass: set each value to the next index which starts a prefix of
// pattern.
lastPrefix := last
for i := last; i >= 0; i-- {
if HasPrefix(pattern, pattern[i+1:]) {
lastPrefix = i + 1
}
// lastPrefix is the shift, and (last-i) is len(suffix).
f.goodSuffixSkip[i] = lastPrefix + last - i
}
// Second pass: find repeats of pattern's suffix starting from the front.
for i := 0; i < last; i++ {
lenSuffix := longestCommonSuffix(pattern, pattern[1:i+1])
if pattern[i-lenSuffix] != pattern[last-lenSuffix] {
// (last-i) is the shift, and lenSuffix is len(suffix).
f.goodSuffixSkip[last-lenSuffix] = lenSuffix + last - i
}
}
return f
}
func longestCommonSuffix(a, b string) (i int) {
for ; i < len(a) && i < len(b); i++ {
if a[len(a)-1-i] != b[len(b)-1-i] {
break
}
}
return
}
// next returns the index in text of the first occurrence of the pattern. If
// the pattern is not found, it returns -1.
func (f *stringFinder) next(text string) int {
i := len(f.pattern) - 1
for i < len(text) {
// Compare backwards from the end until the first unmatching character.
j := len(f.pattern) - 1
for j >= 0 && text[i] == f.pattern[j] {
i--
j--
}
if j < 0 {
return i + 1 // match
}
i += max(f.badCharSkip[text[i]], f.goodSuffixSkip[j])
}
return -1
}
源码结构与核心方法解析
stringFinder 结构体是算法实现的载体,包含模式串 (pattern)、坏字符跳转表 (badCharSkip) 和好后缀跳转表 (goodSuffixSkip)。这些设计体现了典型的空间换时间的算法优化思想。
相关的三个核心方法均未导出(小写字母开头),这可能是为了防止外部错误调用或确保其专用于包内特定场景:
-
makeStringFinder(pattern string): 工厂函数,负责预处理模式串。它构建了两张关键的跳转表,这是 Boyer-Moore 算法高效的前提。badCharSkip 表基于坏字符规则,goodSuffixSkip 表基于好后缀规则,两者共同决定了匹配失败时的最大安全移动距离。
-
longestCommonSuffix(a, b string): 一个辅助函数,用于计算两个字符串的最长公共后缀长度。实现方式是从两个字符串的末尾开始,向前逐字节比较,直到遇到不匹配的字符。该函数在构建好后缀表时被调用。
-
next(text string): 这是执行搜索的引擎。它从文本中可能匹配的末尾位置开始,反向比较模式串和文本。当发生不匹配时,它会查询预处理好的两张表,取 max(f.badCharSkip[text[i]], f.goodSuffixSkip[j]) 作为下一次比较的跳跃距离,从而跳过绝不可能匹配的字符区域,极大提升了搜索效率。如果找到完全匹配,则返回模式串在文本中的起始索引;否则返回 -1。
在 strings 包中的应用
一个自然的疑问是:如此精心实现的算法模块,为何没有对外暴露?答案在于 关注点分离和代码复用。stringFinder 被设计为 strings 包内部的一个高性能基础组件,专门服务于更高层次的功能。
例如,在实现字符串替换函数时,就需要高频次地进行子串查找。我们可以在 strings 包的 replace.go 文件中找到它的应用:
// singleStringReplacer is the implementation that's used when there is only
// one string to replace (and that string has more than one byte).
type singleStringReplacer struct {
finder *stringFinder // 这里使用了 stringFinder
// value is the new string that replaces that pattern when it's found.
value string
}
可见,stringFinder 为 strings.Replacer 等需要高效搜索的功能提供了底层支持。这种设计模式在 Go 标准库中很常见,通过封装复杂算法细节,对外提供简洁、稳定的 API。
总结
通过对 strings.Search 相关内部源码的解读,我们不仅学习了 Boyer-Moore 这一经典字符串搜索算法在 Go 中的工程化实现,更窥见了 Go 标准库在性能与设计上的考量。makeStringFinder 负责预处理构建跳转表,next 方法利用这些表进行高效的跳跃式匹配,而它们最终被整合到如 strings.Replace 这样的实用函数中,为开发者提供既简单又高性能的字符串操作能力。