找回密码
立即注册
搜索
热搜: Java Python Linux Go
发回帖 发新帖

570

积分

0

好友

82

主题
发表于 3 天前 | 查看: 7| 回复: 0

图片

本文通过阅读 Go 语言标准库 strings 包中与搜索相关的源码,聚焦于一个核心的内部结构 stringFinder。它的作用是高效地在源文本中查找字符串,其实现基于经典的 Boyer-Moore 字符串搜索算法。该算法由 Robert S. Boyer 和 J Strother Moore 于 1977 年提出,其设计哲学是通过 坏字符规则(bad character rule)好后缀规则(good suffix rule) 在匹配失败时跳过大量不必要的字符比较,从而在实际应用中达到比暴力匹配(Brute-Force)高数十倍的性能。

下面是 stringFinder 及相关方法的源码:

package strings

// stringFinder efficiently finds strings in a source text. It's implemented
// using the Boyer-Moore string search algorithm:
// https://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm
// https://www.cs.utexas.edu/~moore/publications/fstrpos.pdf (note: this aged
// document uses 1-based indexing)
type stringFinder struct {
    // pattern is the string that we are searching for in the text.
    pattern string
    // badCharSkip[b] contains the distance between the last byte of pattern
    // and the rightmost occurrence of b in pattern. If b is not in pattern,
    // badCharSkip[b] is len(pattern).
    //
    // Whenever a mismatch is found with byte b in the text, we can safely
    // shift the matching frame at least badCharSkip[b] until the next time
    // the matching char could be in alignment.
    badCharSkip [256]int
    // goodSuffixSkip[i] defines how far we can shift the matching frame given
    // that the suffix pattern[i+1:] matches, but the byte pattern[i] does
    // not. There are two cases to consider:
    //
    // 1. The matched suffix occurs elsewhere in pattern (with a different
    // byte preceding it that we might possibly match). In this case, we can
    // shift the matching frame to align with the next suffix chunk. For
    // example, the pattern "mississi" has the suffix "issi" next occurring
    // (in right-to-left order) at index 1, so goodSuffixSkip[3] ==
    // shift+len(suffix) == 3+4 == 7.
    //
    // 2. If the matched suffix does not occur elsewhere in pattern, then the
    // matching frame may share part of its prefix with the end of the
    // matching suffix. In this case, goodSuffixSkip[i] will contain how far
    // to shift the frame to align this portion of the prefix to the
    // suffix. For example, in the pattern "abcxxxabc", when the first
    // mismatch from the back is found to be in position 3, the matching
    // suffix "xxabc" is not found elsewhere in the pattern. However, its
    // rightmost "abc" (at position 6) is a prefix of the whole pattern, so
    // goodSuffixSkip[3] == shift+len(suffix) == 6+5 == 11.
    goodSuffixSkip []int
}

func makeStringFinder(pattern string) *stringFinder {
    f := &stringFinder{
       pattern:        pattern,
       goodSuffixSkip: make([]int, len(pattern)),
    }
    // last is the index of the last character in the pattern.
    last := len(pattern) - 1
    // Build bad character table.
    // Bytes not in the pattern can skip one pattern's length.
    for i := range f.badCharSkip {
       f.badCharSkip[i] = len(pattern)
    }
    // The loop condition is < instead of <= so that the last byte does not
    // have a zero distance to itself. Finding this byte out of place implies
    // that it is not in the last position.
    for i := 0; i < last; i++ {
       f.badCharSkip[pattern[i]] = last - i
    }
    // Build good suffix table.
    // First pass: set each value to the next index which starts a prefix of
    // pattern.
    lastPrefix := last
    for i := last; i >= 0; i-- {
       if HasPrefix(pattern, pattern[i+1:]) {
          lastPrefix = i + 1
       }
       // lastPrefix is the shift, and (last-i) is len(suffix).
       f.goodSuffixSkip[i] = lastPrefix + last - i
    }
    // Second pass: find repeats of pattern's suffix starting from the front.
    for i := 0; i < last; i++ {
       lenSuffix := longestCommonSuffix(pattern, pattern[1:i+1])
       if pattern[i-lenSuffix] != pattern[last-lenSuffix] {
          // (last-i) is the shift, and lenSuffix is len(suffix).
          f.goodSuffixSkip[last-lenSuffix] = lenSuffix + last - i
       }
    }
    return f
}

func longestCommonSuffix(a, b string) (i int) {
    for ; i < len(a) && i < len(b); i++ {
       if a[len(a)-1-i] != b[len(b)-1-i] {
          break
       }
    }
    return
}

// next returns the index in text of the first occurrence of the pattern. If
// the pattern is not found, it returns -1.
func (f *stringFinder) next(text string) int {
    i := len(f.pattern) - 1
    for i < len(text) {
       // Compare backwards from the end until the first unmatching character.
       j := len(f.pattern) - 1
       for j >= 0 && text[i] == f.pattern[j] {
          i--
          j--
       }
       if j < 0 {
          return i + 1 // match
       }
       i += max(f.badCharSkip[text[i]], f.goodSuffixSkip[j])
    }
    return -1
}

源码结构与核心方法解析

stringFinder 结构体是算法实现的载体,包含模式串 (pattern)、坏字符跳转表 (badCharSkip) 和好后缀跳转表 (goodSuffixSkip)。这些设计体现了典型的空间换时间算法优化思想。

相关的三个核心方法均未导出(小写字母开头),这可能是为了防止外部错误调用或确保其专用于包内特定场景:

  1. makeStringFinder(pattern string): 工厂函数,负责预处理模式串。它构建了两张关键的跳转表,这是 Boyer-Moore 算法高效的前提。badCharSkip 表基于坏字符规则,goodSuffixSkip 表基于好后缀规则,两者共同决定了匹配失败时的最大安全移动距离。

  2. longestCommonSuffix(a, b string): 一个辅助函数,用于计算两个字符串的最长公共后缀长度。实现方式是从两个字符串的末尾开始,向前逐字节比较,直到遇到不匹配的字符。该函数在构建好后缀表时被调用。

  3. next(text string): 这是执行搜索的引擎。它从文本中可能匹配的末尾位置开始,反向比较模式串和文本。当发生不匹配时,它会查询预处理好的两张表,取 max(f.badCharSkip[text[i]], f.goodSuffixSkip[j]) 作为下一次比较的跳跃距离,从而跳过绝不可能匹配的字符区域,极大提升了搜索效率。如果找到完全匹配,则返回模式串在文本中的起始索引;否则返回 -1

strings 包中的应用

一个自然的疑问是:如此精心实现的算法模块,为何没有对外暴露?答案在于 关注点分离和代码复用stringFinder 被设计为 strings 包内部的一个高性能基础组件,专门服务于更高层次的功能。

例如,在实现字符串替换函数时,就需要高频次地进行子串查找。我们可以在 strings 包的 replace.go 文件中找到它的应用:

// singleStringReplacer is the implementation that's used when there is only
// one string to replace (and that string has more than one byte).
type singleStringReplacer struct {
    finder *stringFinder // 这里使用了 stringFinder
    // value is the new string that replaces that pattern when it's found.
    value string
}

可见,stringFinderstrings.Replacer 等需要高效搜索的功能提供了底层支持。这种设计模式在 Go 标准库中很常见,通过封装复杂算法细节,对外提供简洁、稳定的 API。

总结

通过对 strings.Search 相关内部源码的解读,我们不仅学习了 Boyer-Moore 这一经典字符串搜索算法在 Go 中的工程化实现,更窥见了 Go 标准库在性能与设计上的考量。makeStringFinder 负责预处理构建跳转表,next 方法利用这些表进行高效的跳跃式匹配,而它们最终被整合到如 strings.Replace 这样的实用函数中,为开发者提供既简单又高性能的字符串操作能力。




上一篇:分页排序避坑指南:MyBatis-Plus多字段排序与SQL优化实战
下一篇:pytest-repeat实战:四种重复测试模式详解与稳定性验证
您需要登录后才可以回帖 登录 | 立即注册

手机版|小黑屋|网站地图|云栈社区 ( 苏ICP备2022046150号-2 )

GMT+8, 2025-12-12 01:47 , Processed in 0.082819 second(s), 42 queries , Gzip On.

Powered by Discuz! X3.5

© 2025-2025 云栈社区.

快速回复 返回顶部 返回列表