Contents

Figures

  1. Ferret Example Screenshot

Notes

  1. rmmseg-cpp is preferable to RMMSeg
  2. The latest code might be unstable

Chapter 1
Introduction

Note 1.  rmmseg-cpp is preferable to RMMSeg

rmmseg-cpp is a re-written of RMMSeg in C++ with a Ruby interface. It is much faster and cosumes much less memory than RMMSeg. The interface of rmmseg-cpp is almost identical to RMMSeg. So rmmseg-cpp is definitely preferable when used in production.
RMMSegis an implementation of MMSEG word segmentationalgorithm. It is based on two variants of maximum matchingalgorithms. Two algorithms are available for using:

  • simple algorithm that uses only forward maximum matching.
  • complex algorithm that uses three-word chunk maximum matching and 3 aditonal rules to solve ambiguities.

Formore information about the algorithm, please refer to the followingessays:

  • http://technology.chtsai.org/mmseg/
  • http://pluskid.lifegoo.com/?p=261

RMMSegcan be used as either a stand alone program or an Analyzer of Ferret.

Chapter 2
Setup

2.1  Requirements

Your system needs the following software to run RMMSeg.
Software Notes
Ruby Version 1.8.x is required
Rake If you want to build the gem manually
rspec If you want to run the testcases

2.2  Installation

2.2.1  Using RubyGems

To install the gem remotely from RubyForge :

sudo gem install rmmseg

Or you can download the gem file manually from RubyForge and install it locally:

sudo gem install --local rmmseg-x.y.z.gem

2.2.2  From Subversion

From subversion repository hosted at RubyForge, you can always get the latest source code.

Note 2.  The latest code might be unstable

Some new features may only be available in the latest code in subversion, but the code might be broken in some cases. So it is recommended to use the released gem package for production.
To check out the code from Rubyforge, you need to install subversion, then:

svn checkout http://rmmseg.rubyforge.org/svn/trunk/ rmmseg

Then you can run

rake gem

to build the gem file.

Chapter 3
Usage

3.1  Stand Alone rmmseg

RMMSeg comes with a script rmmseg. To get the basic usage, just execute it with -h option:

rmmseg -h

It reads from STDIN and print result to STDOUT. Here is a real example:

$ echo "我们都喜欢用 Ruby" | rmmseg
我们 都 喜欢 用 Ruby

3.2  Analyzer for Ferret

RMMSeg include an analyzer for Ferret. It is simply ready to use. Just require it and pass it to Ferret. Here’s a complete example:

#!/usr/bin/env ruby
require 'rubygems'
require 'rmmseg'
require 'rmmseg/ferret'

analyzer = RMMSeg::Ferret::Analyzer.new { |tokenizer|
  Ferret::Analysis::LowerCaseFilter.new(tokenizer)
}

$index = Ferret::Index::Index.new(:analyzer => analyzer)

$index << {
  :title => "分词",
  :content => "中文分词比较困难,不像英文那样,直接在空格和标点符号的地方断开就可以了。"
}
$index << {
  :title => "RMMSeg",
  :content => "RMMSeg 我近日做的一个 Ruby 中文分词实现,下一步是和 Ferret 进行集成。"
}
$index << {
  :title => "Ruby 1.9",
  :content => "Ruby 1.9.0 已经发布了,1.9 的一个重大改进就是对 Unicode 的支持。"
}
$index << {
  :title => "Ferret",
  :content => <<END
Ferret is a high-performance, full-featured text search engine library
written for Ruby. It is inspired by Apache Lucene Java project. With
the introduction of Ferret, Ruby users now have one of the fastest and
most flexible search libraries available. And it is surprisingly easy
to use.
END
}

def highlight_search(key)
  $index.search_each(%Q!content:"#{key}"!) do |id, score|
puts "*** Document \"#{$index[id][:title]}\" found with a score of #{score}"
puts "-"*40
highlights = $index.highlight("content:#{key}", id,
                              :field => :content,
                              :pre_tag => "\033[36m",
                              :post_tag => "\033[m")
puts "#{highlights}"
puts ""
  end
end

ARGV.each { |key|
  puts "\033[33mSearching for #{key}...\033[m"
  puts ""
  highlight_search(key)
}

# Local Variables:
# coding: utf-8
# End:

execute it on the following key words:

$ ruby ferret_example.rb Ruby 中文

will generate the following results:

Searching for Ruby...

*** Document "RMMSeg" found with a score of 0.21875
----------------------------------------
RMMSeg 我近日做的一个 Ruby 中文分词实现,下一步是和 Ferret 进行集成。

*** Document "Ruby 1.9" found with a score of 0.21875
----------------------------------------
Ruby 1.9.0 已经发布了,1.9 的一个重大改进就是对 Unicode 的支持。

*** Document "Ferret" found with a score of 0.176776692271233
----------------------------------------
Ferret is a high-performance, full-featured text search engine library
written for Ruby. It is inspired by Apache Lucene Java project. With
the introduction of Ferret, Ruby users now have one of the fastest and
most flexible search libraries available. And it's surprisingly easy
to use.

Searching for 中文...

*** Document "分词" found with a score of 0.281680464744568
----------------------------------------
中文分词比较困难,不像英文那样,直接在空格和标点符号的地方断开就可以了。

*** Document "RMMSeg" found with a score of 0.281680464744568
----------------------------------------
RMMSeg 我近日做的一个 Ruby 中文分词实现,下一步是和 Ferret 进行集成。

And if you run the example in terminal, you’ll see the result highlighted as in Figure 1: Ferret Example Screenshot.

Figure 1.  Ferret Example Screenshot

3.3  Customization

RMMSeg can be customized through RMMSeg::Config. For example, to use your own dictionaries, just set it before starting to do segmentation:

RMMSeg::Config.dictionaries = [["dict1.dic", true],  # with frequency info
                               ["dict2.dic", false], # without
                               ["dict3.dic", false]]
RMMSeg::Config.max_word_length = 6

Or to use the simple algorithm for more efficient (and less accurate) segmenting:

RMMSeg::Config.algorithm = :simple

For more information on customization, please refer to the RDoc of RMMSeg::Config.

Chapter 4
Resources