今回はlocale cp932に対応していないCPANモジュールを強引に使用する方法について解説します。
文字列の近似値を簡易的に測定できる「String::Trigram」というCPANモジュールを例に実践してみましょう。
まずは以下のソースを「trigram.pl」というファイル名でUTF-8にコードを変換して保存してください。
#!/usr/bin/perl
# コマンドプロンプトでCPANからモジュールをInstallする
# perl -MCPAN -e shell
# cpan>install String::Trigram
#perl5.8以降で日本語を扱う場合、必ずソースをUTF-8で記載すること。
use strict;
use warnings;
use Encode;
use utf8;
use String::Trigram; #←これが今回利用するCPANモジュール
binmode STDIN, ":encoding(cp932)"; # 入力なのに、何故かencoding
binmode STDOUT, ":encoding(cp932)";
binmode STDERR, ":encoding(cp932)";
# ファイルハンドル<IN>等からの入力時はdecodeが必要
# ファイルハンドル<OUT>等への出力時はencodeが必要
our $debug=1;
# ファイル名等は$fname=&de_sjis($fname); のように、事前にデコードが必要
my $str_a='今日はとても良い天気です。';
&vv('$str_a',$str_a);
my $str_b='明日とても良い天気だといいなぁ。';
&vv('$str_b',$str_b);
my $str_c='昨日はあまり良い天気ではありませんでした。';
&vv('$str_c',$str_c);
my $str_d='私の名前はぶんぶんです。';
&vv('$str_d',$str_d);
my $score_ab=String::Trigram::compare($str_a, $str_b);
&vv('$score_ab',$score_ab);
my $score_ac=String::Trigram::compare($str_a, $str_c);
&vv('$score_ac',$score_ac);
my $score_ad=String::Trigram::compare($str_a, $str_d);
&vv('$score_ad',$score_ad);
print '[Finish!]';
<>;
exit;
#----------------------------------
sub en_sjis
{
my ($buf)=@_;
encode('cp932',$buf);
}
sub de_sjis
{
my ($buf)=@_;
decode('cp932',$buf);
}
sub en_utf8
{
my ($buf)=@_;
encode('utf-8',$buf);
}
sub de_utf8
{
my ($buf)=@_;
decode('utf-8',$buf);
}
sub en_euc
{
my ($buf)=@_;
encode('euc-jp',$buf);
}
sub de_euc
{
my ($buf)=@_;
decode('euc-jp',$buf);
}
sub vv
{
my($Name,$Value)=@_;
if($debug) {
print $Name.'=['.$Value.']',"\n";
}
}
もしString::TrigramをCPANからインストールしていないと、下記のようなエラーが表示されます。(入力したコマンドは赤字)
>perl trigram.pl
Can't locate String/Trigram.pm in @INC (you may need to install the String::Trigram module) (@INC entries checked: C:/Strawberry/perl/site/lib C:/Strawberry/perl/vendor/lib C:/Strawberry/perl/lib) at trigram.pl line 13.
BEGIN failed--compilation aborted at trigram.pl line 13.
コマンドプロンプトで「perl -MCPAN -e shell」 を実行し、「install String::Trigram」を実行します。
(Ubuntu等のLinux系OS1ではsudoが必要です。)(入力したコマンドは赤字)
Microsoft Windows [Version 10.0.26100.2605]
(c) Microsoft Corporation. All rights reserved.
C:\Users\xxxx\OneDrive\デスクトップ>perl -MCPAN -e shell
Starting with version 2.29 of the cpan shell, a new download mechanism
is the default which exclusively uses cpan.org as the host to download
from. The configuration variable pushy_https can be used to (de)select
the new mechanism. Please read more about it and make your choice
between the old and the new mechanism by running
o conf init pushy_https
Once you have done that and stored the config variable this dialog
will disappear.
Unable to get Terminal Size. The Win32 GetConsoleScreenBufferInfo call didn't work. The COLUMNS and LINES environment variables didn't work. at C:\Strawberry\perl\vendor\lib/Term/ReadLine/readline.pm line 410.
cpan shell -- CPAN exploration and modules installation (v2.36)
Enter 'h' for help.
cpan> install String::Trigram
Starting with version 2.29 of the cpan shell, a new download mechanism
is the default which exclusively uses cpan.org as the host to download
from. The configuration variable pushy_https can be used to (de)select
the new mechanism. Please read more about it and make your choice
between the old and the new mechanism by running
o conf init pushy_https
Once you have done that and stored the config variable this dialog
will disappear.
Starting with version 2.29 of the cpan shell, a new download mechanism
is the default which exclusively uses cpan.org as the host to download
from. The configuration variable pushy_https can be used to (de)select
the new mechanism. Please read more about it and make your choice
between the old and the new mechanism by running
o conf init pushy_https
Once you have done that and stored the config variable this dialog
will disappear.
Starting with version 2.29 of the cpan shell, a new download mechanism
is the default which exclusively uses cpan.org as the host to download
from. The configuration variable pushy_https can be used to (de)select
the new mechanism. Please read more about it and make your choice
between the old and the new mechanism by running
o conf init pushy_https
Once you have done that and stored the config variable this dialog
will disappear.
Starting with version 2.29 of the cpan shell, a new download mechanism
is the default which exclusively uses cpan.org as the host to download
from. The configuration variable pushy_https can be used to (de)select
the new mechanism. Please read more about it and make your choice
between the old and the new mechanism by running
o conf init pushy_https
Once you have done that and stored the config variable this dialog
will disappear.
Database was generated on Tue, 07 Jan 2025 13:48:05 GMT
Starting with version 2.29 of the cpan shell, a new download mechanism
is the default which exclusively uses cpan.org as the host to download
from. The configuration variable pushy_https can be used to (de)select
the new mechanism. Please read more about it and make your choice
between the old and the new mechanism by running
o conf init pushy_https
Once you have done that and stored the config variable this dialog
will disappear.
Running install for module 'String::Trigram'
Checksum for C:\STRAWB~1\cpan\sources\authors\id\T\TA\TAREKA\String-Trigram-0.12.tar.gz ok
Scanning cache C:\STRAWB~1\cpan\build for sizes
............................................................................DONE
Configuring T/TA/TAREKA/String-Trigram-0.12.tar.gz with Makefile.PL
Checking if your kit is complete...
Looks good
Generating a gmake-style Makefile
Writing Makefile for String::Trigram
Writing MYMETA.yml and MYMETA.json
TAREKA/String-Trigram-0.12.tar.gz
C:\Strawberry\perl\bin\perl.exe Makefile.PL -- OK
Running make for T/TA/TAREKA/String-Trigram-0.12.tar.gz
cp Trigram.pm blib\lib\String\Trigram.pm
TAREKA/String-Trigram-0.12.tar.gz
C:\STRAWB~1\c\bin\gmake.exe -- OK
Running make test for TAREKA/String-Trigram-0.12.tar.gz
"C:\Strawberry\perl\bin\perl.exe" "-Iblib\lib" "-Iblib\arch" test.pl
1..20
# Running under perl version 5.040000 for MSWin32
# Current time local: Wed Jan 8 05:09:06 2025
# Current time GMT: Tue Jan 7 20:09:06 2025
# Using Test.pm version 1.31
1-gram ............................... Locale 'Japanese_Japan.932' is unsupported, and may hang or crash the interpreter at blib\lib/String/Trigram.pm line 32.
ok 1
2-gram ............................... ok 2
3-gram ............................... ok 3
4-gram ............................... ok 4
7-gram ............................... ok 5
compare a to b equals compare b to a . ok 6
completely different strings ......... ok 7
extendBase ........................... ok 8
getBestMatch/1 ....................... ok 9
getBestMatch/2 ....................... ok 10
identical strings .................... ok 11
ignore case .......................... ok 12
keep only alphanumerics .............. ok 13
keeping base of comparison unique .... ok 14
minSim ............................... ok 15
padding .............................. ok 16
reInit/1 ............................. ok 17
reInit/2 ............................. ok 18
several tokens of one trigram type ... ok 19
warp ................................. ok 20
Lockfile removed.
TAREKA/String-Trigram-0.12.tar.gz
C:\STRAWB~1\c\bin\gmake.exe test -- OK
Running make install for TAREKA/String-Trigram-0.12.tar.gz
Installing C:\STRAWB~1\perl\site\lib\String\Trigram.pm
Appending installation info to C:\STRAWB~1\perl\lib/perllocal.pod
TAREKA/String-Trigram-0.12.tar.gz
C:\STRAWB~1\c\bin\gmake.exe install UNINST=1 -- OK
cpan> q
Lockfile removed.
うまくインストールできましたね。改めてtrigram.plを実行してみましょう。
>perl trigram.pl
$str_a=[今日はとても良い天気です。]
$str_b=[明日はとても良い天気だといいなぁ。]
$str_c=[昨日はあまり良い天気ではありませんでした。]
$str_d=[私の名前はぶんぶんです。]
Locale 'Japanese_Japan.932' is unsupported, and may hang or crash the interpreter at C:/Strawberry/perl/site/lib/String/Trigram.pm line 32.
Wide character (U+4ECA) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+65E5) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+306F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3068) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3066) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3082) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+826F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3044) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+5929) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+6C17) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3067) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3059) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3002) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+660E) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+65E5) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+306F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3068) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3066) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3082) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+826F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3044) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+5929) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+6C17) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3060) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3068) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3044) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3044) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+306A) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3041) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3002) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
$score_ab=[0.307692307692308]
Wide character (U+4ECA) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+65E5) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+306F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3068) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3066) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3082) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+826F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3044) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+5929) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+6C17) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3067) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3059) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3002) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+6628) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+65E5) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+306F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3042) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+307E) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+308A) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+826F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3044) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+5929) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+6C17) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3067) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+306F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3042) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+308A) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+307E) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+305B) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3093) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3067) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3057) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+305F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3002) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
$score_ac=[0.117647058823529]
Wide character (U+4ECA) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+65E5) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+306F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3068) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3066) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3082) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+826F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3044) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+5929) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+6C17) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3067) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3059) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+3002) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 524.
Wide character (U+79C1) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+306E) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+540D) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+524D) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+306F) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3076) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3093) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3076) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3093) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3067) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3059) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
Wide character (U+3002) in lc at C:/Strawberry/perl/site/lib/String/Trigram.pm line 258.
$score_ad=[0.115384615384615]
[Finish!]
「Locale 'Japanese_Japan.932' is unsupported, and may hang or crash the interpreter at C:/Strawberry/perl/site/lib/String/Trigram.pm line 32.」
と警告されたのちに、ズラズラと警告が表示されます。これが有名な「Strawberry Perl の日本語cp932警告問題」です。
Trigram.pmはperlで書かれたモジュールなので、中身を見ることができます。(ただし、ReadOnlyモードになっている)
実際に「C:/Strawberry/perl/site/lib/String/Trigram.pm」を秀丸エディタに送って見てみましょう。
package String::Trigram;
use Carp;
use locale;
use 5.6.0;
use strict;
use warnings;
require Exporter;
our @ISA = qw(Exporter);
our @EXPORT_OK = ('compare');
our $VERSION = '0.12';
our $DEFAULT_MIN_SIM = 0;
our $DEFAULT_WARP = 1.0;
our $DEFAULT_IGNORE_CASE = 1;
our $DEFAULT_KEEP_ONLY_ALNUMS = 0;
our $DEFAULT_DEBUG = 0;
our $DEFAULT_NGRAM_LEN = 3;
our $DEFAULT_PADDING = $DEFAULT_NGRAM_LEN - 1;
.................
「use locale;」とありますね。これがあるとStrawberry Perl は、「cp932には対応していない」旨の警告を出します。
実際に対応できてないんですけど、なんと「use locale」をコメントアウトしたら、暫定的に警告が出なくなります。
(だってTrigramソース内では、locale全然使ってないし~)
だたし稀にフリーズします。これは運です。実際対応できてないわけなので仕方ありません。
編集するには、「C:/Strawberry/perl/site/lib/String/Trigram.pm」ファイルを右クリックし、プロパティを表示させます。

読み取り専用のチェックを外して「OK」してから、改めて秀丸エディタに送り、use locale; 行をコメントアウトします。
なんて強引なんでしょう!!!(^^;
package String::Trigram;
use Carp;
# use locale;
use 5.6.0;
use strict;
use warnings;
.................
Trigram.pmを保存してから、改めて実行してみましょう。(Read Onlyに戻さなくても大丈夫です。)
>perl trigram.pl
$str_a=[今日はとても良い天気です。]
$str_b=[明日はとても良い天気だといいなぁ。]
$str_c=[昨日はあまり良い天気ではありませんでした。]
$str_d=[私の名前はぶんぶんです。]
$score_ab=[0.307692307692308]
$score_ac=[0.117647058823529]
$score_ad=[0.115384615384615]
[Finish!]
あら不思議! ちゃんと警告なしで動くじゃないですか!!wwww
この結果から、$str_aと$str_bが、とても似ていることがわかりますね!!!
Trigramは、あくまでも3文字単位に区切って一致する文字列の頻度を計算しているだけなので、AIのような意味的な近さまでは計測できません。類似したファイル名を探すのにはとても便利です。
これはあくまでも暫定処置です。運悪く対応していない文字が含まれるとフリーズするので悪しからず!