m6米乐安卓版下载-米乐app官网下载
暂无图片
3

【pg15】postgresql 中文分词插件 pg-m6米乐安卓版下载

原创 严少安 2023-03-01
1929

pg菜鸟入门学习中,欢迎各位大佬留言技术指导。

bg

昨日群里在讨论 postgresql 的中文分词插件编译过程中异常中断的事情,我是第一次编译pg的插件,随笔记录下。

常见的中文分词插件有:zhparser、pg_jieba,可用于中文分词,全文检索。
但从github仓库看,这两个插件的代码已经有段没更新,未必直接适配pg15,且为了便于学习pg知识,所以采用自编译的方式进行安装。

测试环境

先交代环境,以免产生不必要的信息差,也为问题复现做好基础信息铺垫。

  1. os 为 centos 7.9
  2. 需要安装 gcc 不然编译会报错。gcc 版本为 gcc (gcc) 4.8.5 20150623 (red hat 4.8.5-44)
checking for gcc... no
checking for cc... no
configure: error: in `/root/postgres':
configure: error: no acceptable c compiler found in $path
see `config.log' for more details

其实这里也可以看到,gcc的版本是2015年release的,很老旧了,如有条件建议升级 kernel/os version.

  1. cmake需要安装,版本且看下文。

编译安装 pg 15

pg 版本为 v15.2, 直接下载源码包手动编译,需要安装其他依赖,此处暂不赘述,编译结果如下。

  1. 对源码进行配置
./configure --prefix=/opt/pgsql-152 --datadir=/data/pgdata-152 --with-extra-version="-yan"
...
checking whether gcc -std=gnu99 supports -wl,--as-needed... yes
configure: using compiler=gcc (gcc) 4.8.5 20150623 (red hat 4.8.5-44)
configure: using cflags=-wall -wmissing-prototypes -wpointer-arith -wdeclaration-after-statement -werror=vla -wendif-labels -wmissing-format-attribute -wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -o2
configure: using cppflags= -d_gnu_source 
configure: using ldflags=  -wl,--as-needed
configure: creating ./config.status
config.status: creating gnumakefile
config.status: creating src/makefile.global
config.status: creating src/include/pg_config.h
config.status: creating src/include/pg_config_ext.h
config.status: creating src/interfaces/ecpg/include/ecpg_config.h
config.status: linking src/backend/port/tas/dummy.s to src/backend/port/tas.s
config.status: linking src/backend/port/posix_sema.c to src/backend/port/pg_sema.c
config.status: linking src/backend/port/sysv_shmem.c to src/backend/port/pg_shmem.c
config.status: linking src/include/port/linux.h to src/include/pg_config_os.h
config.status: linking src/makefiles/makefile.linux to src/makefile.port
  1. 对源码进行编译,这个过程稍微久一点
gmake world
...
gmake -c vacuumlo all
gmake[2]: entering directory `/root/postgresql-15.2/contrib/vacuumlo'
gcc -std=gnu99 -wall -wmissing-prototypes -wpointer-arith -wdeclaration-after-statement -werror=vla -wendif-labels -wmissing-format-attribute -wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -o2 -i../../src/interfaces/libpq -i. -i. -i../../src/include  -d_gnu_source   -c -o vacuumlo.o vacuumlo.c
gcc -std=gnu99 -wall -wmissing-prototypes -wpointer-arith -wdeclaration-after-statement -werror=vla -wendif-labels -wmissing-format-attribute -wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -o2  vacuumlo.o -l../../src/common -lpgcommon -l../../src/port -lpgport -l../../src/interfaces/libpq -lpq -l../../src/port -l../../src/common   -wl,--as-needed -wl,-rpath,'/opt/pgsql-152/lib',--enable-new-dtags   -lpgcommon -lpgport -lz -lreadline -lpthread -lrt -ldl -lm  -o vacuumlo
gmake[2]: leaving directory `/root/postgresql-15.2/contrib/vacuumlo'
gmake[1]: leaving directory `/root/postgresql-15.2/contrib'

可以看到此时 pg 已经编译安装完成,查看安装路径,会发现生成了12个目录。

# tree /opt/pgsql-152/ -l 2 -d
/opt/pgsql-152/
├── bin
├── include
│   ├── informix
│   ├── internal
│   ├── libpq
│   └── server
├── lib
│   ├── pgxs
│   └── pkgconfig
└── share
    ├── doc
    └── man
12 directories

但是,插件并非安装在此目录,而是安装到了数据(data)目录,即编译时指定的--datadir路径。

# tree /data/pgdata-152/ -l 2 -d
/data/pgdata-152/
└── postgresql
    ├── extension
    ├── timezone
    ├── timezonesets
    └── tsearch_data
5 directories

对pg进行初始化,并启动pg,登录后,可以查看版本号,同时,可以看到编译时,自定义的版本信息 (15.2-yan)

$ psql psql (15.2-yan) type "help" for help. postgres=# select version(); version ------------------------------------------------------------------------------------------------------------- postgresql 15.2-yan on x86_64-pc-linux-gnu, compiled by gcc (gcc) 4.8.5 20150623 (red hat 4.8.5-44), 64-bit (1 row) postgres=# \l list of databases name | owner | encoding | collate | ctype | icu locale | locale provider | access privileges ----------- ---------- ---------- ------------- ------------- ------------ ----------------- ----------------------- postgres | postgres | utf8 | en_us.utf-8 | en_us.utf-8 | | libc | template0 | postgres | utf8 | en_us.utf-8 | en_us.utf-8 | | libc | =c/postgres | | | | | | | postgres=ctc/postgres template1 | postgres | utf8 | en_us.utf-8 | en_us.utf-8 | | libc | =c/postgres | | | | | | | postgres=ctc/postgres (3 rows) postgres=# \conninfo you are connected to database "postgres" as user "postgres" via socket in "/tmp" at port "5432". postgres=#

编译安装 pg_jieba

pg_jieba 的源码路径为 https://github.com/jaiminpan/pg_jieba/
由于 github 网络限制,我已镜像到 gitee: https://gitee.com/shawnyan/pg_jieba

代码的 readme 中显示该插件支持 postgresql 9.x, postgresql 10.x, postgresql 11.x
该插件需要确保 postgresql 已经安装,并可运行 pg_config 命令。

另外,该插件源码依赖另一个项目 – cppjieba
cppjieba是"结巴(jieba)"中文分词的c 版本
该项目支持utf8编码。支持载自定义用户词典,多路径时支持分隔符’|‘或者’;'分隔。

但依赖:

  • g (version >= 4.1 is recommended) or clang ;
  • cmake (version >= 2.6 is recommended);

这里留个伏笔,虽然 readme 中标记为推荐使用 cmake 2.6,但实际编译时这里就有问题。
如下图,编译时报错:

  1. 提示 foreach 不支持
  2. 需要使用 c99 标准进行编译源码。

20230228_230120.png


[root@centos7 build]# cmake .. -dcmake_prefix_path=/opt/pgsql-152/
-- setting pg_jieba build type - 
-- found postgresql: /opt/pgsql-152/lib (found version "15.2-yan") 
-- postgresql include dirs: /opt/pgsql-152/include;/opt/pgsql-152/include/server
-- postgresql library dirs: /opt/pgsql-152/lib
-- postgresql libraries:    pq
-- postgresql_executable is /opt/pgsql-152/bin/postgres
-- configuring done
-- generating done
-- build files have been written to: /root/postgresql-15.2/contrib/pg_jieba/build
[root@centos7 build]# 
[root@centos7 build]# make
scanning dependencies of target pg_jieba
[ 50%] building c object cmakefiles/pg_jieba.dir/pg_jieba.c.o
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:258:1: warning: ‘definecustomconfigvariables’ was used with no prototype before its definition [-wmissing-prototypes]
 definecustomconfigvariables()
 ^
in file included from /opt/pgsql-152/include/server/nodes/primnodes.h:22:0,
                 from /opt/pgsql-152/include/server/nodes/parsenodes.h:27,
                 from /opt/pgsql-152/include/server/utils/guc.h:16,
                 from /root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:17:
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c: in function ‘extract_dict_list’:
/opt/pgsql-152/include/server/nodes/pg_list.h:353:2: error: ‘for’ loop initial declarations are only allowed in c99 mode
  for (foreachstate cell##__state = {(lst), 0}; \
  ^
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:459:2: note: in expansion of macro ‘foreach’
  foreach(lc, elemlist)
  ^
/opt/pgsql-152/include/server/nodes/pg_list.h:353:2: note: use option -std=c99 or -std=gnu99 to compile your code
  for (foreachstate cell##__state = {(lst), 0}; \
  ^
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:459:2: note: in expansion of macro ‘foreach’
  foreach(lc, elemlist)
  ^
make[2]: *** [cmakefiles/pg_jieba.dir/pg_jieba.c.o] error 1
make[1]: *** [cmakefiles/pg_jieba.dir/all] error 2
make: *** [all] error 2
[root@centos7 build]# 

编译安装 limonp

翻阅源码,发现 pg_jieba 依赖于 deps/limonp,然而 limonp 也有依赖 cppjieba,那么我们就先跳过这步,直接编译 cppjieba,从最里层开始尝试编译。

<-- 此处先跳转到下一节 -->

代码中显示编译至少要使用 cmake3 以上版本,症结找到了,版本很重要,文档也很重要。
https://gitee.com/shawnyan/limonp/blob/master/cmakelists.txt#l1

cmake_minimum_required(version 3.0)

编译安装 cppjieba

剥“洋葱”是个有趣的过程,继续翻代码,可以发现 limonp 依赖于 cppjieba。
那么按照惯性,cppjieba也不可能一次编译成功了。

[ 16%] building cxx object test/cmakefiles/demo.dir/demo.cpp.o
in file included from /root/cppjieba/include/cppjieba/querysegment.hpp:7:0,
                 from /root/cppjieba/include/cppjieba/jieba.hpp:4,
                 from /root/cppjieba/test/demo.cpp:1:
/root/cppjieba/deps/limonp/include/limonp/logging.hpp:56:5: warning: identifier ‘nullptr’ is a keyword in c  11 [-wc  0x-compat]
     assert(tm_tmp != nullptr);
     ^
in file included from /usr/include/c  /4.8.2/cassert:43:0,
                 from /root/cppjieba/deps/limonp/include/limonp/logging.hpp:6,
                 from /root/cppjieba/include/cppjieba/querysegment.hpp:7,
                 from /root/cppjieba/include/cppjieba/jieba.hpp:4,
                 from /root/cppjieba/test/demo.cpp:1:
/root/cppjieba/deps/limonp/include/limonp/logging.hpp: in constructor ‘limonp::logger::logger(size_t, const char*, int)’:
/root/cppjieba/deps/limonp/include/limonp/logging.hpp:56:22: error: ‘nullptr’ was not declared in this scope
     assert(tm_tmp != nullptr);
                      ^

果然,那么问题到底出在哪里了呢。
root cause 是在 cmake 的版本。

果然,在安装 cmake3 之后,重新编译便可以通过。

  • output:
[root@centos7 build]# cmake3 . -dcmake_cxx_standard=11 .. 
-- the c compiler identification is gnu 4.8.5
-- the cxx compiler identification is gnu 4.8.5
-- check for working c compiler: /usr/bin/cc
-- check for working c compiler: /usr/bin/cc - works
-- detecting c compiler abi info
-- detecting c compiler abi info - done
-- detecting c compile features
-- detecting c compile features - done
-- check for working cxx compiler: /usr/bin/c  
-- check for working cxx compiler: /usr/bin/c   - works
-- detecting cxx compiler abi info
-- detecting cxx compiler abi info - done
-- detecting cxx compile features
-- detecting cxx compile features - done
-- configuring done
-- generating done
-- build files have been written to: /root/cppjieba/build
[root@centos7 build]# make
scanning dependencies of target gtest
[  6%] building cxx object deps/gtest/cmakefiles/gtest.dir/src/gtest-all.cc.o
[ 12%] linking cxx static library libgtest.a
[ 12%] built target gtest
scanning dependencies of target load_test
[ 18%] building cxx object test/cmakefiles/load_test.dir/load_test.cpp.o
[ 25%] linking cxx executable ../load_test
[ 25%] built target load_test
scanning dependencies of target demo
[ 31%] building cxx object test/cmakefiles/demo.dir/demo.cpp.o
[ 37%] linking cxx executable ../demo
[ 37%] built target demo
scanning dependencies of target test.run
[ 43%] building cxx object test/unittest/cmakefiles/test.run.dir/gtest_main.cpp.o
[ 50%] building cxx object test/unittest/cmakefiles/test.run.dir/keyword_extractor_test.cpp.o
[ 56%] building cxx object test/unittest/cmakefiles/test.run.dir/trie_test.cpp.o
[ 62%] building cxx object test/unittest/cmakefiles/test.run.dir/segments_test.cpp.o
[ 68%] building cxx object test/unittest/cmakefiles/test.run.dir/pos_tagger_test.cpp.o
[ 75%] building cxx object test/unittest/cmakefiles/test.run.dir/jieba_test.cpp.o
[ 81%] building cxx object test/unittest/cmakefiles/test.run.dir/pre_filter_test.cpp.o
[ 87%] building cxx object test/unittest/cmakefiles/test.run.dir/unicode_test.cpp.o
[ 93%] building cxx object test/unittest/cmakefiles/test.run.dir/textrank_test.cpp.o
[100%] linking cxx executable ../test.run
[100%] built target test.run
[root@centos7 build]# ls
cmakecache.txt  cmakefiles  cmake_install.cmake  ctesttestfile.cmake  demo  deps  load_test  makefile  test
[root@centos7 build]# ./demo 
他来到了网易杭研大厦
[demo] cut with hmm
他/来到/了/网易/杭研/大厦
[demo] cut without hmm 
他/来到/了/网易/杭/研/大厦
我来到北京清华大学
[demo] cutall
我/来到/北京/清华/清华大学/华大/大学
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
[demo] cutforsearch
小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造
[demo] insert user word
男默/女泪
男默女泪
[demo] cutforsearch word with offset
[{"word": "小明", "offset": 0}, {"word": "硕士", "offset": 6}, {"word": "毕业", "offset": 12}, {"word": "于", "offset": 18}, {"word": "中国", "offset": 21}, {"word": "科学", "offset": 27}, {"word": "学院", "offset": 30}, {"word": "科学院", "offset": 27}, {"word": "中国科学院", "offset": 21}, {"word": "计算", "offset": 36}, {"word": "计算所", "offset": 36}, {"word": ",", "offset": 45}, {"word": "后", "offset": 48}, {"word": "在", "offset": 51}, {"word": "日本", "offset": 54}, {"word": "京都", "offset": 60}, {"word": "大学", "offset": 66}, {"word": "日本京都大学", "offset": 54}, {"word": "深造", "offset": 72}]
[demo] lookup tag for single token
[拖拉机:n, ceo:eng, 123:m, 。:x]
[demo] tagging
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上ceo,走上人生巅峰。
[我:r, 是:v, 拖拉机:n, 学院:n, 手扶拖拉机:n, 专业:n, 的:uj, 。:x, 不用:v, 多久:m, ,:x, 我:r, 就:d, 会:v, 升职:v, 加薪:nr, ,:x, 当上:t, ceo:eng, ,:x, 走上:v, 人生:n, 巅峰:n, 。:x]
[demo] keyword extraction
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上ceo,走上人生巅峰。
[{"word": "ceo", "offset": [93], "weight": 11.7392}, {"word": "升职", "offset": [72], "weight": 10.8562}, {"word": "加薪", "offset": [78], "weight": 10.6426}, {"word": "手扶拖拉机", "offset": [21], "weight": 10.0089}, {"word": "巅峰", "offset": [111], "weight": 9.49396}]
[root@centos7 build]# 

完整编译过程

下载源码:

cd /root/postgresql-15.2/contrib
git clone https://gitee.com/shawnyan/pg_jieba --depth=1
git clone https://gitee.com/shawnyan/cppjieba --depth=1 pg_jieba/libjieba
git clone https://gitee.com/shawnyan/limonp --depth=1 pg_jieba/libjieba/deps/limonp

编译:

cd pg_jieba
vi cmakelists.txt
set(cmake_c_standard 99)
mkdir build
cd build
cmake3 -dcmake_prefix_path=/opt/pgsql-152/ ..
make
  • output:
-- the c compiler identification is gnu 4.8.5
-- the cxx compiler identification is gnu 4.8.5
-- check for working c compiler: /usr/bin/cc
-- check for working c compiler: /usr/bin/cc - works
-- detecting c compiler abi info
-- detecting c compiler abi info - done
-- detecting c compile features
-- detecting c compile features - done
-- check for working cxx compiler: /usr/bin/c  
-- check for working cxx compiler: /usr/bin/c   - works
-- detecting cxx compiler abi info
-- detecting cxx compiler abi info - done
-- detecting cxx compile features
-- detecting cxx compile features - done
-- setting pg_jieba build type - 
-- found postgresql: /opt/pgsql-152/lib/libpq.so (found version "15.2") 
-- postgresql_executable is /opt/pgsql-152/bin/postgres
-- configuring done
-- generating done
-- build files have been written to: /root/postgresql-15.2/contrib/pg_jieba/build
[root@centos7 build]# make
scanning dependencies of target pg_jieba
[ 33%] building c object cmakefiles/pg_jieba.dir/pg_jieba.c.o
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:258:1: warning: ‘definecustomconfigvariables’ was used with no prototype before its definition [-wmissing-prototypes]
 definecustomconfigvariables()
 ^
[ 66%] building cxx object cmakefiles/pg_jieba.dir/jieba.cpp.o
[100%] linking cxx shared library pg_jieba.so
[100%] built target pg_jieba
[root@centos7 build]# ll
total 684
-rw-r--r-- 1 root root  15772 mar  1 00:28 cmakecache.txt
drwxr-xr-x 5 root root    283 mar  1 00:28 cmakefiles
-rw-r--r-- 1 root root   7779 mar  1 00:28 cmake_install.cmake
-rw-r--r-- 1 root root   7961 mar  1 00:28 makefile
-rwxr-xr-x 1 root root 666720 mar  1 00:28 pg_jieba.so  --------------  
[root@centos7 build]# cp pg_jieba.so /opt/pgsql-152/lib

安装 pg_jieba 插件

启动 pg 服务之前,先在配置文件中添加预加载。

vi postgresql.conf
shared_preload_libraries = 'pg_jieba'

其次,需要复制辅助文件。

cp pg_jieba/libjieba/dict/jieba.dict.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_base.dict
cp pg_jieba/libjieba/dict/user.dict.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_user.dict
cp pg_jieba/libjieba/dict/hmm_model.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_hmm.model
cp pg_jieba/libjieba/dict/stop_words.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba.stop
cp pg_jieba/pg_jieba.control /data/pgdata-152/postgresql/extension/pg_jieba.control
cp pg_jieba/pg_jieba.sql /data/pgdata-152/postgresql/extension/pg_jieba--1.1.1.sql

启动 pg 服务后,便可正常安装使用 pg_jieba 插件。

[postgres@centos7 ~]$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start.....2023-03-01 00:44:13.044 cst [49163] log:  starting postgresql 15.2-yan on x86_64-pc-linux-gnu, compiled by gcc (gcc) 4.8.5 20150623 (red hat 4.8.5-44), 64-bit
2023-03-01 00:44:13.045 cst [49163] log:  listening on ipv6 address "::1", port 5432
2023-03-01 00:44:13.045 cst [49163] log:  listening on ipv4 address "127.0.0.1", port 5432
2023-03-01 00:44:13.048 cst [49163] log:  listening on unix socket "/tmp/.s.pgsql.5432"
2023-03-01 00:44:13.058 cst [49166] log:  database system was shut down at 2023-02-28 22:40:27 cst
2023-03-01 00:44:13.063 cst [49163] log:  database system is ready to accept connections
 done
server started
[postgres@centos7 ~]$ 
[postgres@centos7 ~]$ psql
psql (15.2-yan)
type "help" for help.
postgres=# create extension pg_jieba;
create extension
postgres=# 
postgres=# \dx
                        list of installed extensions
   name   | version |   schema   |               description                
---------- --------- ------------ ------------------------------------------
 pg_jieba | 1.1.1   | public     | a parser for full-text search of chinese
 plpgsql  | 1.0     | pg_catalog | pl/pgsql procedural language
(2 rows)

最后,实测 pg_jieba 插件,效果如下:

postgres=# select * from to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在pgccc学习postgresql');
                                        to_tsvector                                        
-------------------------------------------------------------------------------------------
 'pgccc':10 'postgresql':12 '中国科学院':5 '学习':11 '小明':1 '毕业':3 '硕士':2 '计算所':6
(1 row)
postgres=# select * from to_tsvector('jiebacfg', '小丽是个潜水王,也是数据库方面的专家');
                       to_tsvector                       
---------------------------------------------------------
 '专家':12 '小丽':1 '数据库':9 '方面':10 '潜水':4 '王':5
(1 row)

总结

简要总结,源码编译时,编译环境、依赖组件、所需版本都是要注意的。
coding 还是要写注释的,至少日志输出不能少,不然遇到错误不好调查。
尤其是数据库系统,万一遇到问题如何定位问题,如何快速且准确地解决,始终是 dba 们的必修课。

最后修改时间:2023-03-02 22:58:01
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【米乐app官网下载的版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论

网站地图