以文本方式查看主题

-  计算机科学论坛  (http://bbs.xml.org.cn/index.asp)
--  『 人工智能 :: 机器学习|数据挖掘|进化计算 』  (http://bbs.xml.org.cn/list.asp?boardid=62)
----  [下载]Corpus: Chinese Short Message Service  (http://bbs.xml.org.cn/dispbbs.asp?boardid=62&rootid=&id=87169)


--  作者:nlplab
--  发布时间:10/14/2010 5:09:00 PM

--  [下载]Corpus: Chinese Short Message Service
*************************************************************

NLPLAB No.: NLPLAB2010T003

Release Date: May 28, 2010

Corpus: Chinese Short Message Service

Abbreviation: CSMS

Version: 1.0

Copyright: Wuying Liu

Contact:
  (1)email: nlplab@163.com; <Natural Language Processing Laboratory>
  (2)mobile phone: 13787784974
  (3)qq: 44631423
  (4)web: http://nlplab.webhop.net

Data Type: Text, UTF-8 code

Language: Chinese

Application: SMS Spam Filtering, Short Text Processing

Introduction:
(1)The CSMS corpus is made up of real-world Chinese mobile messages in their chronological sequence, obtained from volunteers and manually labeled two categories {spam, ham} according to volunteers' feedbacks.
(2)The CSMS corpus consists of 85,870 messages, containing 21,099 spams and 64,771 hams.
(3)Each message includes FromPhoneNumber, ToPhoneNumber and BodyText fields; For the privacy protection, the phone numbers are replaced without changing the communication relation network.
(4)The SMS texts and category labels are stored separately; The SMS texts are stored under the dir "csms/data/", including 85,870 text files; The category labels are stored under the dir "csms/full/".

Example: (1)The SMS file "csms\data\csms.1" is showed as below
              13910000001
              13810000002
              $$$$$$$$ 这八个金钱符转发给八个好朋友.你这一年就会财源滚滚.如果删除不发.那你这一年就会破财.发吧!我也是被逼的,谁叫你人缘好呢
         (2)The category label file "csms\full\index" is showed as below
              spam ../data/csms.1
              ham ../data/csms.2
              ham ../data/csms.3
              ...

*************************************************************
Download: csms-toy.zip
http://cid-2c1d19cb59beaf62.skydrive.live.com/redir.aspx?client=wnf&resId=2C1D19CB59BEAF62!110&ct=&page=self&parid=&type=3


W 3 C h i n a ( since 2003 ) 旗 下 站 点
苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
31.250ms