Title: 中文新聞自動摘要系統
Automatic Text Summarization System for Chinese News
Authors: 廖贊瑋
Liao, Zan-Wei
Lee, Chia-Hoang
Keywords: 自動摘要;中文自動摘要;新聞自動摘要;summarization;text summarization;automatic text summarization;summarization for news
Issue Date: 2008
Abstract: 由於科技的進步,網路的發展,造成資訊量迅速攀升,然而這樣的進步卻相對的造成使用者必須付出更多的時間去瀏覽所需的文件。有鑒於現今搜尋引擎的廣泛使用,人們希望以更高的效率與效能取得資訊,其中自動摘要技術與其後衍生的分類應用,扮演著重要的角色。在搜尋的過程中,若能搭配自動摘要之方法,則可讓使用者根據摘要的內容去判讀是否要讀取這篇文章。如此一來,不僅可以減少使用者瀏覽文件的時間,更可加快使用者搜尋的速度。 本研究利用Yahoo新聞網之新聞內容、中央研究院詞性分類集做分析,萃取出核心關鍵詞,並將句子轉換成關鍵詞串列。利用中文語法之特性、同義詞詞庫,對核心關鍵詞做關鍵詞擴展之動作。接著,利用擴充完之關鍵詞集合做為挑出關鍵詞摘要之依據,並利用[Yihong Gong, Xin Liu, 2001]提出之概念,挑選出潛藏語意分析之摘要。本研究將上述兩種摘要結果做整合且考慮可讀性,產生一篇摘要提供使用者閱讀。
As with the popularity of internet, information overloading has become a major problem and people have to spend more and more time to look for the information they need. In recent years, search engine has been used in many ways for many purposes, so a system which could reduce the amount of the content without losing the principle meaning of the content is necessary. In this research, the application domain is Internet News summarization and the data corpus was collected from Yahoo. We make use of CKIP (Chinese Knowledge and Information Processing) to perform POS tagging task. Based on the POS tagging information, the system analyzes and extracts the core keywords and makes a transition from a sentence to a keyword string. Then keywords expansion is performed based on the Chinese semantic architecture and HowNet. After the expansion, each core keyword will be given a weight according to its type. Then, the weight of each sentence will be obtained by the summation of the weights of the keywords in the sentence. Based on the sentence weighting information, the sentences could be ranked to obtain a core summary set. Also, We use the idea of linear algebra provided by [Yihong Gong, Xin Liu, 2001] to make an assistant summary set and get information that may be missed by using topic based way to make our summary more completely. Finally, the system integrates two summary sets mentioned above to make a summary and takes into account readability issue to make the whole summary become fluent.
Appears in Collections:Thesis