Text Summarization

Sistem Peringkas Teks Otomatis Menggunakan Algoritme Page Rank

Zikra, Hilma. 2009.

With the rapid growth of World Wide Web, a huge amount of information is available and accessible online. People do not have the time to read everything, and yet they have to make critical decisions based on whatever information is available. Automatic text summarization is a technique to assist people in digesting the vast amount of information online. Recently, a number of graph-based approaches, such as PageRank, LexRank, and TextRank, have been suggested for text summarization. Automatic text sllnnnarization is a computerized process of distilling the most important information of a source (or sources) for making a brief version of text(s). This research implements graph-based sunamarization algorithm and similarity using graph-based ranking concept for ranking sentences. The process produces the output in the form of extractive summary that consists of high ranking sentences. The graph-based method applied is PageRank, combined with cosine and content overlap similarity. Evaluation of the summaries uses three human judges and their judgement is compared using kappa measure. The result of our experiment shows that the accuracy of PageRank and cosine similarity is better than PageRank with content overlap similarity. Similarity to title produced the best result than similarity with no title.


Pembobotan Fitur pada Peringkasan Teks Bahasa Indonesia Menggunakan Algoritme Genetika

Aristoteles. 2011.

This thesis aims to perform text feature weighting for summarization of document Bahasa Indonesia using genetic algorithm. There are eleven text features, i.e, sentence position (f1), positive keywords in sentence (f2), negative keywords in sentence (f3), sentence centrality (f4), sentence resemblance to the title (f5), sentence inclusion of name entity (f6), sentence inclusion of numerical data (f7), sentence relative length (f8), bushy path of the node (f9), summation of similarities for each node (f10), and latent semantic feature (f11). We investigate the effect of the first ten sentence features on the summarization task. Then, we use latent semantic feature to increase the accuracy. All feature score functions are used to train a genetic algorithm model to obtain a suitable combination of feature weights. Evaluation of text summarization uses F-measure. The F-measure directly related to the compression rate. The results showed that adding f11 increases the F-measure by 3.26% and 1.55% for compression ratio of 10% and 30%, respectively. On the other hand, it decreases the F-measure by 0.58% for compression ratio of 20%. Analysis of text feature weight showed that only using f2, f4, f5, and f11 can deliver a similar performance using all eleven features.


Sistem Peringkasan Dokumen Berita Bahasa Indonesia Menggunakan Metode Regresi Logistik Biner

Marlina, Meri. 2012.

This thesis aims to perform text feature weighting for summarization of document bahasa Indonesia using binary logistic regression. There are ten text features, i.e., sentence position (f1), positive keywords in sentence (f2), negative keywords in sentence (f3), sentence centrality (f4), sentence resemblance to the title (f5), sentence inclusion of name entity (f6), sentence inclusion of numerical data (f7), sentence relative length (f8), bushy path of the node (f9), and summation of similarities for each node (f10). Ten of these features will be used as an independent variable in the calculation of the binary logistic regression. To denote that the sentence is not included in the summary we use an output value of 0, an output value of 1, otherwise. To evaluate the text summarization, we use N-Gram with compressin rate 30%. Research results show that the accuracy of this method is 42.84%.


Perbandingan Kinerja Algoritme TextRank dengan Algoritme LexRank pada Peringkasan Dokumen Bahasa Indonesia

Marsyah, Yuzar. 2013.

Text summarization is an effective way to obtain information from a document without reading the whole document. Currently the small number of automatic text summarizations for Indonesian are available compared to those for other languages. This study develops an Indonesian automatic text summarization using the graph-based methods, the TextRank algorithm and the LexRank algorithm. It has been proven that the methods and the algorithms have good performance in summarizing texts for English. This work extracts about 25% of the total of sentences in the document. The kappa measure method is applied to evaluate the results of the two algorithms. The result based measure kappa on shows that the TextRank algorithm has better performance than the LexRank algorithm.


Peringkasan Dokumen Bahasa Indonesia Menggunakan Logika Fuzzy

Gerbawani, R. Ahmad Somadi. 2013.

Text summarization is required to help interpreting the large volumes of information in documents. Automatic text summarization is a process of creating a shorter version of document from the source document by using a machine. The goal is to present the most important information and help the user to understand the large volumes of information from the document. This research proposes an automatic text summarization using Fuzzy Inference System (FIS), because the level of importance of sentences in a document is uncertain (fuzzy). The advantage of fuzzy logic is its ability of linguistic reasoning, so that it does not require any mathematical equations. Simulation is conducted on 50 data testing. It is shown that the best average accuracy in this research is 50.58% and the best accuracy for single document summarization is 100%.


Peringkasan Teks Bahasa Indonesia dengan Pemilihan Fitur C4.5 dan Klasifikasi Naive Bayes

Wibowo, Septiandi. 2013.

This research summarized Indonesian text documents using naive bayes (NB) classification method. Segmentation of the documents into sentences and feature computation are the initial stages of training the system to determine which sentences are classified as summary. The classification used 11 features (f1-f11). The features are selected using C4.5 decision tree to determine the features that affect the summary, reduce the number of features and speed up the summarization. The accuracy of summarization using 10 features (f1-f10) was 34.63%, 37.96%, and 28.14% for compression rate (CR) of 10%, 20%, and 30%, respectively. Adding f11 and C4.5 produced an accuracy of 52.45%, 51.49% and 51.35% for CR 10%, 20%, and 30%, respectively. Text summarization using NB classification, C4.5 feature selection, and additional f11 feature produced better accuracy and faster summarization.


Peringkas Dokumen Berbahasa Indonesia Berbasis Kata Benda dengan BM25

Pinandhita, Rendy Rivaldi. 2013.

This research develops summarization of Indonesian documents based on nouns. The problem in this study is that high number of digital documents makes it difficult for the reader to find the desired information. We use cosine similarity, content overlap, and Okapi BM25 in the summarization. This research used newspaper articles from previous research. In the process of summarization, before calculating the similarities, the documents were preprocessed using stoplist, stemming, and selection of nouns. Then, the documents were ranked using PageRank. We used kappa measure to evaluate the level of agreement among evaluators in assessing the relevance of the summaries. Dice coefficient was used to compare automatic summarization to manual ones. Based on the observations, we find that Okapi BM25 is better than cosine similarity and content overlap.