• Title/Summary/Keyword: Paragraph Extraction

Search Result 13, Processing Time 0.031 seconds

Keyword Analysis Based Document Compression System

  • Cao, Kerang;Lee, Jongwon;Jung, Hoekyung
    • Journal of information and communication convergence engineering
    • /
    • v.16 no.1
    • /
    • pp.48-51
    • /
    • 2018
  • The traditional documents analysis was centered on words based system was implemented using a morpheme analyzer. These traditional systems can classify used words in the document but, cannot help to user's document understanding or analysis. In this problem solved, System needs extract for most valuable paragraphs what can help to user understanding documents. In this paper, we propose system extracts paragraphs of normalized XML document. User insert to system what filename when wants for analyze XML document. Then, system is search for keyword of the document. And system shows results searched keyword. When user choice and inserts keyword for user wants then, extracting for paragraph including keyword. After extracting paragraph, system operating maintenance paragraph sequence and check duplication. If exist duplication then, system deletes paragraph of duplication. And system informs result to user what counting each keyword frequency and weight to user, sorted paragraphs.

A Deeping Learning-based Article- and Paragraph-level Classification

  • Kim, Euhee
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.11
    • /
    • pp.31-41
    • /
    • 2018
  • Text classification has been studied for a long time in the Natural Language Processing field. In this paper, we propose an article- and paragraph-level genre classification system using Word2Vec-based LSTM, GRU, and CNN models for large-scale English corpora. Both article- and paragraph-level classification performed best in accuracy with LSTM, which was followed by GRU and CNN in accuracy performance. Thus, it is to be confirmed that in evaluating the classification performance of LSTM, GRU, and CNN, the word sequential information for articles is better than the word feature extraction for paragraphs when the pre-trained Word2Vec-based word embeddings are used in both deep learning-based article- and paragraph-level classification tasks.

A Study on Keyword Extraction From a Single Document Using Term Clustering (용어 클러스터링을 이용한 단일문서 키워드 추출에 관한 연구)

  • Han, Seung-Hee
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.44 no.3
    • /
    • pp.155-173
    • /
    • 2010
  • In this study, a new keyword extraction algorithm is applied to a single document with term clustering. A single document is divided by multiple passages, and two ways of calculating similarities between two terms are investigated; the first-order similarity and the second-order distributional similarity. In this experiment, the best cluster performance is achieved with a 50-term passage from the second-order distributional similarity. From the results of first experiment, the second-order distribution similarity was also applied to various keyword extraction methods using statistic information of terms. In the second experiment, pf(paragraph frequency) and $tf{\times}ipf$(term frequency by inverse paragraph frequency) were found to improve the overall performance of keyword extraction. Therefore, it showed that the algorithm fulfills the necessary conditions which good keywords should have.

XML Document Keyword Weight Analysis based Paragraph Extraction Model (XML 문서 키워드 가중치 분석 기반 문단 추출 모델)

  • Lee, Jongwon;Kang, Inshik;Jung, Hoekyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.11
    • /
    • pp.2133-2138
    • /
    • 2017
  • The analysis of existing XML documents and other documents was centered on words. It can be implemented using a morpheme analyzer, but it can classify many words in the document and cannot grasp the core contents of the document. In order for a user to efficiently understand a document, a paragraph containing a main word must be extracted and presented to the user. The proposed system retrieves keyword in the normalized XML document. Then, the user extracts the paragraphs containing the keyword inputted for searching and displays them to the user. In addition, the frequency and weight of the keyword used in the search are informed to the user, and the order of the extracted paragraphs and the redundancy elimination function are minimized so that the user can understand the document. The proposed system can minimize the time and effort required to understand the document by allowing the user to understand the document without reading the whole document.

Keyword Weight based Paragraph Extraction Algorithm (키워드 가중치 기반 문단 추출 알고리즘)

  • Lee, Jongwon;Joo, Sangwoong;Lee, Hyunju;Jung, Hoekyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.10a
    • /
    • pp.504-505
    • /
    • 2017
  • Existing morpheme analyzers classify the words used in writing documents. A system for extracting sentences and paragraphs based on a morpheme analyzer is being developed. However, there are very few systems that compress documents and extract important paragraphs. The algorithm proposed in this paper calculates the weights of the keyword written in the document and extracts the paragraphs containing the keyword. Users can reduce the time to understand the document by reading the paragraphs containing the keyword without reading the entire document. In addition, since the number of extracted paragraphs differs according to the number of keyword used in the search, the user can search various patterns compared to the existing system.

  • PDF

Korean Summarization System using Automatic Paragraphing (단락 자동 구분을 이용한 문서 요약 시스템)

  • 김계성;이현주;이상조
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.7_8
    • /
    • pp.681-686
    • /
    • 2003
  • In this paper, we describes a system that extracts important sentences from Korean newspaper articles using automatic paragraphing. First, we detect repeated words between sentences. Through observation of the repeated words, this system compute Closeness Degree between Sentences(CDS ) from the degree of morphological agreement and the change of grammatical role. And then, it automatically divides a document into meaningful paragraphs using the number of paragraph defined by the user´s need. Finally. it selects one representative sentence from each paragraph and it generates summary using representative sentences. Though our system doesn´t utilize some features such as title, sentence position, rhetorical structure, etc., it is able to extract meaningful sentences to be included in the summary.

Keyword Weight based Paragraph Extraction Algorithm (문단 가중치 분석 기반 본문 영역 선정 알고리즘)

  • Lee, Jongwon;Yu, Seongjong;Kim, Doan;Jung, Hoekyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2018.05a
    • /
    • pp.462-463
    • /
    • 2018
  • Traditional document analysis systems used word-based analysis using a morphological analyzer or TF-IDF technique. These systems have the advantage of being able to derive key keywords by calculating the weights of the keywords. On the other hand, it is not appropriate to analyze the contents of documents due to the structural limitations. To solve this problem, the proposed algorithm calculates the weights of the documents in the document and divides the paragraphs into areas. And we calculate the importance of the divided regions and let the user know the area with the most important paragraphs in the document. So, it is expected that the user will be provided with a service suitable for analyzing documents rather than using existing document analysis systems.

  • PDF

Setences Extraction System using Automatic Division of Paragraph (단락 자동 구분을 통한 중요 문자 추출)

  • 김계성;이현주;정영규;서연경;손기준;이상조
    • Proceedings of the Korean Society for Cognitive Science Conference
    • /
    • 2000.06a
    • /
    • pp.233-237
    • /
    • 2000
  • 본 논문은 단락의 자동 구분을 통한 중요 문장 추출 시스템을 제안한다. 먼저 어휘의 재출현 여부와 어휘의 일치도, 어휘의 역할 변화를 파악하여 재출현 어휘에 대한 양상을 분석하고 이를 통하여 문장 간의 긴밀도를 정량적으로 계산한다. 다음으로 측정된 문장 간 긴밀도를 이용하여 사용자의 추출 범위에 따라 단락을 구분하고, 각 단락의 대표 문장을 선정하여 최종 요약문을 생성한다. 제안한 방법은 문서 제목, 문장의 위치, 수사 구조 등의 정보를 이용하지 않으며, 단순히 어휘의 출현 빈도만을 이용하던 기존의 통계적인 방법보다 질높은 요약문을 생성할 수 있다. 또한 제안한 방법론은 본 논문이 대상으로 삼고 있는 신문기사의 영역뿐만 아니라 다른 영역으로의 적용이 가능하다.

  • PDF

Deep Learning Document Analysis System Based on Keyword Frequency and Section Centrality Analysis

  • Lee, Jongwon;Wu, Guanchen;Jung, Hoekyung
    • Journal of information and communication convergence engineering
    • /
    • v.19 no.1
    • /
    • pp.48-53
    • /
    • 2021
  • Herein, we propose a document analysis system that analyzes papers or reports transformed into XML(Extensible Markup Language) format. It reads the document specified by the user, extracts keywords from the document, and compares the frequency of keywords to extract the top-three keywords. It maintains the order of the paragraphs containing the keywords and removes duplicated paragraphs. The frequency of the top-three keywords in the extracted paragraphs is re-verified, and the paragraphs are partitioned into 10 sections. Subsequently, the importance of the relevant areas is calculated and compared. By notifying the user of areas with the highest frequency and areas with higher importance than the average frequency, the user can read only the main content without reading all the contents. In addition, the number of paragraphs extracted through the deep learning model and the number of paragraphs in a section of high importance are predicted.

A Study on the Definition of the Educational Facility Maintenance (교육시설물 유지관리 업무규명에 관한 연구)

  • Shon Woo-Kyung;Kim Jang-Young;Han Choong-Hee;Kim Sun-kuk
    • Proceedings of the Korean Institute Of Construction Engineering and Management
    • /
    • autumn
    • /
    • pp.567-570
    • /
    • 2002
  • our country school construction from 1990 year middle the facility expension for the improvement or education environment and the maintenance expense of various conservative construction for old education facilities is increasing. It accomplishes the each subject wild for business which maintains the training facility enterprise to be, the minute paragraph of information and deficient standard of information. It brings about the duplication investment a waste of the revenue source which is limited. Occurrence information which it follows consequently in business process and requirement must clear. We the maintenance phase of the education facilities understand, problem point and obstacle VIP analysis it leads it presents it does analyze. The improvement process model which leads a hereafter function for obstacle VIP extraction and information model construction must be accomplished.

  • PDF