• Title/Summary/Keyword: Document Analysis

Search Result 1,170, Processing Time 0.028 seconds

Case Study on Public Document Classification System That Utilizes Text-Mining Technique in BigData Environment (빅데이터 환경에서 텍스트마이닝 기법을 활용한 공공문서 분류체계의 적용사례 연구)

  • Shim, Jang-sup;Lee, Kang-wook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.1085-1089
    • /
    • 2015
  • Text-mining technique in the past had difficulty in realizing the analysis algorithm due to text complexity and degree of freedom that variables in the text have. Although the algorithm demanded lots of effort to get meaningful result, mechanical text analysis took more time than human text analysis. However, along with the development of hardware and analysis algorithm, big data technology has appeared. Thanks to big data technology, all the previously mentioned problems have been solved while analysis through text-mining is recognized to be valuable as well. However, applying text-mining to Korean text is still at the initial stage due to the linguistic domain characteristics that the Korean language has. If not only the data searching but also the analysis through text-mining is possible, saving the cost of human and material resources required for text analysis will lead efficient resource utilization in numerous public work fields. Thus, in this paper, we compare and evaluate the public document classification by handwork to public document classification where word frequency(TF-IDF) in a text-mining-based text and Cosine similarity between each document have been utilized in big data environment.

  • PDF

Document Clustering using Term reweighting based on NMF (NMF 기반의 용어 가중치 재산정을 이용한 문서군집)

  • Lee, Ju-Hong;Park, Sun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.13 no.4
    • /
    • pp.11-18
    • /
    • 2008
  • Document clustering is an important method for document analysis and is used in many different information retrieval applications. This paper proposes a new document clustering model using the re-weighted term based NMF(non-negative matrix factorization) to cluster documents relevant to a user's requirement. The proposed model uses the re-weighted term by using user feedback to reduce the gap between the user's requirement for document classification and the document clusters by means of machine. The Proposed method can improve the quality of document clustering because the re-weighted terms. the semantic feature matrix and the semantic variable matrix, which is used in document clustering, can represent an inherent structure of document set more well. The experimental results demonstrate appling the proposed method to document clustering methods achieves better performance than documents clustering methods.

  • PDF

The Region Analysis of Document Images Based on One Dimensional Median Filter (1차원 메디안 필터 기반 문서영상 영역해석)

  • 박승호;장대근;황찬식
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.40 no.3
    • /
    • pp.194-202
    • /
    • 2003
  • To convert printed images into electronic ones automatically, it requires region analysis of document images and character recognition. In these, regional analysis segments document image into detailed regions and classifies thee regions into the types of text, picture, table and so on. But it is difficult to classify the text and the picture exactly, because the size, density and complexity of pixel distribution of some of these are similar. Thu, misclassification in region analysis is the main reason that makes automatic conversion difficult. In this paper, we propose region analysis method that segments document image into text and picture regions. The proposed method solves the referred problems using one dimensional median filter based method in text and picture classification. And the misclassification problems of boldface texts and picture regions like graphs or tables, caused by using median filtering, are solved by using of skin peeling filter and maximal text length. The performance, therefore, is better than previous methods containing commercial softwares.

Line Tracking Algorithm for Table Structure Analysis in Form Document Image (양식 문서 영상에서 도표 구조 분석을 위한 라인 추적 알고리즘)

  • Kim, Kye-Kyung
    • Journal of Software Assessment and Valuation
    • /
    • v.17 no.2
    • /
    • pp.151-159
    • /
    • 2021
  • To derive grid lines for analyzing a table layout, line image enhancement techniques are studying such as various filtering or morphology methods. In spite of line image enhancement, it is still hard to extract line components and to express table cell's layout logically in which the cutting points are exist on the line or the tables are skewing . In this paper, we proposed a line tracking algorithm to extract line components under the cutting points on the line or the skewing lines. The table document layout analysis algorithm is prepared by searching grid-lines, line crossing points and gird-cell using line tracking algorithm. Simulation results show that the proposed method derive 96.4% table document analysis result with average 0.41sec processing times.

Content-based Configuration Management System for Software Research and Development Document Artifacts

  • Baek, Dusan;Lee, Byungjeong;Lee, Jung-Won
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.3
    • /
    • pp.1404-1415
    • /
    • 2016
  • Because of the properties of software such as invisibility, complexity, and changeability, software configuration management (SCM) for software artifacts generated during software life-cycle has been used for guarantee of the quality of the software. However, the existing SCM system has only focused on code artifacts and software development document artifacts such as Software Requirements Specification (SRS), Software Design Description (SDD), and Software Test Description (STD). Moreover, software research-oriented project comes out late the code artifacts and the software development document artifacts. Therefore, there is a need for trace and management of software research document artifacts composed of highly abstracted non-functional requirements like 'the purpose of the project', 'the objectives', and 'the progress' before generation of the code artifacts and the software development document artifacts for a long time. Nevertheless, the existing SCM system cannot trace and manage them. In this paper, we propose content-based configuration management system comprised of the relevance link generation phase and content-based testing phase to trace and manage them. The preliminary application results show applicability and feasibility of the proposed system.

DP-LinkNet: A convolutional network for historical document image binarization

  • Xiong, Wei;Jia, Xiuhong;Yang, Dichun;Ai, Meihui;Li, Lirong;Wang, Song
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.5
    • /
    • pp.1778-1797
    • /
    • 2021
  • Document image binarization is an important pre-processing step in document analysis and archiving. The state-of-the-art models for document image binarization are variants of encoder-decoder architectures, such as FCN (fully convolutional network) and U-Net. Despite their success, they still suffer from three limitations: (1) reduced feature map resolution due to consecutive strided pooling or convolutions, (2) multiple scales of target objects, and (3) reduced localization accuracy due to the built-in invariance of deep convolutional neural networks (DCNNs). To overcome these three challenges, we propose an improved semantic segmentation model, referred to as DP-LinkNet, which adopts the D-LinkNet architecture as its backbone, with the proposed hybrid dilated convolution (HDC) and spatial pyramid pooling (SPP) modules between the encoder and the decoder. Extensive experiments are conducted on recent document image binarization competition (DIBCO) and handwritten document image binarization competition (H-DIBCO) benchmark datasets. Results show that our proposed DP-LinkNet outperforms other state-of-the-art techniques by a large margin. Our implementation and the pre-trained models are available at https://github.com/beargolden/DP-LinkNet.

Design and Implementation of a Document-Oriented and Web-Based Nuclear Design Automation System (문서중심 및 웹기반 핵설계 자동화 시스템의 설계 및 구현)

  • Park, Yong-Soo;Kim, Jong-Kyung
    • The KIPS Transactions:PartD
    • /
    • v.11D no.6
    • /
    • pp.1319-1326
    • /
    • 2004
  • To automate nuclear design works which are time-consuming and man-power intensive, Innovative Design Processor ($IDP^{TM}$) is being developed. Two basic principles of IDP are the document-oriented design and the web-based design. The document-oriented design is that, if the designer writes a design document called active document and feeds it to a special program which has a robust parser, the finai document with complete analysis, table and plots is made automatically. The active documents can be written with ordinary HTML/XML editors or created automatically on the web, which is another framework of IDP. Using the proper mix-up of server side and client side programming under the LAMP (Linux/Apache/MySQL/PHP) environment, the design process on the web is modeled as a design wizard style so that even a novice designer makes the design document easily.

The Analysis of the security requirements for a circulation of the classified documents (비밀문서유통을 위한 보안 요구사항 분석)

  • Lee, Ji-Yeong;Park, Jin-Seop;Kang, Seong-Ki
    • Journal of National Security and Military Science
    • /
    • s.1
    • /
    • pp.361-390
    • /
    • 2003
  • In this paper, we analyze the security requirement for a circulation of the classified documents. During the whole document process phases, including phases of drafting, sending/receiving messages, document approval, storing and saving, reading, examining, out-sending and canceling a document, we catch hold of accompanied threat factors and export every threat factors of security. We also propose an appropriate and correspondent approach for security in a well-prepared way. Last, we present the security guidelines for security architecture of the classified documents circulation.

  • PDF

Document Layout Analysis Using Coarse/Fine Strategy (Coarse/fine 전략을 이용한 문서 구조 분석)

  • 박동열;곽희규;김수형
    • Proceedings of the IEEK Conference
    • /
    • 2000.06d
    • /
    • pp.198-201
    • /
    • 2000
  • We propose a method for analyzing the document structure. This method consists of two processes, segmentation and classification. The segmentation first divides a low resolution image, and then finely splits the original document image using projection profiles. The classification deterimines each segmented region as text, line, table or image. An experiment with 238 documents images shows that the segmentation accuracy is 99.1% and the classification accuracy is 97.3%.

  • PDF

Document Image Layout Analysis Using Image Filters and Constrained Conditions (이미지 필터와 제한조건을 이용한 문서영상 구조분석)

  • Jang, Dae-Geun;Hwang, Chan-Sik
    • The KIPS Transactions:PartB
    • /
    • v.9B no.3
    • /
    • pp.311-318
    • /
    • 2002
  • Document image layout analysis contains the process to segment document image into detailed regions and the process to classify the segmented regions into text, picture, table or etc. In the region classification process, the size of a region, the density of black pixels, and the complexity of pixel distribution are the bases of region classification. But in case of picture, the ranges of these bases are so wide that it's difficult to decide the classification threshold between picture and others. As a result, the picture has a higher region classification error than others. In this paper, we propose document image layout analysis method which has a better performance for the picture and text region classification than that of previous methods including commercial softwares. In the picture and text region classification, median filter is used in order to reduce the influence of the size of a region, the density of black pixels, and the complexity of pixel distribution. Futhermore the classification error is corrected by the use of region expanding filter and constrained conditions.