DOI QR코드

DOI QR Code

An Extraction Method of Bibliographic Information from the US Patents: Using an HTML Parsing Technique

미국 특허 서지정보 추출 방법에 대한 연구: HTML 파싱 기법의 활용을 중심으로

  • Han, Yoo-Jin (School of Global Service, Sookmyung Women's University) ;
  • Oh, Seung-Woo (Technology Management, Economics and Policy Program, Seoul National University)
  • Received : 2010.04.16
  • Accepted : 2010.06.13
  • Published : 2010.06.30

Abstract

This study aims to provide a method of extracting the most recent information on US patent documents. An HTML paring technique that can directly connect to the US Patent and Trademark Office (USPTO) Web page is adopted. After obtaining a list of 50 documents through a keyword searching method, this study suggested an algorithm, using HTML parsing techniques, which can extract a patent number, an applicant, and the US patent class information. The study also revealed an algorithm by which we can extract both patents and subsequent patents using their closely connected relationship, that is a very distinctive characteristic of US patent documents. Although the proposed method has several limitations, it can supplement existing databases effectively in terms of timeliness and comprehensiveness.

본 연구는 미국 특허 문서에서 가장 최신의 정보를 추출할 수 있는 방법을 제시하였다. 이를 위해 미국특허청 웹페이지에 직접 접속하여, HTML 문서를 파싱하는 방법을 제시하였다. 먼저 관심 있는 키워드로 검색을 한 후 50개로 이루어진 리스트가 출력되면, HTML 파싱 기법을 이용하여 여기서 직접 특허번호, 출원인, 미국 특허 클래스와 같은 주요 서지정보를 추출할 수 있는 알고리즘을 제안하였다. 또한 미국 특허문서에서 특수하게 제공되는 선.후행 특허간의 관계를 활용해 본 특허와 후행 특허의 미국 특허 클래스를 동시에 추출 할 수 있는 알고리즘도 보여주었다. 본 연구에서 제시한 방법은 몇 가지 한계를 가지지만, 적시성.포괄성 측면에서 이미 존재하는 데이터베이스를 보완할 수 있을 것이다.

Keywords

References

  1. Calcagno, M. 2008. “An investigation into analyzing patents by chemical structure using Thomson’s Derwent World Patent Index codes.” World Patent Information, 30(3): 188-198. https://doi.org/10.1016/j.wpi.2007.10.007
  2. Ernst, H. 2003. “Patent Information for Strategic Technology Management.” World Patent Information, 25(3): 233-242. https://doi.org/10.1016/S0172-2190(03)00077-2
  3. Gupta, S., G. E. Kaiser, P. Grimm, M. F. Chiang, and J. Starren. 2005. “Automating Content Extraction of HTML Documents.” World Wide Web, 8(2): 179-224. https://doi.org/10.1007/s11280-004-4873-3
  4. Hall, B., A. B. Jaffe, and M. Trajtenberg. 2001. The NBER Patent Citations Data File: Lessons, Insights and Methodological Tools. NBER Working Paper 8498.
  5. Lerdorf, R., K. Tatroe, and P. MacIntyre. 2006. Programming PHP (2nd ed.). O'Reilly Media:Sebastopol, CA.
  6. Lichtenthaler, U. 2009. “The role of corporate technology strategy and patent portfolios in low-,medium- and high-technology firms.” Research Policy, 38(3): 559-569. https://doi.org/10.1016/j.respol.2008.10.009
  7. No, H. J. and Y. Park. 2010. “Trajectory patterns of technology fusion: Trend analysis and taxonomical grouping in nanobiotechnology.” Technological Forecasting and Social Change, 77(1): 63-75. https://doi.org/10.1016/j.techfore.2009.06.006
  8. Simmons, E. S. 2004. “The online divide: a professional user’s perspective on Derwent database development in the online era.” World Patent Information, 26(1): 45-47. https://doi.org/10.1016/j.wpi.2003.10.008
  9. World Intellectual Property Organization (WIPO, 2010) IP Statistics.
  10. Yoo, J. B. and Y. M. Chung. 2010. “Analysis of factors influencing patent citations.” Journal of the Korean Society for Information Management, 27(1): 103-118. https://doi.org/10.3743/KOSIM.2010.27.1.103
  11. Yoon, B. U. and Y. Park. 2004. “A text-mining-based patent network: Analytical tool for high-technology trend.” The Journal of High Technology Management Research, 15(1): 37-50. https://doi.org/10.1016/j.hitech.2003.09.003