• Title, Summary, Keyword: 웹마이닝

Search Result 299, Processing Time 0.086 seconds

Twitter Issue Tracking System by Topic Modeling Techniques (토픽 모델링을 이용한 트위터 이슈 트래킹 시스템)

  • Bae, Jung-Hwan;Han, Nam-Gi;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.109-122
    • /
    • 2014
  • People are nowadays creating a tremendous amount of data on Social Network Service (SNS). In particular, the incorporation of SNS into mobile devices has resulted in massive amounts of data generation, thereby greatly influencing society. This is an unmatched phenomenon in history, and now we live in the Age of Big Data. SNS Data is defined as a condition of Big Data where the amount of data (volume), data input and output speeds (velocity), and the variety of data types (variety) are satisfied. If someone intends to discover the trend of an issue in SNS Big Data, this information can be used as a new important source for the creation of new values because this information covers the whole of society. In this study, a Twitter Issue Tracking System (TITS) is designed and established to meet the needs of analyzing SNS Big Data. TITS extracts issues from Twitter texts and visualizes them on the web. The proposed system provides the following four functions: (1) Provide the topic keyword set that corresponds to daily ranking; (2) Visualize the daily time series graph of a topic for the duration of a month; (3) Provide the importance of a topic through a treemap based on the score system and frequency; (4) Visualize the daily time-series graph of keywords by searching the keyword; The present study analyzes the Big Data generated by SNS in real time. SNS Big Data analysis requires various natural language processing techniques, including the removal of stop words, and noun extraction for processing various unrefined forms of unstructured data. In addition, such analysis requires the latest big data technology to process rapidly a large amount of real-time data, such as the Hadoop distributed system or NoSQL, which is an alternative to relational database. We built TITS based on Hadoop to optimize the processing of big data because Hadoop is designed to scale up from single node computing to thousands of machines. Furthermore, we use MongoDB, which is classified as a NoSQL database. In addition, MongoDB is an open source platform, document-oriented database that provides high performance, high availability, and automatic scaling. Unlike existing relational database, there are no schema or tables with MongoDB, and its most important goal is that of data accessibility and data processing performance. In the Age of Big Data, the visualization of Big Data is more attractive to the Big Data community because it helps analysts to examine such data easily and clearly. Therefore, TITS uses the d3.js library as a visualization tool. This library is designed for the purpose of creating Data Driven Documents that bind document object model (DOM) and any data; the interaction between data is easy and useful for managing real-time data stream with smooth animation. In addition, TITS uses a bootstrap made of pre-configured plug-in style sheets and JavaScript libraries to build a web system. The TITS Graphical User Interface (GUI) is designed using these libraries, and it is capable of detecting issues on Twitter in an easy and intuitive manner. The proposed work demonstrates the superiority of our issue detection techniques by matching detected issues with corresponding online news articles. The contributions of the present study are threefold. First, we suggest an alternative approach to real-time big data analysis, which has become an extremely important issue. Second, we apply a topic modeling technique that is used in various research areas, including Library and Information Science (LIS). Based on this, we can confirm the utility of storytelling and time series analysis. Third, we develop a web-based system, and make the system available for the real-time discovery of topics. The present study conducted experiments with nearly 150 million tweets in Korea during March 2013.

Analysis of shopping website visit types and shopping pattern (쇼핑 웹사이트 탐색 유형과 방문 패턴 분석)

  • Choi, Kyungbin;Nam, Kihwan
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.85-107
    • /
    • 2019
  • Online consumers browse products belonging to a particular product line or brand for purchase, or simply leave a wide range of navigation without making purchase. The research on the behavior and purchase of online consumers has been steadily progressed, and related services and applications based on behavior data of consumers have been developed in practice. In recent years, customization strategies and recommendation systems of consumers have been utilized due to the development of big data technology, and attempts are being made to optimize users' shopping experience. However, even in such an attempt, it is very unlikely that online consumers will actually be able to visit the website and switch to the purchase stage. This is because online consumers do not just visit the website to purchase products but use and browse the websites differently according to their shopping motives and purposes. Therefore, it is important to analyze various types of visits as well as visits to purchase, which is important for understanding the behaviors of online consumers. In this study, we explored the clustering analysis of session based on click stream data of e-commerce company in order to explain diversity and complexity of search behavior of online consumers and typified search behavior. For the analysis, we converted data points of more than 8 million pages units into visit units' sessions, resulting in a total of over 500,000 website visit sessions. For each visit session, 12 characteristics such as page view, duration, search diversity, and page type concentration were extracted for clustering analysis. Considering the size of the data set, we performed the analysis using the Mini-Batch K-means algorithm, which has advantages in terms of learning speed and efficiency while maintaining the clustering performance similar to that of the clustering algorithm K-means. The most optimized number of clusters was derived from four, and the differences in session unit characteristics and purchasing rates were identified for each cluster. The online consumer visits the website several times and learns about the product and decides the purchase. In order to analyze the purchasing process over several visits of the online consumer, we constructed the visiting sequence data of the consumer based on the navigation patterns in the web site derived clustering analysis. The visit sequence data includes a series of visiting sequences until one purchase is made, and the items constituting one sequence become cluster labels derived from the foregoing. We have separately established a sequence data for consumers who have made purchases and data on visits for consumers who have only explored products without making purchases during the same period of time. And then sequential pattern mining was applied to extract frequent patterns from each sequence data. The minimum support is set to 10%, and frequent patterns consist of a sequence of cluster labels. While there are common derived patterns in both sequence data, there are also frequent patterns derived only from one side of sequence data. We found that the consumers who made purchases through the comparative analysis of the extracted frequent patterns showed the visiting pattern to decide to purchase the product repeatedly while searching for the specific product. The implication of this study is that we analyze the search type of online consumers by using large - scale click stream data and analyze the patterns of them to explain the behavior of purchasing process with data-driven point. Most studies that typology of online consumers have focused on the characteristics of the type and what factors are key in distinguishing that type. In this study, we carried out an analysis to type the behavior of online consumers, and further analyzed what order the types could be organized into one another and become a series of search patterns. In addition, online retailers will be able to try to improve their purchasing conversion through marketing strategies and recommendations for various types of visit and will be able to evaluate the effect of the strategy through changes in consumers' visit patterns.

Improving Performance of Recommendation Systems Using Topic Modeling (사용자 관심 이슈 분석을 통한 추천시스템 성능 향상 방안)

  • Choi, Seongi;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.3
    • /
    • pp.101-116
    • /
    • 2015
  • Recently, due to the development of smart devices and social media, vast amounts of information with the various forms were accumulated. Particularly, considerable research efforts are being directed towards analyzing unstructured big data to resolve various social problems. Accordingly, focus of data-driven decision-making is being moved from structured data analysis to unstructured one. Also, in the field of recommendation system, which is the typical area of data-driven decision-making, the need of using unstructured data has been steadily increased to improve system performance. Approaches to improve the performance of recommendation systems can be found in two aspects- improving algorithms and acquiring useful data with high quality. Traditionally, most efforts to improve the performance of recommendation system were made by the former approach, while the latter approach has not attracted much attention relatively. In this sense, efforts to utilize unstructured data from variable sources are very timely and necessary. Particularly, as the interests of users are directly connected with their needs, identifying the interests of the user through unstructured big data analysis can be a crew for improving performance of recommendation systems. In this sense, this study proposes the methodology of improving recommendation system by measuring interests of the user. Specially, this study proposes the method to quantify interests of the user by analyzing user's internet usage patterns, and to predict user's repurchase based upon the discovered preferences. There are two important modules in this study. The first module predicts repurchase probability of each category through analyzing users' purchase history. We include the first module to our research scope for comparing the accuracy of traditional purchase-based prediction model to our new model presented in the second module. This procedure extracts purchase history of users. The core part of our methodology is in the second module. This module extracts users' interests by analyzing news articles the users have read. The second module constructs a correspondence matrix between topics and news articles by performing topic modeling on real world news articles. And then, the module analyzes users' news access patterns and then constructs a correspondence matrix between articles and users. After that, by merging the results of the previous processes in the second module, we can obtain a correspondence matrix between users and topics. This matrix describes users' interests in a structured manner. Finally, by using the matrix, the second module builds a model for predicting repurchase probability of each category. In this paper, we also provide experimental results of our performance evaluation. The outline of data used our experiments is as follows. We acquired web transaction data of 5,000 panels from a company that is specialized to analyzing ranks of internet sites. At first we extracted 15,000 URLs of news articles published from July 2012 to June 2013 from the original data and we crawled main contents of the news articles. After that we selected 2,615 users who have read at least one of the extracted news articles. Among the 2,615 users, we discovered that the number of target users who purchase at least one items from our target shopping mall 'G' is 359. In the experiments, we analyzed purchase history and news access records of the 359 internet users. From the performance evaluation, we found that our prediction model using both users' interests and purchase history outperforms a prediction model using only users' purchase history from a view point of misclassification ratio. In detail, our model outperformed the traditional one in appliance, beauty, computer, culture, digital, fashion, and sports categories when artificial neural network based models were used. Similarly, our model outperformed the traditional one in beauty, computer, digital, fashion, food, and furniture categories when decision tree based models were used although the improvement is very small.

Revisiting the cause of unemployment problem in Korea's labor market: The job seeker's interests-based topic analysis (취업준비생 토픽 분석을 통한 취업난 원인의 재탐색)

  • Kim, Jung-Su;Lee, Suk-Jun
    • Management & Information Systems Review
    • /
    • v.35 no.1
    • /
    • pp.85-116
    • /
    • 2016
  • The present study aims to explore the causes of employment difficulty on the basis of job applicant's interest from P-E (person-environment) fit perspective. Our approach relied on a textual analytic method to reveal insights from their situational interests in a job search during the change of labor market. Thus, to investigate the type of major interests and psychological responses, user-generated texts in a social community were collected for analysis between January 1, 2013 through December 31, 2015 by crawling the online-community in regard to job seeking and sharing information and opinions. The results of topic analysis indicated user's primary interests were divided into four types: perception of vocation expectation, employment pre-preparation behaviors, perception of labor market, and job-seeking stress. Specially, job applicants put mainly concerns of monetary reward and a form of employment, rather than their work values or career exploration, thus youth job applicants expressed their psychological responses using contextualized language (e.g., slang, vulgarisms) for projecting their unstable state under uncertainty in response to environmental changes. Additionally, they have perceived activities in the restricted preparation (e.g., certification, English exam) as determinant factors for success in employment and suffered form job-seeking stress. On the basis of these findings, current unemployment matters are totally attributed to the absence of pursing the value of vocation and job in individuals, organizations, and society. Concretely, job seekers are preoccupied with occupational prestige in social aspect and have undecided vocational value. On the other hand, most companies have no perception of the importance of human resources and have overlooked the needs for proper work environment development in respect of stimulating individual motivation. The attempt in this study to reinterpret the effect of environment as for classifying job applicant's interests in reference to linguistic and psychological theories not only helps conduct a more comprehensive meaning for understanding social matters, but guides new directions for future research on job applicant's psychological factors (e.g., attitudes, motivation) using topic analysis.

  • PDF

Issue tracking and voting rate prediction for 19th Korean president election candidates (댓글 분석을 통한 19대 한국 대선 후보 이슈 파악 및 득표율 예측)

  • Seo, Dae-Ho;Kim, Ji-Ho;Kim, Chang-Ki
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.199-219
    • /
    • 2018
  • With the everyday use of the Internet and the spread of various smart devices, users have been able to communicate in real time and the existing communication style has changed. Due to the change of the information subject by the Internet, data became more massive and caused the very large information called big data. These Big Data are seen as a new opportunity to understand social issues. In particular, text mining explores patterns using unstructured text data to find meaningful information. Since text data exists in various places such as newspaper, book, and web, the amount of data is very diverse and large, so it is suitable for understanding social reality. In recent years, there has been an increasing number of attempts to analyze texts from web such as SNS and blogs where the public can communicate freely. It is recognized as a useful method to grasp public opinion immediately so it can be used for political, social and cultural issue research. Text mining has received much attention in order to investigate the public's reputation for candidates, and to predict the voting rate instead of the polling. This is because many people question the credibility of the survey. Also, People tend to refuse or reveal their real intention when they are asked to respond to the poll. This study collected comments from the largest Internet portal site in Korea and conducted research on the 19th Korean presidential election in 2017. We collected 226,447 comments from April 29, 2017 to May 7, 2017, which includes the prohibition period of public opinion polls just prior to the presidential election day. We analyzed frequencies, associative emotional words, topic emotions, and candidate voting rates. By frequency analysis, we identified the words that are the most important issues per day. Particularly, according to the result of the presidential debate, it was seen that the candidate who became an issue was located at the top of the frequency analysis. By the analysis of associative emotional words, we were able to identify issues most relevant to each candidate. The topic emotion analysis was used to identify each candidate's topic and to express the emotions of the public on the topics. Finally, we estimated the voting rate by combining the volume of comments and sentiment score. By doing above, we explored the issues for each candidate and predicted the voting rate. The analysis showed that news comments is an effective tool for tracking the issue of presidential candidates and for predicting the voting rate. Particularly, this study showed issues per day and quantitative index for sentiment. Also it predicted voting rate for each candidate and precisely matched the ranking of the top five candidates. Each candidate will be able to objectively grasp public opinion and reflect it to the election strategy. Candidates can use positive issues more actively on election strategies, and try to correct negative issues. Particularly, candidates should be aware that they can get severe damage to their reputation if they face a moral problem. Voters can objectively look at issues and public opinion about each candidate and make more informed decisions when voting. If they refer to the results of this study before voting, they will be able to see the opinions of the public from the Big Data, and vote for a candidate with a more objective perspective. If the candidates have a campaign with reference to Big Data Analysis, the public will be more active on the web, recognizing that their wants are being reflected. The way of expressing their political views can be done in various web places. This can contribute to the act of political participation by the people.

Re-ranking the Results from Two Image Retrieval System in Cooperative Manner (두 영상검색 시스템의 협력적 이용을 통한 재순위화)

  • Hwang, Joong-Won;Kim, Hyunwoo;Kim, Junmo
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.24 no.1
    • /
    • pp.7-15
    • /
    • 2014
  • Image retrieval has become a huge part of computer vision and data mining. Although commercial image retrieval systems such as Google show great performances, the improvement on the performances are constantly on demand because of the rapid growth of data on web space. To satisfy the demand, many re-ranking methods, which enhance the performances by reordering retrieved results with independent algorithms, has been proposed. Conventional re-ranking algorithms are based on the assumption that visual patterns are not used on initial image retrieval stage. However, image search engines in present have begun to use the visual and the assumption is required to be reconsidered. Also, though it is possible to suspect that integration of multiple retrieval systems can improve the overall performance, the research on the topic has not been done sufficiently. In this paper, we made the condition that other manner than cooperation cannot improve the ranking result. We evaluate the algorithm on toy model and show that propose module can improve the retrieval results.

The Association of Institutional Information on Websites with Present and Future Financial Performance (웹사이트에 게시된 기업의 소개글 분석을 통한 기업의 현재 및 미래 가치 예측 분석 방법)

  • Na, Hyung Jong;Choi, Sukjae;Kwon, Ohbyung
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.4
    • /
    • pp.63-85
    • /
    • 2018
  • The "About Us" page on the website of a corporation provides information regarding the organization's vision, philosophy, and values. We examine the association between institutional information provided on corporate websites (i.e., the "About Us" section) with present and future financial performance. Utilizing a text mining technique, we analyze the institutional information of S&P500 firms in the year 2016. We conduct a factor analysis including words that are intentionally repeated in the introductory text of corporate websites. The results of the analysis reveal that keywords from this institutional information can be grouped into six factors. We then carry out an ordinary least squares regression analysis to determine the associations between these six factors and present financial performance. The results show that keywords in Factor 2 (those related to Purchasing experience) are positively associated with ROE, a variable representing present financial performance, while keywords in Factor 1 (those related to Note to customers) show a negative relationship with ROE. On the other hand, keywords in Factor 1 have a positive relationship with Tobin's Q, a variable representing future financial performance. These results indicate that there is some relationship between the words used in the institutional information in this section of corporate websites and firms' financial performance. Hence, the institutional information on a website may be a useful indicator of current firm performance and future firm value.

A Design of the OOPP(Optimized Online Portfolio Platform) using Enterprise Competency Information (기업 직무 정보를 활용한 OOPP(Optimized Online Portfolio Platform)설계)

  • Jung, Bogeun;Park, Jinuk;Lee, ByungKwan
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.5
    • /
    • pp.493-506
    • /
    • 2018
  • This paper proposes the OOPP(Optimized Online Portfolio Platform) design for the job seekers to search for the job competency necessary for employment and to write and manage portfolio online efficiently. The OOPP consists of three modules. First, JDCM(Job Data Collection Module) stores the help-wanted advertisements of job information sites in a spreadsheet. Second, CSM(Competency Statistical Model) classifies core competencies for each job by text-mining the collected help-wanted ads. Third, OBBM(Optimize Browser Behavior Module) makes users to look up data rapidly by improving the processing speed of a browser. In addition, The OBBM consists of the PSES(Parallel Search Engine Sub-Module) optimizing the computation of a Search Engine and the OILS(Optimized Image Loading Sub-Module) optimizing the loading of image text, etc. The performance analysis of the CSM shows that there is little difference in accuracy between the CSM and the actual advertisement because its data accuracy is 99.4~100%. If Browser optimization is done by using the OBBM, working time is reduced by about 68.37%. Therefore, the OOPP makes users look up the analyzed result in the web page rapidly by analyzing the help-wanted ads. of job information sites accurately.

A Study on Social Media Sentiment Analysis for Exploring Public Opinions Related to Education Policies (교육정책관련 여론탐색을 위한 소셜미디어 감정분석 연구)

  • Chung, Jin-Myeong;Yoo, Ki-Young;Koo, Chan-Dong
    • Informatization Policy
    • /
    • v.24 no.4
    • /
    • pp.3-16
    • /
    • 2017
  • With the development of social media services in the era of Web 2.0, the public opinion formation site has been partially shifted from the traditional mass media to social media. This phenomenon is continuing to expand, and public opinions on government polices created and shared on social media are attracting more attention. It is particularly important to grasp public opinions in policy formulation because setting up educational policies involves a variety of stakeholders and conflicts. The purpose of this study is to explore public opinions about education-related policies through an empirical analysis of social media documents on education policies using opinion mining techniques. For this purpose, we collected the education policy-related documents by keyword, which were produced by users through the social media service, tokenized and extracted sentimental qualities of the documents, and scored the qualities using sentiment dictionaries to find out public preferences for specific education policies. As a result, a lot of negative public opinions were found regarding the smart education policies that use the keywords of digital textbooks and e-learning; while the software education policies using coding education and computer thinking as the keywords had more positive opinions. In addition, the general policies having the keywords of free school terms and creative personality education showed more negative public opinions. As much as 20% of the documents were unable to extract sentiments from, signifying that there are still a certain share of blog posts or tweets that do not reflect the writers' opinions.