Noisy text analytics

Noisy text analytics

Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data. While Text analytics is a growing and mature field that has great value because of the huge amounts of data being produced, processing of noisy text is gaining in importance because a lot of common applications produce noisy text data. Noisy unstructured text data is found in informal settings such as online chat, text messages, e-mails, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech using automatic speech recognition and printed or handwritten text using optical character recognition contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuations, missing letter case information, pause filling words such as “um” and “uh” and other texting and speech disfluencies. Such text can be seen in large amounts in contact centers, chat rooms, optical character recognition (OCR) of text documents, short message service (SMS) text, etc. Documents with historical language can also be considered noisy with respect to today’s knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful. The nature of the noisy text produced in all these contexts warrants moving beyond traditional text analysis techniques.

Contents

Techniques for noisy text analysis

Missing punctuation and the use of non-standard words can often hinder standard natural language processing tools such as Part-of-speech tagging and parsing. Techniques to both learn from the noisy data and then to be able to process the noisy data are only now being developed.

Possible source of noisy text

  • World wide web: Poorly written text is found in web pages, online chat, blogs, wikis, discussion forums, newsgroups. Most of these data are unstructured and the style of writing is very different from, say, well-written news articles. Analysis for the web data is important because they are sources for market buzz analysis, market review, trend estimation, etc. Also, because of the large amount of data, it is necessary to find efficient methods of information extraction, classification, automatic summarization and analysis of these data.
  • Contact centers: This is a general term for help desks, information lines and customer service centers operating in domains ranging from computer sales and support to mobile phones to apparels. On an average a person in the developed world interacts at least once a week with a contact center agent. A typical contact center agent handles over a hundred calls per day. They operate in various modes such as voice, online chat and E-mail. The contact center industry produces gigabytes of data in the form of E-mails, chat logs, voice conversation transcriptions, customer feedback, etc. A bulk of the contact center data is voice conversations. Transcription of these using state of the art automatic speech recognition results in text with 30-40% word error rate. Further, even written modes of communication like online chat between customers and agents and even the interactions over email tend to be noisy. Analysis of contact center data is essential for customer relationship management, customer satisfaction analysis, call modeling, customer profiling, agent profiling, etc., and it requires sophisticated techniques to handle poorly written text.
  • Printed Documents: Many libraries, government organizations and national defence organizations have vast repositories of hard copy documents. To retrieve and process the content from such documents, they need to be processed using Optical Character Recognition. In addition to printed text, these documents may also contain handwritten annotations. OCRed text can be highly noisy depending on the font size, quality of the print etc. It can range from 2-3% word error rates to as high as 50-60% word error rates. Handwritten annotations can be particularly hard to decipher, and error rates can be quite high in their presence.
  • Short Messaging Service (SMS): Language usage over computer mediated discourses, like chats, emails and SMS texts, significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language.

References


See also


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Text analytics — The term text analytics describes a set of linguistic, lexical, pattern recognition,extraction, tagging/structuring, visualization, and predictive techniques. The termalso describes processes that apply these techniques, whether independently or… …   Wikipedia

  • Unstructured data — (or unstructured information) refers to (usually) computerized information that either does not have a data structure or has one that is not easily usable by a a computer program. The term distinguishes such information from data stored in… …   Wikipedia

  • Data quality — Data are of high quality if they are fit for their intended uses in operations, decision making and planning (J. M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real world construct to which they… …   Wikipedia

  • Sentiment analysis — or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials. Generally speaking, sentiment analysis aims to determine …   Wikipedia

  • Decision engineering — framework Decision Engineering is a framework that unifies a number of best practices for organizational decision making. It is based on the recognition that, in many organizations, decision making could be improved if a more structured approach… …   Wikipedia

  • Concept drift — In predictive analytics and machine learning, the concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions… …   Wikipedia

  • Teh — Also the Indonesian word for tea. Teh is an English article based on a common misspelling of the . A common typographical error, this typo became a part of Internet slang and subsequently developed grammatical usages distinct from the .cite… …   Wikipedia

  • k-nearest neighbor algorithm — KNN redirects here. For other uses, see KNN (disambiguation). In pattern recognition, the k nearest neighbor algorithm (k NN) is a method for classifying objects based on closest training examples in the feature space. k NN is a type of instance… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”