What Is NLP Anyway?

October 2012

What Is NLP Anyway?
By Mark Morsch
Radiology Today
Vol. 13 No. 10 P. 12

As coders prepare for the complex transition to the ICD-10 code set, currently set for October 2014, more sophisticated coding technology is increasingly being considered by healthcare organizations to ease the burden. Enter computer-assisted coding and natural language processing (NLP), which are to encoders what encoders were to coding books: technology that, when implemented appropriately, can significantly improve coding accuracy, productivity, and consistency.

Computer-assisted coding technology is nothing new to healthcare. Defined by the American Health Information Management Association as “the use of computer software that automatically generates a set of medical codes for review, validation, and use based upon clinical documentation,” computer-assisted coding technology has been used in hundreds of hospitals, surgery centers, and physician clinics across the United States for the past 15 years.

But how does this technology actually work? To illustrate this, one must look at NLP, which is a driving force behind computer-assisted coding’s progress. What exactly is NLP? In general, it’s the intelligence engine that scans and analyzes clinical documentation then recommends codes for assigning to a clinical case. However, NLP encompasses various types of technologies, some more effective in healthcare than others, especially in relation to the new ICD-10 requirements. Knowing what options are out there will help providers find a vendor that uses the best symbolic rules and statistical components to assist their practice in the ICD-10 transition and beyond.

NLP Defined
NLP identifies a set of technologies and approaches, each of which varies in terms of its effectiveness. Generally, NLP technologies available today for computer-assisted coding fall into one of five methods:

• Medical dictionary matching pairs individual words or groups of words found within the documentation to standard terminology from a medical dictionary. For words that match, the text is typically highlighted and validated by the coder.

• Pattern matching extends the capabilities of medical dictionary matching by coordinating terms with specific patterns of text that describe a diagnosis or a procedure.

• Statistical gathers information from a large, precoded sample of documents to train algorithms based on word and pattern distributions.

• Symbolic rules analyzes language using rules or lexicons, identifying the elements of language with symbols that can be manipulated by the system.

• Symbolic rules and statistical components utilizes both symbolic NLP and a mathematical model of linguistics, including semantics (levels of language that contribute to meaning) and pragmatics (applying domain knowledge to recognize information in the correct context).

Compare and Contrast
To understand how these methods differ, we first need to define the standard measurements of NLP accuracy. Precision measures the number of accurate results compared with total results. Higher rates of precision mean fewer false-positives. Recall measures the number of accurate results compared with the potential number of accurate results. Higher rates of recall mean fewer false-negatives (or missed codes).

Medical dictionary matching NLP typically produces the highest number of medical terms highlighted as potential codes. Precision of medical dictionary matching is very low due to the low number of accurate hits compared with the high number of total hits. This method does little to enhance coder productivity since coders are left to sift through many false-positives to find accurate codes.

Pattern matching NLP is more precise than medical dictionary matching, returning fewer false-positives. But because it can’t analyze the meaning and subtleties of language, it has somewhat lower recall than medical dictionary matching. Neither medical dictionary nor pattern matching techniques includes the intelligence to apply coding guidelines to their analysis.

Statistical NLP relies on a large sample of documents where the meaning of the language already has been matched to accurate results. Only then can the training algorithm start to perform its analysis, form word-type distributions, and derive correlations between input and results that the statistical NLP can apply.

Statistical NLP systems often can be trained quickly to a moderate level of recall and precision, but high performance can be limited by the availability of a highly accurate training sample and the need to have a large number of examples of each specific coding scenario.

Symbolic rules NLP uses inference rules to interpret meaning from text and therefore yields high precision rates (ie, fewer false-positives). Symbolic rules introduce more sophisticated techniques for analyzing medical language based on parsing phrases and sentences. Experts in linguistics construct symbolic rules based on parts of speech and standard English syntax. A medical condition or procedure is recognized when one or more rules successfully match a portion of the clinical documentation. Symbolic rules support more advanced language recognition but become difficult to maintain for large code sets such as ICD-9-CM and ICD-10-CM/PCS.

A combined use of symbolic rules and statistical NLP provides the benefits of utilizing sophisticated inference rules that can “understand” how documentation relates to coding rules and integrates symbolic analysis, which allows for consistent interpretation of clinical content. Together, this combination of technologies enables high rates of precision and recall.

Final Thoughts
Not all computer-assisted coding and NLP technology is created equally; however, understanding the differences among NLP methods allows healthcare practices the best opportunity to select the technology that best suits their particular needs. Despite the new 2014 deadline for the ICD-10 transition, continued preparations now for the conversion will help ensure healthcare organizations’ readiness and compliance for the eventual changeover to ICD-10.

— Mark Morsch is vice president of technology at OptumInsight and coinventor of the LifeCode Natural Language Processing (NLP) Engine with two patents on NLP technology for computer-assisted coding.

Radiology Today Magazine