+31 (0)88 755 94 47

unravel@umcutrecht.nl

Text mining

The UNRAVEL RDP includes all structured data from the EHR. However, some data remain unstructured, such as free text. These texts might harbour valuable variables to extract, such as NYHA class or other clinical symptoms. To enrich the UNRAVEL RDP with these unstructured data from (short) clinical notes, a text mining prototype tool was developed. In short, we defined pre-set variables for the tool to extract from clinical notes, e.g. NYHA classification and cardiovascular risk factors such as diabetes, hypercholesterolemia and hypertension. The pre-set variables are now in accordance to the variables in the TORCH registry but can be defined at the discretion of researcher.

Example of text-mining tool

Text Retrieval tool

The text mining tool for structuring text data is based on automated information retrieval using regular expressions. An information retrieval process works with a user query, where queries are formal statements of information needs. A list of text retrieval synonyms for different clinical variables in cardiomyopathy has been extracted manually by the supervision of medical experts. We have defined these clinical variables as:

  Clinical_Variables = [“NYHA Class”, “ankle oedema”, “ascites”, “pulmonary rales”, “3rd heart sound”, “chronic renal failure”, “neuromuscular dystrophy”, “arterial hypertension”, “diabetes”, “dyslipidaemia”, “smoking”, “alcohol consumption”, “Previous ventricular fibrillation”, “Previous syncope”, “Family history of SCD”, “Competitive sports”, “Previous sustained VT”, “Previous non-sustained VT”, “Familial DCM”]  

Regular expressions (regex) are deployed as a powerful language for matching text patterns. These standard textual patterns can be anything from a simple character, or a complex string containing special characters, where it can match zero or several times for a given string. With the support of Python “re” module, extensive searches through the textual data has been experimented. For example, regular expression code for retrieving NYHA class is shown as below:

 match1 = re.search(r’\b[\w\.-]*NYHA[\w\.-]*|\b[\w\.-]*New\s*York\s*Heart\s*Association[\w\.-]*’, str(textData[row]),
                  re.IGNORECASE)

match2 = re.search(r'[\w\.-]*NYHA\s*II[^I][^V]|[\w\.-]*2/4[\w\.-]*|\w*klasse\s*II[^I][^V]|\w*class\s*II[^I][^V]’,
                      str(textData[row]), re.IGNORECASE)
match3 = re.search(r'[\w\.-]*NYHA\s*III[^V]|[\w\.-]*3/4[\w\.-]*|\w*klasse\s*III[^V]|\w*class\s*III[^V]’,
                      str(textData[row]), re.IGNORECASE) match4 = re.search(r'[\w\.-]*NYHA\s*IV\w*|[\w\.-]*4/4[\w\.-]*|\w*klasse\s*IV\w*|\w*class\s*IV\w*’,
                   str(textData[row]), re.IGNORECASE) 

The code “match = re.search(pattern, string)” stores search result into variable match. If search succeeded match variable will be assigned as true, therefore the text will be retrieved. Otherwise match is false. The complete source code can be found from the GitHub page: https://github.com/bagheria/Text_Analysis_UNRAVEL_RDP

Short And Long clinical Text Classifier (SALTClass)

Short clinical text classification can be applied to extract patient’s family history as a marker for understanding the risk for a cardiovascular disease. However, the challenge of sparsity in short clinical notes, meaning of having a very small counts inherent in text can lead to large sampling errors for classification.

The software package SALTClass (Short And Long Text Classifier) is a clustering-based NLP toolkit, where it uses seven clustering algorithms including latent Dirichlet allocation, K-Means, MiniBatchK-Means, BIRCH, MeanShift, DBScan and GMM. The cluster information is used by applying smoothing methods to embody the sparse text with an enriched representation. Currently, ten different supervised classifiers have been integrated in SALTClass, which can be used on original term-document matrix or in an enrichment pipline.

The software SALTClass can be downloaded as a Python-package from Python Package Index (PyPI) website at https://pypi.org/project/saltclass and from GitHub at https://github.com/bagheria/saltclass.

Future perspectives

Future perspectives include use of natural language processing for automated standardized diagnosis registry from clinical notes based on the International Classification of Disease 10 classification (ICD) mapped to the diagnosis thesaurus and reimbursement codes set by the project group DHD diagnosis thesaurus-DBC-ICD 10of the Dutch Society of Cardiology.