Information Extraction
The information extraction in the PaDaWaN ETL process is realized with an ontology based information extraction system, similar to the systems described in [1]. The system is developed with Java, using the UIMA Framework for annotations and OWL as ontology format.
The ontology used is a terminology tree, initially generated by an automatic algorithm and refined by a domain expert. Because of the higher commonness of standard tools like Excel we defined a meta ontology format that can be edited by a physician in Excel without the need to learn the usage of additional tools. The fine tuning of an ontology can be done in ATHEN/typo3/ afterwards.
Each concept in our ontology format is an attribute-value pair and has a data type. Standard types are boolean, number and choice. We added a more free data type ‘var’ that can be used for everything else that doesn’t fit in the other categories. A concept contains all of it’s synonyms and regular expressions that are needed to find it in the text.
Because there are cases where this simple approach is not sufficient, we added a python postprocessing engine where you can run any python code to transform the facts you found. For example python can be used to translate and normalize unsharp time expressions like “afternoon” to integer values like “minute in day”. You can even simply match a whole segment and do all of the IE logic in python if you want.
The ontology itself contains all required information to find the defined concepts. All regular expressions, configuration parameters and postprocessing scripts are stored in the OWL ontology file to guarantee a simple transfer of all required data from developer to developer or from development to productive environments.