Package nafparserpy
nafparserpy
nafparserpy is a lightweight python XML wrapper for NAF.
The parser wraps lxml to handle NAF trees, providing convenience classes for NAF layers and elements.
The resulting objects are decoupled from the underlying lxml tree: the user is responsible for creating
and handling NAF objects, while the parser handles tree manipulation.
NAF version and DTD
The currently supported NAF version is 3.3. Layer and element classes follow closely the NAF DTD:
- compulsory NAF attributes appear as fields (object attributes)
- NAF subelements appear as fields of the corresponding class
- all attributes (compulsory and optional) appear in an
attrsdict attribute
See NAF-4-Development for more information on NAF.
Background
nafparserpy follows on KafNafParserPy by wrapping
lxml to handle NAF XML trees, and providing convenience classes for handling NAF layers.
Unlike KafNafParserPy, layer objects are decoupled from the underlying lxml etree, so that the user is responsible for creating
and handling NAF objects, while the parser handles tree manipulation:
- the parser allows to add full NAF layer objects to the NAF tree. The user application is responsible for creating these objects; the parser recursively creates and adds nodes for the full layer.
- the parser creates layer objects when retrieving layers; these objects are decoupled from the lxml tree
Example usage
The following examples illustrate basic features of the parser. See the test modules for more examples.
Adding and modifying layers
In this example we will look at the file tests/data/coreference.naf and
- add an
entitieslayer to the tree - modify the
coreferenceslayer by adding a new span to the 'co1'corefelement
We will start by loading the NAF document:
naf = NafParser.load('tests/data/coreference.naf')
Adding layers
We want to add two entities, for the location USA and the person Kitty Genovese.
The Entity class provides a factory method to create entities from their id, type and target ids:
e1 = Entity.create('e1', 'LOC', ['w10'])
e2 = Entity.create('e2', 'PER', ['w12', 'w13'])
Now create the entities layer and add it to the tree:
entities = Entities([e1, e2])
naf.add_layer('entities', entities)
Alternatively, and because the entities layer is a container layer, it can be created directly from its elements list:
naf.add_layer_from_elements('entities', [e1, e2])
To verify that the layer has been added:
> naf.has_layer('entities')
True
We should also add a linguistic processor to the NAF header to explain how we came about these entities: Our linguistic processor is called 'linguistic intuition', version 1.0:
naf.add_linguistic_processor('entities', 'linguistic intuition', '1.0')
We could also have passed tool/data dependencies to this processor, and optional attributes like a timestamp.
The NAF header now holds one linguistic processor for the entities layer:
> len(naf.get_lps('entities'))
1
Modifying layers
The coreferences layer links the term 'murder' to the event murder of Kitty Genovese.
We will add 'Kitty Genovese' as corefering to the event.
Retrieve the coreferences layer:
coreferences = naf.get('coreferences')
Like entities and most NAF layers, the coreferences layer is a container element; we can index it to retrieve its Coref
elements:
co1 = coreferences[0]
NAF coref elements take one or more span children and optionally an externalReferences element. They are mapped to
Coref objects, which have a spans attribute listing their Span subelements, and a possibly ExternalReferences attribute.
Let us add a span over the terms 't12' and 't13':
co1.spans.append(Span.create(['t12', 't13']))
NAF objects are decoupled from the tree, we need now to replace the existing coreferences layer with a new one
constructed from our modified 'co1' coref. We will simply create a new layer object from its elements, and
allow it to replace the existing layer:
naf.add_layer_from_elements('coreferences', [co1], exist_ok=True)
Let us add a linguistic processor to record this modification
naf.add_linguistic_processor('coreferences', 'linguistic intuition', '1.0')
We now have 2 spans in the first coref element in the coreferences layer:
> len(naf.get('coreferences')[0].spans)
2
Adding covered text as comments
The parser is set to add the covered text of span elements as comments to span nodes.
To disable this, one can set the decorate flag of the constructor to False:
NafParser.load(file, decorate=False)
or
naf = NafParser(tree, decorate=False)
Note however that comments coming from an input file/tree are preserved.
Creating a NAF document from scratch
What if you have no NAF document yet, only text? We will create a NAF document, with the text "Colorless green ideas sleep furiously". The author is Noam Chomsky, and we will call this document 'chomsky_colorless.naf'.
Initiate a NAF document:
naf = NafParser(author='Noam Chomsky', filename='chomsky_colorless.naf')
Author name and filename are fileDesc attributes. Let us verify that they are now in the NAF header:
header = naf.get('nafHeader')
> header.fileDesc.has('author')
True
> header.fileDesc.get('author')
Noam Chomsky
> header.fileDesc.has('filename')
True
> header.fileDesc.get('filename')
chomsky_colorless.naf
Alternatively, and because fileDesc is the only element of its kind in NAF documents, we can directly retrieve it:
> naf.get('fileDesc').get('author')
Noam Chomsky
Now we can add a raw text layer:
naf.add_raw_layer('colorless green ideas sleep furiously')
Add the corresponding linguistic processor:
naf.add_linguistic_processor('raw', 'linguistic intuition', '1.0')
By default, the parser is set to keep previously defined linguistic processors for a given layer, so that each layer can have
several lp elements attached to it. To disable this and keep a single lp per layer, use the replace flag:
naf.add_linguistic_processor('raw', 'linguistic intuition', '1.0', replace=True)
Let us record this NAF document and write it to file:
os.makedirs('tests/out', exist_ok=True)
naf.write('tests/out/chomsky_colorless.naf')
To write to stdout:
> naf.write()
<?xml version='1.0' encoding='UTF-8'?>
<NAF xml:lang="en" version="3.3">
<nafHeader>
<fileDesc author="Noam Chomsky" filename="chomsky_colorless.naf"/>
<linguisticProcessors layer="raw">
<lp name="linguistic intuition" version="1.0"/>
</linguisticProcessors>
</nafHeader>
<raw><![CDATA[colorless green ideas sleep furiously]]></raw>
</NAF>
Expand source code
"""
.. include:: ../../docs/USAGE.md
"""
Sub-modules
nafparserpy.layers-
Layer modules provide convenience classes for NAF layers and their elements …
nafparserpy.parser-
Wraps lxml to facilitate handling of NAF documents