Package nafparserpy

nafparserpy

nafparserpy is a lightweight python XML wrapper for NAF.

The parser wraps lxml to handle NAF trees, providing convenience classes for NAF layers and elements. The resulting objects are decoupled from the underlying lxml tree: the user is responsible for creating and handling NAF objects, while the parser handles tree manipulation.

NAF version and DTD

The currently supported NAF version is 3.3. Layer and element classes follow closely the NAF DTD:

  • compulsory NAF attributes appear as fields (object attributes)
  • NAF subelements appear as fields of the corresponding class
  • all attributes (compulsory and optional) appear in an attrs dict attribute

See NAF-4-Development for more information on NAF.

Background

nafparserpy follows on KafNafParserPy by wrapping lxml to handle NAF XML trees, and providing convenience classes for handling NAF layers. Unlike KafNafParserPy, layer objects are decoupled from the underlying lxml etree, so that the user is responsible for creating and handling NAF objects, while the parser handles tree manipulation:

  • the parser allows to add full NAF layer objects to the NAF tree. The user application is responsible for creating these objects; the parser recursively creates and adds nodes for the full layer.
  • the parser creates layer objects when retrieving layers; these objects are decoupled from the lxml tree

Example usage

The following examples illustrate basic features of the parser. See the test modules for more examples.

Adding and modifying layers

In this example we will look at the file tests/data/coreference.naf and

  • add an entities layer to the tree
  • modify the coreferences layer by adding a new span to the 'co1' coref element

We will start by loading the NAF document:

naf = NafParser.load('tests/data/coreference.naf')

Adding layers

We want to add two entities, for the location USA and the person Kitty Genovese.

The Entity class provides a factory method to create entities from their id, type and target ids:

e1 = Entity.create('e1', 'LOC', ['w10'])
e2 = Entity.create('e2', 'PER', ['w12', 'w13'])

Now create the entities layer and add it to the tree:

entities = Entities([e1, e2])
naf.add_layer('entities', entities)

Alternatively, and because the entities layer is a container layer, it can be created directly from its elements list:

naf.add_layer_from_elements('entities', [e1, e2])

To verify that the layer has been added:

> naf.has_layer('entities')
True

We should also add a linguistic processor to the NAF header to explain how we came about these entities: Our linguistic processor is called 'linguistic intuition', version 1.0:

naf.add_linguistic_processor('entities', 'linguistic intuition', '1.0')

We could also have passed tool/data dependencies to this processor, and optional attributes like a timestamp.

The NAF header now holds one linguistic processor for the entities layer:

> len(naf.get_lps('entities'))
1

Modifying layers

The coreferences layer links the term 'murder' to the event murder of Kitty Genovese. We will add 'Kitty Genovese' as corefering to the event.

Retrieve the coreferences layer:

coreferences = naf.get('coreferences')

Like entities and most NAF layers, the coreferences layer is a container element; we can index it to retrieve its Coref elements:

co1 = coreferences[0]

NAF coref elements take one or more span children and optionally an externalReferences element. They are mapped to Coref objects, which have a spans attribute listing their Span subelements, and a possibly ExternalReferences attribute. Let us add a span over the terms 't12' and 't13':

co1.spans.append(Span.create(['t12', 't13']))

NAF objects are decoupled from the tree, we need now to replace the existing coreferences layer with a new one constructed from our modified 'co1' coref. We will simply create a new layer object from its elements, and allow it to replace the existing layer:

naf.add_layer_from_elements('coreferences', [co1], exist_ok=True)

Let us add a linguistic processor to record this modification

naf.add_linguistic_processor('coreferences', 'linguistic intuition', '1.0')

We now have 2 spans in the first coref element in the coreferences layer:

> len(naf.get('coreferences')[0].spans)
2

Adding covered text as comments

The parser is set to add the covered text of span elements as comments to span nodes. To disable this, one can set the decorate flag of the constructor to False:

NafParser.load(file, decorate=False)

or

naf = NafParser(tree, decorate=False)

Note however that comments coming from an input file/tree are preserved.

Creating a NAF document from scratch

What if you have no NAF document yet, only text? We will create a NAF document, with the text "Colorless green ideas sleep furiously". The author is Noam Chomsky, and we will call this document 'chomsky_colorless.naf'.

Initiate a NAF document:

naf = NafParser(author='Noam Chomsky', filename='chomsky_colorless.naf')

Author name and filename are fileDesc attributes. Let us verify that they are now in the NAF header:

header = naf.get('nafHeader')
> header.fileDesc.has('author')
True
> header.fileDesc.get('author')
Noam Chomsky
> header.fileDesc.has('filename')
True
> header.fileDesc.get('filename')
chomsky_colorless.naf

Alternatively, and because fileDesc is the only element of its kind in NAF documents, we can directly retrieve it:

> naf.get('fileDesc').get('author')
Noam Chomsky

Now we can add a raw text layer:

naf.add_raw_layer('colorless green ideas sleep furiously')

Add the corresponding linguistic processor:

naf.add_linguistic_processor('raw', 'linguistic intuition', '1.0')

By default, the parser is set to keep previously defined linguistic processors for a given layer, so that each layer can have
several lp elements attached to it. To disable this and keep a single lp per layer, use the replace flag:

naf.add_linguistic_processor('raw', 'linguistic intuition', '1.0', replace=True)

Let us record this NAF document and write it to file:

os.makedirs('tests/out', exist_ok=True)
naf.write('tests/out/chomsky_colorless.naf')

To write to stdout:

> naf.write()
<?xml version='1.0' encoding='UTF-8'?>
<NAF xml:lang="en" version="3.3">
  <nafHeader>
    <fileDesc author="Noam Chomsky" filename="chomsky_colorless.naf"/>
    <linguisticProcessors layer="raw">
      <lp name="linguistic intuition" version="1.0"/>
    </linguisticProcessors>
  </nafHeader>
  <raw><![CDATA[colorless green ideas sleep furiously]]></raw>
</NAF>
Expand source code
"""
.. include:: ../../docs/USAGE.md
"""

Sub-modules

nafparserpy.layers

Layer modules provide convenience classes for NAF layers and their elements …

nafparserpy.parser

Wraps lxml to facilitate handling of NAF documents