Nuxeo HxAI Connector (WIP)

An addon to bulk map, remap, chain-transform and send metadata and binaries to the Ingest service.

TL;DR

Get up and running quickly!

Ingesting files with default mapping

This will ingest all the files that are under the given <my-root-doc-id>:

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest"
  }
}'

However, The IngestAction is very flexible and can be parameterized very finely as you will see as you go through other examples.


Dry run mode

Ingestion offers a lot of possibilities via mapping and transformation. You certainly want to stay in dryRun mode until you have nailed your parameters:

Write the following content in myParameters.json and inject it the smart way:

{
  "dryRun": true
}

Start fresh from scratch

This will remove all mapping so you can build a new one piece by piece:

Write the following content in myParameters.json and inject it the smart way:

{
  "dryRun": true,
  "aggregateDefaultMapping": false,
  "aggregateDefaultTransformer": false,
  "replaceMapping": true
}

Start fresh with the defaults

This will get you back to default mapping on a document as if it was going to be ingested for the first time:

Write the following content in myParameters.json and inject it the smart way:

{
  "dryRun": true,
  "aggregateDefaultMapping": true,
  "aggregateDefaultTransformer": true,
  "replaceMapping": true
}

Overloading the current (last persisted) mapping

Let’s say you want to adjust your current mapping. You can override it this way:

Write the following content in myParameters.json and inject it the smart way:

{
  "dryRun": true,
  "inlineMapping": "!dc:contributors,!dc:description",
  "inlineTransformer": "dc:title=meta:name=_Flag"
}

That will remove a couple properties and map dc:title to meta:name while changing its value with the _Flag function. See Mapping Documents and Transforming Documents for more info.


Replace current mapping while discarding defaults

Strip any mapping and transformer and add yours.

Write the following content in myParameters.json and inject it the smart way:

{
  "dryRun": true,
  "inlineMapping": "dc:contributors,dc:description",
  "inlineTransformer": "dc:title=meta:name=_Flag",
  "aggregateDefaultMapping": false,
  "aggregateDefaultTransformer": false,
  "replaceMapping": true
}

That will add a couple properties and map dc:title to meta:name while changing its value with the _Flag function. See Mapping Documents and Transforming Documents for more info.


Replace current mapping and leverage defaults

Strip any mapping and transformer except defaults and add yours.

Write the following content in myParameters.json and inject it the smart way:

{
  "dryRun": true,
  "inlineMapping": "dc:contributors,dc:description",
  "inlineTransformer": "dc:title=meta:name=_Flag",
  "aggregateDefaultMapping": true,
  "aggregateDefaultTransformer": true,
  "replaceMapping": true
}

That will add a couple properties and map dc:title to meta:name while changing its value with the _Flag function. See Mapping Documents and Transforming Documents for more info.


A word about Nuxeo

<schema xmlns:common="http://www.nuxeo.org/ecm/schemas/common/" name="common">
  <common:icon>/icons/pdf.png</common:icon>
</schema>
<schema xmlns:dc="http://www.nuxeo.org/ecm/schemas/dublincore/" name="dublincore">
  <dc:contributors>
    <item>Administrator</item>
  </dc:contributors>
  <dc:created>2024-11-21T15:38:08.620Z</dc:created>
  <dc:creator>Administrator</dc:creator>
  <dc:description>A poem from the heart</dc:description>
  <dc:lastContributor>Administrator</dc:lastContributor>
  <dc:modified>2024-11-21T15:55:19.496Z</dc:modified>
  <dc:nature>article</dc:nature>
  <dc:title>testPoem</dc:title>
</schema>

A word about Ingest

The ingest service provides a REST API to send your documents to the Insight content lake

The Ingest payload

The Ingest payload is an array of “ingest events” with two 2 distinguishable parts.

The hard-coded part

A part of the schema is mandatory. It contains an object id and required fields.

The properties part

Data is expected this way:


Installing and configuring

Installation

  1. mp-install this addon’s package: <NUXEO_HOME>/nuxeoctl mp-install nuxeo-hxai-connector
  2. Update nuxeo.conf with desired properties.Please refer to list of configuration options in below section

Ingest Configuration

The following nuxeo.conf properties are available to configure the plugin.

Property name description
hxai.api.client.id The Hxai client ID
hxai.api.client.secret The Hxai client secret
hxai.api.auth.baseurl The IDP base url (ex: https://auth.iam.dev.experience.hyland.com)
hxai.api.ingest.baseurl The Hxai ingest base url (ex: https://ingestion-api.insight.dev.ncp.hyland.com/v1)
hxai.api.ingest.env.key The ingest environment key
hxinsight.environment.type The HxAI environment type
hxinsight.environment.id The HxAI environment id
hxinsight.service.name The HxAI Service name
nuxeo.hxai.sourceid The HxAI source ID

Precisions about binaries flattening

As of today, Ingest only handles binaries at the root of the properties part. This is fine for simple properties like file:content but doesn’t work for complex properties nesting binaries, like files:files. There are several ways to flatten binaries so they comply to Ingest’s requirements:

The clean way

Using custom Mappers to separate, for example,files:files into multiple simple properties and omit the initial array containing them. Custom mapping happens before Mapping and Transforming, so the properties generated during custom mapping can be transformed as well.

The fallback solution

Post-filtering the outgoing JSON payload allows to flatten unnoticed nested binaries. If a complex type containing binaries does not have a custom mapping, we do move the binaries at the root of properties to avoid the binary to be silently ignored by Ingest.

If this was done with files:files, the array containing the files would remain empty in the JSON payload. Thus the contribution of a default IngestiblePropertyMapper

Example

# original structure with a containing array:
{
  "my:complex": [
    {
      "file": {}
    },
    {
      "file": {}
    }
  ]
}

# with a custom mapper, could become:
{
  "renamed:transformed/0": {
    "file": {}
  },
  "renamed:transformed/1": {
    "file": {}
  }
}

# with post-filtering:
{
  "my:complex": [],
  "my:complex/0": {
    "file": {}
  },
  "my:complex/1": {
    "file": {}
  }
}

Nuxeo HxAI Connector to the rescue!

To ingest documents efficiently, the Nuxeo HxAI Connector does the following:

  1. Ingest existing repos in a single command leveraging the BAF
  2. Map documents in a fine-grained way, which means select the metadata we want to send for which document.
  3. Remap metadata as the name of the metadata name might need to be modified.
  4. Transform the data on the go by applying functions to them
  5. Flatten binaries: final payloads cannot nest file metadata, it needs to be all flat
  6. Automatically trigger ingestion with schedules
  7. Consistently ingest documents the same way
  8. Have centralized configurations we can modify for all eligible documents.
  9. Thus, we can have default configuration per document type.
  10. We then need to be able to discard or aggregate the default at will.
  11. Since we want to be able to do many things, we want to test it all without any impact. We need a dry-run mode.

The HxAI Service

To be refactored Ingest operations of uploading files and sending events are implemented in the HxAi service


Triggering ingestion on documents

We can leverage Nuxeo’s search capabilities to target documents and send them to ingestion via a query language called NXQL.

The IngestAction

We leverage the Nuxeo Bulk Action Framework (BAF) which:

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest",
  }
}' -o /dev/null

Parameterizing the Action

The actual list of parameters taken by the Action:

{
  "inlineMapping": "dublincore,common",
  "inlineTransformer": "a=b=Function,c=d=OtherFunction",
  "replaceMapping": false,
  "aggregateDefaultMapping": false,
  "aggregateDefaultTransformer": false,
  "persistMapping": false
}

Parameters’ usage

The parameters of the IngestAction are of two types. Some can be persisted. Some can’t.

See the HxAI facet for more details about parameters’ persistence.

The dryRun param: boolean (false)

The inlineMapping param: String

An inline IngestMappingDescriptor to apply to Documents matching the NXQL query.

The inlineTransformer param: String

An inline IngestTransformerDescriptor to apply to Documents matching the NXQL query.

The replaceMapping param: boolean (false)

Allows to replace the mapping and transformer previously saved on the Document. Defaults to false

The aggregateDefaultMapping param: boolean (true)

Leverages the default IngestMapping for the Document based on type. This adds up to inlineMapping

The aggregateDefaultTransformer param: boolean (true)

Leverages the default IngestTransformer for the Document based on type. This adds up to inlineTransformerParam

The persistMapping param: boolean (false)

Allows saving all parameters except itself, so: persistMapping, replaceMapping and dryRun. This has no effect in dryRun.


Automating ingestion

Ingestion can be automated in 2 distinct ways. Schedule-based (default) or purely Event-based (disabled by default).

Schedule-based

Approach

Schedules are an historical feature of Nuxeo. They fire an event following a cron expression. More info. Schedules are the prefered way to ingest your documents. Indeed, this approach requires read-only access to documents.

By setting up multiple schedules, the user could run multiple ingestion jobs on subparts of her repository, each with its own config.

Sample module with Schedules and corresponding EventListeners

Schedules

Here are 2 Schedules running every other second:

<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.crons.config" version="1.0.0">
  <extension target="org.nuxeo.ecm.core.scheduler.SchedulerService" point="schedule">
    <schedule id="ingest1">
      <eventId>ingest1</eventId>
      <eventCategory>ingest</eventCategory>
      <cronExpression>0/2 * * * * ?</cronExpression>
    </schedule>
    <schedule id="ingest2">
      <eventId>ingest2</eventId>
      <eventCategory>ingest</eventCategory>
      <cronExpression>1/2 * * * * ?</cronExpression>
    </schedule>
  </extension>
</component>
EventListener Contrib
<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.cron.events.listeners.config" version="1.0.0">
  <extension target="org.nuxeo.ecm.core.event.EventServiceComponent" point="listener">
    <listener name="ingest1" async="false" postCommit="false" priority="120" class="org.nuxeo.hxai.listeners.IngestListener1">
      <event>ingest1</event>
    </listener>
    <listener name="ingest2" async="false" postCommit="false" priority="120" class="org.nuxeo.hxai.listeners.IngestListener2">
      <event>ingest2</event>
    </listener>
  </extension>
</component>
EventListener code

This sample will always execute. You will want to update documents only if they have been updated in the last X time units depending on your specific use cases.

public class IngestListener1 implements EventListener {

    @Override
    public void handleEvent(Event event) {
        String query = "SELECT * from Document WHERE ecm:path = '/default-domain/workspaces/test/test'";
        BulkCommand command = new BulkCommand.Builder(IngestAction.ACTION_NAME, query,
                SYSTEM_USERNAME).param(INLINE_MAPPING, "files:files,file:content,dublincore,tags,foo:bar")
                                .param(INLINE_TRANSFORMER, "files:files/=my:binaries")
                                .param(REPLACE_MAPPING, true)
                                .param(DRY_RUN_MODE, true)
                                .build();
        Framework.getService(BulkService.class).submit(command);
    }
}

The IngestUpdateListener

Approach

The IngestUpdateListener will trigger ingestion on a document when it is updated if it has the hxai facet. However, for the document to have the facet, you must have already sent it for ingestion once with the mapping persistence. This is not the prefered way to tackle ingestion because:

Enabling the IngestUpdateListener

It is disabled by default (see Automating Ingestion). You can enable it by contributing the following:

<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.events.listener.config.test" version="1.0.0">
  <require>org.nuxeo.hxai.events.listener.config</require>
  <extension target="org.nuxeo.ecm.core.event.EventServiceComponent" point="listener">
    <listener name="ingestlistener" enabled="true"/>
  </extension>
</component>

The HxAI facet

The flagging role

The HxAI facet acts like a flag to tell Nuxeo that a document’s ingestion has been done already once and is eligible for ingestion update when necessary.

The persistence function: hxai schema

The hxai schema holds some valuable ingestion-related informations. Aside from the ingestion status, the following IngestAction parameters allow to repeat document ingestion exactly the same way it was last done if desired (for update):

The following IngestAction parameters are not storable:


Default configuration

Document type based defaults

Default configuration is based on Document type. If you want to register a default IngestMappingDescriptor or a default IngestTransformerDescriptor for a certain Document type, simply give the type as Descriptor id.

Default provision

Contributing Descriptors

Please, see Contributed Mappings and keep in mind that if you want to override default (the default IngestMappingDescriptor) you need to require the IngestMappingServiceComponent:

<require>"org.nuxeo.hxai.IngestMappingServiceComponent"</require>

Injecting ingestion parameters

Those parameters need to be stringified to be sent in our query (in an additional parameters key):

The hard way

We can write stringified JSON by hand, escaping all sensitive characters:

"{\"inlineMapping\":\"dublincore,common\",\"inlineTransformer\":\"a=b=Function,c=d=OtherFunction\",\"replaceMapping\":false,\"aggregateDefaultMapping\":false,\"aggregateDefaultTransformer\":false,\"persistMapping\":false}"

The cool way

But who wants to do that? Let’s simply do:

$(< myParams.json | jq -c | jq -R)

Sample parameterized query

Thus, the complete query becomes as below:

Plain

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest",
    "parameters": "{\"inlineMapping\":\"dublincore,common\",\"inlineTransformer\":\"a=b=Function,c=d=OtherFunction\",\"replaceMapping\":false,\"aggregateDefaultMapping\":false,\"aggregateDefaultTransformer\":false,\"persistMapping\":false}
  }
}'

Smart

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest",
    "parameters": '$(< myParams.json | jq -c | jq -R)'
  }
}'

Mapping documents

Mapping types

Mappings are defined this way:

Unprefixed properties

Although they are supported, it is discouraged to use unprefixed properties:

files # will add files:files to the mapping

Prefixed properties

dc:title # adding single properties, one by one.

Schemas

dublincore # map the 18 properties present in dublincore

Mapping refence

more info about Mapping references

@myMappingReference # map all the mappings found in the 'myMappingReference' Mapping

Inline IngestMappingDescriptors

Chaining , separated Mappings will help building complete mappings in one line:

dc:title,dc:description #,.. there are 18 properties with the dc: prefix...
# I don't want to type them all, just take the whole dublincore schema! (and the common schema! because why not?)
dublincore,common
# I can also get "simples", properties without a prefix
dublincore,icon
# OK I want the whole dublincore and common schemas, except dc:title
dublincore,icon,!dc:title
# Order matters
dublicore,!dc:title # OK: add all dublincore except dc:title
!dc:title,dublincore # Useless: removes dc:title but adds it back!

Default Mappings

Baseline defaults

Those baseline default mappings are applied to documents whithout document type specific default mappings:

By document type defaults

If a mapping contribution’s id is a document type, it will be used as default mapping instead of the baseline defaults for that document type. See Contributing Mappings.


Custom Mappings

Contributing Mappings

The IngestMappingDescriptor can be contributed via XML, validated and ready to use at runtime.

<?xml version="1.0"?>
<component name="org.nuxeo.hxai.IngestMappingServiceComponent.test.referencing" version="1.0">
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestMappings">
    <!--default for Picture typed documents-->
    <ingest id="Picture">
      <properties>dc:title,icon,relatedtext:relatedtextresources</properties>
    </ingest>
    <!--to be referred to as @first-->
    <ingest id="first">
      <properties>dc:title,icon,relatedtext:relatedtextresources</properties>
    </ingest>
    <ingest id="second">
      <properties>dc:description,uid:major_version,uid:minor_version</properties>
    </ingest>
    <ingest id="third">
      <properties>dc:content-type</properties>
    </ingest>
  </extension>
</component>

Using contributed Mappings

Nesting a contributed Mapping

Now let’s say I have a contributed Mapping with 45 properties in it:

  1. I want them all!
  2. Ah… not this one…
  3. I want it all persisted in a new contributed IngestMappingDescriptor, so I can reuse it at will.

I can avoid retyping 44 properties and keep strong connection with the original mapping by nesting it.

<?xml version="1.0"?>
<component name="org.nuxeo.hxai.IngestMappingServiceComponent.test.referencing" version="1.0">
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestMappings">
    <ingest id="first">
      <properties>@bigMapping,!un:wantedprop</properties>
    </ingest>
  </extension>
</component>
Referencing a contributed Mapping inline

Just like in the XML contributions:

dublincore,@first # duplicate dc:title mapping will be processed only once
# what if I don't want the relatedtext:relatedtextresources brought by @first ?
dublincore,@first,!relatedtext:relatedtextresources # let's take it off

Mixing things up: Yes we can!

In the same IngestMappingDescriptor, use schemas, properties, mapping references to add and remove whatever we want.

It is a good practice to add the removal mapping expressions at the end, so they don’t come back by mistake.

<?xml version="1.0"?>
<component name="org.nuxeo.hxai.IngestMappingServiceComponent.test.referencing" version="1.0">
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestMappings">
    <ingest id="mixItAllUp">
      <properties>common,dc:title,@bigMapping,!@optionalMappings,!un:wantedprop,!uid</properties>
    </ingest>
  </extension>
</component>

Debugging Mappings

Logs are an important part of this module. They can pin point error in your Descriptors

Successfull (happy) logs

Allow following recursive instanciation:

DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
DEBUG [IngestMappingServiceImpl] IngestMapping: 'third' was processed successfully.
DEBUG [IngestMappingServiceImpl] IngestMapping: 'second' was processed successfully.
DEBUG [IngestMappingServiceImpl] IngestMapping: 'first' was processed successfully.

Successfull (verbose) logs

Allow to follow what happens for each mapping:

DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
TRACE [SimpleIngestMapping] the 'dublincore' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'dublincore'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
TRACE [SimpleIngestMapping] the 'dc:content-type' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:content-type'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'third' was processed successfully.
TRACE [SimpleIngestMapping] the 'dc:description' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:description'
TRACE [SimpleIngestMapping] the '@third' mapping was identified as reference to another mapping.
TRACE [SimpleIngestMapping] processing mapping: '@third'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'second' was processed successfully.
TRACE [SimpleIngestMapping] the 'dc:title' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:title'
TRACE [SimpleIngestMapping] the '@second' mapping was identified as reference to another mapping.
TRACE [SimpleIngestMapping] processing mapping: '@second'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'first' was processed successfully.

Mapping cycle detection

Set up a cycle
Normal contribution

Since we have opened the way for mapping references, we have cycle detection. Let’s consider the following:

<ingest id="first">
  <properties>dc:title,@second</properties>
</ingest>
<ingest id="second">
  <properties>dc:description,@third</properties>
</ingest>
<ingest id="third">
  <properties>dc:content-type,</properties>
</ingest>
Problematic contribution

Let’s break it with an override making third depend on forth and make a cyclic reference from forth to second:

<ingest id="third">
  <properties>dc:content-type,@foo,@forth</properties>
</ingest>
<ingest id="forth">
  <properties>@second</properties>
</ingest>

This will not allow Nuxeo to start (avoiding further harm) but we also need a way to track the problem.


Logging to the rescue!

Easily find cyclic references:

// TL;DR
java.lang.IllegalArgumentException: Detected cycle in IngestMapping: first->second->third->forth->second
// Full stack
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
TRACE [SimpleIngestMapping] the 'dublincore' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'dublincore'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
DEBUG [IngestMappingServiceImpl] IngestMapping: third directly depends on: foo->forth
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: foo
TRACE [SimpleIngestMapping] the 'common' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'common'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'foo' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: forth
DEBUG [IngestMappingServiceImpl] IngestMapping: forth directly depends on: second
ERROR [RegistrationInfoImpl] Component service:org.nuxeo.hxai.IngestMappingServiceComponent notification of application started failed: Detected cycle in IngestMapping: first->second->third->forth->second
java.lang.IllegalArgumentException: Detected cycle in IngestMapping: first->second->third->forth->second
        at org.nuxeo.hxai.service.IngestMappingServiceImpl.processMappingDescriptor(IngestMappingServiceImpl.java:67) ~[classes/:?]
        at org.nuxeo.hxai.service.IngestMappingServiceImpl.lambda$processMappingDescriptor$5(IngestMappingServiceImpl.java:93) ~[classes/:?]

Contributing custom Property Mappers

IngestiblePropertyMappers allow you to map certain properties the way you want. It is useful to customize how complex properties will be mapped. They implement java.util.function.Consumer<PropertyMappingContext> which allows them to access any element inside the properties part object of the IngestibleDocument. This allows to flatten a single property into multiple ones. This is useful in the case of files:files for example, which is spread into multiple files:files/n entries.

Sample custom contribution

Here is the default contribution to map my:property. It indicates the Mapper to use my.custom.Mapper.

<?xml version="1.0" encoding="UTF-8"?>
<component name="my.component" version="1.0">
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestPropertyMappers">
    <ingestPropertyMappers id="myFileMappers">
      <class property="my:property">my.custom.Mapper</class>
    </ingestPropertyMappers>
  </extension>
</component>

Merge mechanism

The IngestiblePropertyMappers don’t merge but replace each other. You can still containerize your IngestiblePropertyMappers by descriptor ID.

Default IngestiblePropertyMapper

The default IngestionMappingService configuration comes with a single IngestiblePropertyMapper for files:files. As of today, it does mandatory work to comply to the Ingest REST API specs and should not be touched: it flattens the binaries in files:files at the root of the properties part of the JSON object representing each document.

Targetting the right property

To target a property, you need to write its prefixed form in the property attribute of each IngestPropertyMapper class. See Sample custom contribution for details.


Transformers: Remap and Transform

Transformations

Transforming regroups two things: remapping keys and actually transforming values.

Transformations are 3-optional-ways parameters.

It is done this way:

# Remap only
dc:=base: # Remap all dublincore properties to prefix them with 'base'
:title=:name # Remap all properties suffixed 'title' and apply Function to them
files:file/=ingest:binaries # Remap all files:files/whatever into ingestion:binaries/whatever

# Transform only
==Function # Apply Function to everything
a==Function # Apply Function to a, don't rename it

# Remap and transform
a=b=Function # Map simple property a to b and apply Function
:title=:name=Function # Remap all properties suffixed 'title' and apply Function to them
a:b=c:d=Function # Exactly map a:b to c:d and apply Function to it
files:files/=ingestion:binaries=Function # Remap all flattened items from files:files/whatever to ingestion:bindaries/whatever and apply Function to them one by one

Chaining Transformations into Transformers

Transformations can be chained (joined by , separators) into a Transformer, which will apply them in order. Here is an inline IngestTransformerDescriptor:

# ⚠️ The following will not work as expected ⚠️
a=b=Function,a=b=OtherFunction # After being transformed int b, a is not matched by the second transformation and OtherFunction is not applied.
# This would work but there is a better way bellow
a=b=Function,b==OtherFunction # Function will be applied before OtherFunction

Transformation Functions

Chaining Transformation Functions

The functions used by Transformationss implement java.util.Function<Serializable, Serializable> this allows chaining them:

# The most reliable solution to chain functions on a single property doesn't require you to figure things out:
a=b=Function1=Function2=Function3,c==Function1=Function3 # a is renamed to b and Function1 to 3 are applied to it in order. c will then be transformed by Function1, then Function3
# hard to distinguish both Transformations from each other? Add some comas. It's free!
a=b=Function1=Function2=Function3,,,c==Function1=Function3 # same result

Default function location

There is a default package for functions used by Transformers. If you put your functions there, you don’t need to specify their package:

// assumed package
org.nuxeo.hxai.ingest.functions

Custom function locations

However, functions can be anywhere else:

MyFunction # points to org.nuxeo.hxai.ingest.functions.MyFunction
.MyFunction # same thing
.my.sub.package.MyOtherFunction # points to org.nuxeo.hxai.ingest.functions.my.sub.package.MyFunction
my.complete.package.MyFunction # use a cannonical name

Provided testing functions

A few provided functions:

# The underscore is to differenciate bundled test functions from others.
_Flag # will assure you touched a property
_Concat # will concatenate a distinguishable value to the property value
_Count # initiates or increments a numeric value to tell you how many times it was applied

Contributing Transformers

The IngestTransformerDescriptor can be contributed via XML, validated and ready to use at runtime. They are a centralized way to define remappings and transformations.

<?xml version="1.0"?>
<component name="org.nuxeo.hxai.IngestMappingServiceComponent.test.transforming" version="1.0">
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestTransformers">
    <transformer id="example1">
      <!-- dc:title will be remapped as foo:bar and transformed with the indicated implementation of function<serializable, serializable> -->
      <transformations>dc:title=foo:bar=MyFunction</transformations>
    </transformer>
    <transformer id="example2"><transformations>foo:bar=dc:title=MyUnfunction</transformations></transformer>
  </extension>
</component>

Debugging Transformations

Detecting malformed Transformations

Transformations can be malformed too. Malformed Contributions will be caught at the initialization of Nuxeo:

// Missing left side
DEBUG [SimpleIngestTransformer$Transformation] Instanciating Transformation: 'inline#=c=_Flag'.
TRACE [SimpleIngestTransformer$Transformation] Transformation: 'inline#=c=_Flag' left side: 'null' is of type: 'STAR' right side: 'c' is of type: 'SIMPLE'.
        java.lang.IllegalArgumentException: Malformed Transformation: 'inline#=c=_Flag' with a missing left side.

// Left side only
DEBUG [SimpleIngestTransformer$Transformation] Instanciating Transformation: 'inline#a=='.
TRACE [SimpleIngestTransformer$Transformation] Transformation: 'inline#a==' left side: 'a' is of type: 'SIMPLE' right side: 'null' is of type: 'STAR'.
        java.lang.IllegalArgumentException: Malformed Transformation: 'inline#a==' with a left side only.

// Right side only
DEBUG [SimpleIngestTransformer$Transformation] Instanciating Transformation: 'inline#=c='.
TRACE [SimpleIngestTransformer$Transformation] Transformation: 'inline#=c=' left side: 'null' is of type: 'STAR' right side: 'c' is of type: 'SIMPLE'.
        java.lang.IllegalArgumentException: Malformed Transformation: 'inline#=c=' with a right side only.

Detecting excessive remappings

As we said, Transformations have a remapping role. This is parameterized in the left and right side which are none other than XPaths. Thus, the prefix is like a directory and the suffix is like a file inside that directory.

So we need to be careful not to make excessive mapping, which means mapping several properties to the same target:

// All a: prefixed properties would end up overriding each other as the simple c property (like a:foo, a:bar, a:baz, a:qux would overlap as 'icon' for example)
XPath: 'a:' cannot be the left side of: 'c' in Transformation: 'inline#a:=c=_Flag'. 'a:' is a prefix and can only be mapped to another prefix.

Remapping combinations glossary

you may also want to see the transformation combinations glossary

There are many possible combinations:

Legend

The full form of a remapping looks like so:

1:2=3:4
Symbol Meaning
valid remap
⚪️ no remap
invalid remap (many possible source, one target)

Star to…

Status Pattern Meaning
⚪️ = star to star
=3 star to simple
=3: star to prefix
=:4 star to suffix
=3:4 star to full

Simple to…

Status Pattern Meaning
⚪️ 1= simple to star
1=3 simple to simple
1=3: simple to prefix
1=:4 simple to suffix
1=3:4 simple to full

Prefix to…

Status Pattern Meaning
⚪️ 1:= prefix to star
1:=3 prefix to simple
1:=3: prefix to prefix
1:=:4 prefix to suffix
1:=3:4 prefix to full

Suffix to…

Status Pattern Meaning
⚪️ :2= suffix to star
:2=3 suffix to simple
:2=3: suffix to prefix
:2=:4 suffix to suffix
:2=3:4 suffix to full

Full to…

Status Pattern Meaning
⚪️ 1:2= full to star
1:2=3 full to simple
1:2=3: full to prefix
1:2=:4 full to suffix
1:2=3:4 full to full

Transformation combinations glossary

There are less than for the remapping combinations glossary, but still quite a few combinations possible:

Legend

The full form of a transformation looks like so:

left=right=function1[[=function2]...]

where left and right are parts or a valid remapping

Symbol Meaning
valid transformation
⚪️ no transformation
invalid transformation

Nothing

Status Pattern Meaning
⚪️ == no transformation (also valid for [=,:]*)

No left

Status Pattern Meaning
==Function Transform every value
=right= Only right side provided
=right=Function Missing left side

No right

Status Pattern Meaning
left== Left side only provided
left==Function Transform value for keys matching left expression without remapping

No Function

Status Pattern Meaning
left=right= Remap left matching keys to right expression

Complete Transformation

Status Pattern Meaning
left=right=Function Transform value for keys matching left expression without remapping

CI/CD

Workflows

CI/CD workflows are present here and they include:

Versioning and release strategy

To Pre-Production Marketplace

To Production Marketplace

Automatic Version bump

Once a release of MAJOR.MINOR+1.0 happens to production, the project’s version is bumped automatically to MAJOR.MINOR+1-SNAPSHOT so that, until the next release of MAJOR.MINOR+2 to production, the upcoming PR merges will be deployed to pre-production as MAJOR.MINOR+1.1, MAJOR.MINOR+1.2 and so on.


Updating the docs

Do not edit this file directly, it is autogenerated as well as content.html which serves as the package’s embeded documentation.

Updating this README.md

The file to edit is doc/README.md, then you need to generate its Table Of Content (TOC):

pandoc --toc --toc-depth=6 -s -t gfm -o README.md doc/README.md

⚠️ The CI build and test pipeline will verify that this file, README.md, is equal to the result of the above command.

This is to make sure:

Updating the package doc: content.html

The content.html is not a source file, it is generated in the CI build and test pipeline by the following command and packaged appropriately:

Responsive color theme

Responsive to your OS settings:

pandoc --toc --toc-depth=6 -s --embed-resources --css doc/nour-auto-lail-nahar.css --highlight-style doc/nour-lail.theme -o <path-to-be-determined.html> doc/README.md

Dark theme

pandoc --toc --toc-depth=6 -s --embed-resources --css doc/nour-lail.css --highlight-style doc/nour-lail.theme -o <path-to-be-determined.html> doc/README.md

Light theme

pandoc --toc --toc-depth=6 -s --embed-resources --css doc/nour-nahar.css --highlight-style doc/nour-lail.theme -o <path-to-be-determined.html> doc/README.md

Extras

You can add cosmetics (neon logo and title) with following extras:

-V logo="$(< doc/connector.svg)" --template doc/nuxeo-hxai-connector-template.html

Known limitations and WIP