Nuxeo HxAI Connector (WIP)

An addon to bulk map, remap, chain-transform and send metadata and binaries to Content Intelligence via the Ingest service.

TL;DR

Get up and running quickly!

Ingesting files with default mapping

This will ingest all the files that are under the given <my-root-doc-id>:

curl -sS -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest"
  }
}'

However, The IngestAction is very flexible and can be parameterized very finely as you will see as you go through other examples.

Groups, Users, Members synchronization

This will handle synchronizing groups users and members provided by the UserManager:

curl -XPOST -sS -u foo:bar -H 'Accept: application/json' <myNuxeoUrl>/nuxeo/site/automation/Nucleus.Sync.Users.Groups -H "Content-type: application/json+nxrequest" -d "{}"

Dry run mode

Ingestion offers a lot of possibilities via mapping and transformation. You certainly want to stay in dryRun mode until you have nailed your parameters:

Write the following content in myParameters.json and inject it the smart way:

{
  "dryRun": true
}

Start fresh from scratch

This will remove all mapping so you can build a new one piece by piece:

We want to discard any default contribution: aggregateDefaultMappings and aggregateDefaultTransformations set to false.
We also want to discard previously persisted inlineMappings and inlineTransformations with replaceMapping set to true.

Write the following content in myParameters.json and inject it the smart way:

{
  "dryRun": true,
  "aggregateDefaultMappings": false,
  "aggregateDefaultTransformations": false,
  "aggregateDefaultPropertyMappers": false,
  "replaceMapping": true
}

Start fresh with the defaults

This will get you back to default mapping on a document as if it was going to be ingested for the first time:

We want to leverage any default contribution: aggregateDefaultMappings and aggregateDefaultTransformations set to true.
We also want to discard previously persisted inlineMapping and inlineTransformations with replaceMapping set to true.

Write the following content in myParameters.json and inject it the smart way:

{
  "dryRun": true,
  "replaceMapping": true
}

Overloading the current (last persisted) mapping

Let’s say you want to adjust your current mapping. You can override it this way:

Write the following content in myParameters.json and inject it the smart way:

{
  "dryRun": true,
  "inlineMappings": "!dc:contributors,!dc:description",
  "inlineTransformations": "dc:title=meta:name=_Flag",
  "inlinePropertyMappers": "dc:title ExtraPropertiesMapper dc:extra:my_value"
}

That will:

remove a couple properties and map dc:title to meta:name
change its value with the _Flag function.
add a new dc:extra property with my_value as value See Mapping Documents and Transforming Documents for more info.

Important Detail about Ingestion Phases

Folderish documents should be ingested first. This will spare a lot of ACL recompute downstream. See onlyContent See onlyAncestorsAndFolders

A word about Nuxeo

Nuxeo associates metadata and content (text, binaries…).
Nuxeo indexes documents and has powerfull search capabilities.
Nuxeo’s metadata are stored in schemas:

<schema xmlns:common="http://www.nuxeo.org/ecm/schemas/common/" name="common">
  <common:icon>/icons/pdf.png</common:icon>
</schema>
<schema xmlns:dc="http://www.nuxeo.org/ecm/schemas/dublincore/" name="dublincore">
  <dc:contributors>
    <item>Administrator</item>
  </dc:contributors>
  <dc:created>2024-11-21T15:38:08.620Z</dc:created>
  <dc:creator>Administrator</dc:creator>
  <dc:description>A poem from the heart</dc:description>
  <dc:lastContributor>Administrator</dc:lastContributor>
  <dc:modified>2024-11-21T15:55:19.496Z</dc:modified>
  <dc:nature>article</dc:nature>
  <dc:title>testPoem</dc:title>
</schema>

A word about Ingest

The Ingest service provides a REST API to send your documents to Content Intelligence.

The Ingest payload

The Ingest payload is an array of “ingest events” with two 2 distinguishable parts.

The hard-coded part

A part of the schema is mandatory and handled by the connector. You don’t need to configure it.

The properties part

Data is expected this way:

files: flat, at the root of properties. Nested files will be ignored.
values: regular metadata values that can be nested.

ACL

ACL are in between. They are mandatory but part of the properties. They are sent automatically as well.

Connector Setup

Installation

mp-install this addon’s package: <NUXEO_HOME>/nuxeoctl mp-install nuxeo-hxai-connector
Update nuxeo.conf with desired properties.Please refer to list of configuration options in below section

Configuration through nuxeo.conf

Configuring credentials

Property name	description
hxai.ingest.client.id
hxai.ingest.client.secret
hxai.ingest.env.key	hxai-<uuid>: identifies the env the repo is part of
hxai.ingest.source.id	<uuid> : identifies the repo in Ingest context
hxai.nucleus.client.id
hxai.nucleus.client.secret
hxai.nucleus.system.id	<uuid> : identifies the repo in Nucleus context

Configuring the backing bulk actions

Property name	default
nuxeo.bulk.action.ingestAction.defaultConcurrency	2
nuxeo.bulk.action.ingestAction.defaultPartitions	4
nuxeo.bulk.action.nucleusMappingAction.defaultConcurrency	2
nuxeo.bulk.action.nucleusMappingAction.defaultPartitions	4

Configuration through the ConfigurationService

Some configurations come with default values and are configurable through the ConfigurationService.

Property name	default
hxai.nucleus.auth.base.url	https://auth.iam.dev.experience.hyland.com
hxai.nucleus.system.integration.base.url	https://api.nucleus.dev.experience.hyland.com
hxai.ingest.base.url	https://ingestion.insight.dev.experience.hyland.com
hxai.connection.pool.max.size	1
hxai.executor.pool.size.max	1
hxai.ingest.binary.check.threshold.byte.size	26214400
hxai.ingest.presigned.url.cache.size.max	100
hxai.ingest.inline.consumer.cache.size.max	1000

About hxai.ingest.binary.check.threshold.byte.size

Expressed in bytes. Checking a digest is already known by Content Intelligence has a tangible time cost. If the file being checked is smaller than the threshold, it’s not worth checking. Sending the binary anyway in such case is faster.

In dry-run, this check is still done, to give users a feel of the time consumed by that option and tune the threshold. If one really means to bypass the check in dry-run (to work offline for example), putting this to a very high level is the way to go. Otherwise, it is recommanded to keep it at its real level.

About hxai.connection.pool.max.size

Used for binary upload.

About hxai.executor.pool.size.max

Used for serialization and binary upload.

About hxai.ingest.inline.consumer.cache.size.max

When an inline Consumer<IngestProperty> is submitted to the IngestAction, it is cached for reuse with matching documents. To avoid the cache to grow unexpectedly (especially as user tests and tries things out as inlineTransformers), there is a cache size limit. To keep things simple, the cache is just cleared when it reaches max size.

Configuration through contributions

Default configuration

Default configuration is based on Document type. Descriptors with ID matching a document type are targetted to that document type.

Contributing to ingest XP points (extention points)

There are 3 extention points:

IngestMappings
IngestTransformations
IngestPropertyMappers

They all take IngestDescriptors.

IngestDescriptor

The IngestDescriptor is a flexible descriptor that can take an args String or a list of items which are IngestItemDescriptors. The IngestItemDescriptor is also a flexible descriptor. It takes either an args String or a list of args whiche are IngestArgDescriptors.

Case Study: default configuration

Here is a representative sample showing how to use ingestion descriptors:

  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestMappings">
    <ingest id="system" args="ingestProperty:type"/>
    <ingest id="Root" args="@system root:title"/>
    <ingest id="default" args="@system dublincore file:content files:files"/>
  </extension>
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestPropertyMappers">
    <ingest id="default">
      <item args="files:files FilesPropertyMapper"/>
      <item>
        <arg value="ingestProperty:type"/>
        <arg value="ExtraPropertiesMapper"/>
        <arg value="ingestProperty:type:DOCTYPE"/>
        <arg value="dc:title:BASENAME"/>
        <arg value="dc:created:EPOCH"/>
        <arg value="dc:creator:system"/>
        <arg value="dc:modified:EPOCH"/>
        <arg value="dc:lastContributor:system"/>
      </item>
    </ingest>
    <ingest id="Root">
      <item>
        <arg value="root:title"/>
        <arg value="ExtraPropertiesMapper"/>
        <arg value="ingestProperty:type:DOCTYPE"/>
        <arg value="dc:title:/"/>
        <arg value="dc:created:EPOCH"/>
        <arg value="dc:creator:system"/>
        <arg value="dc:modified:EPOCH"/>
        <arg value="dc:lastContributor:system"/>
      </item>
    </ingest>
  </extension>
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestTransformations">
    <ingest id="default">
      <item args="dc:title==AddKv annotation:name"/>
      <item args="dc:created==AddKv annotation:dateCreated"/>
      <item args="dc:creator==AddKv annotation:createdBy"/>
      <item args="dc:modified==AddKv annotation:dateModified"/>
      <item args="dc:lastContributor==AddKv annotation:modifiedBy"/>
      <item args="ingestProperty:type==AddKv annotation:type"/>
    </ingest>
  </extension>

Precisions about binaries flattening

As of today, Ingest only handles binaries at the root of the properties part. This is fine for simple properties like file:content but doesn’t work for complex properties nesting binaries, like files:files. There are several ways to flatten binaries so they comply to Ingest’s requirements:

The clean way

Using custom Mappers to separate, for example,files:files into multiple simple properties and omit the initial array containing them. Custom mapping happens before Mapping and Transforming, so the properties generated during custom mapping can be transformed as well.

The fallback solution

Post-filtering the outgoing JSON payload allows to flatten unnoticed nested binaries. If a complex type containing binaries does not have a custom mapping, we do move the binaries at the root of properties to avoid the binary to be silently ignored by Ingest.

If this was done with files:files, the array containing the files would remain empty in the JSON payload. Thus the contribution of a default IngestPropertyMapper

Example

# original structure with a containing array:
{
  "my:complex": [
    {
      "file": {}
    },
    {
      "file": {}
    }
  ]
}

# with a custom mapper, could become:
{
  "renamed:transformed/0": {
    "file": {}
  },
  "renamed:transformed/1": {
    "file": {}
  }
}

# with post-filtering:
{
  "my:complex": [],
  "my:complex/0": {
    "file": {}
  },
  "my:complex/1": {
    "file": {}
  }
}

Nuxeo HxAI Connector capabilities

To ingest documents efficiently, the Nuxeo HxAI Connector provides capabilities to:

Synchronize Groups Users Members with Nuclueus based on email address.
Ingest existing repos in a single command leveraging the BAF.
Map documents in a fine-grained way, which means select the metadata we want to send for which document.
Add extra metadata to comply to Ingest’s spec.
Transform the data on the go.
Flatten binaries as required.
Upload binaries to Ingest.
Mark ingested documents for future document updates.
Automatically trigger ingestion with schedules.
Consistently ingest documents the same way.
Provide centralized configurations that can modify for all eligible documents.
Support per document type defaults.
Do all combinations between default, saved and on-the-spot parameters.
Provide a dry-run mode to explore possibilities safely.

Syncrhonizing Groups Users and Members with Nucleus

This will synchronize entities returned by the UserManager. It needs to be run to update as Nuxeo is not notified for every update that is made in an IDP, active directory or the likes.

See instructions.

Triggering ingestion on documents

We can leverage Nuxeo’s query capabilities to target documents and send them to ingestion via a query language called NXQL.

The IngestAction

We leverage the Nuxeo Bulk Action Framework (BAF) which:

can handle all the documents matched by an NXQL query.
provides a REST API to call it and monitor it.
is fault tolerant and has a DLQ mechanism.

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest",
  }
}' -o /dev/null

Parameterizing the Action

The actual list of parameters taken by the Action:

{
  "inlineMappings": "dublincore,common",
  "inlineTransformations": "a=b=Function,c=d=OtherFunction",
  "inlinePropertyMappers": "a=b=Function,c=d=OtherFunction",
  "aggregateDefaultMappings": true,
  "aggregateDefaultTransformations": true,
  "aggregateDefaultPropertyMappers": true,
  "replaceMapping": false,
  "persistMapping": false,
  "onlyContent": false,
  "onlyAncestorsAndFolders": false
}

Parameters’ usage

The parameters of the IngestAction are of two types. Some can be persisted. Some can’t.

See the HxAI facet for more details about parameters’ persistence.

The dryRun param: boolean (false)

prevents saving any inline param
prevents uploading binaries to S3
prevents sending payloads to the Ingest service
does not prevent checkDigest calls. This is to allow you to test and tune you check threshold.

The inlineMappings param: (String or Array)

An inline IngestDescriptor contributing to ingestMappings to apply to Documents matching the NXQL query.

The inlineTransformations param: (String or Array)

An inline IngestDescriptor contributing to ingestTransformations to apply to Documents matching the NXQL query.

The inlinePropertyMappers param: (String or Array)

An inline IngestDescriptor contributing to ingestPropertyMappers to apply to Documents matching the NXQL query.

The aggregateDefaultMappings param: boolean (true)

Leverages the default ingestMappings for the Document based on type. This adds up to inlineMappings.

The aggregateDefaultTransformations param: boolean (true)

Leverages the default ingestTransformations for the Document based on type. This adds up to inlineTransformations.

The aggregateDefaultPropertyMappers param: boolean (true)

Leverages the default ingestPropertyMappers for the Document based on type. This adds up to inlinePropertyMappers.

The onlyContent param: boolean (false)

Only ingest non-folderish documents.

The onlyAncestorsAndFolders param: boolean (false)

Only ingest folderish documents.

The replaceMapping param: boolean (false)

Allows to replace the mapping, transformations and property mappers previously saved on the Document. Defaults to false

The persistMapping param: boolean (false)

Allows saving inline* and aggregate* params. This has no effect in dryRun.

Automating ingestion

Ingestion can be automated in 2 distinct ways. Schedule-based (default) or purely Event-based (disabled by default).

Schedule-based

Approach

Schedules are an historical feature of Nuxeo. They fire an event following a cron expression. More info. Schedules are the prefered way to ingest your documents. Indeed, this approach requires read-only access to documents.

By setting up multiple schedules, the user could run multiple ingestion jobs on subparts of her repository, each with its own config.

Sample module with Schedules and corresponding EventListeners

Schedules

Here are 2 Schedules running every other second:

<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.crons.config" version="1.0.0">
  <extension target="org.nuxeo.ecm.core.scheduler.SchedulerService" point="schedule">
    <schedule id="ingest1">
      <eventId>ingest1</eventId>
      <eventCategory>ingest</eventCategory>
      <cronExpression>0/2 * * * * ?</cronExpression>
    </schedule>
    <schedule id="ingest2">
      <eventId>ingest2</eventId>
      <eventCategory>ingest</eventCategory>
      <cronExpression>1/2 * * * * ?</cronExpression>
    </schedule>
  </extension>
</component>

EventListener Contrib

<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.cron.events.listeners.config" version="1.0.0">
  <extension target="org.nuxeo.ecm.core.event.EventServiceComponent" point="listener">
    <listener name="ingest1" async="false" postCommit="false" priority="120" class="org.nuxeo.hxai.listeners.IngestListener1">
      <event>ingest1</event>
    </listener>
    <listener name="ingest2" async="false" postCommit="false" priority="120" class="org.nuxeo.hxai.listeners.IngestListener2">
      <event>ingest2</event>
    </listener>
  </extension>
</component>

EventListener code

This sample will always execute. You will want to update documents only if they have been updated in the last X time units depending on your specific use cases.

public class IngestListener1 implements EventListener {

    @Override
    public void handleEvent(Event event) {
        String query = "SELECT * from Document WHERE ecm:path = '/default-domain/workspaces/test/test'";
        BulkCommand command = new BulkCommand.Builder(IngestAction.ACTION_NAME, query,
                SYSTEM_USERNAME).param(INLINE_MAPPINGS, "files:files,file:content,dublincore,tags,foo:bar")
                                .param(INLINE_TRANSFORMATIONS, "files:files/=my:binaries")
                                .param(REPLACE_MAPPING, true)
                                .param(DRY_RUN_MODE, true)
                                .build();
        Framework.getService(BulkService.class).submit(command);
    }
}

The IngestUpdateListener

The IngestUpdateListener will trigger ingestion on a document when it is updated if it has the hxai facet. However, for the document to have the facet, you must have already sent it for ingestion once with the mapping persistence.

Once you have made your initial bulk import, this listener will keep you in live sync.

Disabling the IngestUpdateListener

It is enabled by default (see Automating Ingestion). You can disable it by contributing the following:

<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.events.listener.config.test" version="1.0.0">
  <require>org.nuxeo.hxai.events.listener.config</require>
  <extension target="org.nuxeo.ecm.core.event.EventServiceComponent" point="listener">
    <listener name="ingestlistener" enabled="false"/>
  </extension>
</component>

The flagging role

The Hxai facet acts like a flag to tell Nuxeo that a document’s ingestion has been done already once and is eligible for ingestion update when necessary.

The persistence function: hxai schema

The hxai schema holds some valuable ingestion-related informations. Aside from the ingestion status, the following IngestAction parameters allow to repeat document ingestion exactly the same way it was last done if desired (for update):

inlineMappings
inlineTransformations
inlinePropertyMappers
aggregateDefaultMappings
aggregateDefaultTransformations
aggregateDefaultPropertyMappers

The following IngestAction parameters are not storable:

onlyContent
onlyAncestorsAndFolders
replaceMapping
persistMapping

Injecting ingestion parameters

Those parameters need to be stringified to be sent in our query (in an additional parameters key):

The hard way

We can write stringified JSON by hand, escaping all sensitive characters:

"{\"inlineMappings\":\"dublincore,common\",\"inlineTransformations\":\"a=b=Function,c=d=OtherFunction\",\"replaceMapping\":false,\"aggregateDefaultMappings\":false,\"aggregateDefaultTransformations\":false,\"persistMapping\":false}"

A smarter way

This is safer, easier and uses POSIX syntax:

$(jq -c < myParams.json | jq -R)

Sample parameterized query

Thus, the complete query becomes as below:

Plain

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest",
    "parameters": "{\"inlineMappings\":\"dublincore,common\",\"inlineTransformations\":\"a=b=Function,c=d=OtherFunction\",\"replaceMapping\":false,\"aggregateDefaultMappings\":false,\"aggregateDefaultTransformations\":false,\"persistMapping\":false}
  }
}'

Smart

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest",
    "parameters": '$(jq -c < myParams.json | jq -R)'
  }
}'

Mapping documents

Mapping types

Mappings are defined this way:

Unprefixed properties

Although they are supported, it is discouraged to use unprefixed properties:

files # will add files:files to the mapping

Prefixed properties

dc:title # adding single properties, one by one.

Schemas

dublincore # map the 18 properties present in dublincore

Mapping refence

@myMappingReference # map all the mappings found in the 'myMappingReference' Mapping

Inline IngestMappingDescriptors

Chaining , or separated Mappings will help building complete mappings in one line:

dc:title,dc:description #,.. there are 18 properties with the dc: prefix...
# I don't want to type them all, just take the whole dublincore schema! (and the common schema! because why not?)
dublincore,common
# I can also get "simples", properties without a prefix
dublincore,icon
# OK I want the whole dublincore and common schemas, except dc:title
dublincore,icon,!dc:title
# Spaces also work
files:files,dublincore file:content
# Order matters
dublicore,!dc:title # OK: add all dublincore except dc:title
!dc:title,dublincore # Useless: removes dc:title but adds it back!

Default Mappings

The connector comes with a default mapping configuration: ingest-mapping-service-config.xml which ensures compliance to Ingest.

Contributing IngestMappings

IngestDescriptor can be contributed to the IngestMappings XP point via XML. See contributing-to-ingest-xp-points-extention-points.

Defaults by By document type

If a mapping contribution’s id is a document type, it will be used as default mapping. See contributing-to-ingest-xp-points.

Custom Mappings

Nesting a contributed Mapping

dublincore,@bigMapping # @bigMapping is a reference to a mapping with id `bigMapping`.
# what if I don't want the unwanted:prop brought by @bigMapping ?
dublincore,@bigMapping,!unwanted:prop # let's take it off

Note, mappings are deduplicated, it doesn’t matter if your mappings end up requestion the same property multiple times.

Debugging Mappings

Logs can pin point error in your Descriptors

Successfull (happy) logs

Those DEBUG logs represent nested mappings where first references @sencond which references @third:

DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
DEBUG [IngestMappingServiceImpl] IngestMapping: 'third' was processed successfully.
DEBUG [IngestMappingServiceImpl] IngestMapping: 'second' was processed successfully.
DEBUG [IngestMappingServiceImpl] IngestMapping: 'first' was processed successfully.

Successfull (verbose) logs

Same thing at TRACE level

DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
TRACE [SimpleIngestMapping] the 'dublincore' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'dublincore'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
TRACE [SimpleIngestMapping] the 'dc:content-type' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:content-type'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'third' was processed successfully.
TRACE [SimpleIngestMapping] the 'dc:description' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:description'
TRACE [SimpleIngestMapping] the '@third' mapping was identified as reference to another mapping.
TRACE [SimpleIngestMapping] processing mapping: '@third'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'second' was processed successfully.
TRACE [SimpleIngestMapping] the 'dc:title' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:title'
TRACE [SimpleIngestMapping] the '@second' mapping was identified as reference to another mapping.
TRACE [SimpleIngestMapping] processing mapping: '@second'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'first' was processed successfully.

Mapping cycle detection

Settng up a cycle

Since nested mappings are possible, cycle detection is provided. Here is a normal contribution:

<ingest id="first" args="dc:title,@second" />
<ingest id="second" args="dc:description,@third" />
<ingest id="third" args="dc:content-type" />

Let’s break it with an override making third depend on forth and make a cyclic reference from forth to second:

<ingest id="third" args="dc:content-type,@foo,@forth" />
<ingest id="forth" args="@second" />

This will not allow Nuxeo to start but we also need a way to track the problem.

Logging to the rescue

Easily find cyclic references:

// TL;DR
java.lang.IllegalArgumentException: Detected cycle in IngestMapping: first->second->third->forth->second
// Full stack
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
TRACE [SimpleIngestMapping] the 'dublincore' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'dublincore'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
DEBUG [IngestMappingServiceImpl] IngestMapping: third directly depends on: foo->forth
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: foo
TRACE [SimpleIngestMapping] the 'common' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'common'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'foo' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: forth
DEBUG [IngestMappingServiceImpl] IngestMapping: forth directly depends on: second
ERROR [RegistrationInfoImpl] Component service:org.nuxeo.hxai.IngestMappingServiceComponent notification of application started failed: Detected cycle in IngestMapping: first->second->third->forth->second
java.lang.IllegalArgumentException: Detected cycle in IngestMapping: first->second->third->forth->second
        at org.nuxeo.hxai.service.IngestMappingServiceImpl.processMappingDescriptor(IngestMappingServiceImpl.java:67) ~[classes/:?]
        at org.nuxeo.hxai.service.IngestMappingServiceImpl.lambda$processMappingDescriptor$5(IngestMappingServiceImpl.java:93) ~[classes/:?]

The most direct hint being the problematic chain print:

first->second->third->forth->second

IngestPropertyMapper: Do mappings with a broader context

IngestPropertyMappers allow you to map certain properties the way you want as they have access to the context of the whole IngestDocument.

It is useful to customize how complex properties will be mapped.
Add properties that are not in the object
Do logic that involves multiple properties.
…

They implement java.util.function.Consumer<PropertyMappingContext> which allows them to access any element inside the Document. This is useful in the case of files:files for example, which is destructured into multiple files:files/n entries at the root of properties.

Mappers default package

Default location

There is a default package for functions used by IngestPropertyMappers. If you put your functions there, you don’t need to specify their package:

// assumed package
org.nuxeo.hxai.client.objects.json.mappers

This is actually where provided mappers live.

Custom mapper locations

However, functions can be anywhere else:

MyMapper # default: points to org.nuxeo.hxai.client.objects.json.mappers
.MyMapper # same thing
.my.sub.package.MyOtherMapper # points to org.nuxeo.hxai.client.objects.json.mappers
my.complete.package.MyMapper # use a cannonical name

Provided IngestPropertyMappers

A few provided mappers:

ArrayPropertyMapper # Handles arrays destructuring as properties cannot be nested.
ExtraPropertyMapper # Add arbitrary properties to an IngestDocument
FilesPropertyMapper # Destructures Properties implementing a collections of Files

ExtraPropertyMapper usage

This mapper takes positional arguments on top of the key pattern. The following is a compact rewrite of what is defined for root in the default mapping contribution for Root. See case study

This mapper matches a property mapping root:title.
The property in and of itseslf does not exist but it is defined in the mapping of the Root doctype.
Which behaves as a hook to trigger the following mapper, hooking on that root:title property.
Then properties are enumerated as prefix:suffix:(PRESET|literal_value).
PRESET can be:
- BASENAME: the document’s path last segment
- DOCTYPE: the document type
- EPOCH: an Instant representing the oldest possible date
- NOW: an Instant representing the moment the property is created
literal_value: anything other than those PRESET values is used verbatim

So the line below, when it finds a document mapping root:title calls the ExtraPropertyMapper to add:

ingestProperty:type=Root
dc:title=/
dc:created=<EPOCH Value>
dc:creator=system
and so on…

root:title ExtraPropertiesMapper ingestProperty:type:DOCTYPE dc:title:/ dc:created:EPOCH dc:creator:system dc:modified:EPOCH dc:lastContributor:system
^          ^                     ^                   ^                ^                  ^
target     the Mapper to use     added key           preset value     literal value      yet another...

Contributing IngestPropertyMappers

IngestDescriptor can be contributed to the IngestPropertyMappers XP point via XML. See contributing-to-ingest-xp-points-extention-points.

Merge mechanism

The IngestPropertyMappers don’t merge but replace each others.

Transformations: Remap and Transform

Transformations

Transforming regroups two things: remapping keys and actually transforming values.

Transformations are 3-optional-ways parameters.

The targetted property name pattern
The output property name pattern
A chain of functions (taking arguments each) to be applied to the matching properties. The functions can be followed by arguments and are delimited by the same delimiter as parts 1 and 2.

It is done this way:

# Remap only
dc:=base: # Remap all dublincore properties to prefix them with 'base'
:title=:name # Remap all properties suffixed 'title' and apply Function to them
files:file/=ingest:binaries # Remap all files:files/whatever into ingestion:binaries/whatever

# Transform only
==Function # Apply Function to everything
a==Function # Apply Function to a, don't rename it

# Remap and transform
a=b=Function # Map simple property a to b and apply Function
:title=:name=Function # Remap all properties suffixed 'title' and apply Function to them
a:b=c:d=Function # Exactly map a:b to c:d and apply Function to it
files:files/=ingestion:binaries=Function # Remap all flattened items from files:files/whatever to ingestion:bindaries/whatever and apply Function to them one by one

# Order consideration
# ⚠️ The following will not work as expected ⚠️
a=b=Function1=Function2,a=b=OtherFunction # After being transformed int b, a is not matched by the second transformation and OtherFunction is not applied.
# This would work but there is a better way bellow
a=b=Function1=Function2,b==OtherFunction # Function will be applied before OtherFunction
# The most reliable solution to chain functions on a single property doesn't require you to figure things out:
a=b=Function1=Function2=Function3 # a is renamed to b and Function1 to 3 are applied to it in order.
# Adding parameters
a=b=Function1 arg1 arg2=Function2 arg1=Function3 # The function name is always first, then anything before an eventual = is params.
# Multiple chains in a single line
a=b=Function1 arg1 arg2=Function2 arg1=Function3,c==Function1 arg1 arg2 arg3=Function4

Transformation function specification

Interface

All functions must implement Consumer<IngestProperty> They work at a property level (unlike IngestPropertyMappers, they don’t have access to the whole IngestDocument).

Default location

There is a default package for functions used by IngestTransformations. If you put your functions there, you don’t need to specify their package:

// assumed package
org.nuxeo.hxai.ingest.functions

This is actually where provided functions live.

Custom function locations

However, functions can be anywhere else:

MyFunction # default: points to org.nuxeo.hxai.ingest.functions.MyFunction
.MyFunction # same thing
.my.sub.package.MyOtherFunction # points to org.nuxeo.hxai.ingest.functions.my.sub.package.MyFunction
my.complete.package.MyFunction # use a cannonical name

Provided testing functions

A few provided functions:

AddKv # Adds key:value pairs to the targetted property. Takes parameters like key1:value1 key2:value2.
# The following functions are prefixed `_` because they are provided test functions.
_Flag # will assure you touched a property
_Concat # will concatenate a distinguishable value to the property value
_Count # initiates or increments a numeric value to tell you how many times it was applied

Contributing IngestTransformations

IngestDescriptor can be contributed to the IngestTransformations XP point via XML. See contributing-to-ingest-xp-points-extention-points.

Debugging Transformations

Detecting malformed Transformations

Transformations can be malformed too. Malformed Contributions will be caught at the initialization of Nuxeo:

// Missing left side
DEBUG [SimpleIngestTransformations$Transformation] Instanciating Transformation: 'inline#=c=_Flag'.
TRACE [SimpleIngestTransformations$Transformation] Transformation: 'inline#=c=_Flag' left side: 'null' is of type: 'STAR' right side: 'c' is of type: 'SIMPLE'.
        java.lang.IllegalArgumentException: Malformed Transformation: 'inline#=c=_Flag' with a missing left side.

// Left side only
DEBUG [SimpleIngestTransformations$Transformation] Instanciating Transformation: 'inline#a=='.
TRACE [SimpleIngestTransformations$Transformation] Transformation: 'inline#a==' left side: 'a' is of type: 'SIMPLE' right side: 'null' is of type: 'STAR'.
        java.lang.IllegalArgumentException: Malformed Transformation: 'inline#a==' with a left side only.

// Right side only
DEBUG [SimpleIngestTransformations$Transformation] Instanciating Transformation: 'inline#=c='.
TRACE [SimpleIngestTransformations$Transformation] Transformation: 'inline#=c=' left side: 'null' is of type: 'STAR' right side: 'c' is of type: 'SIMPLE'.
        java.lang.IllegalArgumentException: Malformed Transformation: 'inline#=c=' with a right side only.

Detecting excessive remappings

As we said, Transformations have a remapping role. This is parameterized in the left and right side which are none other than XPaths. Thus, the prefix is like a directory and the suffix is like a file inside that directory.

So we need to be careful not to make excessive mapping, which means mapping several properties to the same target:

// All a: prefixed properties would end up overriding each other as the simple c property (like a:foo, a:bar, a:baz, a:qux would overlap as 'icon' for example)
XPath: 'a:' cannot be the left side of: 'c' in Transformation: 'inline#a:=c=_Flag'. 'a:' is a prefix and can only be mapped to another prefix.

Remapping combinations glossary

you may also want to see the transformation combinations glossary

There are many possible combinations:

Legend

The full form of a remapping looks like so:

1:2=3:4

Symbol	Meaning
✅	valid remap
⚪️	no remap
❌	invalid remap (many possible source, one target)

Star to…

Status	Pattern	Meaning
⚪️	=	star to star
❌	=3	star to simple
❌	=3:	star to prefix
❌	=:4	star to suffix
❌	=3:4	star to full

Simple to…

Status	Pattern	Meaning
⚪️	1=	simple to star
✅	1=3	simple to simple
✅	1=3:	simple to prefix
✅	1=:4	simple to suffix
✅	1=3:4	simple to full

Prefix to…

Status	Pattern	Meaning
⚪️	1:=	prefix to star
❌	1:=3	prefix to simple
✅	1:=3:	prefix to prefix
❌	1:=:4	prefix to suffix
❌	1:=3:4	prefix to full

Suffix to…

Status	Pattern	Meaning
⚪️	:2=	suffix to star
❌	:2=3	suffix to simple
❌	:2=3:	suffix to prefix
✅	:2=:4	suffix to suffix
❌	:2=3:4	suffix to full

Full to…

Status	Pattern	Meaning
⚪️	1:2=	full to star
✅	1:2=3	full to simple
✅	1:2=3:	full to prefix
✅	1:2=:4	full to suffix
✅	1:2=3:4	full to full

Transformation combinations glossary

There are less than for the remapping combinations glossary, but still quite a few combinations possible:

Legend

The full form of a transformation looks like so:

left=right=function1[[=function2]...]

where left and right are parts or a valid remapping

Symbol	Meaning
✅	valid transformation
⚪️	no transformation
❌	invalid transformation

Nothing

Status	Pattern	Meaning
⚪️	==	no transformation (also valid for `[=,:]*`)

No left

Status	Pattern	Meaning
✅	==Function	Transform every value
❌	=right=	Only right side provided
❌	=right=Function	Missing left side

No right

Status	Pattern	Meaning
❌	left==	Left side only provided
✅	left==Function	Transform value for keys matching left expression without remapping

No Function

Status	Pattern	Meaning
✅	left=right=	Remap left matching keys to right expression

Complete Transformation

Status	Pattern	Meaning
✅	left=right=Function	Transform value for keys matching left expression without remapping

CI/CD

Workflows

CI/CD workflows are present here and they include:

CI for PR: Build and test of the source code upon raising a PR against 2023 branch.
Deployment of package to Pre-production Marketplace: Merging a PR onto 2023 branch would trigger the CI followed by deployment of the generated package to the nuxeo pre-prod marketplace.
Release to Production Marketplace: A manual release job is available which, when run on a base branch, deploys the latest available minor tag version to the production marketplace.

Versioning and release strategy

To Pre-Production Marketplace

Deployment is done every time a PR is merged onto the base LTS branch (2023)
Follows MAJOR.MINOR.PATCH versioning strategy with patch precision
Whenever a deployment is done, we maintain the state in tags and next merge will be released as MAJOR.MINOR.PATCH+1 and so on.

To Production Marketplace

Deployment is done whenever the manual release job is run
Follows MAJOR.MINOR.0 versioning strategy with MINOR precision
Whenever the release job is run, latest available minor tag (MAJOR.MINOR.PATCH) of the current base branch, is deployed as MAJOR.MINOR+1.0

Automatic Version bump

Once a release of MAJOR.MINOR+1.0 happens to production, the project’s version is bumped automatically to MAJOR.MINOR+1-SNAPSHOT. Until the next release of MAJOR.MINOR+2 to production, the upcoming PR merges will be deployed to pre-production as MAJOR.MINOR+1.1, MAJOR.MINOR+1.2 and so on.

Formatting

The repository holds files in many languages. Formatting is verified in the CI.

Approach

The Makefile provides steps to:

setup text manipulation and formatting tools
use them in a cache-aware fashion

Setup

You can ensure your macOS or Ubuntu workstation has all the tools needed to work with the repository:

# ensures you have java, python, volta and all the formatters
$ make all

Some tools may be downloaded locally, under .make/dl.

Usage

$ make nice

Caching details

Only modified files since last format will be formatted. (even if you switch branch in the middle) This allows to run a single formatting command without thinking of what one wants to format. This also dramatically speeds up the process as it allows to completely skip maven when working on non-java files.

Since make relies on timestamps which git doesn’t restore when moving between comitishs’, the cache relies on file hashing stored in .make/stamps/*.sum files.

Updating the docs

Do not edit README.md, it is autogenerated as well as content.html which serves as the package’s embeded documentation. The file to edit is doc/README.md, then you need to generate its Table Of Content (TOC):

pandoc --toc --toc-depth=6 -s -t gfm -o README.md doc/README.md

⚠️ The CI build and test pipeline will verify that this file, README.md, is equal to the result of the above command.

This is to make sure:

the right file is edited
allow autogen TOC without having to remove it first by hand
avoid the TOC to be outdated

Updating the package doc: content.html

The content.html is not a source file, it can be generated on demand:

# - use pandoc to
# - format this file (doc/README.md)
# - generate README.md with a TOC
# - generate content.html
make README.md

Theming options

By default, the responsive theme will be used, but these possibilities are documented here as memo.

Responsive color theme

Responsive to your OS settings:

pandoc --toc --toc-depth=6 -s --embed-resources --css doc/nour-auto-lail-nahar.css --highlight-style doc/nour-lail.theme -o <path-to-be-determined.html> doc/README.md

Dark theme

pandoc --toc --toc-depth=6 -s --embed-resources --css doc/nour-lail.css --highlight-style doc/nour-lail.theme -o <path-to-be-determined.html> doc/README.md

Light theme

pandoc --toc --toc-depth=6 -s --embed-resources --css doc/nour-nahar.css --highlight-style doc/nour-lail.theme -o <path-to-be-determined.html> doc/README.md

Extras

You can add cosmetics (neon logo and title) with following extras:

-V logo="$(< doc/connector.svg)" --template doc/nuxeo-hxai-connector-template.html

Known limitations and WIP

Delete is not supported yet

Nuxeo HxAI Connector (WIP)

TL;DR

Ingesting files with default mapping

Groups, Users, Members synchronization

Dry run mode

Start fresh from scratch

Start fresh with the defaults

Overloading the current (last persisted) mapping

Important Detail about Ingestion Phases

A word about Nuxeo

A word about Ingest

The Ingest payload

The hard-coded part

The properties part

ACL

Connector Setup

Installation

Configuration through nuxeo.conf

Configuring credentials

Configuring the backing bulk actions

Configuration through the ConfigurationService

About hxai.ingest.binary.check.threshold.byte.size

About hxai.connection.pool.max.size

About hxai.executor.pool.size.max

About hxai.ingest.inline.consumer.cache.size.max

Configuration through contributions

Default configuration

Contributing to ingest XP points (extention points)

IngestDescriptor

Case Study: default configuration

Precisions about binaries flattening

The clean way

The fallback solution

Example

Nuxeo HxAI Connector capabilities

Syncrhonizing Groups Users and Members with Nucleus

Triggering ingestion on documents

The IngestAction

Parameterizing the Action

Parameters’ usage

The dryRun param: boolean (false)

The inlineMappings param: (String or Array)

The inlineTransformations param: (String or Array)

The inlinePropertyMappers param: (String or Array)

The aggregateDefaultMappings param: boolean (true)

The aggregateDefaultTransformations param: boolean (true)

The aggregateDefaultPropertyMappers param: boolean (true)

The onlyContent param: boolean (false)

The onlyAncestorsAndFolders param: boolean (false)

The replaceMapping param: boolean (false)

The persistMapping param: boolean (false)

Automating ingestion

Schedule-based

Approach

Sample module with Schedules and corresponding EventListeners

Schedules

EventListener Contrib

EventListener code

The IngestUpdateListener

Disabling the IngestUpdateListener

The Hxai facet

The flagging role

The persistence function: hxai schema

Injecting ingestion parameters

The hard way

A smarter way

Sample parameterized query

Plain

Smart

Mapping documents

Mapping types

Unprefixed properties

Prefixed properties

Schemas

Mapping refence

Inline IngestMappingDescriptors

Default Mappings

Contributing IngestMappings

Defaults by By document type

Custom Mappings