README

An addon to bulk map, remap, chain-transform and send metadata and binaries to the Ingest service.

TL;DR

Ingesting files with default mapping

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest"
  }
}'

However, The IngestAction is very flexible and can be parameterized very finely as you will see as you go through other examples.

Dry run mode

Ingestion offers a lot of possibilities via mapping and transformation. You certainly want to stay in dryRun mode until you have nailed your parameters:

{
  "dryRun": true
}

Start fresh from scratch

{
  "dryRun": true,
  "aggregateDefaultMapping": false,
  "aggregateDefaultTransformer": false,
  "replaceMapping": true
}

Start fresh with the defaults

This will get you back to default mapping on a document as if it was going to be ingested for the first time:

{
  "dryRun": true,
  "aggregateDefaultMapping": true,
  "aggregateDefaultTransformer": true,
  "replaceMapping": true
}

Overloading the current (last persisted) mapping

Let’s say you want to adjust your current mapping. You can override it this way:

{
  "dryRun": true,
  "inlineMapping": "!dc:contributors,!dc:description",
  "inlineTransformer": "dc:title=meta:name=_Flag"
}

That will remove a couple properties and map dc:title to meta:name while changing its value with the _Flag function. See Mapping Documents and Transforming Documents for more info.

Replace current mapping while discarding defaults

{
  "dryRun": true,
  "inlineMapping": "dc:contributors,dc:description",
  "inlineTransformer": "dc:title=meta:name=_Flag",
  "aggregateDefaultMapping": false,
  "aggregateDefaultTransformer": false,
  "replaceMapping": true
}

That will add a couple properties and map dc:title to meta:name while changing its value with the _Flag function. See Mapping Documents and Transforming Documents for more info.

Replace current mapping and leverage defaults

{
  "dryRun": true,
  "inlineMapping": "dc:contributors,dc:description",
  "inlineTransformer": "dc:title=meta:name=_Flag",
  "aggregateDefaultMapping": true,
  "aggregateDefaultTransformer": true,
  "replaceMapping": true
}

That will add a couple properties and map dc:title to meta:name while changing its value with the _Flag function. See Mapping Documents and Transforming Documents for more info.

A word about Nuxeo

<schema xmlns:common="http://www.nuxeo.org/ecm/schemas/common/" name="common">
  <common:icon>/icons/pdf.png</common:icon>
</schema>
<schema xmlns:dc="http://www.nuxeo.org/ecm/schemas/dublincore/" name="dublincore">
  <dc:contributors>
    <item>Administrator</item>
  </dc:contributors>
  <dc:created>2024-11-21T15:38:08.620Z</dc:created>
  <dc:creator>Administrator</dc:creator>
  <dc:description>A poem from the heart</dc:description>
  <dc:lastContributor>Administrator</dc:lastContributor>
  <dc:modified>2024-11-21T15:55:19.496Z</dc:modified>
  <dc:nature>article</dc:nature>
  <dc:title>testPoem</dc:title>
</schema>

A word about Ingest

The ingest service provides a REST API to send your documents to the Insight content lake

The Ingest payload

The Ingest payload is an array of “ingest events” with two 2 distinguishable parts.

The hard-coded part

A part of the schema is mandatory. It contains an object id and required fields.

The properties part

Installing and configuring

Installation

Ingest Configuration

Precisions about binaries flattening

Property name	description
hxai.api.client.id	The Hxai client ID
hxai.api.client.secret	The Hxai client secret
hxai.api.auth.baseurl	The IDP base url (ex: https://auth.iam.dev.experience.hyland.com)
hxai.api.ingest.baseurl	The Hxai ingest base url (ex: https://ingestion-api.insight.dev.ncp.hyland.com/v1)
hxai.api.ingest.env.key	The ingest environment key
hxinsight.environment.type	The HxAI environment type
hxinsight.environment.id	The HxAI environment id
hxinsight.service.name	The HxAI Service name
nuxeo.hxai.sourceid	The HxAI source ID

As of today, Ingest only handles binaries at the root of the properties part. This is fine for simple properties like file:content but doesn’t work for complex properties nesting binaries, like files:files. There are several ways to flatten binaries so they comply to Ingest’s requirements:

The clean way

Using custom Mappers to separate, for example,files:files into multiple simple properties and omit the initial array containing them. Custom mapping happens before Mapping and Transforming, so the properties generated during custom mapping can be transformed as well.

The fallback solution

Post-filtering the outgoing JSON payload allows to flatten unnoticed nested binaries. If a complex type containing binaries does not have a custom mapping, we do move the binaries at the root of properties to avoid the binary to be silently ignored by Ingest.

If this was done with files:files, the array containing the files would remain empty in the JSON payload. Thus the contribution of a default IngestiblePropertyMapper

Example

# original structure with a containing array:
{
  "my:complex": [
    {
      "file": {}
    },
    {
      "file": {}
    }
  ]
}

# with a custom mapper, could become:
{
  "renamed:transformed/0": {
    "file": {}
  },
  "renamed:transformed/1": {
    "file": {}
  }
}

# with post-filtering:
{
  "my:complex": [],
  "my:complex/0": {
    "file": {}
  },
  "my:complex/1": {
    "file": {}
  }
}

Nuxeo HxAI Connector to the rescue!

The HxAI Service

To be refactored Ingest operations of uploading files and sending events are implemented in the HxAi service

Triggering ingestion on documents

We can leverage Nuxeo’s search capabilities to target documents and send them to ingestion via a query language called NXQL.

The IngestAction

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest",
  }
}' -o /dev/null

Parameterizing the Action

{
  "inlineMapping": "dublincore,common",
  "inlineTransformer": "a=b=Function,c=d=OtherFunction",
  "replaceMapping": false,
  "aggregateDefaultMapping": false,
  "aggregateDefaultTransformer": false,
  "persistMapping": false
}

Parameters’ usage

The parameters of the IngestAction are of two types. Some can be persisted. Some can’t.

The dryRun param: boolean (false)

The inlineMapping param: String

An inline IngestMappingDescriptor to apply to Documents matching the NXQL query.

The inlineTransformer param: String

An inline IngestTransformerDescriptor to apply to Documents matching the NXQL query.

The replaceMapping param: boolean (false)

Allows to replace the mapping and transformer previously saved on the Document. Defaults to false

The aggregateDefaultMapping param: boolean (true)

Leverages the default IngestMapping for the Document based on type. This adds up to inlineMapping

The aggregateDefaultTransformer param: boolean (true)

Leverages the default IngestTransformer for the Document based on type. This adds up to inlineTransformerParam

The persistMapping param: boolean (false)

Allows saving all parameters except itself, so: persistMapping, replaceMapping and dryRun. This has no effect in dryRun.

Automating ingestion

Ingestion can be automated in 2 distinct ways. Schedule-based (default) or purely Event-based (disabled by default).

Schedule-based

Approach

Schedules are an historical feature of Nuxeo. They fire an event following a cron expression. More info. Schedules are the prefered way to ingest your documents. Indeed, this approach requires read-only access to documents.

By setting up multiple schedules, the user could run multiple ingestion jobs on subparts of her repository, each with its own config.

Sample module with Schedules and corresponding EventListeners

Schedules

<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.crons.config" version="1.0.0">
  <extension target="org.nuxeo.ecm.core.scheduler.SchedulerService" point="schedule">
    <schedule id="ingest1">
      <eventId>ingest1</eventId>
      <eventCategory>ingest</eventCategory>
      <cronExpression>0/2 * * * * ?</cronExpression>
    </schedule>
    <schedule id="ingest2">
      <eventId>ingest2</eventId>
      <eventCategory>ingest</eventCategory>
      <cronExpression>1/2 * * * * ?</cronExpression>
    </schedule>
  </extension>
</component>

EventListener Contrib

<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.cron.events.listeners.config" version="1.0.0">
  <extension target="org.nuxeo.ecm.core.event.EventServiceComponent" point="listener">
    <listener name="ingest1" async="false" postCommit="false" priority="120" class="org.nuxeo.hxai.listeners.IngestListener1">
      <event>ingest1</event>
    </listener>
    <listener name="ingest2" async="false" postCommit="false" priority="120" class="org.nuxeo.hxai.listeners.IngestListener2">
      <event>ingest2</event>
    </listener>
  </extension>
</component>

EventListener code

This sample will always execute. You will want to update documents only if they have been updated in the last X time units depending on your specific use cases.

public class IngestListener1 implements EventListener {

    @Override
    public void handleEvent(Event event) {
        String query = "SELECT * from Document WHERE ecm:path = '/default-domain/workspaces/test/test'";
        BulkCommand command = new BulkCommand.Builder(IngestAction.ACTION_NAME, query,
                SYSTEM_USERNAME).param(INLINE_MAPPING, "files:files,file:content,dublincore,tags,foo:bar")
                                .param(INLINE_TRANSFORMER, "files:files/=my:binaries")
                                .param(REPLACE_MAPPING, true)
                                .param(DRY_RUN_MODE, true)
                                .build();
        Framework.getService(BulkService.class).submit(command);
    }
}

The IngestUpdateListener

Approach

The IngestUpdateListener will trigger ingestion on a document when it is updated if it has the hxai facet. However, for the document to have the facet, you must have already sent it for ingestion once with the mapping persistence. This is not the prefered way to tackle ingestion because:

Enabling the IngestUpdateListener

It is disabled by default (see Automating Ingestion). You can enable it by contributing the following:

<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.events.listener.config.test" version="1.0.0">
  <require>org.nuxeo.hxai.events.listener.config</require>
  <extension target="org.nuxeo.ecm.core.event.EventServiceComponent" point="listener">
    <listener name="ingestlistener" enabled="true"/>
  </extension>
</component>

The HxAI facet

The flagging role

The HxAI facet acts like a flag to tell Nuxeo that a document’s ingestion has been done already once and is eligible for ingestion update when necessary.

The persistence function: hxai schema

The hxai schema holds some valuable ingestion-related informations. Aside from the ingestion status, the following IngestAction parameters allow to repeat document ingestion exactly the same way it was last done if desired (for update):

Default configuration

Document type based defaults

Default configuration is based on Document type. If you want to register a default IngestMappingDescriptor or a default IngestTransformerDescriptor for a certain Document type, simply give the type as Descriptor id.

Default provision

Contributing Descriptors

Please, see Contributed Mappings and keep in mind that if you want to override default (the default IngestMappingDescriptor) you need to require the IngestMappingServiceComponent:

<require>"org.nuxeo.hxai.IngestMappingServiceComponent"</require>

Injecting ingestion parameters

Those parameters need to be stringified to be sent in our query (in an additional parameters key):

The hard way

"{\"inlineMapping\":\"dublincore,common\",\"inlineTransformer\":\"a=b=Function,c=d=OtherFunction\",\"replaceMapping\":false,\"aggregateDefaultMapping\":false,\"aggregateDefaultTransformer\":false,\"persistMapping\":false}"

The cool way

$(< myParams.json | jq -c | jq -R)

Sample parameterized query

Plain

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest",
    "parameters": "{\"inlineMapping\":\"dublincore,common\",\"inlineTransformer\":\"a=b=Function,c=d=OtherFunction\",\"replaceMapping\":false,\"aggregateDefaultMapping\":false,\"aggregateDefaultTransformer\":false,\"persistMapping\":false}
  }
}'

Smart

curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
    "query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
    "action":"ingest",
    "parameters": '$(< myParams.json | jq -c | jq -R)'
  }
}'

Mapping documents

Mapping types

Unprefixed properties

files # will add files:files to the mapping

Prefixed properties

dc:title # adding single properties, one by one.

Schemas

dublincore # map the 18 properties present in dublincore

Mapping refence

@myMappingReference # map all the mappings found in the 'myMappingReference' Mapping

Inline IngestMappingDescriptors

Chaining , separated Mappings will help building complete mappings in one line:

dc:title,dc:description #,.. there are 18 properties with the dc: prefix...
# I don't want to type them all, just take the whole dublincore schema! (and the common schema! because why not?)
dublincore,common
# I can also get "simples", properties without a prefix
dublincore,icon
# OK I want the whole dublincore and common schemas, except dc:title
dublincore,icon,!dc:title
# Order matters
dublicore,!dc:title # OK: add all dublincore except dc:title
!dc:title,dublincore # Useless: removes dc:title but adds it back!

Default Mappings

Baseline defaults

Those baseline default mappings are applied to documents whithout document type specific default mappings:

By document type defaults

If a mapping contribution’s id is a document type, it will be used as default mapping instead of the baseline defaults for that document type. See Contributing Mappings.

Custom Mappings

Contributing Mappings

The IngestMappingDescriptor can be contributed via XML, validated and ready to use at runtime.

<?xml version="1.0"?>
<component name="org.nuxeo.hxai.IngestMappingServiceComponent.test.referencing" version="1.0">
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestMappings">
    <!--default for Picture typed documents-->
    <ingest id="Picture">
      <properties>dc:title,icon,relatedtext:relatedtextresources</properties>
    </ingest>
    <!--to be referred to as @first-->
    <ingest id="first">
      <properties>dc:title,icon,relatedtext:relatedtextresources</properties>
    </ingest>
    <ingest id="second">
      <properties>dc:description,uid:major_version,uid:minor_version</properties>
    </ingest>
    <ingest id="third">
      <properties>dc:content-type</properties>
    </ingest>
  </extension>
</component>

Using contributed Mappings

Nesting a contributed Mapping

I can avoid retyping 44 properties and keep strong connection with the original mapping by nesting it.

<?xml version="1.0"?>
<component name="org.nuxeo.hxai.IngestMappingServiceComponent.test.referencing" version="1.0">
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestMappings">
    <ingest id="first">
      <properties>@bigMapping,!un:wantedprop</properties>
    </ingest>
  </extension>
</component>

Referencing a contributed Mapping inline

dublincore,@first # duplicate dc:title mapping will be processed only once
# what if I don't want the relatedtext:relatedtextresources brought by @first ?
dublincore,@first,!relatedtext:relatedtextresources # let's take it off

Mixing things up: Yes we can!

In the same IngestMappingDescriptor, use schemas, properties, mapping references to add and remove whatever we want.

It is a good practice to add the removal mapping expressions at the end, so they don’t come back by mistake.

<?xml version="1.0"?>
<component name="org.nuxeo.hxai.IngestMappingServiceComponent.test.referencing" version="1.0">
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestMappings">
    <ingest id="mixItAllUp">
      <properties>common,dc:title,@bigMapping,!@optionalMappings,!un:wantedprop,!uid</properties>
    </ingest>
  </extension>
</component>

Debugging Mappings

Logs are an important part of this module. They can pin point error in your Descriptors

Successfull (happy) logs

DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
DEBUG [IngestMappingServiceImpl] IngestMapping: 'third' was processed successfully.
DEBUG [IngestMappingServiceImpl] IngestMapping: 'second' was processed successfully.
DEBUG [IngestMappingServiceImpl] IngestMapping: 'first' was processed successfully.

Successfull (verbose) logs

DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
TRACE [SimpleIngestMapping] the 'dublincore' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'dublincore'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
TRACE [SimpleIngestMapping] the 'dc:content-type' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:content-type'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'third' was processed successfully.
TRACE [SimpleIngestMapping] the 'dc:description' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:description'
TRACE [SimpleIngestMapping] the '@third' mapping was identified as reference to another mapping.
TRACE [SimpleIngestMapping] processing mapping: '@third'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'second' was processed successfully.
TRACE [SimpleIngestMapping] the 'dc:title' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:title'
TRACE [SimpleIngestMapping] the '@second' mapping was identified as reference to another mapping.
TRACE [SimpleIngestMapping] processing mapping: '@second'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'first' was processed successfully.

Mapping cycle detection

Set up a cycle

Normal contribution

Since we have opened the way for mapping references, we have cycle detection. Let’s consider the following:

<ingest id="first">
  <properties>dc:title,@second</properties>
</ingest>
<ingest id="second">
  <properties>dc:description,@third</properties>
</ingest>
<ingest id="third">
  <properties>dc:content-type,</properties>
</ingest>

Problematic contribution

Let’s break it with an override making third depend on forth and make a cyclic reference from forth to second:

<ingest id="third">
  <properties>dc:content-type,@foo,@forth</properties>
</ingest>
<ingest id="forth">
  <properties>@second</properties>
</ingest>

This will not allow Nuxeo to start (avoiding further harm) but we also need a way to track the problem.

Logging to the rescue!

// TL;DR
java.lang.IllegalArgumentException: Detected cycle in IngestMapping: first->second->third->forth->second
// Full stack
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
TRACE [SimpleIngestMapping] the 'dublincore' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'dublincore'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
DEBUG [IngestMappingServiceImpl] IngestMapping: third directly depends on: foo->forth
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: foo
TRACE [SimpleIngestMapping] the 'common' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'common'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'foo' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: forth
DEBUG [IngestMappingServiceImpl] IngestMapping: forth directly depends on: second
ERROR [RegistrationInfoImpl] Component service:org.nuxeo.hxai.IngestMappingServiceComponent notification of application started failed: Detected cycle in IngestMapping: first->second->third->forth->second
java.lang.IllegalArgumentException: Detected cycle in IngestMapping: first->second->third->forth->second
        at org.nuxeo.hxai.service.IngestMappingServiceImpl.processMappingDescriptor(IngestMappingServiceImpl.java:67) ~[classes/:?]
        at org.nuxeo.hxai.service.IngestMappingServiceImpl.lambda$processMappingDescriptor$5(IngestMappingServiceImpl.java:93) ~[classes/:?]

Contributing custom Property Mappers

IngestiblePropertyMappers allow you to map certain properties the way you want. It is useful to customize how complex properties will be mapped. They implement java.util.function.Consumer<PropertyMappingContext> which allows them to access any element inside the properties part object of the IngestibleDocument. This allows to flatten a single property into multiple ones. This is useful in the case of files:files for example, which is spread into multiple files:files/n entries.

Sample custom contribution

Here is the default contribution to map my:property. It indicates the Mapper to use my.custom.Mapper.

<?xml version="1.0" encoding="UTF-8"?>
<component name="my.component" version="1.0">
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestPropertyMappers">
    <ingestPropertyMappers id="myFileMappers">
      <class property="my:property">my.custom.Mapper</class>
    </ingestPropertyMappers>
  </extension>
</component>

Merge mechanism

The IngestiblePropertyMappers don’t merge but replace each other. You can still containerize your IngestiblePropertyMappers by descriptor ID.

Default IngestiblePropertyMapper

The default IngestionMappingService configuration comes with a single IngestiblePropertyMapper for files:files. As of today, it does mandatory work to comply to the Ingest REST API specs and should not be touched: it flattens the binaries in files:files at the root of the properties part of the JSON object representing each document.

Targetting the right property

To target a property, you need to write its prefixed form in the property attribute of each IngestPropertyMapper class. See Sample custom contribution for details.

Transformers: Remap and Transform

Transformations

Transforming regroups two things: remapping keys and actually transforming values.

# Remap only
dc:=base: # Remap all dublincore properties to prefix them with 'base'
:title=:name # Remap all properties suffixed 'title' and apply Function to them
files:file/=ingest:binaries # Remap all files:files/whatever into ingestion:binaries/whatever

# Transform only
==Function # Apply Function to everything
a==Function # Apply Function to a, don't rename it

# Remap and transform
a=b=Function # Map simple property a to b and apply Function
:title=:name=Function # Remap all properties suffixed 'title' and apply Function to them
a:b=c:d=Function # Exactly map a:b to c:d and apply Function to it
files:files/=ingestion:binaries=Function # Remap all flattened items from files:files/whatever to ingestion:bindaries/whatever and apply Function to them one by one

Chaining Transformations into Transformers

Transformations can be chained (joined by , separators) into a Transformer, which will apply them in order. Here is an inline IngestTransformerDescriptor:

# ⚠️ The following will not work as expected ⚠️
a=b=Function,a=b=OtherFunction # After being transformed int b, a is not matched by the second transformation and OtherFunction is not applied.
# This would work but there is a better way bellow
a=b=Function,b==OtherFunction # Function will be applied before OtherFunction

Transformation Functions

Chaining Transformation Functions

The functions used by Transformationss implement java.util.Function<Serializable, Serializable> this allows chaining them:

# The most reliable solution to chain functions on a single property doesn't require you to figure things out:
a=b=Function1=Function2=Function3,c==Function1=Function3 # a is renamed to b and Function1 to 3 are applied to it in order. c will then be transformed by Function1, then Function3
# hard to distinguish both Transformations from each other? Add some comas. It's free!
a=b=Function1=Function2=Function3,,,c==Function1=Function3 # same result

Default function location

There is a default package for functions used by Transformers. If you put your functions there, you don’t need to specify their package:

// assumed package
org.nuxeo.hxai.ingest.functions

Custom function locations

MyFunction # points to org.nuxeo.hxai.ingest.functions.MyFunction
.MyFunction # same thing
.my.sub.package.MyOtherFunction # points to org.nuxeo.hxai.ingest.functions.my.sub.package.MyFunction
my.complete.package.MyFunction # use a cannonical name

Provided testing functions

# The underscore is to differenciate bundled test functions from others.
_Flag # will assure you touched a property
_Concat # will concatenate a distinguishable value to the property value
_Count # initiates or increments a numeric value to tell you how many times it was applied

Contributing Transformers

The IngestTransformerDescriptor can be contributed via XML, validated and ready to use at runtime. They are a centralized way to define remappings and transformations.

<?xml version="1.0"?>
<component name="org.nuxeo.hxai.IngestMappingServiceComponent.test.transforming" version="1.0">
  <extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestTransformers">
    <transformer id="example1">
      <!-- dc:title will be remapped as foo:bar and transformed with the indicated implementation of function<serializable, serializable> -->
      <transformations>dc:title=foo:bar=MyFunction</transformations>
    </transformer>
    <transformer id="example2"><transformations>foo:bar=dc:title=MyUnfunction</transformations></transformer>
  </extension>
</component>

Debugging Transformations

Detecting malformed Transformations

Transformations can be malformed too. Malformed Contributions will be caught at the initialization of Nuxeo:

// Missing left side
DEBUG [SimpleIngestTransformer$Transformation] Instanciating Transformation: 'inline#=c=_Flag'.
TRACE [SimpleIngestTransformer$Transformation] Transformation: 'inline#=c=_Flag' left side: 'null' is of type: 'STAR' right side: 'c' is of type: 'SIMPLE'.
        java.lang.IllegalArgumentException: Malformed Transformation: 'inline#=c=_Flag' with a missing left side.

// Left side only
DEBUG [SimpleIngestTransformer$Transformation] Instanciating Transformation: 'inline#a=='.
TRACE [SimpleIngestTransformer$Transformation] Transformation: 'inline#a==' left side: 'a' is of type: 'SIMPLE' right side: 'null' is of type: 'STAR'.
        java.lang.IllegalArgumentException: Malformed Transformation: 'inline#a==' with a left side only.

// Right side only
DEBUG [SimpleIngestTransformer$Transformation] Instanciating Transformation: 'inline#=c='.
TRACE [SimpleIngestTransformer$Transformation] Transformation: 'inline#=c=' left side: 'null' is of type: 'STAR' right side: 'c' is of type: 'SIMPLE'.
        java.lang.IllegalArgumentException: Malformed Transformation: 'inline#=c=' with a right side only.

Detecting excessive remappings

As we said, Transformations have a remapping role. This is parameterized in the left and right side which are none other than XPaths. Thus, the prefix is like a directory and the suffix is like a file inside that directory.

So we need to be careful not to make excessive mapping, which means mapping several properties to the same target:

// All a: prefixed properties would end up overriding each other as the simple c property (like a:foo, a:bar, a:baz, a:qux would overlap as 'icon' for example)
XPath: 'a:' cannot be the left side of: 'c' in Transformation: 'inline#a:=c=_Flag'. 'a:' is a prefix and can only be mapped to another prefix.

Remapping combinations glossary

Legend

Star to…

Simple to…

Prefix to…

Suffix to…

Full to…

Transformation combinations glossary

Legend

Symbol	Meaning
✅	valid remap
⚪️	no remap
❌	invalid remap (many possible source, one target)

Status	Pattern	Meaning
⚪️	=	star to star
❌	=3	star to simple
❌	=3:	star to prefix
❌	=:4	star to suffix
❌	=3:4	star to full

Status	Pattern	Meaning
⚪️	1=	simple to star
✅	1=3	simple to simple
✅	1=3:	simple to prefix
✅	1=:4	simple to suffix
✅	1=3:4	simple to full

Status	Pattern	Meaning
⚪️	1:=	prefix to star
❌	1:=3	prefix to simple
✅	1:=3:	prefix to prefix
❌	1:=:4	prefix to suffix
❌	1:=3:4	prefix to full

Status	Pattern	Meaning
⚪️	:2=	suffix to star
❌	:2=3	suffix to simple
❌	:2=3:	suffix to prefix
✅	:2=:4	suffix to suffix
❌	:2=3:4	suffix to full

Status	Pattern	Meaning
⚪️	1:2=	full to star
✅	1:2=3	full to simple
✅	1:2=3:	full to prefix
✅	1:2=:4	full to suffix
✅	1:2=3:4	full to full

left=right=function1[[=function2]...]

Nothing

No left

No right

No Function

Complete Transformation

CI/CD

Workflows

Versioning and release strategy

To Pre-Production Marketplace

To Production Marketplace

Automatic Version bump

Once a release of MAJOR.MINOR+1.0 happens to production, the project’s version is bumped automatically to MAJOR.MINOR+1-SNAPSHOT so that, until the next release of MAJOR.MINOR+2 to production, the upcoming PR merges will be deployed to pre-production as MAJOR.MINOR+1.1, MAJOR.MINOR+1.2 and so on.

Updating the docs

Symbol	Meaning
✅	valid transformation
⚪️	no transformation
❌	invalid transformation

Status	Pattern	Meaning
⚪️	==	no transformation (also valid for `[=,:]*`)

Status	Pattern	Meaning
✅	==Function	Transform every value
❌	=right=	Only right side provided
❌	=right=Function	Missing left side

Status	Pattern	Meaning
❌	left==	Left side only provided
✅	left==Function	Transform value for keys matching left expression without remapping

Status	Pattern	Meaning
✅	left=right=	Remap left matching keys to right expression

Status	Pattern	Meaning
✅	left=right=Function	Transform value for keys matching left expression without remapping

Do not edit this file directly, it is autogenerated as well as content.html which serves as the package’s embeded documentation.

Updating this README.md

The file to edit is doc/README.md, then you need to generate its Table Of Content (TOC):

pandoc --toc --toc-depth=6 -s -t gfm -o README.md doc/README.md

⚠️ The CI build and test pipeline will verify that this file, README.md, is equal to the result of the above command.

Updating the package doc: content.html

The content.html is not a source file, it is generated in the CI build and test pipeline by the following command and packaged appropriately:

Nuxeo HxAI Connector (WIP)

TL;DR

Ingesting files with default mapping

Dry run mode

Start fresh from scratch

Start fresh with the defaults

Overloading the current (last persisted) mapping

Replace current mapping while discarding defaults

Replace current mapping and leverage defaults

A word about Nuxeo

A word about Ingest

The Ingest payload

The hard-coded part

The properties part

Installing and configuring

Installation

Ingest Configuration

Precisions about binaries flattening

The clean way

The fallback solution

Example

Nuxeo HxAI Connector to the rescue!

The HxAI Service

Triggering ingestion on documents

The IngestAction

Parameterizing the Action

Parameters’ usage

The dryRun param: boolean (false)

The inlineMapping param: String

The inlineTransformer param: String

The replaceMapping param: boolean (false)

The aggregateDefaultMapping param: boolean (true)

The aggregateDefaultTransformer param: boolean (true)

The persistMapping param: boolean (false)

Automating ingestion

Schedule-based

Approach

Sample module with Schedules and corresponding EventListeners

Schedules

EventListener Contrib

EventListener code

The IngestUpdateListener

Approach

Enabling the IngestUpdateListener

The HxAI facet

The flagging role

The persistence function: hxai schema

Default configuration

Document type based defaults

Default provision

Contributing Descriptors

Injecting ingestion parameters

The hard way

The cool way

Sample parameterized query

Plain

Smart

Mapping documents

Mapping types

Unprefixed properties

Prefixed properties

Schemas

Mapping refence

Inline IngestMappingDescriptors

Default Mappings

Baseline defaults

By document type defaults

Custom Mappings

Contributing Mappings

Using contributed Mappings

Nesting a contributed Mapping

Referencing a contributed Mapping inline

Mixing things up: Yes we can!

Debugging Mappings

Successfull (happy) logs

Successfull (verbose) logs

Mapping cycle detection

Set up a cycle

Normal contribution

Problematic contribution