An addon to bulk map, remap, chain-transform and send metadata and binaries to Content Intelligence via the Ingest service.
Get up and running quickly!
This will ingest all the files that are under the given
<my-root-doc-id>:
curl -sS -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
"query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
"action":"ingest"
}
}'However, The IngestAction is very flexible and can be parameterized very finely as you will see as you go through other examples.
This will handle synchronizing groups users and members provided by
the UserManager:
curl -XPOST -sS -u foo:bar -H 'Accept: application/json' <myNuxeoUrl>/nuxeo/site/automation/Nucleus.Sync.Users.Groups -H "Content-type: application/json+nxrequest" -d "{}"Ingestion offers a lot of possibilities via mapping and
transformation. You certainly want to stay in dryRun mode
until you have nailed your parameters:
Write the following content in myParameters.json and
inject it the smart way:
{
"dryRun": true
}This will remove all mapping so you can build a new one piece by piece:
false.true.Write the following content in myParameters.json and
inject it the smart way:
{
"dryRun": true,
"aggregateDefaultMappings": false,
"aggregateDefaultTransformations": false,
"aggregateDefaultPropertyMappers": false,
"replaceMapping": true
}This will get you back to default mapping on a document as if it was going to be ingested for the first time:
true.true.Write the following content in myParameters.json and
inject it the smart way:
{
"dryRun": true,
"replaceMapping": true
}Let’s say you want to adjust your current mapping. You can override it this way:
Write the following content in myParameters.json and
inject it the smart way:
{
"dryRun": true,
"inlineMappings": "!dc:contributors,!dc:description",
"inlineTransformations": "dc:title=meta:name=_Flag",
"inlinePropertyMappers": "dc:title ExtraPropertiesMapper dc:extra:my_value"
}That will:
dc:title to
meta:name_Flag function.dc:extra property with my_value
as value See Mapping Documents and Transforming Documents
for more info.Folderish documents should be ingested first. This will
spare a lot of ACL recompute downstream. See onlyContent See
onlyAncestorsAndFolders
Nuxeo associates metadata and content (text,
binaries…).Nuxeo indexes documents and has powerfull search
capabilities.Nuxeo’s metadata are stored in schemas:<schema xmlns:common="http://www.nuxeo.org/ecm/schemas/common/" name="common">
<common:icon>/icons/pdf.png</common:icon>
</schema>
<schema xmlns:dc="http://www.nuxeo.org/ecm/schemas/dublincore/" name="dublincore">
<dc:contributors>
<item>Administrator</item>
</dc:contributors>
<dc:created>2024-11-21T15:38:08.620Z</dc:created>
<dc:creator>Administrator</dc:creator>
<dc:description>A poem from the heart</dc:description>
<dc:lastContributor>Administrator</dc:lastContributor>
<dc:modified>2024-11-21T15:55:19.496Z</dc:modified>
<dc:nature>article</dc:nature>
<dc:title>testPoem</dc:title>
</schema>The Ingest service provides a REST API to send your documents to Content Intelligence.
The Ingest payload is an array of “ingest events” with two 2 distinguishable parts.
A part of the schema is mandatory and handled by the connector. You don’t need to configure it.
Data is expected this way:
ACL are in between. They are mandatory but part of the properties. They are sent automatically as well.
<NUXEO_HOME>/nuxeoctl mp-install nuxeo-hxai-connectornuxeo.conf with desired properties.Please refer
to list of configuration options in below section| Property name | description |
|---|---|
| hxai.ingest.client.id | |
| hxai.ingest.client.secret | |
| hxai.ingest.env.key | hxai-<uuid>: identifies the env the repo is part of |
| hxai.ingest.source.id | <uuid> : identifies the repo in Ingest context |
| hxai.nucleus.client.id | |
| hxai.nucleus.client.secret | |
| hxai.nucleus.system.id | <uuid> : identifies the repo in Nucleus context |
| Property name | default |
|---|---|
| nuxeo.bulk.action.ingestAction.defaultConcurrency | 2 |
| nuxeo.bulk.action.ingestAction.defaultPartitions | 4 |
| nuxeo.bulk.action.nucleusMappingAction.defaultConcurrency | 2 |
| nuxeo.bulk.action.nucleusMappingAction.defaultPartitions | 4 |
Some configurations come with default values and are configurable through the ConfigurationService.
| Property name | default |
|---|---|
| hxai.nucleus.auth.base.url | https://auth.iam.dev.experience.hyland.com |
| hxai.nucleus.system.integration.base.url | https://api.nucleus.dev.experience.hyland.com |
| hxai.ingest.base.url | https://ingestion.insight.dev.experience.hyland.com |
| hxai.connection.pool.max.size | 1 |
| hxai.executor.pool.size.max | 1 |
| hxai.ingest.binary.check.threshold.byte.size | 26214400 |
| hxai.ingest.presigned.url.cache.size.max | 100 |
| hxai.ingest.inline.consumer.cache.size.max | 1000 |
Expressed in bytes. Checking a digest is already known by Content Intelligence has a tangible time cost. If the file being checked is smaller than the threshold, it’s not worth checking. Sending the binary anyway in such case is faster.
In dry-run, this check is still done, to give users a feel of the time consumed by that option and tune the threshold. If one really means to bypass the check in dry-run (to work offline for example), putting this to a very high level is the way to go. Otherwise, it is recommanded to keep it at its real level.
Used for binary upload.
Used for serialization and binary upload.
When an inline Consumer<IngestProperty> is
submitted to the IngestAction, it is cached for reuse with
matching documents. To avoid the cache to grow unexpectedly (especially
as user tests and tries things out as inlineTransformers),
there is a cache size limit. To keep things simple, the cache is just
cleared when it reaches max size.
Default configuration is based on Document type.
Descriptors with ID matching a document type are targetted to that
document type.
There are 3 extention points:
IngestMappingsIngestTransformationsIngestPropertyMappersThey all take IngestDescriptors.
The IngestDescriptor is a flexible descriptor that can
take an args String or a list of items which
are IngestItemDescriptors. The
IngestItemDescriptor is also a flexible descriptor. It
takes either an args String or a list of args
whiche are IngestArgDescriptors.
Here is a representative sample showing how to use ingestion descriptors:
<extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestMappings">
<ingest id="system" args="ingestProperty:type"/>
<ingest id="Root" args="@system root:title"/>
<ingest id="default" args="@system dublincore file:content files:files"/>
</extension>
<extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestPropertyMappers">
<ingest id="default">
<item args="files:files FilesPropertyMapper"/>
<item>
<arg value="ingestProperty:type"/>
<arg value="ExtraPropertiesMapper"/>
<arg value="ingestProperty:type:DOCTYPE"/>
<arg value="dc:title:BASENAME"/>
<arg value="dc:created:EPOCH"/>
<arg value="dc:creator:system"/>
<arg value="dc:modified:EPOCH"/>
<arg value="dc:lastContributor:system"/>
</item>
</ingest>
<ingest id="Root">
<item>
<arg value="root:title"/>
<arg value="ExtraPropertiesMapper"/>
<arg value="ingestProperty:type:DOCTYPE"/>
<arg value="dc:title:/"/>
<arg value="dc:created:EPOCH"/>
<arg value="dc:creator:system"/>
<arg value="dc:modified:EPOCH"/>
<arg value="dc:lastContributor:system"/>
</item>
</ingest>
</extension>
<extension target="org.nuxeo.hxai.IngestMappingServiceComponent" point="ingestTransformations">
<ingest id="default">
<item args="dc:title==AddKv annotation:name"/>
<item args="dc:created==AddKv annotation:dateCreated"/>
<item args="dc:creator==AddKv annotation:createdBy"/>
<item args="dc:modified==AddKv annotation:dateModified"/>
<item args="dc:lastContributor==AddKv annotation:modifiedBy"/>
<item args="ingestProperty:type==AddKv annotation:type"/>
</ingest>
</extension>As of today, Ingest only handles binaries at the root of the properties part. This is fine for
simple properties like file:content but doesn’t work for
complex properties nesting binaries, like files:files.
There are several ways to flatten binaries so they comply to Ingest’s
requirements:
Using custom
Mappers to separate, for example,files:files into
multiple simple properties and omit the initial array containing them.
Custom mapping happens before Mapping and Transforming, so the
properties generated during custom mapping can be transformed as
well.
Post-filtering the outgoing JSON payload allows to flatten unnoticed nested binaries. If a complex type containing binaries does not have a custom mapping, we do move the binaries at the root of properties to avoid the binary to be silently ignored by Ingest.
If this was done with files:files, the array containing
the files would remain empty in the JSON payload. Thus the contribution
of a default
IngestPropertyMapper
# original structure with a containing array:
{
"my:complex": [
{
"file": {}
},
{
"file": {}
}
]
}
# with a custom mapper, could become:
{
"renamed:transformed/0": {
"file": {}
},
"renamed:transformed/1": {
"file": {}
}
}
# with post-filtering:
{
"my:complex": [],
"my:complex/0": {
"file": {}
},
"my:complex/1": {
"file": {}
}
}To ingest documents efficiently, the
Nuxeo HxAI Connector provides capabilities to:
This will synchronize entities returned by the
UserManager. It needs to be run to update as
Nuxeo is not notified for every update that is made in an
IDP, active directory or the likes.
We can leverage Nuxeo’s query capabilities to target
documents and send them to ingestion via a query language called
NXQL.
We leverage the Nuxeo Bulk Action Framework
(BAF) which:
NXQL
query.curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
"query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
"action":"ingest",
}
}' -o /dev/nullThe actual list of parameters taken by the Action:
{
"inlineMappings": "dublincore,common",
"inlineTransformations": "a=b=Function,c=d=OtherFunction",
"inlinePropertyMappers": "a=b=Function,c=d=OtherFunction",
"aggregateDefaultMappings": true,
"aggregateDefaultTransformations": true,
"aggregateDefaultPropertyMappers": true,
"replaceMapping": false,
"persistMapping": false,
"onlyContent": false,
"onlyAncestorsAndFolders": false
}The parameters of the IngestAction are of two types.
Some can be persisted. Some can’t.
See the HxAI facet for more details about parameters’ persistence.
An inline IngestDescriptor contributing to
ingestMappings to apply to Documents matching
the NXQL query.
An inline IngestDescriptor contributing to
ingestTransformations to apply to Documents
matching the NXQL query.
An inline IngestDescriptor contributing to
ingestPropertyMappers to apply to Documents
matching the NXQL query.
Leverages the default ingestMappings for the
Document based on type. This adds up to
inlineMappings.
Leverages the default ingestTransformations for the
Document based on type. This adds up to
inlineTransformations.
Leverages the default ingestPropertyMappers for the
Document based on type. This adds up to
inlinePropertyMappers.
Only ingest non-folderish
documents.
Only ingest folderish documents.
Allows to replace the mapping, transformations and property mappers
previously saved on the Document. Defaults to
false
Allows saving inline* and aggregate*
params. This has no effect in dryRun.
Ingestion can be automated in 2 distinct ways. Schedule-based (default) or purely Event-based (disabled by default).
Schedules are an historical feature of
Nuxeo. They fire an event following a cron expression. More
info. Schedules are the prefered way to ingest your
documents. Indeed, this approach requires read-only access to
documents.
By setting up multiple schedules, the user could run multiple ingestion jobs on subparts of her repository, each with its own config.
Here are 2 Schedules running every other second:
<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.crons.config" version="1.0.0">
<extension target="org.nuxeo.ecm.core.scheduler.SchedulerService" point="schedule">
<schedule id="ingest1">
<eventId>ingest1</eventId>
<eventCategory>ingest</eventCategory>
<cronExpression>0/2 * * * * ?</cronExpression>
</schedule>
<schedule id="ingest2">
<eventId>ingest2</eventId>
<eventCategory>ingest</eventCategory>
<cronExpression>1/2 * * * * ?</cronExpression>
</schedule>
</extension>
</component><?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.cron.events.listeners.config" version="1.0.0">
<extension target="org.nuxeo.ecm.core.event.EventServiceComponent" point="listener">
<listener name="ingest1" async="false" postCommit="false" priority="120" class="org.nuxeo.hxai.listeners.IngestListener1">
<event>ingest1</event>
</listener>
<listener name="ingest2" async="false" postCommit="false" priority="120" class="org.nuxeo.hxai.listeners.IngestListener2">
<event>ingest2</event>
</listener>
</extension>
</component>This sample will always execute. You will want to update documents only if they have been updated in the last X time units depending on your specific use cases.
public class IngestListener1 implements EventListener {
@Override
public void handleEvent(Event event) {
String query = "SELECT * from Document WHERE ecm:path = '/default-domain/workspaces/test/test'";
BulkCommand command = new BulkCommand.Builder(IngestAction.ACTION_NAME, query,
SYSTEM_USERNAME).param(INLINE_MAPPINGS, "files:files,file:content,dublincore,tags,foo:bar")
.param(INLINE_TRANSFORMATIONS, "files:files/=my:binaries")
.param(REPLACE_MAPPING, true)
.param(DRY_RUN_MODE, true)
.build();
Framework.getService(BulkService.class).submit(command);
}
}The IngestUpdateListener will trigger ingestion on a
document when it is updated if it has the hxai
facet. However, for the document to have the facet, you must have
already sent it for ingestion once with the mapping
persistence.
Once you have made your initial bulk import, this listener will keep you in live sync.
It is enabled by default (see Automating Ingestion). You can disable it by contributing the following:
<?xml version="1.0" encoding="UTF-8"?>
<component name="org.nuxeo.hxai.events.listener.config.test" version="1.0.0">
<require>org.nuxeo.hxai.events.listener.config</require>
<extension target="org.nuxeo.ecm.core.event.EventServiceComponent" point="listener">
<listener name="ingestlistener" enabled="false"/>
</extension>
</component>The Hxai facet acts like a flag to tell
Nuxeo that a document’s ingestion has been done already
once and is eligible for ingestion update when necessary.
The hxai schema holds some valuable ingestion-related
informations. Aside from the ingestion status, the following
IngestAction parameters allow to repeat document ingestion
exactly the same way it was last done if desired (for update):
The following IngestAction parameters are not
storable:
Those parameters need to be stringified to be sent in our query (in an additional parameters key):
We can write stringified JSON by hand, escaping all sensitive characters:
"{\"inlineMappings\":\"dublincore,common\",\"inlineTransformations\":\"a=b=Function,c=d=OtherFunction\",\"replaceMapping\":false,\"aggregateDefaultMappings\":false,\"aggregateDefaultTransformations\":false,\"persistMapping\":false}"This is safer, easier and uses POSIX syntax:
$(jq -c < myParams.json | jq -R)Thus, the complete query becomes as below:
curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
"query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
"action":"ingest",
"parameters": "{\"inlineMappings\":\"dublincore,common\",\"inlineTransformations\":\"a=b=Function,c=d=OtherFunction\",\"replaceMapping\":false,\"aggregateDefaultMappings\":false,\"aggregateDefaultTransformations\":false,\"persistMapping\":false}
}
}'curl -ss -u foo:bar -H 'Content-Type: application/json' <myNuxeoUrl>/nuxeo/api/v1/automation/Bulk.RunAction -d \
'{"params":{
"query":"SELECT * FROM Document WHERE ecm:ancestorId = '\''<my-root-doc-id>'\''",
"action":"ingest",
"parameters": '$(jq -c < myParams.json | jq -R)'
}
}'Mappings are defined this way:
Although they are supported, it is discouraged to use unprefixed properties:
files # will add files:files to the mappingdc:title # adding single properties, one by one.dublincore # map the 18 properties present in dublincore@myMappingReference # map all the mappings found in the 'myMappingReference' MappingChaining , or separated Mappings will help
building complete mappings in one line:
dc:title,dc:description #,.. there are 18 properties with the dc: prefix...
# I don't want to type them all, just take the whole dublincore schema! (and the common schema! because why not?)
dublincore,common
# I can also get "simples", properties without a prefix
dublincore,icon
# OK I want the whole dublincore and common schemas, except dc:title
dublincore,icon,!dc:title
# Spaces also work
files:files,dublincore file:content
# Order matters
dublicore,!dc:title # OK: add all dublincore except dc:title
!dc:title,dublincore # Useless: removes dc:title but adds it back!The connector comes with a default mapping configuration:
ingest-mapping-service-config.xml which ensures compliance
to Ingest.
IngestDescriptor can be contributed to the
IngestMappings XP point via XML. See
contributing-to-ingest-xp-points-extention-points.
If a mapping contribution’s id is a document type, it will be used as default mapping. See contributing-to-ingest-xp-points.
dublincore,@bigMapping # @bigMapping is a reference to a mapping with id `bigMapping`.
# what if I don't want the unwanted:prop brought by @bigMapping ?
dublincore,@bigMapping,!unwanted:prop # let's take it offNote, mappings are deduplicated, it doesn’t matter if your mappings end up requestion the same property multiple times.
Logs can pin point error in your Descriptors
Those DEBUG logs represent nested mappings where first
references @sencond
which references @third:
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
DEBUG [IngestMappingServiceImpl] IngestMapping: 'third' was processed successfully.
DEBUG [IngestMappingServiceImpl] IngestMapping: 'second' was processed successfully.
DEBUG [IngestMappingServiceImpl] IngestMapping: 'first' was processed successfully.Same thing at TRACE level
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
TRACE [SimpleIngestMapping] the 'dublincore' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'dublincore'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
TRACE [SimpleIngestMapping] the 'dc:content-type' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:content-type'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'third' was processed successfully.
TRACE [SimpleIngestMapping] the 'dc:description' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:description'
TRACE [SimpleIngestMapping] the '@third' mapping was identified as reference to another mapping.
TRACE [SimpleIngestMapping] processing mapping: '@third'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'second' was processed successfully.
TRACE [SimpleIngestMapping] the 'dc:title' mapping was identified as a property.
TRACE [SimpleIngestMapping] processing mapping: 'dc:title'
TRACE [SimpleIngestMapping] the '@second' mapping was identified as reference to another mapping.
TRACE [SimpleIngestMapping] processing mapping: '@second'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'first' was processed successfully.Since nested mappings are possible, cycle detection is provided. Here is a normal contribution:
<ingest id="first" args="dc:title,@second" />
<ingest id="second" args="dc:description,@third" />
<ingest id="third" args="dc:content-type" />Let’s break it with an override making third depend on forth and make a cyclic reference from forth to second:
<ingest id="third" args="dc:content-type,@foo,@forth" />
<ingest id="forth" args="@second" />This will not allow Nuxeo to start but we also need a
way to track the problem.
Easily find cyclic references:
// TL;DR
java.lang.IllegalArgumentException: Detected cycle in IngestMapping: first->second->third->forth->second
// Full stack
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: default
TRACE [SimpleIngestMapping] the 'dublincore' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'dublincore'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'default' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: first
DEBUG [IngestMappingServiceImpl] IngestMapping: first directly depends on: second
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: second
DEBUG [IngestMappingServiceImpl] IngestMapping: second directly depends on: third
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: third
DEBUG [IngestMappingServiceImpl] IngestMapping: third directly depends on: foo->forth
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: foo
TRACE [SimpleIngestMapping] the 'common' mapping was identified as a schema.
TRACE [SimpleIngestMapping] processing mapping: 'common'
DEBUG [IngestMappingServiceImpl] IngestMapping: 'foo' was processed successfully.
DEBUG [IngestMappingServiceImpl] processing mapping descriptor: forth
DEBUG [IngestMappingServiceImpl] IngestMapping: forth directly depends on: second
ERROR [RegistrationInfoImpl] Component service:org.nuxeo.hxai.IngestMappingServiceComponent notification of application started failed: Detected cycle in IngestMapping: first->second->third->forth->second
java.lang.IllegalArgumentException: Detected cycle in IngestMapping: first->second->third->forth->second
at org.nuxeo.hxai.service.IngestMappingServiceImpl.processMappingDescriptor(IngestMappingServiceImpl.java:67) ~[classes/:?]
at org.nuxeo.hxai.service.IngestMappingServiceImpl.lambda$processMappingDescriptor$5(IngestMappingServiceImpl.java:93) ~[classes/:?]
The most direct hint being the problematic chain print:
first->second->third->forth->second
IngestPropertyMappers allow you to map certain
properties the way you want as they have access to the context of the
whole IngestDocument.
They implement
java.util.function.Consumer<PropertyMappingContext>
which allows them to access any element inside the Document. This is
useful in the case of files:files for example, which is
destructured into multiple files:files/n entries at the
root of properties.
There is a default package for functions used by
IngestPropertyMappers. If you put your functions there, you
don’t need to specify their package:
// assumed package
org.nuxeo.hxai.client.objects.json.mappersThis is actually where provided mappers live.
However, functions can be anywhere else:
MyMapper # default: points to org.nuxeo.hxai.client.objects.json.mappers
.MyMapper # same thing
.my.sub.package.MyOtherMapper # points to org.nuxeo.hxai.client.objects.json.mappers
my.complete.package.MyMapper # use a cannonical nameA few provided mappers:
ArrayPropertyMapper # Handles arrays destructuring as properties cannot be nested.
ExtraPropertyMapper # Add arbitrary properties to an IngestDocument
FilesPropertyMapper # Destructures Properties implementing a collections of FilesThis mapper takes positional arguments on top of the key pattern. The
following is a compact rewrite of what is defined for root in the
default mapping contribution for Root. See case study
root:title.Root doctype.root:title property.prefix:suffix:(PRESET|literal_value).PRESET can be:
BASENAME: the document’s path last segmentDOCTYPE: the document typeEPOCH: an Instant representing the oldest
possible dateNOW: an Instant representing the moment
the property is createdliteral_value: anything other than those
PRESET values is used verbatimSo the line below, when it finds a document mapping
root:title calls the ExtraPropertyMapper to
add:
ingestProperty:type=Rootdc:title=/dc:created=<EPOCH Value>dc:creator=systemroot:title ExtraPropertiesMapper ingestProperty:type:DOCTYPE dc:title:/ dc:created:EPOCH dc:creator:system dc:modified:EPOCH dc:lastContributor:system
^ ^ ^ ^ ^ ^
target the Mapper to use added key preset value literal value yet another...
IngestDescriptor can be contributed to the
IngestPropertyMappers XP point via XML. See
contributing-to-ingest-xp-points-extention-points.
The IngestPropertyMappers don’t merge but replace each
others.
Transforming regroups two things: remapping keys and actually transforming values.
Transformations are 3-optional-ways parameters.
It is done this way:
# Remap only
dc:=base: # Remap all dublincore properties to prefix them with 'base'
:title=:name # Remap all properties suffixed 'title' and apply Function to them
files:file/=ingest:binaries # Remap all files:files/whatever into ingestion:binaries/whatever
# Transform only
==Function # Apply Function to everything
a==Function # Apply Function to a, don't rename it
# Remap and transform
a=b=Function # Map simple property a to b and apply Function
:title=:name=Function # Remap all properties suffixed 'title' and apply Function to them
a:b=c:d=Function # Exactly map a:b to c:d and apply Function to it
files:files/=ingestion:binaries=Function # Remap all flattened items from files:files/whatever to ingestion:bindaries/whatever and apply Function to them one by one
# Order consideration
# ⚠️ The following will not work as expected ⚠️
a=b=Function1=Function2,a=b=OtherFunction # After being transformed int b, a is not matched by the second transformation and OtherFunction is not applied.
# This would work but there is a better way bellow
a=b=Function1=Function2,b==OtherFunction # Function will be applied before OtherFunction
# The most reliable solution to chain functions on a single property doesn't require you to figure things out:
a=b=Function1=Function2=Function3 # a is renamed to b and Function1 to 3 are applied to it in order.
# Adding parameters
a=b=Function1 arg1 arg2=Function2 arg1=Function3 # The function name is always first, then anything before an eventual = is params.
# Multiple chains in a single line
a=b=Function1 arg1 arg2=Function2 arg1=Function3,c==Function1 arg1 arg2 arg3=Function4All functions must implement
Consumer<IngestProperty> They work at a property
level (unlike IngestPropertyMappers, they don’t have access
to the whole IngestDocument).
There is a default package for functions used by
IngestTransformations. If you put your functions there, you
don’t need to specify their package:
// assumed package
org.nuxeo.hxai.ingest.functionsThis is actually where provided functions live.
However, functions can be anywhere else:
MyFunction # default: points to org.nuxeo.hxai.ingest.functions.MyFunction
.MyFunction # same thing
.my.sub.package.MyOtherFunction # points to org.nuxeo.hxai.ingest.functions.my.sub.package.MyFunction
my.complete.package.MyFunction # use a cannonical nameA few provided functions:
AddKv # Adds key:value pairs to the targetted property. Takes parameters like key1:value1 key2:value2.
# The following functions are prefixed `_` because they are provided test functions.
_Flag # will assure you touched a property
_Concat # will concatenate a distinguishable value to the property value
_Count # initiates or increments a numeric value to tell you how many times it was appliedIngestDescriptor can be contributed to the
IngestTransformations XP point via XML. See
contributing-to-ingest-xp-points-extention-points.
Transformations can be malformed too. Malformed
Contributions will be caught at the initialization of
Nuxeo:
// Missing left side
DEBUG [SimpleIngestTransformations$Transformation] Instanciating Transformation: 'inline#=c=_Flag'.
TRACE [SimpleIngestTransformations$Transformation] Transformation: 'inline#=c=_Flag' left side: 'null' is of type: 'STAR' right side: 'c' is of type: 'SIMPLE'.
java.lang.IllegalArgumentException: Malformed Transformation: 'inline#=c=_Flag' with a missing left side.
// Left side only
DEBUG [SimpleIngestTransformations$Transformation] Instanciating Transformation: 'inline#a=='.
TRACE [SimpleIngestTransformations$Transformation] Transformation: 'inline#a==' left side: 'a' is of type: 'SIMPLE' right side: 'null' is of type: 'STAR'.
java.lang.IllegalArgumentException: Malformed Transformation: 'inline#a==' with a left side only.
// Right side only
DEBUG [SimpleIngestTransformations$Transformation] Instanciating Transformation: 'inline#=c='.
TRACE [SimpleIngestTransformations$Transformation] Transformation: 'inline#=c=' left side: 'null' is of type: 'STAR' right side: 'c' is of type: 'SIMPLE'.
java.lang.IllegalArgumentException: Malformed Transformation: 'inline#=c=' with a right side only.
As we said, Transformations have a remapping role. This
is parameterized in the left and right side which are none other than
XPaths. Thus, the prefix is like a directory and the suffix
is like a file inside that directory.
So we need to be careful not to make excessive mapping, which means mapping several properties to the same target:
// All a: prefixed properties would end up overriding each other as the simple c property (like a:foo, a:bar, a:baz, a:qux would overlap as 'icon' for example)
XPath: 'a:' cannot be the left side of: 'c' in Transformation: 'inline#a:=c=_Flag'. 'a:' is a prefix and can only be mapped to another prefix.you may also want to see the transformation combinations glossary
There are many possible combinations:
The full form of a remapping looks like so:
1:2=3:4| Symbol | Meaning |
|---|---|
| ✅ | valid remap |
| ⚪️ | no remap |
| ❌ | invalid remap (many possible source, one target) |
| Status | Pattern | Meaning |
|---|---|---|
| ⚪️ | = | star to star |
| ❌ | =3 | star to simple |
| ❌ | =3: | star to prefix |
| ❌ | =:4 | star to suffix |
| ❌ | =3:4 | star to full |
| Status | Pattern | Meaning |
|---|---|---|
| ⚪️ | 1= | simple to star |
| ✅ | 1=3 | simple to simple |
| ✅ | 1=3: | simple to prefix |
| ✅ | 1=:4 | simple to suffix |
| ✅ | 1=3:4 | simple to full |
| Status | Pattern | Meaning |
|---|---|---|
| ⚪️ | 1:= | prefix to star |
| ❌ | 1:=3 | prefix to simple |
| ✅ | 1:=3: | prefix to prefix |
| ❌ | 1:=:4 | prefix to suffix |
| ❌ | 1:=3:4 | prefix to full |
| Status | Pattern | Meaning |
|---|---|---|
| ⚪️ | :2= | suffix to star |
| ❌ | :2=3 | suffix to simple |
| ❌ | :2=3: | suffix to prefix |
| ✅ | :2=:4 | suffix to suffix |
| ❌ | :2=3:4 | suffix to full |
| Status | Pattern | Meaning |
|---|---|---|
| ⚪️ | 1:2= | full to star |
| ✅ | 1:2=3 | full to simple |
| ✅ | 1:2=3: | full to prefix |
| ✅ | 1:2=:4 | full to suffix |
| ✅ | 1:2=3:4 | full to full |
There are less than for the remapping combinations glossary, but still quite a few combinations possible:
The full form of a transformation looks like so:
left=right=function1[[=function2]...]where left and right are parts or a valid remapping
| Symbol | Meaning |
|---|---|
| ✅ | valid transformation |
| ⚪️ | no transformation |
| ❌ | invalid transformation |
| Status | Pattern | Meaning |
|---|---|---|
| ⚪️ | == | no transformation (also valid for
[=,:]*) |
| Status | Pattern | Meaning |
|---|---|---|
| ✅ | ==Function | Transform every value |
| ❌ | =right= | Only right side provided |
| ❌ | =right=Function | Missing left side |
| Status | Pattern | Meaning |
|---|---|---|
| ❌ | left== | Left side only provided |
| ✅ | left==Function | Transform value for keys matching left expression without remapping |
| Status | Pattern | Meaning |
|---|---|---|
| ✅ | left=right= | Remap left matching keys to right expression |
| Status | Pattern | Meaning |
|---|---|---|
| ✅ | left=right=Function | Transform value for keys matching left expression without remapping |
CI/CD workflows are present here and they include:
CI for PR: Build and test of the source code upon raising a PR against 2023 branch.
Deployment of package to Pre-production Marketplace: Merging a PR onto 2023 branch would trigger the CI followed by deployment of the generated package to the nuxeo pre-prod marketplace.
Release to Production Marketplace: A manual release job is available which, when run on a base branch, deploys the latest available minor tag version to the production marketplace.
Once a release of MAJOR.MINOR+1.0 happens to production,
the project’s version is bumped automatically to
MAJOR.MINOR+1-SNAPSHOT. Until the next release of
MAJOR.MINOR+2 to production, the upcoming PR merges will be
deployed to pre-production as MAJOR.MINOR+1.1,
MAJOR.MINOR+1.2 and so on.
The repository holds files in many languages. Formatting is verified in the CI.
The Makefile provides steps to:
You can ensure your macOS or Ubuntu
workstation has all the tools needed to work with the repository:
# ensures you have java, python, volta and all the formatters
$ make allSome tools may be downloaded locally, under
.make/dl.
$ make niceOnly modified files since last format will be formatted. (even if you switch branch in the middle) This allows to run a single formatting command without thinking of what one wants to format. This also dramatically speeds up the process as it allows to completely skip maven when working on non-java files.
Since make relies on timestamps which git doesn’t restore when moving
between comitishs’, the cache relies on file hashing stored in
.make/stamps/*.sum files.
Do not edit README.md, it is autogenerated as well as
content.html which serves as the package’s embeded
documentation. The file to edit is doc/README.md, then you
need to generate its Table Of Content (TOC):
pandoc --toc --toc-depth=6 -s -t gfm -o README.md doc/README.md⚠️ The CI build and test pipeline will verify that this file,
README.md, is equal to the result of the above command.
This is to make sure:
TOC without having to remove it first by
handTOC to be outdatedThe content.html is not a source file, it can be
generated on demand:
# - use pandoc to
# - format this file (doc/README.md)
# - generate README.md with a TOC
# - generate content.html
make README.mdBy default, the responsive theme will be used, but these possibilities are documented here as memo.
Responsive to your OS settings:
pandoc --toc --toc-depth=6 -s --embed-resources --css doc/nour-auto-lail-nahar.css --highlight-style doc/nour-lail.theme -o <path-to-be-determined.html> doc/README.mdpandoc --toc --toc-depth=6 -s --embed-resources --css doc/nour-lail.css --highlight-style doc/nour-lail.theme -o <path-to-be-determined.html> doc/README.mdpandoc --toc --toc-depth=6 -s --embed-resources --css doc/nour-nahar.css --highlight-style doc/nour-lail.theme -o <path-to-be-determined.html> doc/README.mdYou can add cosmetics (neon logo and title) with following extras:
-V logo="$(< doc/connector.svg)" --template doc/nuxeo-hxai-connector-template.html