Queue Schema
Every queue has an associated schema that specifies which fields will be extracted from documents as well as the structure of the data sent to connector and exported from the platform.
See the introduction to Rossum customization for a high-level overview of configuring the captured fields and managing schemas.
The best visual guide to the schema JSON that will get you started tweaking it in the Rossum app is our tutorial on editing the extraction schema. Especially when maintaining long dropdown select boxes.
Rossum schema supports data fields with single values (datapoint),
fields with multiple values (multivalue) or tuples of fields (tuple). At the
topmost level, each schema consists of sections, which may either directly
contain actual data fields (datapoints) or use nested multivalues and tuples as
containers for single datapoints.
But while schema may theoretically consist of an arbitrary number of nested containers, the Rossum UI supports only certain particular combinations of datapoint types. The supported shapes are:
-
simple: atomic datapoints of type
number,string,dateorenum -
list: simple datapoint within a multivalue
-
tabular: simple datapoint within a "multivalue tuple" (a multivalue list containing a tuple for every row)
Schema content
Schema content consists of a list of section objects.
Common attributes
The following attributes are common for all schema objects:
| Attribute | Type | Description | Required |
|---|---|---|---|
| category | string | Category of an object, one of section, multivalue, tuple or datapoint. | yes |
| id | string | Unique identifier of an object. Maximum length is 50 characters. | yes |
| label | string | User-friendly label for an object, shown in the user interface | yes |
| hidden | boolean | If set to true, the object is not visible in the user interface, but remains stored in the database and may be exported. Default is false. Note that section is hidden if all its children are hidden. | no |
| disable_prediction | boolean | Can be set to true to disable field extraction, while still preserving the data shape. Ignored by aurora engines. | no |
Section
Example section object:
{
"category": "section",
"id": "amounts_section",
"label": "Amounts",
"children": [...],
"icon": ""
}Section represents a logical part of the document, such as amounts or vendor info. It is allowed only at the top level. Schema allows multiple sections, and there should be at least one section in the schema.
| Attribute | Type | Description | Required |
|---|---|---|---|
| children | list[object] | Specifies objects grouped under a given section. It can contain multivalue or datapoint objects. | yes |
| icon | string | The icon that appears on the left panel in the UI for a given section (not yet supported on UI). |
Datapoint
A datapoint represents a single value, typically a field of a document or some global document information. Fields common to all datapoint types:
| Attribute | Type | Description | Required |
|---|---|---|---|
| type | string | Data type of the object, must be one of the following: string, number, date, enum, button | yes |
| can_export | boolean | If set to false, datapoint is not exported through export endpoint. Default is true. | |
| can_collapse | boolean | If set to true, tabular (multivalue-tuple) datapoint may be collapsed in the UI. Default is false. | |
| rir_field_names | list[string] | List of references used to initialize an object value. See below for the description. | |
| default_value | string | Default value used either for fields that do not use hints from AI engine predictions (i.e. rir_field_names are not specified), or when the AI engine does not return any data for the field. | |
| constraints | object | A map of various constraints for the field. See Value constraints. | |
| ui_configuration | object | A group of settings affecting behaviour of the field in the application. See UI configuration. | |
| width | integer | Width of the column (in characters). Default widths are: number: 8, string: 20, date: 10, enum: 20. Only supported for table datapoints. | |
| stretch | boolean | If total width of columns doesn't fill up the screen, datapoints with stretch set to true will be expanded proportionally to other stretching columns. Only supported for table datapoints. | |
| width_chars | integer | (Deprecated) Use width and stretch properties instead. | |
| score_threshold | float [0;1] | Threshold used to automatically validate field content based on AI confidence scores. If not set, queue.default_score_threshold is used. | |
| formula | string[0;2000] | Formula definition, required only for fields of type formula, see Formula Fields. rir_field_names should also be empty for these fields. | |
| prompt | string[0;2000] | Prompt definition, required only for fields of type reasoning. | |
| context | list[string] | Context for the prompt, required only for fields of type reasoning see Logical Types. |
rir_field_names attribute allows to specify source of initial value of the object. List items may be:
- one of extracted field types to use the AI engine prediction value
upload:idto identify a value specified while uploading the documentedit:idto identify a value specified in edit_pages endpointemail_header:<id>to use a value extracted from email headers. Supported email headers:from,to,reply-to,subject,message-id,date.email_body:<id>to select email body. Supported values aretext_htmlfor HTML body ortext_plainfor plain text body.email:<id>to identify a value specified inemail.receivedhook responseemails_import:<id>to identify a value specified in thevaluesparameter when importing an email.
If more list items in rir_field_names are specified, the first available value will be used.
String type
Example string type datapoint with constraints:
{
"category": "datapoint",
"id": "document_id",
"label": "Invoice ID",
"type": "string",
"default_value": null,
"rir_field_names": ["document_id"],
"constraints": {
"length": {
"exact": null,
"max": 16,
"min": null
},
"regexp": {
"pattern": "^INV[0-9]+$"
},
"required": false
}
}String datapoint does not have any special attribute.
Date type
Example date type datapoint:
{
"id": "item_delivered",
"type": "date",
"label": "Item Delivered",
"format": "MM/DD/YYYY",
"category": "datapoint"
}Attributes specific to Date datapoint:
| Attribute | Type | Description | Required |
|---|---|---|---|
| format | string | Enforces a format for date datapoint on the UI. See Date format below for more details. Default is YYYY-MM-DD. |
Date format supported: available tokens
Example date formats:
D/M/YYYY: e.g. 23/1/2019MM/DD/YYYY: e.g. 01/23/2019YYYY-MM-DD: e.g. 2019-01-23 (ISO date format)
Number type
Example number type datapoint:
{
"id": "item_quantity",
"type": "number",
"label": "Quantity",
"format": "#,##0.#",
"category": "datapoint"
}Attributes specific to Number datapoint:
| Attribute | Type | Default | Description | Required |
|---|---|---|---|---|
| format | string | # ##0.# | Available choices for number format show table below. null value is allowed. | |
| aggregations | object | A map of various aggregations for the field. See aggregations. |
The following table shows numeric formats with their examples.
| Format | Example |
|---|---|
# ##0,# | 1 234,5 or 1234,5 |
# ##0.# | 1 234.5 or 1234.5 |
#,##0.# | 1,234.5 or 1234.5 |
#'##0.# | 1'234.5 or 1234.5 |
#.##0,# | 1.234,5 or 1234,5 |
# ##0 | 1 234 or 1234 |
#,##0 | 1,234 or 1234 |
#'##0 | 1'234 or 1234 |
#.##0 | 1.234 or 1234 |
Aggregations
Example number type datapoint with sum aggregation:
{
"id": "quantity",
"type": "number",
"label": "Quantity",
"category": "datapoint",
"aggregations": {
"sum": {
"label": "Total"
}
},
"default_value": null,
"rir_field_names": []
}Aggregations allow computation of some informative values, e.g. a sum of a table column with numeric values.
These are returned among messages when the validate endpoint is called.
Aggregations can be computed only for tables (multivalues of tuples).
| Attribute | Type | Description | Required |
|---|---|---|---|
| sum | object | Sum of values in a column. Default label: "Sum". |
All aggregation objects can have an attribute label that will be shown in the UI.
Enum type
Example enum type datapoint with options:
{
"id": "document_type",
"type": "enum",
"label": "Document type",
"hidden": false,
"category": "datapoint",
"options": [
{
"label": "Invoice Received",
"value": "21"
},
{
"label": "Invoice Sent",
"value": "22"
},
{
"label": "Receipt",
"value": "23"
}
],
"default_value": "21",
"rir_field_names": [],
"enum_value_type": "number"
}Attributes specific to Enum datapoint:
| Attribute | Type | Description | Required |
|---|---|---|---|
| options | object | See object description below. | yes |
| enum_value_type | string | Data type of the option's value attribute. Must be one of the following: string, number, date | no |
Every option consists of an object with keys:
| Attribute | Type | Description | Required |
|---|---|---|---|
| value | string | Value of the option. | yes |
| label | string | User-friendly label for the option, shown in the UI. | yes |
Enum datapoint value is matched in a case insensitive mode, e.g. EUR currency value returned by the AI Core Engine is
matched successfully against {"value": "eur", "label": "Euro"} option.
Button type
Specifies a button shown in Rossum UI. For more details please refer to custom UI extension.
Example button type datapoint:
{
"id": "show_email",
"type": "button",
"category": "datapoint",
"popup_url": "http://example.com/show_customer_data",
"can_obtain_token": true
}Buttons cannot be direct children of multivalues (simple multivalues with buttons are not allowed. In tables, buttons are children of tuples). Despite being a datapoint object, button currently cannot hold any value. Therefore, the set of available Button datapoint attributes is limited to:
| Attribute | Type | Description | Required |
|---|---|---|---|
| type | string | Data type of the object, must be one of the following: string, number, date, enum, button | yes |
| can_export | boolean | If set to false, datapoint is not exported through export endpoint. Default is true. | |
| can_collapse | boolean | If set to true, tabular (multivalue-tuple) datapoint may be collapsed in the UI. Default is false. | |
| popup_url | string | URL of a popup window to be opened when button is pressed. | |
| can_obtain_token | boolean | If set to true the popup window is allowed to ask the main Rossum window for authorization token |
Value constraints
Example datapoint with value constraints:
{
"id": "document_id",
"type": "string",
"label": "Invoice ID",
"category": "datapoint",
"constraints": {
"length": {
"max": 32,
"min": 5
},
"required": false
},
"default_value": null,
"rir_field_names": [
"document_id"
]
}Constraints limit allowed values. When constraints is not satisfied, annotation is considered invalid and cannot be exported.
| Attribute | Type | Description | Required |
|---|---|---|---|
| length | object | Defines minimum, maximum or exact length for the datapoint value. By default, minimum and maximum are 0 and infinity, respectively. Supported attributes: min, max and exact | |
| regexp | object | When specified, content must match a regular expression. Supported attributes: pattern. To ensure that entire value matches, surround your regular expression with ^ and $. | |
| required | boolean | Specifies if the datapoint is required by the schema. Default value is true. |
UI configuration
Example datapoint with UI configuration:
{
"id": "document_id",
"type": "string",
"label": "Invoice ID",
"category": "datapoint",
"ui_configuration": {
"type": "captured",
"edit": "disabled"
},
"default_value": null,
"rir_field_names": [
"document_id"
]
}UI configuration provides a group of settings, which alter behaviour of the field in the application. This does not affect behaviour of the field via the API.
For example, disabling edit prohibits changing a value of the datapoint in the application, but the value can still be modified through API.
| Attribute | Type | Description | Required |
|---|---|---|---|
| type | string | Logical type of the datapoint. Possible values are: captured, data, manual, formula, reasoning or null. Default value is null. | false |
| edit | string | When set to disabled, value of the datapoint is not editable via UI. When set to enabled_without_warning, no warnings are displayed in the UI regarding this fields editing behaviour. Default value is enabled, this option enables field editing, but user receives dismissible warnings when doing so. | false |
Logical types
- Captured field represents information retrieved by the OCR model. If combined with
editoption disabled, user can't overwrite field's value, but is able to redraw field's bounding box and select another value from the document by such an action. - Data field represents information filled by extensions. This field is not mapped to the AI model, so it does not have a bounding box, neither it's possible to create one. If combined with
editoption disabled the field can't be modified from the UI. - Manual field behaves exactly like Data field, without the mapping to extensions. This field should be used for sharing information between users or to transfer information to downstream systems.
- Formula field This field will be updated according to its
formuladefinition, see Formula Fields. If theeditoption is not disabled the field value can be overridden from the UI (see no_recalculation). - Reasoning fields This field will be updated according to its
promptandcontext.contextsupports adding related schema fields in a format of TxScript strings (e.g.field.invoice_id, alsoself.attr.labelandself.attr.descriptionare supported). If theeditoption is not disabled the field value can be overridden from the UI (see no_recalculation). - null value is displayed in UI as Unset and behaves similar to the Captured field.
Multivalue
Example simple multivalue:
{
"category": "multivalue",
"id": "po_numbers",
"label": "PO numbers",
"children": {
...
},
"show_grid_by_default": false,
"min_occurrences": null,
"max_occurrences": null,
"rir_field_names": null
}Example multivalue with grid configuration:
{
"category": "multivalue",
"id": "line_item",
"label": "Line Item",
"children": {
...
},
"grid": {
"row_types": [
"header", "data", "footer"
],
"default_row_type": "data",
"row_types_to_extract": [
"data"
]
},
"min_occurrences": null,
"max_occurrences": null,
"rir_field_names": ["line_items"]
}Multivalue is list of datapoints or tuples of the same type.
It represents a container for data with multiple occurrences
(such as line items) and can contain only objects with the same id.
| Attribute | Type | Description | Required |
|---|---|---|---|
| children | object | Object specifying type of children. It can contain only objects with categories tuple or datapoint. | yes |
| min_occurrences | integer | Minimum number of occurrences of nested objects. If condition of min_occurrences is violated corresponding fields should be manually reviewed. Minimum required value for the field is 0. If not specified, it is set to 0 by default. | |
| max_occurrences | integer | Maximum number of occurrences of nested objects. All additional rows above max_occurrences are removed by extraction process. Minimum required value for the field is 1. If not specified, it is set to 1000 by default. | |
| grid | object | Configure magic-grid feature properties, see below. | |
| show_grid_by_default | boolean | If set to true, the magic-grid is opened instead of footer upon entering the multivalue. Default false. Applied only in UI. Useful when annotating documents for custom training. | |
| rir_field_names | list[string] | List of names used to initialize content from the AI engine predictions. If specified, the value of the first field from the array is used, otherwise default name line_items is used. Attribute can be set only for multivalue containing objects with category tuple. | no |
Multivalue grid object
Multivalue grid object allows to specify a row type for each row of the
grid. For data representation of actual grid data rows see Grid object description.
| Attribute | Type | Description | Default | Required |
|---|---|---|---|---|
| row_types | list[string] | List of allowed row type values. | ["data"] | yes |
| default_row_type | string | Row type to be used by default | data | yes |
| row_types_to_extract | list[string] | Types of rows to be extracted to related table | ["data"] | yes |
For example to distinguish two header types and a footer in the validation interface, following row types may be used: header,
subsection_header, data and footer.
Currently, data extraction classifies every row as either data or header (additional row types may be introduced
in the future). We remove rows returned by data extraction that are not in row_types list (e.g. header by
default) and are on the top/bottom of the table. When they are in the middle of the table, we mark them as skipped
(null).
There are three visual modes, based on row_types quantity:
- More than two row types defined: User selects row types freely to any non-default row type. Clearing row type resets to a default row type. We support up to 6 colors to easily distinguish row types visually.
- Two row types defined (header and default): User only marks header and skipped rows. Clearing row type resets to a default row type.
- One row type defined: User is only able to mark row as skipped (
nullvalue in data). This is also a default behavior when nogridrow types configuration is specified in the schema.
Only rows marked as one of row_types_to_extract values are transferred to a table by pressing "Read data from table" button in the Rossum UI (calling grid-to-table conversion API endpoint).
Tuple
Example tuple object:
{
"category": "tuple",
"id": "tax_details",
"label": "Tax Details",
"children": [
...
],
"rir_field_names": [
"tax_details"
]
}Container representing tabular data with related values, such as tax details.
A tuple must be nested within a multivalue object, but unlike multivalue,
it may consist of objects with different ids.
| Attribute | Type | Description | Required |
|---|---|---|---|
| children | list[object] | Array specifying objects that belong to a given tuple. It can contain only objects with category datapoint. | yes |
| rir_field_names | list[string] | List of names used to initialize content from the AI engine predictions. If specified, the value of the first extracted field from the array is used, otherwise, no AI engine initialization is done for the object. |
Updating Schema
When project evolves, it is a common practice to enhance or change the extracted field set. This is done by updating the schema object.
By design, Rossum supports multiple schema versions at the same time. However, each document annotation is related to only one of those schemas. If the schema is updated, all related document annotations are updated accordingly. See preserving data on schema change below for limitations of schema updates.
In addition, every queue is linked to a schema, which is used for all newly imported documents.
When updating a schema, there are two possible approaches:
- Update the schema object (PUT/PATCH). All related annotations will be
updated to match current schema shape (even
exportedanddeleteddocuments). - Create a new schema object (POST) and link it to the queue. In such case, only newly created objects will use the current schema. All previously created objects will remain in the shape of their linked schema.
Formerly, we recommended to always create a new schema object when changing the set of extracted fields. This is no longer necessary since updating of the current schema object (PUT/PATCH) can be used instead. See use-cases below if not sure which approach is appropriate.
Use case 1 - Initial setting of a schema
- Situation: User is initially setting up the schema. This might be an iterative process.
- Recommendation: Edit the existing schema by updating schema (PUT) or updating part of a schema (PATCH).
Use case 2 - Updating attributes of a field (label, constraints, options, etc.)
- Situation: User is updating field, e.g. changing label, number format, constraints, enum options, hidden flag, etc.
- Recommendation: Edit existing schema (see Use case 1).
Use case 3 - Adding new field to a schema, even for already imported documents.
- Situation: User is extending a production schema by adding a new field. Moreover, user would like to see all annotations (
to_review,postponed,exported,deleted, etc. states) in the look of the newly extended schema. - Recommendation: Edit existing schema (see Use case 1). Data of already created annotations will be transformed to the shape of the updated schema. New fields of annotations in
to_reviewandpostponedstate that are linked to extracted fields types will be filled by AI Engine, if available. New fields for alreadyexportedordeletedannotations (alsopurged,exportingandfailed_export) will be filled with empty or default values.
Use case 4 - Adding new field to schema, only for newly imported documents
- Situation: User is extending a production schema by adding a new field. However, with the intention that the user does not want to see the newly added field on previously created annotations.
- Recommendation: Create a new schema object (POST) and link it to the queue. Annotation data of previously created annotations will be preserved according to the original schema. This approach is recommended if there is an organizational need to keep different field sets before and after the schema update.
Use case 5 - Deleting schema field, even for already imported documents.
- Situation: User is changing a production schema by removing a field that was used previously. However, user would like to see all annotations (
to_review,postponed,exported,deleted, etc. states) in the look of the newly extended schema. There is no need to see the original fields in already exported annotations. - Recommendation: Edit existing schema (see Use case 1).
Use case 6 - Deleting schema field, only for newly imported documents
- Situation: User is changing a production schema by removing a field that was used previously. However, with the intention that the user will still be able to see the removed fields on previously created annotations.
- Recommendation: Create a new schema object (see Use case 4). Annotation data of previously created annotations will be preserved according to the original schema. This approach is recommended if there is an organizational need to retrieve the data in the original state.
When copying an annotation or moving it to a new queue by patching its queue attribute, the annotation in the new queue will still be associated with the old schema.
Preserving data on schema change
In order to transfer annotation field values properly during the schema update,
a datapoint's category and schema_id must be preserved.
Supported operations that preserve fields values are:
- adding a new datapoint (filled from AI Engine, if available)
- reordering datapoints on the same level
- moving datapoints from section to another section
- moving datapoints to and from a tuple
- reordering datapoints inside a tuple
- making datapoint a multivalue (original datapoint is the only value in a new multivalue container)
- making datapoint non-multivalue (only first datapoint value is preserved)
Extracted field types
AI engine currently automatically extracts the following fields at the all endpoint, subject to ongoing expansion.
Identifiers
| Attr. rir_field_names | Field label | Description |
|---|---|---|
| account_num | Bank Account | Bank account number. Whitespaces are stripped. |
| bank_num | Sort Code | Sort code. Numerical code of the bank. |
| iban | IBAN | Bank account number in IBAN format. |
| bic | BIC/SWIFT | Bank BIC or SWIFT code. |
| const_sym | Constant Symbol | Statistical code on payment order. |
| spec_sym | Specific Symbol | Payee ID on the payment order, or similar. |
| var_sym | Variable symbol | In some countries used by the supplier to match the payment received against the invoice. Possible non-numeric characters are stripped. |
| terms | Terms | Payment terms as written on the document (e.g. "45 days", "upon receipt"). |
| payment_method | Payment method | Payment method defined on a document (e.g. 'Cheque', 'Pay order', 'Before delivery') |
| customer_id | Customer Number | The number by which the customer is registered in the system of the supplier. Whitespaces are stripped. |
| date_due | Date Due | The due date of the invoice. |
| date_issue | Issue Date | Date of issue of the document. |
| date_uzp | Tax Point Date | The date of taxable event. |
| document_id | Document Identifier | Document number. Whitespaces are stripped. |
| order_id | Order Number | Purchase order identification (Order Numbers not captured as "sender_order_id"). Whitespaces are stripped. |
| recipient_address | Recipient Address | Address of the customer. |
| recipient_dic | Recipient Tax Number | Tax identification number of the customer. Whitespaces are stripped. |
| recipient_ic | Recipient Company ID | Company identification number of the customer. Possible non-numeric characters are stripped. |
| recipient_name | Recipient Name | Name of the customer. |
| recipient_vat_id | Recipient VAT Number | Customer VAT Number |
| recipient_delivery_name | Recipient Delivery Name | Name of the recipient to whom the goods will be delivered. |
| recipient_delivery_address | Recipient Delivery Address | Address of the recipient where the goods will be delivered. |
| sender_address | Supplier Address | Address of the supplier. |
| sender_dic | Supplier Tax Number | Tax identification number of the supplier. Whitespaces are stripped. |
| sender_ic | Supplier Company ID | Business/organization identification number of the supplier. Possible non-numeric characters are stripped. |
| sender_name | Supplier Name | Name of the supplier. |
| sender_vat_id | Supplier VAT Number | VAT identification number of the supplier. |
| sender_email | Supplier Email | Email of the sender. |
| sender_order_id | Supplier's Order ID | Internal order ID in the suppliers system. |
| delivery_note_id | Delivery Note ID | Delivery note ID defined on the invoice. |
| supply_place | Place of Supply | Place of supply (the name of the city or state where the goods will be supplied). |
Starting from July 2020 field invoice_id was renamed to document_id. However, the
invoice_id name will still be supported for backwards compatibility. For future, we would recommend switching to
document_id in your extraction schemas.
Document attributes
| Attr. rir_field_names | Field label | Description |
|---|---|---|
| currency | Currency | The currency which the invoice is to be paid in. Possible values: AED, ARS, AUD, BGN, BRL, CAD, CHF, CLP, CNY, COP, CRC, CZK, DKK, EUR, GBP, GTQ, HKD, HUF, IDR, ILS, INR, ISK, JMD, JPY, KRW, MXN, MYR, NOK, NZD, PEN, PHP, PLN, RON, RSD, SAR, SEK, SGD, THB, TRY, TWD, UAH, USD, VES, VND, ZAR or other. May be also in lowercase. |
| document_type | Document Type | Possible values: credit_note, debit_note, tax_invoice (most typical), proforma, receipt, delivery_note, order or other. |
| language | Language | The language which the document was written in. Values are ISO 639-3 language codes, e.g.: eng, fra, deu, zho. See Languages Supported By Rossum |
| payment_method_type | Payment Method Type | Payment method used for the transaction. Possible values: card, cash. |
Starting from May 2020 the invoice_type document attribute was renamed to document_type. However, the
invoice_type name will still be supported for backwards compatibility. For future, we would recommend switching to
document_type in your extraction schemas.
Amounts
| Attr. rir_field_names | Field label | Description |
|---|---|---|
| amount_due | Amount Due | Final amount including tax to be paid after deducting all discounts and advances. |
| amount_rounding | Amount Rounding | Remainder after rounding amount_total. |
| amount_total | Total Amount | Subtotal over all items, including tax. |
| amount_paid | Amount paid | Amount paid already. |
| amount_total_base | Tax Base Total | Base amount for tax calculation. |
| amount_total_tax | Tax Total | Total tax amount. |
Typical relations (may depend on local laws):
amount_total = amount_total_base + amount_total_tax amount_rounding = amount_total - round(amount_total) amount_due = amount_total - amount_paid + amount_rounding
All amounts are in the main currency of the invoice (as identified in the currency response field). Amounts in other currencies are generally excluded.
Tables
At the moment, the AI engine automatically extracts 2 types of tables.
In order to pick one of the possible choices, set rir_field_names attribute on multivalue.
| Attr. rir_field_names | Table |
|---|---|
| tax_details | Tax details |
| line_items | Line items |
For backwards compatibility, the rir_field_names on multivalue are by default set to line_items.
However, if any of column schema rir_field_names contain a string starting with tax_detail_ then the table is assumed to be tax_details.
Tax details
Tax details table and breakdown by tax rates.
| Attr. rir_field_names | Field label | Description |
|---|---|---|
| tax_detail_base | Tax Base | Sum of tax bases for items with the same tax rate. |
| tax_detail_rate | Tax Rate | One of the tax rates in the tax breakdown. |
| tax_detail_tax | Tax Amount | Sum of taxes for items with the same tax rate. |
| tax_detail_total | Tax Total | Total amount including tax for all items with the same tax rate. |
| tax_detail_code | Tax Code | Text on document describing tax code of the tax rate (e.g. 'GST', 'CGST', 'DPH', 'TVA'). If multiple tax rates belong to one tax code on the document, the tax code will be assigned only to the first tax rate. (in future such tax code will be distributed to all matching tax rates.) |
Line items
AI engine currently automatically extracts line item table content and recognizes row and column types as detailed below. Invoice line items come in a wide variety of different shapes and forms. The current implementation can deal with (or learn) most layouts, with borders or not, different spacings, header rows, etc. We currently make two further assumptions:
- The table generally follows a grid structure - that is, columns and rows may be represented as rectangle spans. In practice, this means that we may currently cut off text that overlaps from one cell to the next column. We are also not optimizing for table rows that are wrapped to multiple physical lines.
- The table contains just a flat structure of line items, without subsection headers, nested tables, etc.
We plan to gradually remove both assumptions in the future.
| Attribute rir_field_names | Field label | Description |
|---|---|---|
| table_column_code | Item Code/ID | Can be the SKU, EAN, a custom code (string of letters/numbers) or even just the line number. |
| table_column_description | Item Description | Line item description. Can be multi-line with details. |
| table_column_quantity | Item Quantity | Quantity of the item. |
| table_column_uom | Item Unit of Measure | Unit of measure of the item (kg, container, piece, gallon, ...). |
| table_column_rate | Item Rate | Tax rate for the line item. |
| table_column_tax | Item Tax | Tax amount for the line. Rule of thumb: tax = rate * amount_base. |
| table_column_amount_base | Amount Base | Unit price without tax. (This is the primary unit price extracted.) |
| table_column_amount | Amount | Unit price with tax. Rule of thumb: amount = amount_base + tax. |
| table_column_amount_total_base | Amount Total Base | The total amount to be paid for all the items excluding the tax. Rule of thumb: amount_total_base = amount_base * quantity. |
| table_column_amount_total | Amount Total | The total amount to be paid for all the items including the tax. Rule of thumb: amount_total = amount * quantity. |
| table_column_other | Other | Unrecognized data type. |