Queue Schema

Every queue has an associated schema that specifies which fields will be extracted from documents as well as the structure of the data sent to connector and exported from the platform.

See the introduction to Rossum customization for a high-level overview of configuring the captured fields and managing schemas.

The best visual guide to the schema JSON that will get you started tweaking it in the Rossum app is our tutorial on editing the extraction schema. Especially when maintaining long dropdown select boxes.

Rossum schema supports data fields with single values (datapoint), fields with multiple values (multivalue) or tuples of fields (tuple). At the topmost level, each schema consists of sections, which may either directly contain actual data fields (datapoints) or use nested multivalues and tuples as containers for single datapoints.

But while schema may theoretically consist of an arbitrary number of nested containers, the Rossum UI supports only certain particular combinations of datapoint types. The supported shapes are:

simple: atomic datapoints of type number, string, date or enum
list: simple datapoint within a multivalue
tabular: simple datapoint within a "multivalue tuple" (a multivalue list containing a tuple for every row)

Schema content

Schema content consists of a list of section objects.

Common attributes

The following attributes are common for all schema objects:

Attribute	Type	Description	Required
category	string	Category of an object, one of `section`, `multivalue`, `tuple` or `datapoint`.	yes
id	string	Unique identifier of an object. Maximum length is 50 characters.	yes
label	string	User-friendly label for an object, shown in the user interface	yes
hidden	boolean	If set to `true`, the object is not visible in the user interface, but remains stored in the database and may be exported. Default is false. Note that `section` is hidden if all its children are hidden.	no
disable_prediction	boolean	Can be set to `true` to disable field extraction, while still preserving the data shape. Ignored by aurora engines.	no

Section

Example section object:

{
  "category": "section",
  "id": "amounts_section",
  "label": "Amounts",
  "children": [...],
  "icon": ""
}

Section represents a logical part of the document, such as amounts or vendor info. It is allowed only at the top level. Schema allows multiple sections, and there should be at least one section in the schema.

Attribute	Type	Description	Required
children	list[object]	Specifies objects grouped under a given section. It can contain `multivalue` or `datapoint` objects.	yes
icon	string	The icon that appears on the left panel in the UI for a given section (not yet supported on UI).

Datapoint

A datapoint represents a single value, typically a field of a document or some global document information. Fields common to all datapoint types:

Attribute	Type	Description	Required
type	string	Data type of the object, must be one of the following: `string`, `number`, `date`, `enum`, `button`	yes
can_export	boolean	If set to `false`, datapoint is not exported through export endpoint. Default is true.
can_collapse	boolean	If set to `true`, tabular (multivalue-tuple) datapoint may be collapsed in the UI. Default is false.
rir_field_names	list[string]	List of references used to initialize an object value. See below for the description.
default_value	string	Default value used either for fields that do not use hints from AI engine predictions (i.e. `rir_field_names` are not specified), or when the AI engine does not return any data for the field.
constraints	object	A map of various constraints for the field. See Value constraints.
ui_configuration	object	A group of settings affecting behaviour of the field in the application. See UI configuration.
width	integer	Width of the column (in characters). Default widths are: number: 8, string: 20, date: 10, enum: 20. Only supported for table datapoints.
stretch	boolean	If total width of columns doesn't fill up the screen, datapoints with stretch set to true will be expanded proportionally to other stretching columns. Only supported for table datapoints.
width_chars	integer	(Deprecated) Use `width` and `stretch` properties instead.
score_threshold	float [0;1]	Threshold used to automatically validate field content based on AI confidence scores. If not set, `queue.default_score_threshold` is used.
formula	string[0;2000]	Formula definition, required only for fields of type `formula`, see Formula Fields. `rir_field_names` should also be empty for these fields.
prompt	string[0;2000]	Prompt definition, required only for fields of type `reasoning`.
context	list[string]	Context for the prompt, required only for fields of type `reasoning` see Logical Types.

rir_field_names attribute allows to specify source of initial value of the object. List items may be:

one of extracted field types to use the AI engine prediction value
upload:id to identify a value specified while uploading the document
edit:id to identify a value specified in edit_pages endpoint
email_header:<id> to use a value extracted from email headers. Supported email headers: from, to, reply-to, subject, message-id, date.
email_body:<id> to select email body. Supported values are text_html for HTML body or text_plain for plain text body.
email:<id> to identify a value specified in email.received hook response
emails_import:<id> to identify a value specified in the values parameter when importing an email.

If more list items in rir_field_names are specified, the first available value will be used.

String type

Example string type datapoint with constraints:

{
  "category": "datapoint",
  "id": "document_id",
  "label": "Invoice ID",
  "type": "string",
  "default_value": null,
  "rir_field_names": ["document_id"],
  "constraints": {
    "length": {
      "exact": null,
      "max": 16,
      "min": null
    },
    "regexp": {
      "pattern": "^INV[0-9]+$"
    },
    "required": false
  }
}

String datapoint does not have any special attribute.

Date type

Example date type datapoint:

{
  "id": "item_delivered",
  "type": "date",
  "label": "Item Delivered",
  "format": "MM/DD/YYYY",
  "category": "datapoint"
}

Attributes specific to Date datapoint:

Attribute	Type	Description	Required
format	string	Enforces a format for `date` datapoint on the UI. See Date format below for more details. Default is `YYYY-MM-DD`.

Date format supported: available tokens

Example date formats:

D/M/YYYY: e.g. 23/1/2019
MM/DD/YYYY: e.g. 01/23/2019
YYYY-MM-DD: e.g. 2019-01-23 (ISO date format)

Number type

Example number type datapoint:

{
  "id": "item_quantity",
  "type": "number",
  "label": "Quantity",
  "format": "#,##0.#",
  "category": "datapoint"
}

Attributes specific to Number datapoint:

Attribute	Type	Default	Description	Required
format	string	`# ##0.#`	Available choices for number format show table below. `null` value is allowed.
aggregations	object		A map of various aggregations for the field. See aggregations.

The following table shows numeric formats with their examples.

Format	Example
`# ##0,#`	1 234,5 or 1234,5
`# ##0.#`	1 234.5 or 1234.5
`#,##0.#`	1,234.5 or 1234.5
`#'##0.#`	1'234.5 or 1234.5
`#.##0,#`	1.234,5 or 1234,5
`# ##0`	1 234 or 1234
`#,##0`	1,234 or 1234
`#'##0`	1'234 or 1234
`#.##0`	1.234 or 1234

Aggregations

Example number type datapoint with sum aggregation:

{
  "id": "quantity",
  "type": "number",
  "label": "Quantity",
  "category": "datapoint",
  "aggregations": {
    "sum": {
      "label": "Total"
    }
  },
  "default_value": null,
  "rir_field_names": []
}

Aggregations allow computation of some informative values, e.g. a sum of a table column with numeric values. These are returned among messages when the validate endpoint is called. Aggregations can be computed only for tables (multivalues of tuples).

Attribute	Type	Description	Required
sum	object	Sum of values in a column. Default `label`: "Sum".

All aggregation objects can have an attribute label that will be shown in the UI.

Enum type

Example enum type datapoint with options:

{
  "id": "document_type",
  "type": "enum",
  "label": "Document type",
  "hidden": false,
  "category": "datapoint",
  "options": [
    {
      "label": "Invoice Received",
      "value": "21"
    },
    {
      "label": "Invoice Sent",
      "value": "22"
    },
    {
      "label": "Receipt",
      "value": "23"
    }
  ],
  "default_value": "21",
  "rir_field_names": [],
  "enum_value_type": "number"
}

Attributes specific to Enum datapoint:

Attribute	Type	Description	Required
options	object	See object description below.	yes
enum_value_type	string	Data type of the option's value attribute. Must be one of the following: `string`, `number`, `date`	no

Every option consists of an object with keys:

Attribute	Type	Description	Required
value	string	Value of the option.	yes
label	string	User-friendly label for the option, shown in the UI.	yes

Enum datapoint value is matched in a case insensitive mode, e.g. EUR currency value returned by the AI Core Engine is matched successfully against {"value": "eur", "label": "Euro"} option.

Button type

Specifies a button shown in Rossum UI. For more details please refer to custom UI extension.

Example button type datapoint:

{
  "id": "show_email",
  "type": "button",
  "category": "datapoint",
  "popup_url": "http://example.com/show_customer_data",
  "can_obtain_token": true
}

Buttons cannot be direct children of multivalues (simple multivalues with buttons are not allowed. In tables, buttons are children of tuples). Despite being a datapoint object, button currently cannot hold any value. Therefore, the set of available Button datapoint attributes is limited to:

Attribute	Type	Description	Required
type	string	Data type of the object, must be one of the following: `string`, `number`, `date`, `enum`, `button`	yes
can_export	boolean	If set to `false`, datapoint is not exported through export endpoint. Default is true.
can_collapse	boolean	If set to `true`, tabular (multivalue-tuple) datapoint may be collapsed in the UI. Default is false.
popup_url	string	URL of a popup window to be opened when button is pressed.
can_obtain_token	boolean	If set to `true` the popup window is allowed to ask the main Rossum window for authorization token

Value constraints

Example datapoint with value constraints:

{
  "id": "document_id",
  "type": "string",
  "label": "Invoice ID",
  "category": "datapoint",
  "constraints": {
    "length": {
      "max": 32,
      "min": 5
    },
    "required": false
  },
  "default_value": null,
  "rir_field_names": [
    "document_id"
  ]
}

Constraints limit allowed values. When constraints is not satisfied, annotation is considered invalid and cannot be exported.

Attribute	Type	Description
length	object	Defines minimum, maximum or exact length for the datapoint value. By default, minimum and maximum are `0` and infinity, respectively. Supported attributes: `min`, `max` and `exact`
regexp	object	When specified, content must match a regular expression. Supported attributes: `pattern`. To ensure that entire value matches, surround your regular expression with `^` and `$`.
required	boolean	Specifies if the datapoint is required by the schema. Default value is `true`.

UI configuration

Example datapoint with UI configuration:

{
  "id": "document_id",
  "type": "string",
  "label": "Invoice ID",
  "category": "datapoint",
  "ui_configuration": {
    "type":  "captured",
    "edit": "disabled"
  },
  "default_value": null,
  "rir_field_names": [
    "document_id"
  ]
}

UI configuration provides a group of settings, which alter behaviour of the field in the application. This does not affect behaviour of the field via the API. For example, disabling edit prohibits changing a value of the datapoint in the application, but the value can still be modified through API.

Attribute	Type	Description	Required
type	string	Logical type of the datapoint. Possible values are: `captured`, `data`, `manual`, `formula`, `reasoning` or `null`. Default value is `null`.	false
edit	string	When set to `disabled`, value of the datapoint is not editable via UI. When set to `enabled_without_warning`, no warnings are displayed in the UI regarding this fields editing behaviour. Default value is `enabled`, this option enables field editing, but user receives dismissible warnings when doing so.	false

Logical types

Captured field represents information retrieved by the OCR model. If combined with edit option disabled, user can't overwrite field's value, but is able to redraw field's bounding box and select another value from the document by such an action.
Data field represents information filled by extensions. This field is not mapped to the AI model, so it does not have a bounding box, neither it's possible to create one. If combined with edit option disabled the field can't be modified from the UI.
Manual field behaves exactly like Data field, without the mapping to extensions. This field should be used for sharing information between users or to transfer information to downstream systems.
Formula field This field will be updated according to its formula definition, see Formula Fields. If the edit option is not disabled the field value can be overridden from the UI (see no_recalculation).
Reasoning fields This field will be updated according to its prompt and context. context supports adding related schema fields in a format of TxScript strings (e.g. field.invoice_id, also self.attr.label and self.attr.description are supported). If the edit option is not disabled the field value can be overridden from the UI (see no_recalculation).
null value is displayed in UI as Unset and behaves similar to the Captured field.

Multivalue

Example simple multivalue:

{
  "category": "multivalue",
  "id": "po_numbers",
  "label": "PO numbers",
  "children": {
    ...
  },
  "show_grid_by_default": false,
  "min_occurrences": null,
  "max_occurrences": null,
  "rir_field_names": null
}

Example multivalue with grid configuration:

{
  "category": "multivalue",
  "id": "line_item",
  "label": "Line Item",
  "children": {
    ...
  },
  "grid": {
    "row_types": [
      "header", "data", "footer"
    ],
    "default_row_type": "data",
    "row_types_to_extract": [
      "data"
    ]
  },
  "min_occurrences": null,
  "max_occurrences": null,
  "rir_field_names": ["line_items"]
}

Multivalue is list of datapoints or tuples of the same type. It represents a container for data with multiple occurrences (such as line items) and can contain only objects with the same id.

Attribute	Type	Description	Required
children	object	Object specifying type of children. It can contain only objects with categories `tuple` or `datapoint`.	yes
min_occurrences	integer	Minimum number of occurrences of nested objects. If condition of min_occurrences is violated corresponding fields should be manually reviewed. Minimum required value for the field is 0. If not specified, it is set to 0 by default.
max_occurrences	integer	Maximum number of occurrences of nested objects. All additional rows above max_occurrences are removed by extraction process. Minimum required value for the field is 1. If not specified, it is set to 1000 by default.
grid	object	Configure magic-grid feature properties, see below.
show_grid_by_default	boolean	If set to `true`, the magic-grid is opened instead of footer upon entering the multivalue. Default `false`. Applied only in UI. Useful when annotating documents for custom training.
rir_field_names	list[string]	List of names used to initialize content from the AI engine predictions. If specified, the value of the first field from the array is used, otherwise default name `line_items` is used. Attribute can be set only for multivalue containing objects with category `tuple`.	no

Multivalue grid object

Multivalue grid object allows to specify a row type for each row of the grid. For data representation of actual grid data rows see Grid object description.

Attribute	Type	Description	Default	Required
row_types	list[string]	List of allowed row type values.	`["data"]`	yes
default_row_type	string	Row type to be used by default	`data`	yes
row_types_to_extract	list[string]	Types of rows to be extracted to related table	`["data"]`	yes

For example to distinguish two header types and a footer in the validation interface, following row types may be used: header, subsection_header, data and footer.

Currently, data extraction classifies every row as either data or header (additional row types may be introduced in the future). We remove rows returned by data extraction that are not in row_types list (e.g. header by default) and are on the top/bottom of the table. When they are in the middle of the table, we mark them as skipped (null).

There are three visual modes, based on row_types quantity:

More than two row types defined: User selects row types freely to any non-default row type. Clearing row type resets to a default row type. We support up to 6 colors to easily distinguish row types visually.
Two row types defined (header and default): User only marks header and skipped rows. Clearing row type resets to a default row type.
One row type defined: User is only able to mark row as skipped (null value in data). This is also a default behavior when no grid row types configuration is specified in the schema.

Only rows marked as one of row_types_to_extract values are transferred to a table by pressing "Read data from table" button in the Rossum UI (calling grid-to-table conversion API endpoint).

Tuple

Example tuple object:

{
  "category": "tuple",
  "id": "tax_details",
  "label": "Tax Details",
  "children": [
    ...
  ],
  "rir_field_names": [
    "tax_details"
  ]
}

Container representing tabular data with related values, such as tax details. A tuple must be nested within a multivalue object, but unlike multivalue, it may consist of objects with different ids.

Attribute	Type	Description	Required
children	list[object]	Array specifying objects that belong to a given `tuple`. It can contain only objects with category `datapoint`.	yes
rir_field_names	list[string]	List of names used to initialize content from the AI engine predictions. If specified, the value of the first extracted field from the array is used, otherwise, no AI engine initialization is done for the object.

Updating Schema

When project evolves, it is a common practice to enhance or change the extracted field set. This is done by updating the schema object.

By design, Rossum supports multiple schema versions at the same time. However, each document annotation is related to only one of those schemas. If the schema is updated, all related document annotations are updated accordingly. See preserving data on schema change below for limitations of schema updates.

In addition, every queue is linked to a schema, which is used for all newly imported documents.

When updating a schema, there are two possible approaches:

Update the schema object (PUT/PATCH). All related annotations will be updated to match current schema shape (even exported and deleted documents).
Create a new schema object (POST) and link it to the queue. In such case, only newly created objects will use the current schema. All previously created objects will remain in the shape of their linked schema.

Formerly, we recommended to always create a new schema object when changing the set of extracted fields. This is no longer necessary since updating of the current schema object (PUT/PATCH) can be used instead. See use-cases below if not sure which approach is appropriate.

Use case 1 - Initial setting of a schema

Situation: User is initially setting up the schema. This might be an iterative process.
Recommendation: Edit the existing schema by updating schema (PUT) or updating part of a schema (PATCH).

Use case 2 - Updating attributes of a field (label, constraints, options, etc.)

Situation: User is updating field, e.g. changing label, number format, constraints, enum options, hidden flag, etc.
Recommendation: Edit existing schema (see Use case 1).

Use case 3 - Adding new field to a schema, even for already imported documents.

Situation: User is extending a production schema by adding a new field. Moreover, user would like to see all annotations (to_review, postponed, exported, deleted, etc. states) in the look of the newly extended schema.
Recommendation: Edit existing schema (see Use case 1). Data of already created annotations will be transformed to the shape of the updated schema. New fields of annotations in to_review and postponed state that are linked to extracted fields types will be filled by AI Engine, if available. New fields for already exported or deleted annotations (also purged, exporting and failed_export) will be filled with empty or default values.

Use case 4 - Adding new field to schema, only for newly imported documents

Situation: User is extending a production schema by adding a new field. However, with the intention that the user does not want to see the newly added field on previously created annotations.
Recommendation: Create a new schema object (POST) and link it to the queue. Annotation data of previously created annotations will be preserved according to the original schema. This approach is recommended if there is an organizational need to keep different field sets before and after the schema update.

Use case 5 - Deleting schema field, even for already imported documents.

Situation: User is changing a production schema by removing a field that was used previously. However, user would like to see all annotations (to_review, postponed, exported, deleted, etc. states) in the look of the newly extended schema. There is no need to see the original fields in already exported annotations.
Recommendation: Edit existing schema (see Use case 1).

Use case 6 - Deleting schema field, only for newly imported documents

Situation: User is changing a production schema by removing a field that was used previously. However, with the intention that the user will still be able to see the removed fields on previously created annotations.
Recommendation: Create a new schema object (see Use case 4). Annotation data of previously created annotations will be preserved according to the original schema. This approach is recommended if there is an organizational need to retrieve the data in the original state.

When copying an annotation or moving it to a new queue by patching its queue attribute, the annotation in the new queue will still be associated with the old schema.

Preserving data on schema change

In order to transfer annotation field values properly during the schema update, a datapoint's category and schema_id must be preserved.

Supported operations that preserve fields values are:

adding a new datapoint (filled from AI Engine, if available)
reordering datapoints on the same level
moving datapoints from section to another section
moving datapoints to and from a tuple
reordering datapoints inside a tuple
making datapoint a multivalue (original datapoint is the only value in a new multivalue container)
making datapoint non-multivalue (only first datapoint value is preserved)

Extracted field types

AI engine currently automatically extracts the following fields at the all endpoint, subject to ongoing expansion.

Identifiers

Attr. rir_field_names	Field label	Description
account_num	Bank Account	Bank account number. Whitespaces are stripped.
bank_num	Sort Code	Sort code. Numerical code of the bank.
iban	IBAN	Bank account number in IBAN format.
bic	BIC/SWIFT	Bank BIC or SWIFT code.
const_sym	Constant Symbol	Statistical code on payment order.
spec_sym	Specific Symbol	Payee ID on the payment order, or similar.
var_sym	Variable symbol	In some countries used by the supplier to match the payment received against the invoice. Possible non-numeric characters are stripped.
terms	Terms	Payment terms as written on the document (e.g. "45 days", "upon receipt").
payment_method	Payment method	Payment method defined on a document (e.g. 'Cheque', 'Pay order', 'Before delivery')
customer_id	Customer Number	The number by which the customer is registered in the system of the supplier. Whitespaces are stripped.
date_due	Date Due	The due date of the invoice.
date_issue	Issue Date	Date of issue of the document.
date_uzp	Tax Point Date	The date of taxable event.
document_id	Document Identifier	Document number. Whitespaces are stripped.
order_id	Order Number	Purchase order identification (Order Numbers not captured as "sender_order_id"). Whitespaces are stripped.
recipient_address	Recipient Address	Address of the customer.
recipient_dic	Recipient Tax Number	Tax identification number of the customer. Whitespaces are stripped.
recipient_ic	Recipient Company ID	Company identification number of the customer. Possible non-numeric characters are stripped.
recipient_name	Recipient Name	Name of the customer.
recipient_vat_id	Recipient VAT Number	Customer VAT Number
recipient_delivery_name	Recipient Delivery Name	Name of the recipient to whom the goods will be delivered.
recipient_delivery_address	Recipient Delivery Address	Address of the recipient where the goods will be delivered.
sender_address	Supplier Address	Address of the supplier.
sender_dic	Supplier Tax Number	Tax identification number of the supplier. Whitespaces are stripped.
sender_ic	Supplier Company ID	Business/organization identification number of the supplier. Possible non-numeric characters are stripped.
sender_name	Supplier Name	Name of the supplier.
sender_vat_id	Supplier VAT Number	VAT identification number of the supplier.
sender_email	Supplier Email	Email of the sender.
sender_order_id	Supplier's Order ID	Internal order ID in the suppliers system.
delivery_note_id	Delivery Note ID	Delivery note ID defined on the invoice.
supply_place	Place of Supply	Place of supply (the name of the city or state where the goods will be supplied).

Starting from July 2020 field invoice_id was renamed to document_id. However, the invoice_id name will still be supported for backwards compatibility. For future, we would recommend switching to document_id in your extraction schemas.

Document attributes

Attr. rir_field_names	Field label	Description
currency	Currency	The currency which the invoice is to be paid in. Possible values: AED, ARS, AUD, BGN, BRL, CAD, CHF, CLP, CNY, COP, CRC, CZK, DKK, EUR, GBP, GTQ, HKD, HUF, IDR, ILS, INR, ISK, JMD, JPY, KRW, MXN, MYR, NOK, NZD, PEN, PHP, PLN, RON, RSD, SAR, SEK, SGD, THB, TRY, TWD, UAH, USD, VES, VND, ZAR or other. May be also in lowercase.
document_type	Document Type	Possible values: credit_note, debit_note, tax_invoice (most typical), proforma, receipt, delivery_note, order or other.
language	Language	The language which the document was written in. Values are ISO 639-3 language codes, e.g.: eng, fra, deu, zho. See Languages Supported By Rossum
payment_method_type	Payment Method Type	Payment method used for the transaction. Possible values: card, cash.

Starting from May 2020 the invoice_type document attribute was renamed to document_type. However, the invoice_type name will still be supported for backwards compatibility. For future, we would recommend switching to document_type in your extraction schemas.

Amounts

Attr. rir_field_names	Field label	Description
amount_due	Amount Due	Final amount including tax to be paid after deducting all discounts and advances.
amount_rounding	Amount Rounding	Remainder after rounding amount_total.
amount_total	Total Amount	Subtotal over all items, including tax.
amount_paid	Amount paid	Amount paid already.
amount_total_base	Tax Base Total	Base amount for tax calculation.
amount_total_tax	Tax Total	Total tax amount.

Typical relations (may depend on local laws):

amount_total = amount_total_base + amount_total_tax
amount_rounding = amount_total - round(amount_total)
amount_due = amount_total - amount_paid + amount_rounding

All amounts are in the main currency of the invoice (as identified in the currency response field). Amounts in other currencies are generally excluded.

Tables

At the moment, the AI engine automatically extracts 2 types of tables. In order to pick one of the possible choices, set rir_field_names attribute on multivalue.

Attr. rir_field_names	Table
tax_details	Tax details
line_items	Line items

For backwards compatibility, the rir_field_names on multivalue are by default set to line_items. However, if any of column schema rir_field_names contain a string starting with tax_detail_ then the table is assumed to be tax_details.

Tax details

Tax details table and breakdown by tax rates.

Attr. rir_field_names	Field label	Description
tax_detail_base	Tax Base	Sum of tax bases for items with the same tax rate.
tax_detail_rate	Tax Rate	One of the tax rates in the tax breakdown.
tax_detail_tax	Tax Amount	Sum of taxes for items with the same tax rate.
tax_detail_total	Tax Total	Total amount including tax for all items with the same tax rate.
tax_detail_code	Tax Code	Text on document describing tax code of the tax rate (e.g. 'GST', 'CGST', 'DPH', 'TVA'). If multiple tax rates belong to one tax code on the document, the tax code will be assigned only to the first tax rate. (in future such tax code will be distributed to all matching tax rates.)

Line items

AI engine currently automatically extracts line item table content and recognizes row and column types as detailed below. Invoice line items come in a wide variety of different shapes and forms. The current implementation can deal with (or learn) most layouts, with borders or not, different spacings, header rows, etc. We currently make two further assumptions:

The table generally follows a grid structure - that is, columns and rows may be represented as rectangle spans. In practice, this means that we may currently cut off text that overlaps from one cell to the next column. We are also not optimizing for table rows that are wrapped to multiple physical lines.
The table contains just a flat structure of line items, without subsection headers, nested tables, etc.

We plan to gradually remove both assumptions in the future.

Attribute rir_field_names	Field label	Description
table_column_code	Item Code/ID	Can be the SKU, EAN, a custom code (string of letters/numbers) or even just the line number.
table_column_description	Item Description	Line item description. Can be multi-line with details.
table_column_quantity	Item Quantity	Quantity of the item.
table_column_uom	Item Unit of Measure	Unit of measure of the item (kg, container, piece, gallon, ...).
table_column_rate	Item Rate	Tax rate for the line item.
table_column_tax	Item Tax	Tax amount for the line. Rule of thumb: `tax = rate * amount_base`.
table_column_amount_base	Amount Base	Unit price without tax. (This is the primary unit price extracted.)
table_column_amount	Amount	Unit price with tax. Rule of thumb: `amount = amount_base + tax`.
table_column_amount_total_base	Amount Total Base	The total amount to be paid for all the items excluding the tax. Rule of thumb: `amount_total_base = amount_base * quantity`.
table_column_amount_total	Amount Total	The total amount to be paid for all the items including the tax. Rule of thumb: `amount_total = amount * quantity`.
table_column_other	Other	Unrecognized data type.