TorchicTab: Semantic Table Annotation

Assigning semantic annotations to the elements of a table.

This post is based on the following publications:

TorchicTab: Semantic Table Annotation with Wikidata and Language Models. Dasoulas, I., Yang, D., Duan, X., Dimou, A., SemTab’23: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2023, co-located with the 22nd International Semantic Web Conference (ISWC), November 6-10, 2023, Athens, Greece.

This publication won the SemTab 2023 challenge award at the 22nd International Sematic Web Conference. The SemTab challenge benchmarks systems that assigns semantic tags to the elements of the table by comparing their accuracy.

TorchicTab is a semantic table annotation system that automatically understands the content of a table and assigns semantic tags to its elements with high accuracy. TorchicTab provides annotations both by leveraging heuristic methods to link table concepts with Wikidata entities and properties, as well as by applying classification methods to annotated tables to predict columns and predicates for unseen tables. It was developed to participate in the SemTab challenge which benchmarks systems that assign semantic tags to the elements of a table by comparing their accuracy. TorchicTab consists of two complementary subsystems: TorchicTab-Heuristic and TorchicTab-Classification.

  • TorchicTab-Heuristic leverages heuristic data mining to link tables with existing RDF graphs. It annotates tables by inferring the subject column from other elements in the table and predicts properties and qualifiers from large RDF graphs, such as Wikidata, for n-ary relations.

  • TorchicTab-Classification requires a sufficiently large number of annotated tables for training to leverage the pre-trained language model DODUO that learns from provided table annotations. It adopts a sub-table sampling strategy to address larger tables.

High level view of TorchicTab.

SemTab Challenge

The Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) benchmarks table annotation systems over different tables’ types (complete, horizontal, etc.) and annotation tasks:

  • The Cell Entity Annotation (CEA) task associates a table cell with an entity.
  • The Column Type Annotation (CTA) task assigns a semantic type to a column.
  • The Column Property Annotation (CPA) task discovers a semantic relation contained in the RDF graph that best represents the relation between two columns.
  • The Topic Detection (TD) task identifies the topic of a table that lacks a subject column and assigns a class.

SemTab tasks: CEA, CPA and CTA. The entire table is considered for TD.

TorchicTab

TorchicTab consists of 2 subsystems: (1) TorchicTab-Heuristic to annotate datasets with Wikidata as reference knowledge base, and (2) TorchiTab-Classification to annotate datasets based on a sufficiently large number of annotated tables for training.

TorchicTab-Heuristic

The TorchicTab-Heuristic semantically annotates tables with entities and relations from Wikidata. We followed two similar workflows depending on whether the table has an entity column or not.

Complete tables

To semantically annotate complete tables (i.e. tables that do have entity columns), TorchicTab-Heuristic follows four steps:

TorchicTab-Heuristic.

(1) Table pre-processing: all tables are pre-processed, e.g., non-cell values and HTML tags are removed or incorrect encodings are fixed, and columns are distinguisted into NE-columns whose values can be named entities in the reference RDF graph or L-columns whose values can be literal values.

(2) Candidate search: candidate entities are assigned to all NE-column cells. We employ 4 lookup strategies targeting an Elasticsearch index which contains names and aliases from Wikidata:

  1. The Complete Cell Strategy uses the complete cell;

  2. the Fuzzy Strategy leverages Elasticsearch fuzzy queries for noisy cells and cells with spelling mistakes;

  3. the Cell Token Strategy removes stop-words, splits cells into tokens and queries the index for each token separately; and

  4. the Cell Token Combinations Strategy removes stop-words from the cell and splits it into tokens.

(3) Ranking: The candidates are ranked based on string similarity, comparing their labels to the table’s cells, and context similarity, comparing their sub-graphs within the RDF graph to the cell’s context within the table and a confidence score is assigned to each candidate.

(4) Task estimation: The candidates’ scores and properties are used to calculate the most suitable relations between columns, via majority voting (CPA). The outputs of the CPA and candidate ranking are used to select the most suitable candidate for each NE-column cell (CEA). The outputs of CEA are then used for each NE-column to rank candidate types that could represent them and select the best (CTA).

Entity-less tables

To semantically annotate tables that lack a subject column, the workflow is similar to the one for complete tables, but the candidate selection step is adjusted to account for the lack of a subject column. To achieve this, we extend our Elasticsearch, adding an index containing all Wikidata triples, as we can no longer rely on extracting candidates using only entities’ names and aliases.

TorchicTab-Classification

The TorchicTab-Classification considers pre-defined terms from vocabularies like Schema.org and DBpedia, unlike TorchicTab-Heuristic which focused on entity discovery in Wikidata. We consider the task as multi-class classifications where each column or column pair can be annotated with only one label and apply DODUO, a multi-task learning framework based on pre-trained language models.

However, due to the limitations imposed by the maximum length of token inputs (typically 512 tokens) in most language models, it becomes challenging to analyze the entire table with DODUO. Moreover, DODUO only models the information contained in labeled columns, disregarding the extensive context from the unlabeled columns, and it only captures information within those columns for upcoming predictions.

TorchicTab-Classification.

Sub-table Sampling Strategy

To deal with DODUO’s limitations, we followed a sub-table sampling strategy to split the larger tables and incorporate richer context within the original DODUO. We perform several steps before and after applying DODUO:

(1) Row selection: A specific number of rows (e.g., 40 out of 56 rows of a table) was randomly selected; these rows are randomly divided into smaller equal sub-tables (e.g., 8 sub-tables with 5 rows each); each sub-table serves as a single unit as training sample after further processing.

(2) Column selection: Each sub-table will be reduced to a maximum of 10 columns; the selection of these columns is random (e.g., 10 out of 14 columns). While some columns may be shared among sub-tables from the same large table, they may also have different columns.

(3) Token construction: For each sub-table, we have 50 tabular cells to represent as an individual sample for training and predicting, thus the maximum tabular cell length is 10 sub-word level tokens to fulfill the limitation by the maximum length of token inputs (512) in language models.

(4) Majority voting: Once the column’s type or relation predictions within sub-tables were done by DODUO, we employed a majority voting strategy to consolidate the final predicted outcome of the column or column pair, which involved aggregating the individual predictions based on sub-tables and selecting the most frequent predicted type or relation as the final result.

Evaluation

Our results include the f1-score (F1), precision (P) and recall (R) for the validation and test sets provided by the competition with four groups of datasets across two rounds.

Datasets

The datasets which were used by the challenge are:

  • WikidataTables: datasets with tables generated using an improved version of SemTab’s data generator that creates realistic-looking tables using SPARQL queries. The target knowledge graph for this dataset is Wikidata, and the tasks are CEA, CTA, and CPA.

  • tFood: datasets derived from the Food domain. tFood has horizontal relational tables where each table represents a collection of entities, and entity tables, each representing a single entity. The target knowledge graph for this dataset is Wikidata, and the tasks are CEA, CTA, CPA and CQA.

  • SOTAB: a benchmark dataset created using tables from the WDC Schema.org Table Corpus for the CTA and CPA tasks. The column types and the column relationships are annotated using the Schema.org and DBpedia.

  • CQA: a dataset based on Wikary, consisting of Wikipedia tables for the CQA task.

Results

Results SOTAB.

Round 2 was concluded with 6 core participants. TorchicTab outperforms other systems, in some cases with a very high margin, with F1 scores of up to 0.9. Thanks to its high accuracy, TorchicTab won the price for the SemTab challenge during the ISWC2023 conference!

Anastasia Dimou
Professor

Semantic Web, Knowledge Graphs & AI