Understanding document structures is a crucial step in creating the best data extraction.
Why?
Extracting document structures means that your team can understand, and extract, the entire content of a document, not just individual or specific data points.
So with Structure Extraction, you get text in contextual blocks including headings, paragraphs, lists, footnotes and tables and other formatting information (without the need to code - this is done by our engineers).
You can export the structured data as a JSON or use the Rest API.
Table of Contents
Document structures help you organise complex information. By utilising a process called "Layout Segmentation", users can segment a page into individual blocks, such as:
Now, why do you need structure extraction?
Structure Extraction replaces all of the nightmares associated with data coding with a no-code method to extract document data in just a few clicks.
Whether your team of data scientists need to evaluate a sample of documents, or your business team need quick data analytics, Structure Extraction provides the fundamental tool for any team to find, use, and store data.
The Structure Extraction platform is then able to assign meanings to each block, for example: "This text block is a title level one", "This table has the caption 'Durchschnittsprämien' and belongs to chapter X".
From retrieving this information on what type of data each block contains, Structure Extraction can reconstruct the contents of a document and can infer a more complex analysis.
We can compare Structure Extraction to someone who is skimming through a book. The person is trying to absorb as much information as possible, searching for relevant pieces of information without needing to read everything.
But why do people skim through information? To save themselves time and effort; to get straight to what they're looking for.
Structure Extraction does exactly this for users – it enables people to find specific information quickly, without needing to study their entire document landscape.
You will have several steps in Acodis that guide you through extracting the entire document structure. The user can choose exactly what they want to extract depending on their goals.
In the end, you have the option to export a JSON or use the API to connect your preferred app. More info on integration/API with Acodis here.
(The image illustrates an example of options to extract structure with Acodis.)
Gives you the option to extract all the text identified within your uploaded document.
The reading order simply refers to how readers should perceive the entire document: where the title page is; where the appendix is; etc.
Headers can easily define "sections" of any given document and can indicate where specific pieces of data are located. For many who need to analyse the layout of their documents, it is fundamental to understand the information/location of main headers.
And while this process may be easy for single-page documents, it can be a timely task if you are handling hundreds of PDFs at once.
While some documents include an Appendix that indicates where figures are located in a document, it can be time-consuming to actually pinpoint them if the document landscape spans over hundreds of pages. But this is not the case anymore and can be done with a single click.
When users select "Figure", all of the relevant pieces of content will then become highlighted across all document pages.
Good to know: Furthermore, all of the content locked inside of those figures is also transformed into structured information, meaning that you can analyse all of the data within them.
When text is laid out on several pages, it undergoes a series of transformations.
For example:
Text aggregation ultimately tries to undo these transformations back to the original text.
When we say "noise", we're not referring to your documents being loud, but rather about managing any repeated elements that do not contribute to the normal contents of a document. This can include:
Acodis Document Structure Extraction can identify much of the noise within documents and still manage to analyse any data around it.
Captions are generally mini explanations that are located under figures in documents.
Analysing them ultimately provides even more context to the figures and therefore improves how far someone can analyse the contents of a document.
Highlighted in purple, we see that Structure Extraction was able to extract the table's caption.
Now you have the option to export the content and the semantic structure as a JSON or connect Acodis via API to your preferred app to use the data in your subsequent processing.
Extract content from Marketing and Sales literature to enable an automated chat-bot to help (prospective) customers learn about your product.
For more information on this and to get your own free demo, contact one of our experts and they'll gladly walk you through it.