Load and Export Layout Data

Dataframe and CSV

layoutparser.io.load_dataframe(df: pandas.core.frame.DataFrame, block_type: str = None) → layoutparser.elements.layout.Layout[source]

Load the Layout object from the given dataframe.

Parameters
  • df (pd.DataFrame) –

  • block_type (str) – If there’s no block_type column in the CSV file, you must pass in a block_type variable such that layout parser can appropriately detect the type of the layout elements.

Returns

The parsed Layout object from the CSV file.

Return type

Layout

layoutparser.io.load_csv(filename: str, block_type: str = None) → layoutparser.elements.layout.Layout[source]

Load the Layout object from the given CSV file.

Parameters
  • filename (str) – The name of the CSV file. A row of the table represents an individual layout element.

  • block_type (str) – If there’s no block_type column in the CSV file, you must pass in a block_type variable such that layout parser can appropriately detect the type of the layout elements.

Returns

The parsed Layout object from the CSV file.

Return type

Layout

Dict and JSON

layoutparser.io.load_dict(data: Union[Dict, List[Dict]]) → Union[layoutparser.elements.base.BaseLayoutElement, layoutparser.elements.layout.Layout][source]

Load a dict of list of dict representations of some layout data, automatically parse its type, and save it as any of BaseLayoutElement or Layout datatype.

Parameters

data (Union[Dict, List]) – A dict of list of dict representations of the layout data

Raises
  • ValueError – If the data format is incompatible with the layout-data-JSON format, raise a ValueError.

  • ValueError – If any block_type name is not in the available list of layout element names defined in BASECOORD_ELEMENT_NAMEMAP, raise a ValueError.

Returns

Based on the dict format, it will automatically parse the type of the data and load it accordingly.

Return type

Union[BaseLayoutElement, Layout]

layoutparser.io.load_json(filename: str) → Union[layoutparser.elements.base.BaseLayoutElement, layoutparser.elements.layout.Layout][source]

Load a JSON file and save it as a layout object with appropriate data types.

Parameters

filename (str) – The name of the JSON file.

Returns

Based on the JSON file format, it will automatically parse the type of the data and load it accordingly.

Return type

Union[BaseLayoutElement, Layout]

PDF

layoutparser.io.load_pdf(filename: str, load_images: bool = False, x_tolerance: int = 1.5, y_tolerance: int = 2, keep_blank_chars: bool = False, use_text_flow: bool = True, horizontal_ltr: bool = True, vertical_ttb: bool = True, extra_attrs: Optional[List[str]] = None, dpi: int = 72) → Union[List[layoutparser.elements.layout.Layout], Tuple[List[layoutparser.elements.layout.Layout], List[Image.Image]]][source]

Load all tokens for each page from a PDF file, and save them in a list of Layout objects with the original page order.

Parameters
  • filename (str) – The path to the PDF file.

  • load_images (bool, optional) – Whether load screenshot for each page of the PDF file. When set to true, the function will return both the layout and screenshot image for each page. Defaults to False.

  • x_tolerance (int, optional) – The threshold used for extracting “word tokens” from the pdf file. It will merge the pdf characters into a word token if the difference between the x_2 of one character and the x_1 of the next is less than or equal to x_tolerance. See details in pdf2plumber’s documentation. Defaults to 1.5.

  • y_tolerance (int, optional) –

    The threshold used for extracting “word tokens” from the pdf file. It will merge the pdf characters into a word token if the difference between the y_2 of one character and the y_1 of the next is less than or equal to y_tolerance. See details in pdf2plumber’s documentation. Defaults to 2.

  • keep_blank_chars (bool, optional) –

    When keep_blank_chars is set to True, it will treat blank characters are treated as part of a word, not as a space between words. See details in pdf2plumber’s documentation. Defaults to False.

  • use_text_flow (bool, optional) –

    When use_text_flow is set to True, it will use the PDF’s underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) See details in pdf2plumber’s documentation. Defaults to True.

  • horizontal_ltr (bool, optional) – When horizontal_ltr is set to True, it means the doc should read text from left to right, vice versa. Defaults to True.

  • vertical_ttb (bool, optional) – When vertical_ttb is set to True, it means the doc should read text from top to bottom, vice versa. Defaults to True.

  • extra_attrs (Optional[List[str]], optional) –

    Passing a list of extra_attrs (e.g., [“fontname”, “size”]) will restrict each words to characters that share exactly the same value for each of those attributes extracted by pdfplumber, and the resulting word dicts will indicate those attributes. See details in pdf2plumber’s documentation. Defaults to [“fontname”, “size”].

  • dpi (int, optional) – When loading images of the pdf, you can also specify the resolution (or DPI, dots per inch) for rendering the images. Higher DPI values mean clearer images (also larger file sizes). Setting dpi will also automatically resizes the extracted pdf_layout to match the sizes of the images. Therefore, when visualizing the pdf_layouts, it can be rendered appropriately. Defaults to DEFAULT_PDF_DPI=72, which is also the default rendering dpi from the pdfplumber PDF parser.

Returns

When load_images=False, it will only load the pdf_tokens from

the PDF file. Each element of the list denotes all the tokens appeared on a single page, and the list is ordered the same as the original PDF page order.

Tuple[List[Layout], List[“Image.Image”]]:

When load_images=True, besides the all_page_layout, it will also return a list of page images.

Return type

List[Layout]

Examples::
>>> import layoutparser as lp
>>> pdf_layout = lp.load_pdf("path/to/pdf")
>>> pdf_layout[0] # the layout for page 0
>>> pdf_layout, pdf_images = lp.load_pdf("path/to/pdf", load_images=True)
>>> lp.draw_box(pdf_images[0], pdf_layout[0])

Other Formats

Stay tuned! We are working on to support more formats.