How Marker converts PDFs to Markdown

October 28, 2024

Marker is an open-source tool for converting PDF documents to Markdown. Besides passing your PDF to an LLM, it's the most accurate tool available. It's also multilingual, supporting 90+ languages.

How it works

Marker uses a pipeline of neural networks to perform the conversion. The high-level pipeline is detailed in the convert_single_pdf function in the source code. In total, there are four main steps:

  1. Text detection: First, Marker uses a segmentation model (a Vision Transformer, or ViT) to detect sections of the page that contain text. This first pass can take in low-resolution images because it is only used to find the locations of the text.
  2. Text recognition: Next, Marker uses a recognition model that takes in each text segment and converts the raw pixels into a string of text. This model uses an encoder-decoder architecture, where the encoder compresses the image into a latent representation and the decoder reconstructs the text from the latent representation. The encoder is a custom Donut model, which is based on the Swin transformer.
  3. Layout detection: Marker then uses a layout detection model to detect the structure of the page, including the locations of headers, footers, and other non-textual elements. The layout detection model is another ViT-based segmentation model. There is a bunch of additional processing that happens to merge adjacent tables, resolve overlapping text, and remove blank text boxes.
  4. Reading order: If you've ever used an OCR tool, you've probably noticed that the lines of text often don't match up with the actual reading order. This can be very annoying if you're trying to select or highlight text. To fix this, Marker uses another model to annotate each piece of text with a reading order index. This model is a encoder-decoder model that uses a Donut-based encoder and an MBart-based decoder.

Marker can also extract equations and tables using a separate model for each.

Possible improvements

Marker is already quite accurate, but there are still a few areas where it can be improved. The biggest one is speed. Although it is much more accurate, Marker is about 100x slower than OCR tools like OCRMyPDF and DocTR. Processing a 1,000-page PDF can take half an hour with Marker on a datacenter-grade GPU, whereas OCRMyPDF can do it in about 3 minutes on a CPU.

Intuitively, the text detection step could be combined with layout recognition and reading order prediction. Another option is that as models gradually get smaller and better, this might not be necessary.