LlamaIndex • 2025-03-20

Parsing PDFs with LlamaParse: a how-to guide

LlamaParse

Generative AI (GenAI) is rapidly and radically changing how we produce and consume information. But it’s a ravenous beast when it comes to data.

In order to reach timely and relevant conclusions, it must consume large volumes of accurate data. The large language models that drive much of AI are often trained on public web pages due to their abundance and easy access. However, not all information is available in HTML.

Millions of Adobe PDF (Portable Document Format) documents contain data in the form of facts, figures, equations, laws and regulations, etc. that could be extremely useful to extract and store. Getting that data out of those PDFs can be a challenge.

LlamaIndex’s LlamaParse simplifies extracting information from PDFs. Let us show you how it’s done with just a few lines of code.

Challenges with parsing PDFs

Unlike HTML, which inherently provides some structure to the content it presents, PDF was originally created to preserve the visual look of a printed document. Although the ability to encode structural metadata was eventually added, doing so can be a cumbersome process.

Sadly, apps that generate PDFs with included structural metadata often just get it wrong. PDFs generated by Adobe’s InDesign, for example, will often set the reading order of a page based upon the order the designer worked on the elements on that page.

What is LlamaParse?

LlamaParse is a GenAI-native parsing platform for parsing and transforming complex documents into clean data for LLM applications. LlamaParse integrates with LlamaIndex, the open source data orchestration framework for building large language model (LLM) applications. LlamaIndex makes it easier to build agents and the contextual data that supports them, leveraging AI to extract information from a number of document formats — including PDFs.

LlamaParse is really good at:

✅ Parsing a variety of unstructured file types (.pdf, .pptx, .docx, .xlsx, .html, jpeg, and more).

✅ Parsing embedded tables accurately into text and semi-structured representations.

✅ Extracting data from visual elements (images/diagrams).

✅ Using natural language instructions to customize the output how you want it.

Benefits of using LlamaParse for PDF extraction

You can parse PDFs on your own or with a standard, non-GenAI-enabled PDF parser. However, using LlamaParse provides numerous improvements:

Significantly reduces the time and effort spent on data extraction by leveraging LLM intelligence to reduce the manual work involved in data extraction
Increases the volume of data available for your GenAI apps by freeing it from its sources
Converts unstructured data into structured data

Extracting and storing information from PDFs using LlamaParse

You can use LlamaParse via our UI (LlamaCloud), the API, or one of our language SDKs. Below we’ll demonstrate some of LlamaParse’s features using Python. If you would prefer to try LlamaParse without the bother of coding, you can create a free LlamaCloud account and experiment using some of your own documents.

Getting started: Installing LlamaIndex and LlamaParse

First we need to install LlamaIndex and LlamaParse to make them available for our Python scripts.

Caution: We recommend creating and activating a virtual Python environment before proceeding to avoid issues caused by conflicting dependency versions.

pip install llama-index
pip install llama-parse

API key

You also need an API key. To create one, create a free LlamaCloud account.

Once you have created an account and are logged in, click API Keys from the left navigation and then the Generate New Key button.

Caution: Once your key is displayed, copy it and save it to a secure location. You won’t be able to redisplay it once you dismiss the screen. If you lose your key, you should revoke it and create a new one.

We recommend setting your API as an environmental variable in your Python virtual environment.

export LLAMA_CLOUD_API_KEY='yourkey12345…."

Now we’re ready to begin parsing PDF files. Let’s write some code.

Parse your first PDF with LlamaParse and Python

We’ll begin by parsing a two-page report comparing investment returns from the Nasdaq-100 vs. the S&P 500. You can download the PDF file with your browser or by using wget.

wget https://indexes.nasdaqomx.com/docs/NDX-vs-SPX_2%20pager.pdf

Now we’ll create a simple Python script to parse the PDF file and save its content as markdown.

from llama_parse import LlamaParse


parser = LlamaParse(
   # api_key="llx-...",  # if you did not create an environmental variable you can set the API key here
   result_type="markdown",  # "markdown" and "text" are available
   )

file_name = "./NDX-vs-SPX_2 pager.pdf"
extra_info = {"file_name": file_name}

with open(f"./{file_name}", "rb") as f:
   # must provide extra_info with file_name key with passing file object
   documents = parser.load_data(f, extra_info=extra_info)

# with open('output.md', 'w') as f:
   # print(documents, file=f)

# Write the output to a file
with open("output.md", "w", encoding="utf-8") as f:
   for doc in documents:
       f.write(doc.text)

Save the script as parse.py and then run it.

python parse.py

Once the script finishes, open output.md. You’ll see that LlamaParse has successfully extracted the content from the PDF and correctly applied structure via headers. Notice that the table at the bottom of the first page was also recreated in markdown.

| |Nasdaq-100 TR|S&P 500 TR|
|---|---|---|
|Cumulative Return|315%|156%|
|Annualized Return|13%|9%|
|Annualized Volatility|22%|20%|

Extracting data from charts and graphs

However, you may notice that the data from the two graphs on the first page are missing. This is actually by design. LlamaParse supports multiple parsing modes, which allows you to balance speed, cost, and advanced parsing power based upon your needs at the moment.

The default mode, which we used above, skips most graphs since they require more advanced and costly processing. But we can enable LlamaParse to utilize a different mode with a slight configuration change to our parser.

from llama_parse import LlamaParse


parser = LlamaParse(
   # api_key="llx-...",  # if you did not create an environmental variable you can set the API key here
   result_type="markdown",  # "markdown" and "text" are available,

    extract_charts=True,

    auto_mode=True,

    auto_mode_trigger_on_image_in_page=True,

    auto_mode_trigger_on_table_in_page=True,
   )

file_name = "./NDX-vs-SPX_2 pager.pdf"
extra_info = {"file_name": file_name}

with open(f"./{file_name}", "rb") as f:
   # must provide extra_info with file_name key with passing file object
   documents = parser.load_data(f, extra_info=extra_info)

# with open('output.md', 'w') as f:
   # print(documents, file=f)

# Write the output to a file
with open("output.md", "w", encoding="utf-8") as f:
   for doc in documents:
       f.write(doc.text)

When the parser completes its work, you’ll see, for example, that the industry breakdown chart on the second page has been parsed and rendered as a table.

Industry (ICB) Breakdown

| Industry | Nasdaq-100 Industry (ICB) Weights | S&P 500 Industry (ICB) Weights |
|----------|-----------------------------------|--------------------------------|
| Technology | 55% | 22% |
| Consumer Services | 25% | 14% |
| Health Care | 8% | 13% |
| Consumer Goods | 6% | 8% |
| Industrials | 6% | 12% |
| Telecommunications | 1% | 2% |
| Utilities | 3% | 3% |
| Financials | - | 18% |
| Oil & Gas | - | 5% |
| Basic Materials | 2% | 2% |

Learn more about parsing modes and how LlamaParse auto mode enables you to optimize your parsing costs by only invoking the premium modes when they are genuinely needed.

Storing your data in a vector database

Once you’ve extracted your data from the PDF, you can send it to a vector database such as Elasticsearch. You’ll need API keys from OpenAI and Elastic Cloud, as well as a few more dependencies installed including nest_asyncio. From there it’s just a few additional lines of code to store your data.

from llama_index.vector_stores.elasticsearch import ElasticsearchStore

es_store = ElasticsearchStore(
    index_name="llama-parse-docs",
    es_cloud_id=es_cloud_id,  # found within the deployment page
    es_api_key=es_api_key,  # create an API key within Kibana (Security -> API Keys)
)

from llama_index.core.node_parser import SimpleNodeParser
node_parser = SimpleNodeParser()

nodes = node_parser.get_nodes_from_documents(documents)

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, StorageContext

storage_context = StorageContext.from_defaults(vector_store=es_store)
index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    embed_model=OpenAIEmbedding(api_key=openai_api_key),
)

We won’t go through the full process here, but we’ve prepared detailed examples of sending extracted data to Elasticsearch as well as to Astra DB in our repo.

Fine-tuning your PDF

Almost any document that can be displayed on a computer screen can be saved as a PDF, which, as we discussed earlier, makes it difficult to parse them consistently. LlamaParse supports numerous ways to adjust the parsing instructions fed to the LLM for better results or advanced capabilities.

Translate your output

Have a document in Spanish but need to store its content in English? LlamaParse can do that.

Download this Spanish language document.

wget https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf

LlamaParse lets you customize the prompt sent to the LLM that initiates parsing.

from llama_parse import LlamaParse

parser = LlamaParse(
   result_type="markdown",

    user_prompt="If the input is not in English, translate the output into English."
   )

file_name = "./howtohelpyourchildsucceedinschoolspanish.pdf"
extra_info = {"file_name": file_name}

with open(f"./{file_name}", "rb") as f:
   # must provide extra_info with file_name key with passing file object
   documents = parser.load_data(f, extra_info=extra_info)

# with open('output.md', 'w') as f:
   # print(documents, file=f)

# Write the output to a file
with open("output.md", "w", encoding="utf-8") as f:
   for doc in documents:
       f.write(doc.text)

Prompts are incredibly powerful and provide you with granular control of your output. Learn more about how to use prompts in the docs.

Limiting/targeting your parsing

If you only need to parse selected pages from your PDF you don’t need to extract those pages from your source file before parsing. Just pass the pages you wish to parse.

parser = LlamaParse(
   target_pages="0,10,12,22-33"
)

Note that pages are numbered starting at 0.

If your source file has headers or footers, you can instruct LlamaParse to ignore content that falls within a specified section of the page. Here we instruct the parser to ignore the top 10% of the page and the bottom 5%.

parser = LlamaParse(
   bbox_top=0.1,
   bbox_bottom=0.05
)

More examples

There are dozens of additional ways you can fine tune your parsing to fit your specific needs. You can apply output schemas to create invoices and resumes. Although we have focused on parsing PDF files, LlamaParse supports extracting data from dozens of different file formats, including audio files.

To help you explore the full range of capabilities, we’ve compiled dozens of examples and made them available in our repo.

Conclusion

LlamaParse makes parsing PDFs and other unstructured data easier and less labor-intensive than ever. It exponentially increases the data available for your GenAI apps while simultaneously freeing your developers to innovate rather than wasting time writing bespoke apps to extract and parse data from hard-to-crack data sources.

Get started with LlamaParse today and parse up to 1000 pages per day for free.