PDF Component
Search and extract data from PDF documents
Component key: pdf · Changelog ↓Description
PDF (Portable Document Format) is a file format developed by Adobe for presenting documents independently of software, hardware, or operating systems. The pdf component allows finding text in PDF documents, listing page numbers, and extracting specific pages from a document.
This component does not support PDF documents that are encrypted.
Actions
Extract All Text
Extracts all text from the specified PDF document and returns it as an array of text strings. | key: extractAllText
| Input | Notes | Example |
|---|---|---|
| PDF Data | The PDF file data to process. This can be a file reference from a previous step. |
{
"data": [
"Product Specification Document\n\nRevision: 2.1\nDate: 2024-01-15\n\nOverview\nThis document outlines the technical specifications for the XR-500 series product line. The XR-500 incorporates advanced features including wireless connectivity, enhanced power management, and improved thermal design.",
"Technical Specifications\n\nDimensions: 15.2 x 8.4 x 2.1 inches\nWeight: 3.5 lbs\nPower Requirements: 100-240V AC, 50/60Hz\nOperating Temperature: 0°C to 40°C\nStorage Temperature: -20°C to 60°C\n\nConnectivity\n- WiFi 802.11ax (WiFi 6)\n- Bluetooth 5.2\n- USB-C 3.2 Gen 2\n- Ethernet (RJ45) 10/100/1000 Mbps",
"Safety and Compliance\n\nThis product meets the following standards:\n- FCC Part 15 Class B\n- CE Marking (EU)\n- RoHS Compliant\n- UL Listed\n\nWarranty Information\nStandard warranty: 2 years from date of purchase\nExtended warranty options available\n\nFor support, contact: support@example.com\nPhone: 1-555-SUPPORT (1-555-787-7678)"
]
}
Extract Page
Extracts the specified page from the PDF document and returns it as a new separate PDF document. | key: extractPage
| Input | Notes | Example |
|---|---|---|
| Page Number | The page number to extract from the PDF. | 5 |
| PDF Data | The PDF file data to process. This can be a file reference from a previous step. |
{
"data": {
"type": "Buffer",
"data": [
37,
80,
68,
70,
45,
49,
46,
52,
10,
49,
32,
48,
32,
111,
98,
106,
10,
60,
60,
10,
47,
84,
121,
112,
101,
32,
47,
67,
97,
116,
97,
108,
111,
103,
10,
47,
80,
97,
103,
101,
115,
32,
50,
32,
48,
32,
82,
10,
62,
62,
10,
101,
110,
100,
111,
98,
106,
10
]
},
"contentType": "application/pdf"
}
Extract Page Text
Extracts text from the specified page range in the PDF document. | key: extractPageText
| Input | Notes | Example |
|---|---|---|
| Page End | The ending page number for extraction. If not provided, only the start page is extracted. | 5 |
| Page Start | The starting page number for extraction. | 1 |
| PDF Data | The PDF file data to process. This can be a file reference from a previous step. |
{
"data": [
"Product Specification Document\n\nRevision: 2.1\nDate: 2024-01-15\n\nOverview\nThis document outlines the technical specifications for the XR-500 series product line. The XR-500 incorporates advanced features including wireless connectivity, enhanced power management, and improved thermal design."
]
}
Extract Structured Text
Extracts all text items from the PDF with their position coordinates, dimensions, font metadata, and layout flags for custom parsing. | key: extractStructuredText
| Input | Notes | Example |
|---|---|---|
| Page End | The ending page number for extraction. If not provided, extraction continues to the last page. | 5 |
| Page Start | The starting page number for extraction. If not provided, extraction starts from the first page. | 1 |
| PDF Data | The PDF file data to process. This can be a file reference from a previous step. |
{
"data": [
{
"pageNumber": 1,
"pageWidth": 612,
"pageHeight": 792,
"items": [
{
"text": "Product Specification",
"x": 72,
"y": 48.96,
"width": 142.8,
"height": 14.4,
"fontName": "g_d0_f1",
"hasEOL": true,
"direction": "ltr",
"pageNumber": 1,
"pageWidth": 612,
"pageHeight": 792
},
{
"text": "Revision: 2.1",
"x": 72,
"y": 80.64,
"width": 78.96,
"height": 11.04,
"fontName": "g_d0_f2",
"hasEOL": true,
"direction": "ltr",
"pageNumber": 1,
"pageWidth": 612,
"pageHeight": 792
}
]
}
]
}
Extract Table Data
Detects and extracts tabular structures from the PDF using coordinate-based row and column clustering, returning two-dimensional string arrays. | key: extractTableData
| Input | Notes | Example |
|---|---|---|
| Column Tolerance | X-coordinate tolerance in PDF points for detecting table column boundaries. Default is 10 points. Decrease for dense tables, increase for tables with wider spacing. | 10 |
| Page End | The ending page number for extraction. If not provided, extraction continues to the last page. | 5 |
| Page Start | The starting page number for extraction. If not provided, extraction starts from the first page. | 1 |
| PDF Data | The PDF file data to process. This can be a file reference from a previous step. | |
| Row Tolerance | Y-coordinate tolerance in PDF points for grouping text items into table rows. Default is 3 points. | 3 |
{
"data": [
{
"pageNumber": 2,
"tables": [
{
"rows": [
[
"Property",
"Value",
"Unit"
],
[
"Dimensions",
"15.2 x 8.4 x 2.1",
"inches"
],
[
"Weight",
"3.5",
"lbs"
]
],
"boundingBox": {
"x": 72,
"y": 144,
"width": 468,
"height": 48
}
}
]
}
]
}
Extract Text by Pattern
Extracts text from the specified PDF document that matches the search text. | key: extractTextByPattern
| Input | Notes | Example |
|---|---|---|
| Case Sensitive | When true, the search is case-sensitive. | true |
| Characters After | The number of characters to extract after the search pattern. If not provided, the entire page is returned. | 10 |
| Search Pattern | This is the text to search for in the PDF document. | Some Text |
| PDF Data | The PDF file data to process. This can be a file reference from a previous step. |
{
"data": [
"Technical Specifications\n\nDimensions: 15.2 x 8.4 x 2.1 inches\nWeight: 3.5 lbs\nPower Requirements: 100-240V AC, 50/60Hz\nOperating Temperature: 0°C to 40°C\nStorage Temperature: -20°C to 60°C\n\nConnectivity\n- WiFi 802.11ax (WiFi 6)\n- Bluetooth 5.2\n- USB-C 3.2 Gen 2\n- Ethernet (RJ45) 10/100/1000 Mbps"
]
}
Extract Text with Layout
Extracts text from the PDF with line breaks and paragraph spacing preserved from the original document layout. | key: extractTextWithLayout
| Input | Notes | Example |
|---|---|---|
| Line Tolerance | Y-coordinate tolerance in PDF points for grouping text items into lines. Items within this vertical distance are considered same-line. Default is 2 points. Increase for PDFs with inconsistent text positioning. | 2 |
| Page End | The ending page number for extraction. If not provided, extraction continues to the last page. | 5 |
| Page Start | The starting page number for extraction. If not provided, extraction starts from the first page. | 1 |
| PDF Data | The PDF file data to process. This can be a file reference from a previous step. |
{
"data": [
{
"pageNumber": 1,
"text": "Product Specification Document\n\nRevision: 2.1\nDate: 2024-01-15\n\nOverview\nThis document outlines the technical specifications\nfor the XR-500 series product line."
}
]
}
Find Pattern
Searches the PDF document and returns page numbers containing text that matches the search criteria. | key: findPattern
| Input | Notes | Example |
|---|---|---|
| Case Sensitive | When true, the search is case-sensitive. | true |
| Contains | When true, returns pages containing the pattern; when false, returns pages without the pattern. | true |
| Search Pattern | The text pattern to search for in the PDF document. | Some Text |
| PDF Data | The PDF file data to process. This can be a file reference from a previous step. | |
| Use Regex | When true, treats the search pattern as a regular expression. | true |
{
"data": [
2,
4,
7
]
}
Find Text Position
Searches the PDF document and returns the position coordinates of all occurrences of the specified text. | key: findTextPosition
| Input | Notes | Example |
|---|---|---|
| Case Sensitive | When true, the search is case-sensitive. | true |
| Page Number | Limit the search to a specific page number. If not provided, all pages are searched. | 1 |
| PDF Data | The PDF file data to process. This can be a file reference from a previous step. | |
| Search Text | The text to search for in the PDF document. | Tenant |
{
"data": [
{
"pageNumber": 6,
"text": "Segments",
"x": 101.53,
"y": 509.28,
"width": 47.25,
"height": 13.44,
"pageWidth": 612,
"pageHeight": 792
}
]
}
Page Numbers
Returns a sequence of page numbers for the PDF document, from 1 to the last page. | key: pageNumbers
| Input | Notes | Example |
|---|---|---|
| PDF Data | The PDF file data to process. This can be a file reference from a previous step. |
{
"data": [
1,
2,
3,
4,
5
]
}
Changelog
2026-03-30
Added PDF text and table extraction actions:
- Find Text Position - Returns coordinate positions of text occurrences within a PDF document
- Extract Structured Text - Returns all text items with position coordinates, dimensions, and font metadata for custom parsing
- Extract Text with Layout - Preserves line breaks and paragraph spacing from the original PDF layout
- Extract Table Data - Detects and extracts tabular structures as two-dimensional arrays using coordinate-based heuristics
2026-02-12
Improved error handling across PDF processing actions with descriptive error messages and detection of corrupted or invalid PDF structures