PDF Component
Search and extract data from PDF documents
Component key: pdf
Description
The pdf component provides some actions for interacting with PDF documents, like finding text in a PDF document, listing page numbers, or extracting a specific page of a document.
This component cannot currently handle PDF documents that are encrypted.
Actions
Extract All Text
Extracts all the text from the specified PDF document and returns it as an array of texts. | key: extractAllText
Input | Notes |
---|---|
PDF data data / Required pdfData | This must refer to a buffer containing the raw bytes of a PDF. |
Example Payload for Extract All Text
{
"data": [
"This is an example text extracted from page 1 of a PDF document.",
"This is an example text extracted from page 2 of a PDF document.",
"This is an example text extracted from page 3 of a PDF document."
]
}
Extract Page
Returns the specified page in the given PDF document as a new separate PDF document. | key: extractPage
Input | Notes | Example |
---|---|---|
Page Number string / Required pageNumber | This specifies a page in a PDF document by number. | 5 |
PDF data data / Required pdfData | This must refer to a buffer containing the raw bytes of a PDF. |
Example Payload for Extract Page
{
"data": {
"type": "Buffer",
"data": [
69,
120,
97,
109,
112,
108,
101
]
},
"contentType": "application/pdf"
}
Extract Page Text
Locates and extracts pages text from the specified PDF document that matches the specified page range. | key: extractPageText
Input | Notes | Example |
---|---|---|
Page End string pageEnd | This specifies the ending page to extract from the PDF document. If not defined, will only extract the page on the pageStart input. | 5 |
Page Start string / Required pageStart | This specifies the starting page to extract from the PDF document. | 1 |
PDF data data / Required pdfData | This must refer to a buffer containing the raw bytes of a PDF. |
Example Payload for Extract Page Text
{
"data": [
"This is an example text extracted from page 1 of a PDF document."
]
}
Extract Text by Pattern
Extracts text from the specified PDF document that matches the search text. | key: extractTextByPattern
Input | Default | Notes | Example |
---|---|---|---|
Case Sensitive boolean / Required caseSensitive | false | This specifies whether searching should be case-sensitive. You can choose true or false. | true |
Characters After string charactersAfter | This specifies the number of characters to extract from the PDF document after the search pattern found. If not specified, the entire page is returned. | 10 | |
Search Pattern string / Required pattern | This is the text to search for in the PDF document. | Some Text | |
PDF data data / Required pdfData | This must refer to a buffer containing the raw bytes of a PDF. |
Example Payload for Extract Text by Pattern
{
"data": [
"This is an example text extracted from page 1 of a PDF document.",
"This is an example text extracted from page 2 of a PDF document.",
"This is an example text extracted from page 3 of a PDF document."
]
}
Find Pattern
Searches the PDF document and returns a list of page numbers containing text that satisfies the search criteria. | key: findPattern
Input | Default | Notes | Example |
---|---|---|---|
Case Sensitive boolean / Required caseSensitive | false | This specifies whether searching should be case-sensitive. You can choose true or false. | true |
Contains? boolean / Required contains | true | This specifies whether to return page numbers that either contain or don't contain the search pattern. Options are true or false | |
Search Pattern string / Required pattern | This is the pattern to search for in the PDF document. | Some Text | |
PDF data data / Required pdfData | This must refer to a buffer containing the raw bytes of a PDF. | ||
Use Regex boolean / Required useRegex | false | This specifies whether the search pattern is a regular expression. | true |
Example Payload for Find Pattern
{
"data": [
1,
2,
3,
4,
5
]
}
Page Numbers
Returns a sequence of page numbers for the specified PDF document, from 1 to the last page number. | key: pageNumbers
Input | Notes |
---|---|
PDF data data / Required pdfData | This must refer to a buffer containing the raw bytes of a PDF. |
Example Payload for Page Numbers
{
"data": [
1,
2,
3,
4,
5
]
}