Loop Over and Process Files
In this tutorial we will build an integration that downloads and processes files stored in Google Cloud Storage, but simpler concepts can be applied to files stored in Dropbox, Amazon S3, Azure Blob Storage, and SFTP server, or any other file storage system.
For this integration, assume that some third-party service writes an XML file to a Google Cloud Platform (GCP) storage bucket whenever they process an order.
We'll configure our integration to run every five minutes, and our integration will do the following:
- Look for files in the
unprocessed/
directory of our GCP Storage bucket - For each file that we find:
- Download the file
- Deserialize the XML contained in the file
- Do some data mapping
- Post the transformed data to an HTTP endpoint
- Move the file from the
unprocessed/
directory to aprocessed/
directory
Our integration will take advantage of the loop component to process files one by one.
If you would like to view the YAML definition of this example integration, it's available on GitHub. You can import it by creating a new integration and selecting Import.
List files
We'll start by adding a Google Cloud Storage List Files action to our integration. This will automatically create a Google Cloud Storage connection to our config wizard.
We'll want our customers to be able to specify their own bucket, so let's add a Select Bucket data source config variable to our config wizard:
We can also add two more string config variables to represent the unprocessed/
and processed/
directories in the bucket (which our customer may want to change, so it's handy to make these config variables).
Now, configure the List Files action to reference our config variables:
Next, we'll open the Test Configuration drawer and select Test-instance configuration to set some test credentials and config variable values.
Finally, we'll click Run.
If you see any errors about permissions, ensure that the Google IAM account you created as the proper permissions to the bucket you created.
You should see the files in your unprocessed/
directory:
Create our loop
Next, we'll loop over the files that our List Files step found. We'll add a Loop Over Items step.
Under the Items input we will reference the list of files our previous step returned:
Add tasks to the loop
Our loop is now configured to run once for each file that was found in the unprocessed/
directory in our GCP bucket.
Our loop will contain a few steps to process and send the data to an external system.
Download the file we're currently looping over
First, we'll download the file we're currently looping over.
The item that we're currently processing from our is accessible using the currentItem
key of the loop.
We'll add a Download File action from the GCP component.
- For File Name we'll reference the loop's
currentItem
. - For Bucket Name we'll reference the bucket name config variable we created.
- Our Connection is already set up for us.
For example, if there's a file named
unprocessed/order-123.xml
in our bucket,loopOverEachFile.currentItem
would be equal to"unprocessed/order-123.xml"
:
Because we're downloading a XML file, this action will return parsed XML in a string format.
Deserialize the XML
Next, we'll use the Deserialize XML action to convert the XML string into a JavaScript object whose keys can be referenced by subsequent steps.
Map the data
Next, suppose the API we're sending the data to expects a different format ("quantity" instead of "qty", etc). We can use the Collection Tools Create Object action to create a new object for us, referencing the results of the Deserialize XML step:
Send the data
Next, we'll use the HTTP component's POST Request action to send the data we generated.
As a placeholder for an external API, we'll post the data to Postman's https://postman-echo.com/post
endpoint.
For our Data input, we'll reference the Create Object's results:
Move the file to a processed directory
Finally, we'll move the file that we downloaded out of the way by moving the file from unprocessed/
to processed/
.
First, we need to replace the word unprocessed
with processed
.
We'll use the Text Manipulation component's Find & Replace action for that, once again referencing the loop's currentItem
:
We'll add a Google Cloud Storage Move File action to move our folder from one folder to another:
Conclusion
That's it! At this point we have an integration that loops over files in a directory, processes them, and sends the data to an HTTP endpoint. This integration can be published, and instances of this integration can be configured and deployed to customers.