Skip to main content

Loop Over and Process Files

In this tutorial we will build an integration that downloads and processes files stored in Google Cloud Storage, but simpler concepts can be applied to files stored in Dropbox, Amazon S3, Azure Blob Storage, and SFTP server, or any other file storage system.

For this integration, assume that some third-party service writes an XML file to a Google Cloud Platform (GCP) storage bucket whenever they process an order.

We'll configure our integration to run every five minutes, and our integration will do the following:

  • Look for files in the unprocessed/ directory of our GCP Storage bucket
  • For each file that we find:
    • Download the file
    • Deserialize the XML contained in the file
    • Do some data mapping
    • Post the transformed data to an HTTP endpoint
    • Move the file from the unprocessed/ directory to a processed/ directory

Our integration will take advantage of the loop component to process files one by one.

If you would like to view the YAML definition of this example integration, it's available on GitHub. You can import it by creating a new integration and selecting Import.

List files

We'll start by adding a Google Cloud Storage List Files action to our integration. This will automatically create a Google Cloud Storage connection to our config wizard.

We'll want our customers to be able to specify their own bucket, so let's add a Select Bucket data source config variable to our config wizard:

Google Cloud Storage - Select Bucket data source

We can also add two more string config variables to represent the unprocessed/ and processed/ directories in the bucket (which our customer may want to change, so it's handy to make these config variables).

Now, configure the List Files action to reference our config variables:

Google Cloud Storage - List Files inputs

Next, we'll open the Test Configuration drawer and select Test-instance configuration to set some test credentials and config variable values.

Google Cloud Storage - Test config variables

Finally, we'll click Run. If you see any errors about permissions, ensure that the Google IAM account you created as the proper permissions to the bucket you created. You should see the files in your unprocessed/ directory:

Google Cloud Storage - List Files result

Create our loop

Next, we'll loop over the files that our List Files step found. We'll add a Loop Over Items step.

Under the Items input we will reference the list of files our previous step returned:

Loop Over Items in Prismatic integration designer

Add tasks to the loop

Our loop is now configured to run once for each file that was found in the unprocessed/ directory in our GCP bucket. Our loop will contain a few steps to process and send the data to an external system.

Download the file we're currently looping over

First, we'll download the file we're currently looping over. The item that we're currently processing from our is accessible using the currentItem key of the loop.

We'll add a Download File action from the GCP component.

  • For File Name we'll reference the loop's currentItem.
  • For Bucket Name we'll reference the bucket name config variable we created.
  • Our Connection is already set up for us. For example, if there's a file named unprocessed/order-123.xml in our bucket, loopOverEachFile.currentItem would be equal to "unprocessed/order-123.xml":
Download current file inputs

Because we're downloading a XML file, this action will return parsed XML in a string format.

Download current file results

Deserialize the XML

Next, we'll use the Deserialize XML action to convert the XML string into a JavaScript object whose keys can be referenced by subsequent steps.

Download current file results

Map the data

Next, suppose the API we're sending the data to expects a different format ("quantity" instead of "qty", etc). We can use the Collection Tools Create Object action to create a new object for us, referencing the results of the Deserialize XML step:

Create object inputs

Send the data

Next, we'll use the HTTP component's POST Request action to send the data we generated. As a placeholder for an external API, we'll post the data to Postman's https://postman-echo.com/post endpoint. For our Data input, we'll reference the Create Object's results:

HTTP POST inputs

Move the file to a processed directory

Finally, we'll move the file that we downloaded out of the way by moving the file from unprocessed/ to processed/.

First, we need to replace the word unprocessed with processed. We'll use the Text Manipulation component's Find & Replace action for that, once again referencing the loop's currentItem:

Find-and-replace inputs

We'll add a Google Cloud Storage Move File action to move our folder from one folder to another:

Move File inputs

Conclusion

That's it! At this point we have an integration that loops over files in a directory, processes them, and sends the data to an HTTP endpoint. This integration can be published, and instances of this integration can be configured and deployed to customers.