Using Salesforce with Pentaho Data Integration

Pentaho Data Integration is the tool of the trade to move data between systems, and it doesn’t have to be just a business intelligence process. We can actually use it as an agile tool for point-to-point integration between systems. PDI has its own Salesforce input step which makes it a good candidate for integration.

What is Salesforce?

Salesforce is a cloud solution for customer relationship management (CRM). As a next generation multi-tenant Platform as a Service (PaaS), its unique infrastructure enables you to focus your efforts where they are most essential: creating microservices that can be leveraged in innovative applications and really speeding up the CRM development process.

Salesforce is the right platform to give you a complete 360º vision of your customer and his interactions with your brand, whether this happens via your email campaigns, call centres, social networks, or a simple phone call. Marketing automation is, for example, just one of the many great things Salesforce brings to you in an all-in-one platform.

How do we use PDI to connect to Salesforce?

For this access we need all our Salesforce connection details: the username, password and the SOAP web service URL. PDI has to be compatible with the SOAP API version that you use. For example:

  PDI version   SOAP API version number
  2.0   1.0
  3.8   20.0
  4.2   21.0
  6.0   24.0
  7.0   37.0
  8.2   40.0


Nevertheless, even if Salesforce gives us a new version of the API we can still use the old API perfectly well. Just be careful, because if you’ve created new modules inside the platform, the new API won’t have these customisations, and so you’ll need to use the Salesforce Object Query Language (SOQL) to get the data. But don’t worry, we’ll explain it all in the next section.

How do we use PDI to connect to Salesforce?

The SOQL syntax is quite similar to SQL syntax, but with a few differences:

  1. The SOQL does not recognise any special characters (such as * or ; ) and so we have to use all the fields that we will get from Salesforce, and we cannot add the ; at EOF.
  2. We cannot use comments in a query; SOQL does not recognise this either.
  3. To create joins we need to know a few things:
    • For the native modules that we need linkage to (direct relationship), we need to add in final name a ’s‘. For example:

Get all Orders with and without has Products (OrderItem Module)

  • For the customisation modules that we need to get data from another module (direct relationship) we need add to final name the  ‚__r‘ . For example:
    Filter  OrderItems by Product_Skins__c field inside Product 2 Module 

How do we extract data from Salesforce with PDI?

We can use the Salesforce input step inside PDI to get data from Salesforce using SOQL; just be aware you can only use up to 20,000 characters to create the query.

  • Connection parameters specified:
    • Salesforce web service URL:

<url of Salesforce Platform>/services/Soap/u/<number of API Soap updated>

  • Username: Username Access to the Platform  (i.e. [email protected])
  • Password:Password + Token (the company provides the token for us add to the password in Kettle.Properties) i.e: PASSWORDTOKEN

Settings parameters specified:

    • Specify query: Without active (like we can see in the image below) we only need to choose the module (the table containing records that we need to access).

For the next tab (Content) we have the following parameters options:

  • If we want to get all records from Salesforce (I mean, if we want to get delete records and insert records) you need place a tick in Query All Records, and choose from the parameters below one of the following options:
    • All (get new/update records and delete records), Update (get only inserts and update records) ;
    • If you untick the tick from Query All Records parameters, we only get insert/update registers;
    • Delete (we get only delete records).

How does PDI know if records are new/updates or deletes?

The Salesforce has native fields very useful for controlling the process. However we cannot see these fields in layout or on builder schema in SF. We only can see the data associated with these specific fields if we’re using the SOQL or PDI to access these fields.

  • CreatedById and CreateDate are fields that shows the user and data time when records were created.
  • The LastModifiedDate and LastModifiedID shows the data time and the user who modified the record. We can use these fields to get data updated in SF.
  • Id (Salesforce Id) present in URL as a string of 18 characters (Java config.) displays the register.
    For example:
  • We have more one native field IsDeleted with data type = Boolean that shows if the record was removed (IsDelete = true) or not (IsDelete = false).

In Additional field option we have three options:

  • Time out is useful in asynchrony systems because we can configure the timeout interval in milliseconds before the step times out;
  • Use Compression is useful to get more performance from the process. Because when you tick it, the system will direct all calls to the API and send all in .qzip field;
  • Limit is for configuring the maximum number of records to retrieve from the query.

Inside the last tab, we can see all fields from the query inside the first tab. Without SOQL we get all the module fields. With SOQL we get all the fields inside on SELECT function.

And for these cases, we need to do the manually changes.
For more details:

The base64 displays images or PDFs present in SF.

If we need send images (.jpeg) or pdf (.pdf) directly to SF we load these type of fields  using JAVA to convert binary files to the base64.

For example, to send a PDF file to SF:

How to load data to Salesforce with PDI?

Send data to Salesforce from other databases or from Salesforce.

The connection option is equal as described in Salesforce Input.
In Settings Options we have new parameters:

  • Rollback all Changes on error – if we got any error nothing will integrate into SF;
  • Batch Size – we can bring a static number of the records and integrate them simultaneously (the same batch) to SF;
  • In Output Fields Label we need to add the field name that we want to get the Salesforce ID for each record integrated.

In the Fields Option, we need to put field mapping.

  • For Module Field, we need to put the API Name field in SF to get the new data;
  • In the Steam Field, we need to put the name of the field that will be integrated into the respective field in SF;
  • Use External id = N to all field updated inside the respective Module;
  • Use External id = Y to all records that we need updating but are not present in the current module, but present in another module.

Delete records inside Salesforce

We delete records from Salesforce with Delete Salesforce step. We need to specify the key field from Table Input that does the reference to the key in Salesforce (Saleforce Id).

Update Salesforce records

If we only want to update records in SF we need to use the Salesforce Update Step.
Inside Fields (Key included) Option we need to add the key to records for the specific module.

Upsert data to Salesforce

If we want to insert and update in the same Batch to SF, we need to use Salesforce Upsert.
The parameter Upsert Comparison Field helps match the data in SF.

Fátima MirandaUsing Salesforce with Pentaho Data Integration

Do you want to receive amazing news about the IT industry's hot topics and the best articles about state-of-the-art technology?
Subscribe to our newsletter and be the first one to receive information to keep you constantly on edge.