ETL Gone Portable: Reducing Cloud Vendor Lock-in

Why portability matters? In this ever fast developing technological world there are an overwhelming amount of technologies that can be used for storing, transforming and querying your data. Depending on your internal strategy, you might either decide to keep all your infrastructure on-premise coping with the maintenance costs or opt for a more streamlined solution to embrace the cloud and select one (or more) cloud provider to host your IT needs. The quest starts here.

Typically one of the questions we get from our customers is: does cloud provider selection matter? As any good consultant, our answer is usually “it depends”. Costs are different, the interfaces and API’s are different, and more importantly, the available toolsets and related maturity are also different. Selecting a cloud provider will usually end up in a compromise between cost and technology.

Now, specifically for ETL-type workloads, you usually need one or more places to read and write your data to, a piece of technology where you can code your data pipelines and an API where you can query and/or make your data available to other external consumers (e.g. a database or similar). Cloud providers offer all of the above; using some of the tools available on Google Cloud Platform (GCP) and Amazon Web Services (AWS) as examples:

GCP: Google Cloud Storage + Google Dataflow + Google BiqQuery
AWS: S3 + Glue + Redshift

All these technologies have their strengths and weaknesses, and they do what they are intended to do very well, but what happens if you implement your whole ETL pipeline using these tools and then decide to either move away from the cloud or swap provider?

Portability vs. Performance (and how it affects vendor lock-in)

One can argue that there is not always a need for portability, and that is very true. This whole blog post might not even make sense if you’re sure that you’re going to stay strictly within the remit of a single cloud provider. However, larger enterprises that operate in a multitude of countries typically face the challenge that a single cloud provider doesn’t operate in all the countries required (think banks or financial institutions that need to load/transform data in regulated countries where the data cannot leave the host country). From our experience implementing ETL projects, there are usually two main scenarios (which are actually not mutually exclusive):

The ETL pipelines are very simple and operate over moderately sized datasets (let’s assume datasets that range from high megabytes to low gigabytes in size in this scenario), and there is no need to use any type of elastic compute technology. This typically equates to circa 60–85% of the usual use cases.
The ETL pipelines are very complex, operate over very large datasets (high gigabyte and beyond) and require the processing technology to be able scale in order to compute results in an effective time. These usually account for the remaining 15–40% of the use cases.

The first scenario is by far the most common one, and this is a good thing. The fact that most ETL pipeline tasks are typically “simple” means that you might be able to avoid using technologies that would quickly lock you in to a particular storage and/or compute technology. Scenario 2 is usually when things get trickier (and more fun!).

In a scenario where you need the underlying technology to scale up with the datasets, you need to start making some choices and compromising on technology. Fortunately these days there are a lot of tools on this spectrum, and cloud providers typically focus on maintaining and investing in a small number of them. For GCP, you might consider using DataFlow, DataProc and BigQuery to process your data while on AWS you can consider Athena, EMR and Redshift.

Regardless of the tool selection or the cloud provider, there will always be a need to orchestrate how these processes glue together to implement your data pipelines. This is where a tool like Pentaho Data Integration can help regardless of the technological choice.

How can a tool like Pentaho Data Integration help?

For anyone not aware, Pentaho Data Integration (PDI) by Hitachi Vantara is an open-source ETL tool that you can use to implement all your common ETL tasks. Fortunately, it can also help with cloud execution and orchestration scenarios with some of its internal features:

It has its own transformation engine that you can use to process data independently of the cloud provider.
It can abstract the underlying storage system using the Virtual File System (VFS) concept.
It can interact with other external technologies, such as Google BigQuery or Redshift, for example, in order to orchestrate other parts of the processing pipeline.

So, how can you do this?

Abstracting the underlying storage layer

PDI offers a very useful VFS functionality which is built on top of the Apache VFS project in order to abstract file system access. What this means is that for every process that requires reading and/or writing into a filesystem, you can implement it without having to think about the type of filesystem that is being used; whether it is a local filesystem, a remote SFTP server, Google Cloud Storage or S3, ETL transformations correctly implemented using the VFS will seamlessly work with all of them. How does this work?

Very easy! The VFS notation requires you to specify a prefix to your filesystem that will enable you to specify a path such as:
filesystem://path/to/file.txt

Which means that if, in your ETL pipeline you need to access a file or files that exist in a path denominated “/input-data/sales/” you can use the following functionality to express the path:

file:///input-data/sales/
s3://bucket-name/input-data/sales/
gs://bucket-name/input-data/sales/

In a nutshell, what this means is that you can write highly portable code that can operate under a filesystem independently of the storage type! Great, so what else can we do to maintain some portability?

An external portable engine to execute ETL on

As we stated previously, PDI bundles its own execution engine, and since it’s based on Java, you can run it wherever the JVM runs, which means that you can implement your ETL processes without having to think about the underlying technology – which makes it highly portable across environments. You can code it once and deploy it everywhere the engine runs.

What this means is that for simple ETL pipelines, you can achieve a very portable design with PDI which will allow you to move across environments very easily, let’s see an example:

You start by designing and testing your ETL on premise with your own servers and local filesystem, where PDI is hosted in its own VM or dedicated server.
You decide you want to move to AWS: you parametrise the ETL in order to use AWS S3 as the file system and deploy PDI in its own EC2 instance.
You are required to use Google Cloud Platform because you need it to operate in a very specific region: you parametrise the ETL to use GCS as the filesystem and deploy PDI in its own Google Compute instance.

Of course, this is still a very limited view of what you can achieve in a cloud platform, but it does give you 100% portability and ease of transportation of your ETL environment. For more complex scenarios, PDI gives you options to either scale out or connect to other external tools:

You can use a farm of Carte servers to horizontally scale your execution, either to split or cluster the processing
You can implement your data transformations using Map/Reduce
You can use AEL to scale out your data processing using Spark
You can connect to tools like Google BigQuery and/or Redshift after loading the data to do further processing

These more complex scenarios add in the additional fact that you must be able to easily orchestrate the pipeline to avoid getting into an implementation tangle.

Orchestrating the data pipeline

Although, as stated in the beginning of this blog post, there are typically two types of ETL pipeline scenarios, in reality they usually come together to form the overall ETL processing pipeline. In the scenario where (some) portability between cloud providers is desired, it is crucial that you have a tool at your disposal that is highly flexible and parameterisable in order to be able to dynamically adapt to the execution requirements.

For example, if you have your AWS implementation utilizing S3 and Redshift and want to move to GCP:

The code for file manipulation and processing that purely utilizes S3 is 100% portable if implemented correctly using the VFS capabilities.
The code that loads and operates Redshift will most likely not be portable, but it can be implemented in a way that can be substituted by a Google BigQuery implementation that uses the same input file layouts and table structure to achieve the same functionality.

Of course this situation will require you to maintain code modules that are specific to each cloud provider, but at least you can compartmentalise and encapsulate this functionality in a way that you can easily swap them if required. With PDI, the job orchestration functionality allows you to not only parameterise settings but also the actual execution pipeline, which would make this situation extremely easy to implement.

Conclusion

We hope this blog post was enlightening on how you can design your ETL pipelines and maintain some portability across cloud vendors. Tools like Pentaho Data Integration make your life much easier in achieving this and, we hear that the upcoming versions of PDI will further ease portability with some cool new features!

André Simões

Business Intelligence & Big Data Evangelist, Xpand IT

André Simões

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	Used by Google reCAPTCHA, which protects our site against spam enquiries on contact forms.
_icl_visitor_lang_js	1 day	Used by WPML WordPress plugin. The purpose of the cookie is to store the redirected language.
cli_user_preference	1 year	This cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
cookielawinfo-checkbox-[CATEGORY]	11 months	Used by GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the [CATEGORY] .
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
PHPSESSID	session	Used on native PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	Used by GDPR Cookie Consent plugin to store whether or not the user has consented to the use of cookies. It does not store any personal data.
viewed_cookies_policy	11 months	Used by GDPR Cookie Consent plugin to store whether or not the user has consented to the use of cookies. It does not store any personal data.
wpml_browser_redirect_test	session	Used by WPML WordPress plugin and is used to test if cookies are enabled on the browser.

Cookie	Duration	Description
__cf_bm	30 minutes	Used by Cloudflare, is used to support Cloudflare Bot Management.
_os_session	14 days	This cookie does not contain any user-specific information.
abgroups	1 month	Activates group A or B for the A/B feature functionality test.
brighsprout_auth_provider_session	2 hours	Brigh Sprout set's this cookie.
bscookie	2 years	Used by LinkedIn remembering that a logged in user is verified by two factor authentication.
CONSENT	2 years	Used by YouTube via embedded youtube-videos and registers anonymous statistical data.
cxssh_status	3 months 8 days	This cookie determines whether the browser accepts cookies.
lang	session	Used by LinkdIn to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings.
language	session	Used to store the language preference of the user.
li_gc	2 years	Used by Linkedin to store consent of guests regarding the use of cookies for non-essential purposes.
lidc	1 day	Used by LinkedIn to facilitate data center selection.
ln_or	1 day	Cookie used by LinkedIn.
VISITOR_INFO1_LIVE	5 months 27 days	Used by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
XSRF-TOKEN	2 hours	Wix set this cookie for security purposes and this cookie is written to help with site security in preventing Cross-Site Request Forgery attacks.
yt-remote-connected-devices	never	Used by YouTube to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	Used by YouTube to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
__adroll	1 year 1 month	This cookie is set by AdRoll to identify users across visits and devices. It is used by real-time bidding for advertisers to display relevant advertisements.
__adroll_fpc	1 year	AdRoll sets this cookie to target users with advertisements based on their browsing behaviour.
__adroll_shared	1 year 1 month	Adroll sets this cookie to collect information on users across different websites for relevant advertising.
__ar_v4	1 year	This cookie is set under the domain DoubleClick, to place ads that point to the website in Google search results and to track conversion rates for these ads.
__rd_experiment_version	session	This cookie tracks user behavior in RD's forms, aiding in the creation of analytical reports on them.
_clck	1 year	Microsoft Clarity sets this cookie to retain the browser's Clarity User ID and settings exclusive to that website. This guarantees that actions taken during subsequent visits to the same website will be linked to the same user ID.
_clsk	1 day	Microsoft Clarity sets this cookie to store and consolidate a user's pageviews into a single session recording.
_fbp	3 months	Used by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_ga	2 years	Used by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_*	2 years	Used by Google Analytics to distinguish users.
_gat	1 minute	Used by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
_gat_gtag_UA_*	1 minute	Used by Google Analytics to distinguish users and to store a unique user ID.
_gat_UA-*	1 minute	Used by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Google Tag Manager sets the cookie to experiment advertisement efficiency of websites using their services.
_gd*	session	Used by Google Analytics to distinguish users
_gid	1 day	Used by Google Analytics registers a unique ID that is used to generate statistical data on how the visitor uses the website.
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect a user's first pageview session, which is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user’s first session. It stores the true/false value, indicating whether it was the first time Hotjar saw this user.
_hjIncludedInSessionSample_*	2 minutes	Hotjar sets this cookie to determine if a user is included in the data sampling defined by your site's daily session limit.
_hjRecordingEnabled	never	Hotjar sets this cookie when a Recording starts and is read when the recording module is initialized, to see if the user is already in a recording in a particular session.
_hjRecordingLastActivity	never	Hotjar sets this cookie when a user recording starts and when data is sent through the WebSocket.
_hjSession_*	30 minutes	Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_hjSessionUser_*	1 year	Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_te_	session	Adroll Group registers a unique ID that identifies a returning user's device. The ID is used for targeted ads.
319af4c0-e197-4de9-8a9b-fe98c8a2ca04	session	Dynamics 365 Marketing uses this cookie to group all page loads by a given visitor that are recorded by the same behavioral-analysis script and that occur within the configured timeframe. It will consider all of these as part of a single visit to the website.
79f08280-5c63-4331-b04d-fb6f39afda51	2 years	This cookie enables Dynamics 365 Marketing to score leads based on their level of interaction with a given website. The cookie contains no personal information, but does uniquely identify a specific browser on a specific machine, and Dynamics 365 Marketing can use it to correlate this ID with an actual contact in the Dynamics 365 Marketing database.
AnalyticsSyncHistory	1 month	Used by LinkedIn to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
anj	3 months	AppNexus sets the anj cookie that contains data stating whether a cookie ID is synced with partners.
ANONCHK	10 minutes	The ANONCHK cookie, set by Bing, is used to store a user's session ID and verify ads' clicks on the Bing search engine. The cookie helps in reporting and personalization as well.
bcookie	2 years	Used by LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
browser_id	5 years	Used for identifying the visitor browser on re-visit to the website.
CLID	1 year	Used by Microsoft Clarity. The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
CMID	1 year	Casale Media sets this cookie to collect information on user behaviour for targeted advertising.
CMPRO	3 months	CasaleMedia sets CMPRO cookie for anonymous usage tracking and targeted advertising.
CMPS	3 months	CasaleMedia sets CMPS cookie for anonymous user tracking based on users' website visits to display targeted ads.
fr	3 months	Used by Facebook to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies store information about how the user uses the website to present them with relevant ads according to the user profile.
KRTBCOOKIE_*	3 months	Pubmatic sets this cookie to register a unique ID that identifies the user's device during return visits across websites that use the same ad network.
li_sugr	3 months	LinkedIn sets this cookie to collect user behaviour data to optimise the website and make advertisements on the website more relevant.
MR	7 days	This cookie, set by Bing, is used to collect user information for analytics purposes.
msd365mkttr	2 years	Microsoft Dynamic 365 collects information on user behaviour on multiple websites. This information is used in order to optimize the relevance of advertisement on the website.
msd365mkttrs	session	It allows the use of a specific form that sends the data filled in by the user to Microsoft Dynamic 365.
MUID	1 year	Identifies unique web browsers visiting Microsoft sites. These cookies are used for advertising, site analytics, and other operational purposes.
PugT	1 month	PubMatic sets this cookie to check when the cookies were updated on the browser in order to limit the number of calls to the server-side cookie store.
scribd_ubtc	10 years	Scribd sets this cookie to gather data on user behaviour across several websites and maximise the relevancy of the advertisements on the website.
SM	session	Microsoft Clarity cookie set this cookie for synchronizing the MUID across Microsoft domains.
SRM_B	1 year 24 days	Used by Microsoft Advertising as a unique ID for visitors.
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.
UserMatchHistory	1 month	Used by LinkedIn for Ads ID syncing.
uuid2	3 months	The uuid2 cookie is set by AppNexus and records information that helps differentiate between devices and browsers. This information is used to pick out ads delivered by the platform and assess the ad performance and its attribute payment.
VISITOR_PRIVACY_METADATA	5 months 27 days	Cookie used by Youtube and used to track and enrich the users privacy settings on the Youtube platform.
vuid	2 years	Used by Vimeo to collect tracking information by setting a unique ID to embed videos to the website.
YSC	session	Used by Youtube to track the views of embedded videos on Youtube pages.
yt.innertube::nextId	never	Used by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	Used by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Search

Shortcodes Ultimate

Business Intelligence & Analytics

ETL Gone Portable: Reducing Cloud Vendor Lock-in

Portability vs. Performance (and how it affects vendor lock-in)