Airbyte’s AWS S3 connector brings open source data integration to data lakes

Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out.


Open source data integration platform Airbyte has announced its first data lake integration, allowing users to replicate data from myriad sources to Amazon’s Simple Storage Service (S3). The San Francisco-based startup said that it plans to support data lakes from “other cloud providers” — including Databricks’ open source Delta Lake — soon.

Businesses of all sizes have an abundance of data spread across myriad tools such as CRM, marketing, customer support, and product analytics. While accessing the data isn’t the problem, deriving meaningful insights from data stored in different locations and formats is — so businesses have to combine it in a centralized location and transform it into a common format that makes it easier to analyze.

From ETL to ELT

Historically, a typical process to achieve this would be what is known as “extract, transform, load” (ETL), which involves transforming the data before it arrives in a central data warehouse — this made more sense with expensive on-premises storage, even though the transformation process could be painfully slow and the user would often have to re-extract the data if their needs changed. The modern alternative — “extract, load, transform” (ELT) — allows companies to transform the raw data on-demand when it’s already in the warehouse. This has been enabled through the lower costs attributed to modern cloud-based storage and computation platforms such as Databricks, Snowflake, Google’s BigQuery, and Amazon’s Redshift.

Airbyte is chiefly concerned with the “EL” part of ELT, though it also supports the transformation phase through integrations with third-party tools such as dbt. The company recently launched its Connector Development Kit (CDK) to enable businesses to create their own custom data source connectors, however it also offers dozens of pre-built connectors. This make it easier for companies to create data pipelines, and transport their data from sources such as CRMs (e.g. Salesforce), databases (e.g. MySQL, PostreSQL), and analytics (e.g. Amplitude) to destinations including databases (e.g. BigQuery), data warehouses (e.g. Snowflake) and — now — data lakes.

Data lakes and date warehouses serve very distinct purposes — the former houses raw, unstructured data which is more flexible but storage-intensive, while the latter is all about structured data that has already been processed and filtered for specific use-cases as determined by the company. Thus, Airbyte’s decision to support S3 makes sense, given that it needs to open itself to as many potential data integration scenarios as possible.

Above: Airbyte: Data replication

Open for business

Open source data integration tools have been big news of late. Last week GitLab announced it was spinning out its open source ELT (extract, load, transform) platform Meltano as a standalone business, a project that is setting out to achieve something similar to Airbyte. Moreover, as an independent business, Meltano has also managed to attract some big-name investors including Alphabet’s GV and WordPress founder Matt Mullenweg. Elsewhere, Dbt Labs (formerly Fishtown Analytics) last week raised $150 million at a $1.5 billion valuation to build out its open source dbt data transformation tool, which both Meltano and Airbyte leverage in their respective products.

Airbyte, for its part, has raised north of $31 million in the past few months, starting with a $5.2 million seed raise in March followed shortly after by a $26 million series A round less than three months later. The open source data ETL industry, it seems, is heating up.

For now, Airbyte’s core product is the free and MIT-licensed community edition, though it eventually plans to go commercial through a hosted cloud incarnation, with an additional enterprise-grade offering in the works too.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Leave a Comment