Integrating Google Cloud Storage with ClickHouse Cloud
The GCS ClickPipe provides a fully-managed and resilient way to ingest data from Google Cloud Storage (GCS). It supports both one-time and continuous ingestion with exactly-once semantics.
GCS ClickPipes can be deployed and managed manually using the ClickPipes UI, as well as programmatically using OpenAPI and Terraform.
Supported formats
Features
One-time ingestion
By default, the GCS ClickPipe will load all files matched by a pattern from the specified bucket into the ClickHouse destination table in a single batch operation. Once the ingestion task completes, the ClickPipe stops automatically. This one-time ingestion mode provides exactly-once semantics, ensuring that each file is processed reliably without duplicates.
Continuous ingestion
When continuous ingestion is enabled, ClickPipes continuously ingests data from the specified path. To determine ingestion order, the GCS ClickPipe relies on the implicit lexicographical order of files, by default. It can also be configured to ingest files in any order using a Google Cloud Pub/Sub subscription configured to provide notifications for the bucket.
Lexicographical order
The GCS ClickPipe assumes files are added to a bucket in lexicographical order, and relies on this implicit order to ingest files sequentially. This means that any new file must be lexically greater than the last ingested file. For example, files named file1, file2, and file3 will be ingested sequentially, but if a new file 0 is added to the bucket, it will be ignored because the file name isn't lexically greater than the last ingested file.
In this mode, the GCS ClickPipe does an initial load of all files in the specified path, and then polls for new files at a configurable interval (by default, 30 seconds). It is not possible to start ingestion from a specific file or point in time — ClickPipes will always load all files in the specified path.
Any order
Unordered mode is not supported for public buckets. It requires Service Account authentication and a Google Cloud Pub/Sub subscription connected to the bucket.
It's possible to configure a GCS ClickPipe to ingest files that don't have an implicit order by setting up a Google Cloud Pub/Sub subscription that receives notifications from the bucket. This allows ClickPipes to listen for object created events and ingest any new files regardless of the file naming convention.
In this mode, the GCS ClickPipe does an initial load of all files in the selected path, and then listens for object notifications via the Pub/Sub subscription that match the path. Any message for a previously seen file, file not matching the path, or event of a different type will be ignored. It is not possible to start ingestion from a specific file or point in time — ClickPipes will always load all files in the selected path.
Setting up Pub/Sub notifications
To use unordered mode, you need to configure automatic notifications from your GCS bucket to a Pub/Sub topic. Follow the official documentation for Pub/Sub notifications to create a Pub/Sub topic and subscription, then set up notifications for the bucket.
To create the notification:
Granting permissions to the service account
Unordered mode requires Service Account authentication. The service account used by ClickPipes must have the following permissions:
- Read objects from the GCS bucket — to fetch the data files.
- Read messages from the Pub/Sub subscription — to receive object notifications.
- Get the Pub/Sub subscription — to verify the subscription exists and retrieve its metadata.
Grant these permissions using the following gcloud commands:
Configuring the ClickPipe
In the ClickHouse Cloud console, navigate to Data Sources > Create ClickPipe, then choose Google Cloud Storage. Enter the details to connect to your GCS bucket, using Service Account as the authentication method, and upload the service account key JSON file. Next, click Incoming data.
Toggle on Continuous ingestion, where you’ll see the new Any order ingestion option. Then, input the Pub/Sub subscription path in the following format:
File pattern matching
Object Storage ClickPipes follow the POSIX standard for file pattern matching. All patterns are case-sensitive and match the full path after the bucket name. For better performance, use the most specific pattern possible (e.g., data-2024-*.csv instead of *.csv).
Supported patterns
| Pattern | Description | Example | Matches |
|---|---|---|---|
? | Matches exactly one character (excluding /) | data-?.csv | data-1.csv, data-a.csv, data-x.csv |
* | Matches zero or more characters (excluding /) | data-*.csv | data-1.csv, data-001.csv, data-report.csv, data-.csv |
** Recursive | Matches zero or more characters (including /). Enables recursive directory traversal. | logs/**/error.log | logs/error.log, logs/2024/error.log, logs/2024/01/error.log |
Examples:
https://bucket.s3.amazonaws.com/folder/*.csvhttps://bucket.s3.amazonaws.com/logs/**/data.jsonhttps://bucket.s3.amazonaws.com/file-?.parquethttps://bucket.s3.amazonaws.com/data-2024-*.csv.gz
Unsupported patterns
| Pattern | Description | Example | Alternatives |
|---|---|---|---|
{abc,def} | Brace expansion - alternatives | {logs,data}/file.csv | Create separate ClickPipes for each path. |
{N..M} | Numeric range expansion | file-{1..100}.csv | Use file-*.csv or file-?.csv. |
Examples:
https://bucket.s3.amazonaws.com/{documents-01,documents-02}.jsonhttps://bucket.s3.amazonaws.com/file-{1..100}.csvhttps://bucket.s3.amazonaws.com/{logs,metrics}/data.parquet
Exactly-once semantics
Various types of failures can occur when ingesting large dataset, which can result in a partial inserts or duplicate data. Object Storage ClickPipes are resilient to insert failures and provides exactly-once semantics. This is accomplished by using temporary "staging" tables. Data is first inserted into the staging tables. If something goes wrong with this insert, the staging table can be truncated and the insert can be retried from a clean state. Only when an insert is completed and successful, the partitions in the staging table are moved to target table. To read more about this strategy, check-out this blog post.
Virtual columns
To track which files have been ingested, include the _file virtual column to the column mapping list. The _file virtual column contains the filename of the source object, which can be used to query which files have been processed.
Access control
Permissions
The GCS ClickPipe supports public and private buckets. Requester Pays buckets are not supported.
The roles/storage.objectViewer role must be granted at the bucket level. This role contains the storage.objects.list and `storage.objects.get IAM permissions, which allow ClickPipes to list and fetch objects in the specified bucket.
Pub/Sub subscription
When using unordered mode, the service account must have the following roles on the Pub/Sub subscription:
roles/pubsub.subscriber— to receive and acknowledge messages.roles/pubsub.viewer— to get subscription metadata.
Authentication
Service account
Service Account authentication is required when using unordered mode with Pub/Sub notifications. Select Service Account as the authentication method and upload the service account key JSON file.
HMAC credentials
To use HMAC keys to authenticate, choose Credentials under Authentication method when setting up your ClickPipe connection. Then, provide the access key (e.g., GOOGTS7C7FUP3AIRVJTE2BCDKINBTES3HC2GY5CBFJDCQ2SYHV6A6XXVTJFSA) and secret key (e.g., bGoa+V7g/yqDXvKRqq+JTFn4uQZbPiQJo4pf9RzJ) under Access key and Secret key, respectively.
Follow this guide to create a service account with an HMAC key.
Network access
GCS ClickPipes use two distinct network paths for metadata discovery and data ingestion: the ClickPipes service and the ClickHouse Cloud service, respectively. If you want to configure an additional layer of network security (e.g., for compliance reasons), network access must be configured for both paths.
-
For IP-based access control, the IP filtering rules for your GCS bucket must allow the static IPs for the ClickPipes service region listed here, as well as the static IPs for the ClickHouse Cloud service. To obtain the static IPs for your ClickHouse Cloud region, open a terminal and run:
Advanced settings
ClickPipes provides sensible defaults that cover the requirements of most use cases. If your use case requires additional fine-tuning, you can adjust the following settings:
| Setting | Default value | Description |
|---|---|---|
Max insert bytes | 10GB | Number of bytes to process in a single insert batch. |
Max file count | 100 | Maximum number of files to process in a single insert batch. |
Max threads | auto(3) | Maximum number of concurrent threads for file processing. |
Max insert threads | 1 | Maximum number of concurrent insert threads for file processing. |
Min insert block size bytes | 1GB | Minimum size of bytes in the block which can be inserted into a table. |
Max download threads | 4 | Maximum number of concurrent download threads. |
Object storage polling interval | 30s | Configures the maximum wait period before inserting data into the ClickHouse cluster. |
Parallel distributed insert select | 2 | Parallel distributed insert select setting. |
Parallel view processing | false | Whether to enable pushing to attached views concurrently instead of sequentially. |
Use cluster function | true | Whether to process files in parallel across multiple nodes. |
Scaling
Object Storage ClickPipes are scaled based on the minimum ClickHouse service size determined by the configured vertical autoscaling settings. The size of the ClickPipe is determined when the pipe is created. Subsequent changes to the ClickHouse service settings won't affect the ClickPipe size.
To increase the throughput on large ingest jobs, we recommend scaling the ClickHouse service before creating the ClickPipe.
Known limitations
File size
ClickPipes will only attempt to ingest objects that are 10GB or smaller in size. If a file is greater than 10GB, an error will be appended to the ClickPipes dedicated error table.
Compatibility
The GCS ClickPipe uses on the Cloud Storage XML API for interoperability, which requires using the https://storage.googleapis.com/ bucket prefix (instead of gs://) and using HMAC keys for authentication.
View support
Materialized views on the target table are also supported. ClickPipes will create staging tables not only for the target table, but also any dependent materialized view.
We don't create staging tables for non-materialized views. This means that if you have a target table with one of more downstream materialized views, those materialized views should avoid selecting data via a view from the target table. Otherwise, you may find that you're missing data in the materialized view.