Minhaz Kazi, Developer Advocate, Google Analytics – October 2022
Measuring exact distinct counts (i.e. cardinality) for large datasets requires significant memory and affects performance. Google Analytics 4 properties use HyperLogLog++ (HLL++) algorithm to estimate cardinality for most used metrics including Active Users and Sessions. For example, when viewing Active Users in the Google Analytics UI as well as through the Data API, the count will be an approximation. Using HLL++ for these metrics ensures better performance with higher estimation accuracy and lower error bounds. This blogpost provides details on how HLL++ is used in Google Analytics 4 properties and how you can replicate some of the estimations using BigQuery event export data.
About HLL++
HLL++ estimates cardinality while using less memory and improving performance.
HLL++ has augmentations made over the HyperLogLog algorithm and is based on
HyperLogLog in Practice: Algorithmic Engineering of a State of the Art
Cardinality Estimation Algorithm. Changes in the HLL++ precision
parameter
trade off memory usage vs. accuracy of the computed stats. Increasing the
parameter value lowers the error, but increases the memory consumption.
Unique count implementation in BigQuery
- Use
COUNT(DISTINCT)
to measure exact cardinality. This approach requires more memory and will take longer to run, especially for large datasets. APPROX_COUNT_DISTINCT
approximates results with HLL++. However,APPROX_COUNT_DISTINCT
does not allow users to configure the approximation accuracy.- To use custom
precision
values, use HyperLogLog++ functions. See HLL++ Sketches for permittedprecision
values and the confidence intervals for typical precisions. sparse precision
is another parameter for HLL+. In BigQuery,sparse precision
value is not user definable and is fixed atprecision + 5
.
HLL++ Implementation in Google Analytics 4 properties
Google Analytics 4 uses the following configuration for measuring cardinality of the related metrics.
Metric | precision |
sparse precision |
---|---|---|
Sessions | 12 | 17 |
Active Users | 14 | 25 |
Total Users | 14 | 25 |
Using BigQuery HLL++ Functions with Google Analytics Event data
If your Google Analytics 4 property's event data is available in BigQuery, you can try to match the metrics from UI with BigQuery data. The following examples assume:
- Reporting identifier is set to
By device only
for the Google Analytics 4 property. - Other potential sources of discrepancy, e.g. reporting timezone, are already addressed.
Total Users
Exact count using COUNT(DISTINCT)
:
SELECT
COUNT(DISTINCT user_pseudo_id) AS exact_total_user_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
Approximate count using APPROX_COUNT_DISTINCT
:
SELECT
APPROX_COUNT_DISTINCT(user_pseudo_id) AS approx_total_user_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
You can replicate APPROX_COUNT_DISTINCT
using BigQuery HLL++ functions. This
will return the identical or very similar results as APPROX_COUNT_DISTINCT
:
SELECT
HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 15)) AS approx_total_user_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
And finally, to replicate the data in Google Analytics UI, use precision = 14
:
SELECT
HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 14)) AS total_user_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
Active Users
To calculate Active Users count from BigQuery event export table, you first have to filter the events for Active Users only. Implementing the Active User filter is out of scope for this article.
WITH ActiveUsers AS
(
SELECT
user_pseudo_id
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE
<implement active user filter logic>
)
SELECT
HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 14)) AS active_user_count,
FROM ActiveUsers
Sessions
The ga_session_id
event parameter identifies individual unique sessions for
each user. The combination of user_pseudo_id
and ga_session_id
will be
unique across your dataset for unique sessions. This is the standard method of
counting sessions for Google Analytics 4 properties. For sessions, precision
is 12.
SELECT
HLL_COUNT.EXTRACT(
HLL_COUNT.INIT(
CONCAT(
user_pseudo_id,
(SELECT `value` FROM UNNEST(event_params) WHERE key = 'ga_session_id' LIMIT 1).int_value),
12)) AS session_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
You can use queries in this blogpost with the Basic Queries and Advanced Queries examples to gain additional insights from BigQuery event export data for your Google Analytics 4 properties. If your property generates significant amount of event data, you can also implement the HLL++ functions to estimate cardinality of other metrics that you commonly use in your BigQuery queries. To learn more about HLL++ and why cardinality computation is expensive for large datasets, read the in-depth blogpost on Cloud Blog: Using HLL++ to speed up count-distinct in massive datasets.