Behavior of the CMP with Bots

What are bots?

👉 Bots are software applications that run automated tasks over the Internet. They are used to index internet content or to automatically gather information from websites. 

Some bots work for legit purposes, whereas some collect data for malicious purposes, such as:

  • Content reselling
  • Click generation
  • Price undercutting
  • Etc.

Like any client-based web solution, Didomi is impacted by the bot traffic that generates “false” data. As a consequence, it can generate inaccurate CMP analytics

Impact on CMP Analytics Indicators

The most impacted metric is the total notices (with an increase in volume), which directly inflates the notice bounce rate and addressability rate performance indicators.

Provide analytics data without bots

👉 Bots impact Web data, so they generate false user data. They deteriorate the addressability rate, as well as the pageview consent rate by increasing the volume of notice bounces and the number of pageviews without consent.

In order not to deteriorate the compliance of your reports, we advise you not to exclude all UA (user agents). These UAs can be hiding bots, but also users who have given their consent. 

In this case, excluding UAs represents both a compliance and legal risk.

There are two types of bots:

Declared Bots: they can be detected thanks to their user agent (UA). They are excluded with the user agent filtering method. A few examples of bots:
    • Scraper bots: programmed to capture the content offline, such as names, prices, and product details on e-commerce websites.
    • Crawler bots: used by large companies, such as Google, Yahoo etc, for content indexing purposes.
    • Performance/audit bots: used by website performance tools to perform SEO audit or to evaluate page loading time performance. Didomi also uses a bot to evaluate the compliance of websites.

Hiding Bots: they use standard user agents and therefore can’t be identified with the UA filtering method.
A specialized solution/technology is required to detect then to exclude them from analytics data.

Example of user agents

Declared Bots

  • Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) TagInspector/500.1 Chrome/90.0.4430.72 Safari/537.36 Edg/90.0.818.42
  • Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/85.0.4183.102 Safari/537.36
  • Mozilla/5.0 (iplabel; Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36

Elements that are not part of a standard user agent.

Hiding Bot User agents

  • Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36
  • Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64

Even if the user agents above are used by bots, they are also used by regular visitors: user agents can’t be excluded.

Be careful with your own bots

If you are using tools to evaluate the performance of your website: page loading time, SEO audit, etc. 

They probably use bots to do it. As a consequence, they generate data if they are not identified by our technology. You can:

  1. Check the bots we detect (see the list below). 
  2. Verify with your solutions if the bots have a UA pattern.
  3. Add the patterns in your bot management custom feature.

Behavior of the CMP with Bots

⚙️ By default, bots will "bypass" the consent notice. And we consider that the consent is already given for the bots and all the scripts will therefore be fired, along with consent events. So the banner is not deployed and doesn't collect any consent from the bots.

➡️ If you need to collect consent for bots in your Consent Notice, you can follow our Bypass consent collection for bots.

You can add the JSON code to your consent notice in 2.customization; Advanced settings; Custom JSON. 

Remember that, in that case, the banner is deployed for bots, but they will probably not be able to make a consent choice: there is just a consent notice with the consent string by default. No consent is collected, the bot will probably not be able to browse the website.

Custom bot management, bypass consent collection for bots

👉 You can directly customize the bot management with custom json in your SDK implementation. 

The features offer the following capabilities:

  • Defining the category of bots to block
  • Adding user agent patterns (terms) for exclusion purposes

Here are all the details in the developer documentation.

Didomi’s bot list

👉 +90 bots are automatically detected at the CMP level and during data cleaning processing. Below the lists of the bot patterns (terms) used to identify the bot traffic. All the visitors with a user agent containing the following terms are identified as bots.

Crawler bots

Googlebot, adsbot, feedfetcher, mediapartners, bingbot, bingpreview, slurp, linkedin, msnbot, teoma, alexabot, exabot, facebot,  facebook, twitter, yandex, baidu, duckduckbot, qwant, archive, applebot, addthis, slackbot, reddit, whatsapp, pinterest, moatbot, google-xrawler, NETVIGIE, PetalBot, PhantomJS, NativeAIBot, Cocolyzebot, SMTBot, EchoboxBot, Quora-Bot, BLP_bbot, MAZBot, ScooperBot, BublupBot, Cincraw, HeadlessChrome, diffbot, Google Web Preview, Doximity-Diffbot, Rely Bot, pingbot, cXensebot, PingdomTMS, AhrefsBot, semrush, seenaptic, netvibes, taboolabot, SimplePie, APIs-Google, Google-Read-Aloud, googleweblight, DuplexWeb-Google, Google Favicon, Storebot-Google, TagInspector, Rigor, Bazaarvoice, KlarnaBot, pageburst, naver, iplabel, plus generic terms like “robot”, “scraper”, “crawler”, “spider”, “crawling” and “oncrawl”.

Performance bots

Chrome-Lighthouse, gtmetrix, speedcurve, DareBoost, PTST, StatusCake_Pagespeed_Indev.

Bot management diagram

schema

(1)  SDK is loaded

(2) Notice triggering rules verification:

  • SDK scans the user agent to identify if it’s a bot or not.
  • If a bot is detected, the behavior of the notice is defined by the notice config (trigger or not the notice).
  • If the visitor is not labelled as a bot, the notice is triggered.

(3) CMP events (notice display) are triggered

(4) Data Processing (turn events into analytics)

👉 All the events (data) collected from (identified) bots are excluded from the analytics, even if the notice has been displayed to the bot on purpose.

(5) Analytics data is displayed in the dashboards

Bot protection tools

schema_1

Some solutions are specialized in bot detection and protection. They protect your website from bot traffic. 

As these solutions detect bots before they reach the website (see drawing), they can prevent the bot to load any page and therefore prevent for impacting the CMP analytics data.

For more information, see solutions such as Datadome, Human, Cloudflare, Netacea, etc.