OPINION11 August 2022

War of the worlds: Tackling survey fraud

AI Opinion Technology

Ryan Howard shares methods to defend against survey fraud and maintain the quality of research data.

Tech abstract image

An automated survey responder, particularly within business-to-business (B2B) research, can provide a steady trickle of ill-gotten gains. These bots are not alien tentacles occasionally sent forth by the dark web. Rather anyone, including those without IT skills, can download and launch their own within minutes. A colleague at a tech manufacturer, revealed their in-house surveys comprise mainly of automated responses – less a mischievous side hustle, more an invasion.

At its head, a growing community of coders, intellects vast and unsympathetic, to whom this is sport. The mainstreaming of virtual private networks (VPNs) let fraudsters operate from anywhere, with currencies in their favour.

Keeping them at bay was once an effortless thwack, but as we do more with artificial intelligence (AI), so do they. Today’s bots are bred to squirm naturally, mimic reading time to navigate speed traps, adjust to sentiment to vary their answers realistically, not randomly. They approach gingerly, careful not to raise suspicion. Without precaution, it is already impossible to tell them apart. Researchers near the coalface are getting twitchy.

Just like the classic science fiction trope goes, to save our world we must turn the invader’s inherent strength against itself:

Set trapdoors
Repeating a question on a reversed answer scale not only foils unengaged humans. A ‘please click threee here’ (sic) works a treat. AI freaks out at poor spelling.

Play dirty
Allow multiple responses to a question which logically can only have a single answer, such as age. A bot will do as the input expects, inclined to tick more than one box.

Capture page timings
I still see ‘non-AI’ bots, the minions if you will, characterised by constant or inhuman pacing. Respondents should not be able to move a mouse, or scroll and tap, faster than you can.

Avoid field names and form validation
Form engines populate attribute labels, hidden within the page’s HTML, to communicate what is contained within each field to the receiving server: a name, address, password, etc. Instead, use JavaScript workarounds to submit information, and if validation fails, trigger a reCAPTCHA task.

Ask imaginative open-ended questions
Just like our chatbots do, survey bots can ingest frequently appearing nouns phrases from questionnaires, identify commonality and respond on topic, albeit still a little clunkily. Asking an “anything else?” style question is an opportunity to pick from a list of generic answers. Formulating creative questions within a specific context early in your questionnaire lays bare canned responses. Those rehearsed in data cleaning can spot these a mile away.

Segmentation reveals programmatic behaviour. Look for instances where the pattern of answers is directly influenced by onscreen elements. For example, you may find a segment whose Max Diff scores mirror item length/verbosity. At first glance, their priorities appeared consistent and considered. Chilling, isn’t it?

Call a sheepdog
Isolation Forests are machine learning algorithms which generate rules to separate out respondents. They run hundreds of times over, around and around, reporting which respondents need more complex rules to be split from the herd. These are the harder to pin down, multivariate outliers, those that don’t fit a mould – say, ahem, bots less able to convey coherent personas.

At the time of writing, this algorithm works remarkably well, and it would be a poetic solution were it not so heavy handed. That is, it will point to a pensioner who commutes on a bike and has a PlayStation 5. That is, alone, it cannot differentiate between a bot and your eccentric aunt.

Use premium panels
There are tiers of sample providers. Some will supplement their panel using river sampling, luring relevant visitors with flashy pop-ups. Like blood in the water, the survey is wide open for a single bot to rotate VPNs, attempting it many times over while signalling the swarm to follow. Each pass, another roll of the dice to evade security and outfox quality checks.

Then our standard aggregate analysis tables conceal bots, leading some to believe there is still such a thing as a cheap and dirty alternative to premium sample – a false economy by anyone’s definition.

That’s why some panels validate and monitor panellist engagement. Some like mother hens. Wary that some have human helpers, they’ve forged hostile, walled gardens, with reCAPTCHA at every turn. They also clean their data aggressively. As a modeller, I rely on considered answers that produce meaningful relationships at a respondent level, so compelled to seek out the industry’s most skilled and ruthless.

That said, when it comes to data quality, we cannot be overzealous, lest real people are deleted out, harming representation and skyrocketing the costs of research; the economics of which are exponential and immediate. The pragmatic approach is to combine and alternate strategies – perhaps the odd bot might get through. Though, mark my words, as AI becomes more sophisticated, no single ploy will be enough, no combination of tricks a permanent haven.

So, this is the time for vigilance, creativity, and paranoia. Firstly, lets acknowledge this new generation among us and that they’re more insidious than we’d like to believe. They are getting smarter, and still they come.

Ryan Howard is a freelance data science consultant.