SQL patterns I use to catch transaction fraud

redbell1 pts0 comments

Six SQL patterns I use to catch transaction fraud Quick disclaimer: I do data work on a program-integrity team. Examples below use generic transaction tables and made-up scenarios. Nothing here comes from anything I’ve actually worked on or seen. Views are mine, not my employer’s.

Fraud detection in transaction data is mostly SQL. Not machine learning, not graph databases, not whatever Gartner is hyping this year. SQL, run against the right tables, with the right joins, looking for the right shapes.

I work mostly with government-funded benefit programs, but the patterns below port over to anything with a transactions table: credit cards, healthcare claims, e-commerce, point-of-sale. If money moves and gets logged, these queries will find weird things in the log.

Six patterns. Roughly in the order I’d build them out on a new dataset.

1. Velocity

The simplest one. Someone with a stolen card wants to drain it before the holder notices. So they hit the card fast.

SELECT<br>cardholder_id,<br>date_trunc('hour', timestamp) AS hour_bucket,<br>count(*) AS tx_count,<br>min(timestamp) AS first_tx,<br>max(timestamp) AS last_tx<br>FROM transactions<br>WHERE timestamp >= current_date - INTERVAL '30 days'<br>GROUP BY 1, 2<br>HAVING count(*) > 10;<br>Tune two knobs: the window size and the count threshold. I usually run a 1-minute, 5-minute, and 1-hour version in parallel and compare. Different fraud shows up at different scales — a card-testing ring hits a server in seconds; a benefits-trafficking ring might take an afternoon.

A few cardholders will legitimately blow past the threshold. Route operators servicing vending machines. People reloading prepaid cards in bulk. Your false positives. Worth keeping a whitelist after the first pass.

For sliding-window velocity, this is the form I use:

SELECT<br>cardholder_id,<br>timestamp,<br>count(*) OVER (<br>PARTITION BY cardholder_id<br>ORDER BY timestamp<br>RANGE BETWEEN INTERVAL '5 minutes' PRECEDING AND CURRENT ROW<br>) AS tx_in_last_5min<br>FROM transactions<br>QUALIFY tx_in_last_5min >= 5<br>ORDER BY cardholder_id, timestamp;<br>QUALIFY works in Snowflake, BigQuery, Databricks, Teradata. For Postgres you wrap the whole thing in a CTE and filter on the outside. Slight pain, same result.

2. Impossible travel

If a card swipes in Chicago and seven minutes later swipes in Los Angeles, one of those swipes is fake. The card is cloned. This is the most uncontroversial fraud signal you’ll find — there’s almost no legitimate reason a single card is in two distant places in seven minutes.

WITH ordered_tx AS (<br>SELECT<br>cardholder_id,<br>timestamp,<br>location,<br>LAG(timestamp) OVER (PARTITION BY cardholder_id ORDER BY timestamp) AS prev_ts,<br>LAG(location) OVER (PARTITION BY cardholder_id ORDER BY timestamp) AS prev_loc<br>FROM transactions<br>SELECT<br>cardholder_id,<br>prev_ts AS first_tx,<br>timestamp AS second_tx,<br>prev_loc AS first_location,<br>location AS second_location,<br>EXTRACT(EPOCH FROM (timestamp - prev_ts)) / 60 AS minutes_apart,<br>haversine(prev_loc, location) AS miles_apart<br>FROM ordered_tx<br>WHERE prev_ts IS NOT NULL<br>AND prev_loc <> location<br>AND haversine(prev_loc, location)<br>/ nullif(EXTRACT(EPOCH FROM (timestamp - prev_ts)), 0)<br>* 3600 > 600;<br>haversine is the great-circle distance function. Most warehouses ship one. If yours doesn’t, it’s about ten lines to write your own.

The 600 mph threshold is rough — commercial jet cruise is around 575, so this is “faster than a plane could possibly do it.” You can tighten it to 100 mph if you want to catch suspiciously-fast ground travel too, but at that threshold you start picking up real airline travelers, kids with parents driving them home from camp, that kind of thing.

A few other shapes in the same family are worth running:

Two distant cities, same state, inside 5 minutes. Local cloning rings.

Multiple ZIP codes inside an hour. Skimmer rings working a region.

Border crossings inside 10 minutes. International rings.

3. Amount anomalies

There are a couple of amounts that show up disproportionately in fraud and almost never in normal use.

SELECT cardholder_id, timestamp, amount, merchant_id<br>FROM transactions<br>WHERE<br>(amount >= 99.50 AND amount 100.00)<br>OR (amount >= 499.50 AND amount 500.00)<br>OR amount IN (1.00, 5.00, 10.00)<br>ORDER BY cardholder_id, timestamp;<br>What’s happening:

Round dollar amounts at small values — $1.00, $5.00, $10.00 — are almost always card tests. Someone got a card number from a dump and they’re checking if it works before reselling it. Real cardholders almost never buy something for exactly $1.00. Coffee is $4.73, gas is $52.81. The roundness is the signal.

Amounts just below a threshold are different. $99.99 is interesting because at a lot of places, $100 is the line where the cashier is supposed to check ID. $499.99 is interesting because $500 is often a daily ATM cap. Whoever’s doing the transaction knows the rules and is staying under them.

(For benefits transactions specifically, the round-number pattern doesn’t help much. Benefits don’t get card-tested the same way. There the...

timestamp from cardholder_id card amount fraud

Related Articles