Harvesting the Noise While it's Fresh, Revisited

A year's worth of logs yields entertaining but unsurprising findings about spammer behavior.

spam mail, masked but detected (from the archive)

Returning readers will be almost painfully aware that here at nxdomain.no (also known as bsdly.net) we host and maintain a blocklist, which in turn is the product of traffic that hits our mail system with attempts at delivery to one or more of the now more than three hundred thousand known bad addresses, also featured at the blocklist home page.

Note: This piece is also available with trackers but nicer formatting here.

When I first set up the greytrapping back in 2007, the initial spamtraps were non-deliverable addresses in our domains that I had extracted from mail server logs. I won't bore you with the details (which are anyway documented at length in earlier articles), but it was clear from those logs that the domains we hosted back then were more or less continously subject to Joe jobs, as in somebody sending messages with a forged From: field with a made up address in our domains.

After a while I started extracting the potential new spamtraps from the greylist — actually dumping data from there once per hour as part of the script that also generated the exported blocklist. The basic process is described in the July 25 2007 article Harvesting the noise while it's still fresh; SPF found potentially useful (also available trackerless but with links to tracked articles).

Then today it struck me that while that method is useful, by extracting only from the greylist we will only ever collect the address from the initial connections. Any addresses attempted after the miscreants enter the blocklist will simply not be recorded there.

This of course lead to the question: What did we miss?

Fortunately I keep my logs around for a while, the most easily accessible log archive for my main spamd spans a lttle over a year. So I set about with some very basic grep and awk, which netted me this raw list of targeted addresses from the spamd logs.

The list weighs in at a total of 269903 entries, as counted by wc -l.

Some of those addresses are valid, and a small, but actually significant, number are in domains we do not actually serve here, and some entries do not look like mail addresses at all. The stranger ones could be strings encoded in a character set that spamd is not equipped to handle, or could be other binary data that might have been intended to trigger bugs in some of the variants of fully equipped SMTP servers that are out there. Or simply noise of any other kind, including a byproduct of the not very intelligent extraction one-liner I used.

The target addresses in foreign domains I take as a sign that at least some spamming operators mistake a reasonably configured spamd for an open relay, just like they did all those years ago when I started running the greytrapping.

Some things apparently stay the same no matter how the rest of the world has found a way to move forward.

While I did a few other tasks and finally started writing this article, the bulk of the processes that would answer the question posed earlier (What did we miss?) could fortunately run unattended in the background, and after some manual massaging we are left with a results file, with 1530 entries that were none of

actually useful deliverable addresses in our domains
existing spamtraps

This means of course that the collection of imaginary friends expanded by the same number, and now stands at 304154 entries.

Which I suppose means that harvesting the noise even after a period of aging for refinement can be a good thing.

The entries added represent a wide variety of phenomena. Quite a few seem to be truncated versions of earlier spamtrap entries, and a fair number of the new entries look like they may have descended from artifacts of stupidity such as products of SMTP callbacks. Proving mainly that in mail and spam handling, there appears to be a space still for the less intellectually astute.

With all of this said, the natural followup question is, given the modest net result, was this worth the effort?

Well, the raw output that yielded 269903 entries needed some manual operations in order to weed out the obvious noise (exact time used not recorded), followed by another background task that took, according to time(1)

    real        105m24.220s
    user        73m3.280s
    sys	        29m14.930s

which yielded 1577 entries that were pared down to 1530 entries that met the criteria for inclusion in the circle of imaginary friends (also known as spamtraps).

Before this experiment, the spamtraps list numbered 302625, after including the result here, the count stands at 304154, for a gain of less than one percent of the previous total. Again, if you check back at the traplist home page now, the total number is likely to have increased again.

So was it worth the effort? I feel that as an experiment, it was worth doing.

Whether or not it is an experiment that is worth repeating is a question for another day.

If you have opinions on this, I would love to hear from you, in comments, via email or messages on whichever social media brought you the link to this article.

As always, parties interested in studying the data referenced in this article and other pieces I have written are welcome to contact me for arrangements. I can easily dig out more and rawer data than directly referenced here on request.

Stay safe out there.

As a side note, a slightly improved way of extracting useful data about other domains' mail service via SPF records can be found in the November 2018 artice Goodness, Enumerated by Robots. Or, Handling Those Who Do Not Play Well With Greylisting.

That article (naturally) works from the premise that you are running a recent OpenBSD system.

Addendum 2025-01-12

For those so inclined, it is perhaps worth noting that after a bit of pondering some time after writing this piece, I started looking at extracting other items from the spamd logs log entries.

I ended up with extracting the local parts for new spamtraps from the purported sender addreses of entries for trapped delivery attempts some time mid-2024. This made for a significant increase in the number of new imaginary friends, and by the final months of that year I had also started extracting similarly from the string offered by the spam senders as their host name in the EHLO/HELO exchange, which of course swelled the population further.

The effect is clearly to be seen in the file that records the number of spamtraps added per year, updated via trivial scriptery roughly daily.

I hope this article and its addendum helps inspire others in our efforts of green cybercrime prevention by giving the actually intelligent detection methods less work to do.

Addendum some more 2025-01-18

I suppose it had to happen sooner or later, but as commemmorated in this toot, which said

Likely not blogworthy in itself, but #openbsd #spamd aficionados will get a light chuckle from hearing that some scraping and massaging relevant logs had the number of imaginary friends at https://nxdomain.no/~peter/traplist.shtml for our not-friends to play with roll past the one million mark in the early hours of today CET.

The recent update of https://nxdomain.no/~peter/harvesting_the_noise_revisited.html has links to more info. #spam #antispam #greytrapping #blocklists #cybercrime

Yes, that's right, after I turned to extracting vaguely relevant data from logs in order to salt the mine and poison the well further, the number of imaginary friends quickly grew past the one million mark.

And as if this particular Saturday morning was not already quite weird enough for most tastes, somebot produced another remarkable item that I just could not restist tooting about,

And ref previous toot, the 1006089th imaginary friend to join the collection at https://nxdomain.no/~peter/traplist.shtml is, mail.protection.outlook.com@bsdly.net following this sequence: https://nxdomain.no/~peter/blogpix/2025-01_18_johnson@vicglobalintelligence.com_to_mail.protection.outlook.com@bsdly.net.txt

The bots never cease to amaze #openbsd #spamd #greytrapping #antispam #cybercrime

And the two episodes combined proved addendum-worty, at least, see https://nxdomain.no/~peter/harvesting_the_noise_revisited.html

Yes, you read that right: For reasons known only to the bots' herders (if that), the subdomain that houses mail services for a large number of Microsoft customers entered the lexicon of spammers' spanto: addresses. Only to be included at first sight in the herd of imaginary friends I hope will help poison the spammers' data further.

The activity here did of course not stop the bots from keeping on trying. A few minutes after the second addendum here was added and tooted out, my logs showed the following activity from the hosts involved in trying to spam mail.protection.outlook.com@bsdly.net: https://nxdomain.no/~peter/blogpix/2025-01-18_host_targeting_mail.protection.outlook.com@bsdly.net_all_spamd_log_entries.txt. And more likely than not, they will keep trying.

How was the start of your weekend?

Also worth noting is that if you do try to do this at home, please keep in mind that you will neeed to implement a scheme that keeps actually valid addresses in your domains out of the spamtrap pool. Otherwise regrettable episodes may arise.