If you just want to download the latest source or try the live CGI version of CMOScript, here you go:
Download link:
http://sandgnat.com/cmos/obfu.pl. (latest version)
Examples (try it out -- live CGI rule generator) here:
http://sandgnat.com/cmos/cmos.jsp
Otherwise, read on!
Users don't want spam. Mail admins will put filters in place to attempt to reduce the spam so their users are happy. SpamAssassin is a filter that uses a cumulative score. Simple word rules such as adding a score when the word "viagra" appears in the subject line or body of a message has been quite effective in filtering viagra spam. However, spammers are sneaky fellers. They will do anything they can to get their spam into your inbox. One tactic they use to bypass word based rules/filters is obfuscation.
While there are many different kinds of obfuscation (HTML comment tags, small fonts, etc... see
Jennifer's Rules and MaskedWordList for more), let's focus on the following three forms of word obfuscation:
All three of these methods rely on humans ability to recognize the original word regardless of extra, missing, or replaced characters.
SpamAssassin rules can be written that match these obfuscation techniques. For instance, one day you receive a spam with "v1codin" in the subject line. Easy, you think, and write a rule...
header MY_VICODIN Subject =~ /v[i1]codin/i
... and life is good. However, the next day your vicodin-peddling chum sends you another spam with a slightly different "v|codin" in the subject line. Doh. Let's get smarter...
header MY_VICODIN Subject =~ /v[i1l]c[o0]d[i1l]n/i
... that'll keep 'em at bay. However, the next day... DOH! Now they're coming in as "V_Ì+Ç+0-D_Ï-Ñ"!!! The low-life scum! GRRRR!!!! You hate playing whack-a-mole with your rules, but how can you feasibly write a rule that matches all the possible permutations? (Ok, so maybe this example is a little far fetched, but they really do use these tactics)
This page details my attempt at creating a rule generation script for obfuscated text that will catch many of the permutations of the obfuscation techniques above.
I will call it CMOScript (See-Moh-Script) because I'm getting sick of typing "The script".
ChrissMediocreObfuScript (CMOScript) is a script written in perl that will generate a rules file consisting of rules that match obfuscated text. The primary goal (duh!) is to match as much spam as possible with as few false positives as possible.
This thing sort of evolved as I was writing it. My spam corpus contains mostly smut spam. The default SpamAssassin ruleset wasn't catching many of the obfu "naughty words". Originally, the script was intended to be run on rules files from the default distribution, such as 20_porn.cf... however after running stats, I found I wasn't doing much better than the original ruleset.
Now, I create a simple rules file with very short rules that match the naughty words. I use CMOScript to generate a complex obfuscation detecting rules file. The hit rate on my obfuscation detecting smut rules are pretty decent for my corpus.
Here's how I actually use it in my environment:
user1@ns1:~$ ./bin/obfu.pl -o /etc/spamassassin/local_pron.cf.src > local_pron.cf user1@ns1:~$ sudo chown root.root local_pron.cf user1@ns1:~$ sudo mv local_pron.cf /etc/spamassassin/local_pron.cf
The first line will generate the destination rules file from the source rules file and place it in my home directory. The second line changes the ownership of the destination file to root. The third line places the destination rules file in the appropriate place for my configuration.
Usage: obfu.pl [arguments] [source rules file] > [destination rules file]
Arguments: -o Create rules that ONLY match if text is obfuscated
-g Don't use gappy text obfuscation check. (faster matches)
-s Use simple gappy text pattern (faster matches) (can't be used with -g)
-m <n> Allow n characters for gappy text match (more lenient matches: catches v..iag.ra)
(probably causes more false positives)
-w Don't use special gap for short words
(disabling this catches more spam, but probably causes more false positives)
-u Don't check for (unicode) html entities/entity ranges (faster matches)
-h Output high ASCII characters directly (greater than 127; don't use \xFF in rules)
-v Make vowels matches optional
(Very experimental-NOT RECOMMENDED when using gappy text!)
-D Print debug information (as comments in generated rules file)
CMOScript operates either on a pipe or takes a filename arg, like many standard unix tools (including SpamAssassin). You input your source rules file to the script and it spits out an obfuscation-matching rules file which you can then pipe to a file in your /etc/spamassassin directory, or whatever.
SpamAssassin body and header rules are extracted from the source rules file (As of this writing, Subject is the only header rule supported). Each rule is broken apart into tokens. For each token that is a letter of the alphabet such as 'a', a replacement pattern such as: [a4\*\@\xC0-\xC5\xAA\xE0-\xE5] is looked up and inserted into the target rule. As each source rule from your rules file is converted to an obfuscation detecting rule, it is renamed with a prefix of "LOCAL_OBFU_".
(Optional, Default) G*A*P filler is inserted between each simple character token.
(Optional) The rule is made to ignore non-obfuscated versions of the word. This can be useful, for instance, to match commonly obfuscated words that are not always spammy. Example: your Mother may tell you she just "refinanced" her "mortgage", but a spammer might tell you you need to "R.efin.an.ce" your "M0rtg@ge". You can flip a switch and generate rules that match the latter but not the former.
(Optional) All vowels are made "optional characters" via the "?" regexp modifier (This is still experimental. It seems to generate too many false positive hits when used in conjunction with gappy text detection).
Add yours to this list:
Feel free to add bugs to this list:
Version History: ChrissMediocreObfuScriptVersionHist
Courtesy of ChrisThielen
Subject: Reverse character range with Unicode
I just noticed that when using Unicode (not using -u) the rules regerated contain character range that are in reverse order:
body LOCAL_OBFU_CSIM_VIAGRA /(?!\bviagra\b)(?:\b[vu]|\B(?:\\\/|\xCE\xBD))[\x01-\x2F\x3A-\x40\x5B-\x60\|\x7F-\xA1\xA4-\xA8\xAB-\xAD\xAF-\xB1\xB4\xB7-\xBB\xBF\xF7]?(?:[il1:\|\*\xCC-\xCF\xEC-\xEF\xA6]|(?:\xC4[\xB0-\xAF]|\xC4[\xAF-\xAE]|...
See the [\xB0-\xAF] and then the [\xAF-\xAE] and etc. (I trucated the rule)
So SA will complain that
[2677] warn: config: invalid regexp for rule LOCAL_OBFU_CSIM_VIAGRA: /(?!\bviagra\b)(?:\b[vu]|\B(?:\\\/|\xCE\xBD))[\x01-\x2F\x3A-\x40\x5B-\x60\|\x7F-\xA1\xA4-\xA8\xAB-\xAD\xAF-\xB1\xB4\xB7-\xBB\xBF\xF7]?(?:[il1:\|\*\xCC-\xCF\xEC-\xEF\xA6]|(?:\xC4[\xB0-\xAF]|\xC4[\xAF-\xAE]|...
Invalid [] range "\xB0-\xAF" in regex; marked by <-- HERE in m/(?i)(?!\bviagra\b)(?:\b[vu]|\B(?:\\\/|\xCE\xBD))[\x01-\x2F\x3A-\x40\x5B-\x60\|\x7F-\xA1\xA4-\xA8\xAB-\xAD\xAF-\xB1\xB4\xB7-\xBB\xBF\xF7]?(?:[il1:\|\*\xCC-\xCF\xEC-\xEF\xA6]|(?:\xC4[\xB0-\xAF <-- HERE ]|\xC4[\xAF-\xAE]|...
When using -u (no Unicode)the rule is shorter and there is no such character range trouble.