Membasmi spam images menggunakan FuzzyOcr.


Email spam makin merajalela saja, seenaknya memasuki mailserver-mailserver yang digunakan ISP, Hosting, Perusahaan dan sebagainya. Mail server kantorku juga kena spam yang memakai images ini, dalam satu hari bisa mencapai 5-10 spam. Spam ini menggunakan CID URL dan Content-ID Header dengan memasukkan embedded image yang di encode dengan base64, jadi isi emailnya ada file jpeg atau gif yang berisi tulisan-tulisan berbau iklan, baca juga di RFC 2557. Hal ini bikin jengkel dan pusing kepala. Untuk mengatasi ini aku mencoba implementasi FuzzyOCR sebagai pluginnya Spamassassin. Cara kerjanya adalah FuzzyOCR memakai OCRplugin dalam hal ini gocr(optical character recognition program) untuk mengecek secara spesifik keywords didalam image/gif, image/jpeg atau image/png yang diikutkan kedalam emailnya. Tanpa basa basi aku implementasikan FuzzyOCR pada mailserver, berikut beberapa langkah instalasi yang sempat kudokumentasikan walau tidak lengkap dari proses menginstall library/program pendukungnya dan settingan beberapa file konfigurasinya :

Implementasi ini dilakukan pada mesin Linux Debian Sarge 3.1

gtoms@mail-server-bcu:~$ uname -a
Linux mail-server-bcu 2.4.27-2-386 #1 Wed Aug 17 09:33:35 UTC 2005 i686 GNU/Linux

mail-server-bcu:/home/gtoms# wget http://internap.dl.sourceforge.net/sourceforge/libungif/libungif-4.1.4.tar.gz
--10:47:52-- http://internap.dl.sourceforge.net/sourceforge/libungif/libungif-4.1.4.tar.gz
=> `libungif-4.1.4.tar.gz'
Resolving internap.dl.sourceforge.net... 64.74.207.43
Connecting to internap.dl.sourceforge.net[64.74.207.43]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 602,359 [application/x-gzip]
100%[====================================================================>] 602,359 60.57K/s ETA 00:00
10:48:01 (72.99 KB/s) - `libungif-4.1.4.tar.gz' saved [602359/602359]


mail-server-bcu:/home/gtoms# tar xzvf libungif-4.1.4.tar.gz

mail-server-bcu:/home/gtoms/libungif-4.1.4# apt-get install libnetpbm10-dev netpbm giflib3g-dev libimage-exif-perl libstring-approx-perl
Reading Package Lists... Done
Building Dependency Tree... Done
The following extra packages will be installed:
giflib3g libexif10 libnetpbm10
Recommended packages:
gs gs-aladdin
The following NEW packages will be installed:
giflib3g giflib3g-dev libexif10 libimage-exif-perl libnetpbm10 libnetpbm10-dev libstring-approx-perl netpbm
0 upgraded, 8 newly installed, 0 to remove and 3 not upgraded.
Need to get 1602kB of archives.
After unpacking 5493kB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://debian.indika.net.id stable/main libexif10 0.6.9-6 [81.0kB]
Get:2 http://debian.indika.net.id stable/main libimage-exif-perl 0.99.4-3 [53.7kB]
Get:3 http://debian.indika.net.id stable/main libnetpbm10 2:10.0-8sarge3 [65.0kB]
Get:4 http://debian.indika.net.id stable/main libnetpbm10-dev 2:10.0-8sarge3 [111kB]
Get:5 http://debian.indika.net.id stable/main libstring-approx-perl 3.24-1 [43.3kB]
Get:6 http://debian.indika.net.id stable/main netpbm 2:10.0-8sarge3 [1200kB]
Get:7 http://debian.indika.net.id stable/main giflib3g 3.0-11 [23.6kB]
Get:8 http://debian.indika.net.id stable/main giflib3g-dev 3.0-11 [25.0kB]
Fetched 1602kB in 2s (653kB/s)
Selecting previously deselected package libexif10.
(Reading database ... 32381 files and directories currently installed.)
Unpacking libexif10 (from .../libexif10_0.6.9-6_i386.deb) ...
Selecting previously deselected package libimage-exif-perl.
Unpacking libimage-exif-perl (from .../libimage-exif-perl_0.99.4-3_i386.deb) ...
Selecting previously deselected package libnetpbm10.
Unpacking libnetpbm10 (from .../libnetpbm10_2%3a10.0-8sarge3_i386.deb) ...
Selecting previously deselected package libnetpbm10-dev.
Unpacking libnetpbm10-dev (from .../libnetpbm10-dev_2%3a10.0-8sarge3_i386.deb) ...
Selecting previously deselected package libstring-approx-perl.
Unpacking libstring-approx-perl (from .../libstring-approx-perl_3.24-1_i386.deb) ...
Selecting previously deselected package netpbm.
Unpacking netpbm (from .../netpbm_2%3a10.0-8sarge3_i386.deb) ...
Selecting previously deselected package giflib3g.
Unpacking giflib3g (from .../giflib3g_3.0-11_i386.deb) ...
Selecting previously deselected package giflib3g-dev.
Unpacking giflib3g-dev (from .../giflib3g-dev_3.0-11_i386.deb) ...
Setting up libexif10 (0.6.9-6) ...
Setting up libimage-exif-perl (0.99.4-3) ...
Setting up libnetpbm10 (10.0-8sarge3) ...
Setting up libnetpbm10-dev (10.0-8sarge3) ...
Setting up libstring-approx-perl (3.24-1) ...
Setting up netpbm (10.0-8sarge3) ...
Setting up giflib3g (3.0-11) ...
Setting up giflib3g-dev (3.0-11) ...
mail-server-bcu:/home/gtoms/libungif-4.1.4# apt-get install imagemagick libjpeg-progs Reading Package Lists... Done
Building Dependency Tree... Done
The following extra packages will be installed:
libdps1 libjasper-1.701-1 liblcms1 libmagick6
Suggested packages:
gs html2ps libjasper-runtime liblcms-utils
The following NEW packages will be installed:
imagemagick libdps1 libjasper-1.701-1 libjpeg-progs liblcms1 libmagick6
0 upgraded, 6 newly installed, 0 to remove and 3 not upgraded.
Need to get 3258kB of archives.
After unpacking 10.7MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://debian.indika.net.id stable/main libdps1 4.3.0.dfsg.1-14sarge1 [286kB]
Get:2 http://debian.indika.net.id stable/main libjasper-1.701-1 1.701.0-2 [135kB]
Get:3 http://debian.indika.net.id stable/main liblcms1 1.13-1 [123kB]
Get:4 http://debian.indika.net.id stable/main libmagick6 6:6.0.6.2-2.7 [1172kB]
Get:5 http://debian.indika.net.id stable/main imagemagick 6:6.0.6.2-2.7 [1466kB]
Get:6 http://debian.indika.net.id stable/main libjpeg-progs 6b-10 [77.1kB]
Fetched 3258kB in 1s (1736kB/s)
Selecting previously deselected package libdps1.
(Reading database ... 32951 files and directories currently installed.)
Unpacking libdps1 (from .../libdps1_4.3.0.dfsg.1-14sarge1_i386.deb) ...
Selecting previously deselected package libjasper-1.701-1.
Unpacking libjasper-1.701-1 (from .../libjasper-1.701-1_1.701.0-2_i386.deb) ...
Selecting previously deselected package liblcms1.
Unpacking liblcms1 (from .../liblcms1_1.13-1_i386.deb) ...
Selecting previously deselected package libmagick6.
Unpacking libmagick6 (from .../libmagick6_6%3a6.0.6.2-2.7_i386.deb) ...
Selecting previously deselected package imagemagick.
Unpacking imagemagick (from .../imagemagick_6%3a6.0.6.2-2.7_i386.deb) ...
Selecting previously deselected package libjpeg-progs.
Unpacking libjpeg-progs (from .../libjpeg-progs_6b-10_i386.deb) ...
Setting up libdps1 (4.3.0.dfsg.1-14sarge1) ...
Setting up libjasper-1.701-1 (1.701.0-2) ...
Setting up liblcms1 (1.13-1) ...
Setting up libmagick6 (6.0.6.2-2.7) ...
Setting up imagemagick (6.0.6.2-2.7) ...
Setting up libjpeg-progs (6b-10) ...


mail-server-bcu:/home/gtoms/libungif-4.1.4# cd gocr-0.40/src

mail-server-bcu:/home/gtoms/libungif-4.1.4/gocr-0.40/src# wget http://antispam.imp.ch/patches/patch-gocr-segfault
--10:58:14-- http://antispam.imp.ch/patches/patch-gocr-segfault
=> `patch-gocr-segfault'
Resolving antispam.imp.ch... 157.161.9.64
Connecting to antispam.imp.ch[157.161.9.64]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 596 [text/plain]
100%[====================================================================>] 596 --.--K/s
10:58:20 (5.68 MB/s) - `patch-gocr-segfault' saved [596/596]


mail-server-bcu:/home/gtoms/libungif-4.1.4/gocr-0.40/src# patch pgm2asc.c < patch-gocr-segfault
patching file pgm2asc.c
Hunk #1 succeeded at 1202 with fuzz 2 (offset 2 lines).
Hunk #2 succeeded at 1255 (offset 2 lines).


mail-server-bcu:/home/gtoms/libungif-4.1.4/gocr-0.40/src# cd ..

mail-server-bcu:/home/gtoms/libungif-4.1.4/gocr-0.40# ./configure --prefix=/usr && make && make install
checking for gcc... gcc
checking for C compiler default output... a.out
checking whether the C compiler works... yes


mail-server-bcu:/home/gtoms/libungif-4.1.4/gocr-0.40# cd

mail-server-bcu:~# wget http://www200.pair.com/mecham/spam/image001.gif
--11:07:17-- http://www200.pair.com/mecham/spam/image001.gif
=> `image001.gif'
Resolving www200.pair.com... 209.68.2.45
Connecting to www200.pair.com[209.68.2.45]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12,312 [image/gif]
100%[====================================================================>] 12,312 22.06K/s
11:07:18 (22.00 KB/s) - `image001.gif' saved [12312/12312]


mail-server-bcu:~# giftopnm image001.gif > image001.pnm
giftopnm: too much input data, ignoring extra...
giftopnm: bogus character 0x00, ignoring
mail-server-bcu:~# gocr image001.pnm
' AnENTlON ALL DAY TRADERS AND INVESTORS '
DPER UP 5$.330/o IN A SINGLE DAY!.
m HAVE A RUNNE_!.!.
CoUED THIS 8E THE NEXT EXXoN9.
AEE sIcNs sH_THAT DpERIs A8ouT To ExpEoDE!.
WATCH DPE_PH LIHE A HAWH STARTING ON THURS AUG _7_H
Company Hame DEEP EARTH RESOURCES IDPER PHI
&ock Symbol DPER
Wednesday Close O OOS5 IUP 59 334b Wednesday alonel
Wednesday _olume 60,5l5,066
5_day Targd O 07
Currem RdIng STROHE BUY
WATCH THIS AS SooN AS THE MARKET opENS!.!.!.
DPER RELEASES BREAm!HG HEWS
SIHGAPORE__(MARmET WIRE>_May l5, 2008 __ Deep Eadh ResouIres, Inr (the ''Compan_')
(OtheI OTC DPER Pm _ Hem) Iepo_ that tudheI to the Compan_s nem Ielease
dated May 4, 2008 In whIrh the dlIe_on announred the rhange ot the Compan_s name,
management has InItIated eno_ to IdentIh a_quIsItIon and loInt ventuIe oppodunItIes
wIthIn the eneIgy se_oI
The Compan_s InItIal eno_ wIII be ronrentIated on IdentIhIng oII and gas pIopedIes
Iorated In the UnIted States Management wIII be _udyIng IntoImatIon on pIopedIes lorated
In Texas, AII_ona, Hew MexIro and Utah It Is the Compan_s IntentIon to employ a lowIId_
arquIsItIon _Iategy In oIdeI to sele_ the InItIal arquIsItIon oI loInt ventuIe taIgeb
Onre the rompany has arquIIed an InteIe_ In a roIe gIoup ot pIopedIes, a tInanrIal
a_e_ment wIII be rondu_ed to deteImIne the appIopIIate level ot II* toleIanre
Management torus wIII be on sele_Ing the InItIal taIgeb to e_ablIsh a base wIthIn
the eneIgy se_oI
OPPORTUNl_ DOES NOT HNOCH ON THE DOOR Mm DA_
So ADD DpER To YoUR RADAR N_ AND WATCH IT SoA_


mail-server-bcu:/home/gtoms# wget http://users.own-hero.net/~decoder/fuzzyocr/fuzzyocr-2.3b.tar.gz
--11:11:45-- http://users.own-hero.net/%7Edecoder/fuzzyocr/fuzzyocr-2.3b.tar.gz
=> `fuzzyocr-2.3b.tar.gz'
Resolving users.own-hero.net... 85.214.51.57
Connecting to users.own-hero.net[85.214.51.57]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76,411 [application/x-gzip]
100%[====================================================================>] 76,411 65.44K/s
11:11:50 (65.33 KB/s) - `fuzzyocr-2.3b.tar.gz' saved [76411/76411]


mail-server-bcu:/home/gtoms# tar xzvf fuzzyocr-2.3b.tar.gz
FuzzyOcr-2.3b/
FuzzyOcr-2.3b/FuzzyOcr.pm
FuzzyOcr-2.3b/INSTALL
FuzzyOcr-2.3b/FuzzyOcr.words.sample
FuzzyOcr-2.3b/FuzzyOcr.cf
FuzzyOcr-2.3b/LICENSE
FuzzyOcr-2.3b/FAQ
FuzzyOcr-2.3b/samples/
FuzzyOcr-2.3b/samples/png.eml
FuzzyOcr-2.3b/samples/corrupted-gif.eml
FuzzyOcr-2.3b/samples/jpeg.eml
FuzzyOcr-2.3b/samples/README
FuzzyOcr-2.3b/samples/animated-gif.eml


mail-server-bcu:/home/gtoms# cd FuzzyOcr-2.3b

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b#

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b# wget http://www200.pair.com/mecham/spam/fuzzyocr-23b-hashdb-poison.patch
--11:12:29-- http://www200.pair.com/mecham/spam/fuzzyocr-23b-hashdb-poison.patch
=> `fuzzyocr-23b-hashdb-poison.patch'
Resolving www200.pair.com... 209.68.2.45
Connecting to www200.pair.com[209.68.2.45]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,052 [text/plain]
100%[====================================================================>] 2,052 --.--K/s
11:12:29 (19.57 MB/s) - `fuzzyocr-23b-hashdb-poison.patch' saved [2052/2052]


mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b# patch FuzzyOcr.pm < fuzzyocr-23b-hashdb-poison.patch
patching file FuzzyOcr.pm

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b# patch FuzzyOcr.pm < fuzzyocr-23b-hashdb-poison.patch
patching file FuzzyOcr.pm

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b# cp FuzzyOcr.pm /etc/spamassassin/

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b# cp FuzzyOcr.cf /etc/spamassassin/

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b# cp FuzzyOcr.words.sample /etc/spamassassin/FuzzyOcr.words

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b# vi /etc/spamassassin/init.pre
Tambahkan baris ini : loadplugin FuzzyOcr /etc/spamassassin/FuzzyOcr.pm

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b# vi /etc/spamassassin/FuzzyOcr.cf
hilangkan tanda pagar pada baris berikut :
#loadplugin FuzzyOcr FuzzyOcr.pm

Jika anda memakai versi Spamassassin dibawah 3.1.4, buang tanda pagar dan set nilainya ke 1.0:
focr_pre314 1.0

Set focr_base_score to 2 :
focr_base_score 2

untuk men test sample images set focr_autodisable_score ke 50:
focr_autodisable_score 50

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b# spamassassin --lint
[11615] warn: config: created user preferences file: /root/.spamassassin/user_prefs
[11615] warn: Subroutine new redefined at /etc/spamassassin/FuzzyOcr.pm line 116.
[11615] warn: Subroutine parse_config redefined at /etc/spamassassin/FuzzyOcr.pm line 126.
[11615] warn: Subroutine dummy_check redefined at /etc/spamassassin/FuzzyOcr.pm line 223.
[11615] warn: Subroutine fuzzyocr_check redefined at /etc/spamassassin/FuzzyOcr.pm line 227.
[11615] warn: Subroutine load_global_words redefined at /etc/spamassassin/FuzzyOcr.pm line 237.
[11615] warn: Subroutine load_personal_words redefined at /etc/spamassassin/FuzzyOcr.pm line 255.
[11615] warn: Subroutine parse_scansets redefined at /etc/spamassassin/FuzzyOcr.pm line 278.
[11615] warn: Subroutine max redefined at /etc/spamassassin/FuzzyOcr.pm line 285.
[11615] warn: Subroutine reorder redefined at /etc/spamassassin/FuzzyOcr.pm line 293.
[11615] warn: Subroutine pipe_io redefined at /etc/spamassassin/FuzzyOcr.pm line 298.
[11615] warn: Subroutine handle_error redefined at /etc/spamassassin/FuzzyOcr.pm line 410.
[11615] warn: Subroutine logfile redefined at /etc/spamassassin/FuzzyOcr.pm line 416.
[11615] warn: Subroutine check_image_hash_db redefined at /etc/spamassassin/FuzzyOcr.pm line 435.
[11615] warn: Subroutine add_image_hash_db redefined at /etc/spamassassin/FuzzyOcr.pm line 475.
[11615] warn: Subroutine calc_image_hash redefined at /etc/spamassassin/FuzzyOcr.pm line 497.
[11615] warn: Subroutine debuglog redefined at /etc/spamassassin/FuzzyOcr.pm line 537.
[11615] warn: Subroutine wrong_ctype redefined at /etc/spamassassin/FuzzyOcr.pm line 543.
[11615] warn: Subroutine corrupt_img redefined at /etc/spamassassin/FuzzyOcr.pm line 562.
[11615] warn: Subroutine known_img_hash redefined at /etc/spamassassin/FuzzyOcr.pm line 587.
[11615] warn: Subroutine check_fuzzy_ocr redefined at /etc/spamassassin/FuzzyOcr.pm line 602.mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b# cd samples


mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b/samples# spamassassin -t < animated-gif.eml
spamassassin -t < corrupted-gif.eml
Content analysis details: (14.0 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
0.0 HTML_MESSAGE BODY: HTML included in message
1.5 FUZZY_OCR_WRONG_CTYPE BODY: Mail contains an image with wrong
content-type set
Image has format "GIF" but content-type is
"image/jpeg"
2.5 FUZZY_OCR_CORRUPT_IMG BODY: Mail contains a corrupted image
Corrupt image: GIF-LIB error: Image is
defective, decoding aborted.
10 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
"alert" in 1 lines
"alert" in 1 lines
"stock" in 2 lines
"investor" in 1 lines
"company" in 1 lines
"trade" in 1 lines
"target" in 1 lines
"service" in 1 lines
"recommendation" in 1 lines
(10 word occurrences found)

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b/samples# spamassassin -t < jpeg.eml
Content analysis details: (8.1 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
1.8 SUB_HELLO Subject starts with "Hello"
2.3 DATE_IN_FUTURE_12_24 Date: is 12 to 24 hours after Received: date
0.0 HTML_MESSAGE BODY: HTML included in message
4.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
"viagra" in 2 lines
"cialis" in 1 lines
"levitra" in 1 lines
(4 word occurrences found)

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b/samples# spamassassin -t < png.eml
Content analysis details: (35.1 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
0.8 EXTRA_MPART_TYPE Header has extraneous Content-type:...type= entry
2.0 DATE_IN_FUTURE_03_06 Date: is 3 to 6 hours after Received: date
0.0 HTML_MESSAGE BODY: HTML included in message
3.0 LONGWORDS Long string of long words
3.4 FORGED_MUA_OUTLOOK Forged mail pretending to be from MS Outlook
26 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
"alert" in 2 lines
"news" in 2 lines
"symbol" in 1 lines
"alert" in 2 lines
"stock" in 1 lines
"investor" in 3 lines
"company" in 2 lines
"buy" in 1 lines
"price" in 2 lines
"trade" in 2 lines
"target" in 2 lines
"service" in 2 lines
"recommendation" in 1 lines
"levitra" in 1 lines
"software" in 2 lines
(26 word occurrences found)

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b/samples# vi /etc/spamassassin/FuzzyOcr.cf
Nah kita rubah kembali nilai focr_autodisable_score 50 menjadi 3.5 dan ini harus sama dengan nilai $sa_kill_level_deflt pada file amavisd.conf anda

mail-server-bcu:/home/gtoms/FuzzyOcr-2.3b/samples# amavisd-new reload

Saat ini FuzzyOCR sudah berjalan dan kalau anda jalankan #top maka anda bisa lihat gocr muncul jika ada email berisi images dan bisa menghabiskan resources cpu kita, makanya kalau mau memakai plugin ini usahakan pada mesin dengan spesifikasi tinggi. Selesai sudah, dari pemantauan spam images langsung berkurang drastis, kadang hanya 1 spam images dalam satu hari yang lolos.

Sumber : http://www200.pair.com/mecham/spam/image_spam.html

@2006 henry@gultom.or.id



Baca juga :