You are on page 1of 37

Akademia Świętokrzyska im.

Jana Kochanowskiego
w Kielcach

Wybrane aspekty języka spamu

Patryk Pawlak

Praca dyplomowa napisana pod kierunkiem prof. dra hab. Piotra Ruszkiewicza
w Zakładzie Neofilologii Akademii Świętokrzyskiej im. Jana
Kochanowskiego
w Kielcach

Kielce, czerwiec 2007

The Holy Cross Academy of Education at Kielce

Selective aspects of the language of spam

Patryk Pawlak

This thesis has been written under the guidance of
Prof. Dr Habil. Piotr Ruszkiewicz

Submitted to the Department of Foreign Languages and Literatures
Holy Cross Academy of Education at Kielce
in partial fulfillment of the requirements for the degree of
LICENTIATE

June, 2007
Kielce

Patryk Pawlak

Selective aspects of the language of spam

Approved as to content and style by:

Prof. Dr Habil. Piotr Ruszkiewicz, supervisor

reviewer

Kielce, 2007

1 Fake content technique………………………………………………………………..3 Other techniques……………………………………………………………………… 21 2.3 Spam costs……………………………………………………………………………. 23 Chapter 3 Fighting with spam……………………………………………………………………….2 Substitution technique……………………………………………………………….1.1.4 Legal status…………………………………………………………………………… 14 1.1 Messaging spam……………………………………………………………….. 1 Table of contents……………………………………………………………………….3.6 DoS spam………………………………………………………………………. 16 2. 27 4 ..5 M-spam………………………………………………………………………….22 2.1.1 Fingerprint matching method………………………………………………………… 24 3..2 Chat spam………………………………………………………………………..2. 19 2..1 Graphic/Image spam…………………………………………………………….2 Machine learning filters……………………………………………………………… 25 3. 16 2.1 Bayesian filters………………………………………………………………… 25 3.2 History…………………………………………………………………………..1.1 Additional messages……………………………………………………………. 10 1. 14 Chapter 2 Spammers techniques (morphing messages)…………………………………………….2.2...5 Political issues……………………………………………………………………….. 17 2. 7 1. 8 1.10 1..Selective aspects of the language of spam Table of contents Title page………………………………………………………………………………. 24 3..2 HTML text……………………………………………………………………… 20 2.. 9 1.3 Proof system………………………………………………………………………….. 11 1..1.3.4 Clogging filter database………………………………………………………..3...1 Letter-number substitution…………………………………………………….1 Etymology……………………………………………………………………… 10 1..1.3 N-gram models………………………………………………………………… 26 3.. 19 2.2.2 Discriminative linear models………………………………………………….. 4 Introduction……………………………………………………………………………....2.3.4 Blog spam……………………………………………………………………….1.16 2.2.1. 6 Chapter 1 Information about spam………………………………………………………………….2.8 1.3 Numbers……………………………………………………………………….3 Newsgroup and forum spam……………………………………………………. 19 2..1.9 1. 26 3.2 Dyslexia………………………………………………………………………… 22 2.2 Origins and etymology of the term spam…………………………………………….2 Address…………………………………………………………………………. 7 1..1 Different media spam………………………………………………………………… 7 1.3 BIG letters……………………………………………………………………… 22 2... 13 1..

29 3.2 Analyzing images ……………………………………………………………… 30 Conclusions……………………………………………………………………………… 32 Polish summary…………………………………………………………………………..2 Whitelist………………………………………………………………………. 29 3... 29 3.4 Sender list…………………………………………………………………………….3. 28 3. 27 3..5. 29 3.3 Financial challenge…………………………………………………………….5 Other techniques…………………………………………………………………….1 URL analysis…………………………………………………………………..5. 33 Additional resources…………………………………………………………………….1 Human challenge……………………………………………………………….. 34 Bibliogrphy……………………………………………………………………………… 35 5 .. 29 3.3.4.4.2 Computational challenge………………………………………………………. 28 3..3.3..1 Blacklist……………………………………………………………………….

such as: fake content technique (adding additional content or random characters to the body of the message). substitution techniques. In the final part of the first chapter I shall provide some additional information concerning political and legal status of spam. These include fingerprint matching methods.g. The last chapter will include a schematic description of techniques used by anti-spam researchers and specialists in their fight against spam. safe sender lists and other techniques (e. In the first chapter I shall present some specific information concerning different kinds of spam appearing in various media. The second chapter will describe some of the methods used by spammers to circumvent anti-spam filters and systems. proof systems including challenging methods. URL analysis. 6 . and other methods used to fool computer filters. as well as its total cost. image analysis). This will be followed by a brief history of spam messaging and the etymology of the term.Introduction In this thesis I will try to provide some insight into the selective aspects of the language of spam. machine learning systems such as Bayesian filters or models tracking specially morphed by spammers words.

e. Because of the troubles spam causes at present and the extent to which it grew. Web search engine spam. because other “normal” messages can posses at least one of these features without being spam itself. It usually uses one scheme of spam messages. spamming has become a serious and hot topic discussed thoroughly in many newsgroups and foras not only on the internet. 1. Moreover. spam in blogs.instant messaging). internet forum spam and junk fax transmissions. or. hence it is also sometimes called UBE (Unsolicited Bulk E-mail). Here I will discuss some of them. or subscriber newsletters (usually they are sent to a large group of people). Information about spam The term spam when related to the internet or e-mails means a great number of unwanted messages. One of the reasons why spamming is so common nowadays is its economical values. a spammer does not have to be a computer genius to send junk mails.1.instant messaging spam.1 Messaging spam Messaging spam is connected with instant messaging systems. (it is also called spim from spam and IM. i. the only costs of sending spam are the costs of gathering e-mail addresses and widening the mailing lists. An E-mail message can only be called spam if it is both unsolicited and bulk. what is getting more widespread nowadays buying them on the net underground. there are also other media in which it can appear. Usenet newsgroup spam.Chapter 1. mobile phone messaging spam. i.1 Different media spam 1. the term is usually connected with e-mail. Usually.. job enquiries (the receiver has not generally granted permission to get it). Although.e. repeating in 7 . therefore the number of spammers around the world is huge and is still growing.

newsgroup and forum spam is connected with these media. spam messages are posted by specially made programs called spam bots (spam-robots) which use different ID’s.. moderators ban the person that broke the rule expelling him or her for a limited time from the chat room or in extreme cases imposing a permanent ban. Usually. visit . And according to Wikipedia “It consists of repeating the same word or sentence many times to get attention or to interfere with normal operations.2 Chat spam Chat spam is connected with any chat environment.short more or less regular periods of time. Usually. sometimes even IP’s 8 .e.”. i.. However. i. It is generally considered very rude and may lead to swift exclusion of the user from the used chat service by the owners or moderators. and as a matter of fact it is actually older than e-mail spam..1. because the solution is given and people do not have to ponder about it any more (that is. till the next spam message appears). A given web site has an information on how to get rid of spam messages.1. it works. According to the Usenet convention (Usenet is a giant newsgroup in which spam firstly appeared) the term spam here means “excessive multiple posting” (sometimes the abbreviation of these words is used to described spam) of messages usually of very similar or exact content.3 Newsgroup and forum spam As the name describes it. 1. This type of spam caused the closure of many newsgroups as because of the great number of unwanted messages it was impossible to read any foras or publications. The problem here is that these messages can be removed for free. for a fee.e.. IRC (Internet Relay Chat) or any of the online games. “Want to get rid of this message.” This phenomenon is maybe the most controlled as the action is taken generally immediately after it appears. 1.

In most cases m-spam consists of a short message with a request to give a response call. Another technique used in mobile phone spamming is called one-ring fraud which is based on the same scheme only that there is no message but a short connection.after each publication and shut down as fast as possible not to be seen by anyone for a longer span of time.4 Blog spam Blog spam or spamdexing is a form of spam whose main aim is “to manipulate the relevancy or prominence of resources indexed by a search engine” (http://en. which leaves only the number on the cell phone. by mechanically adding random posts to any type of discussion boards which tolerate hyperlinks.5 M-spam Another spam type is called m-spam or in other words mobile phone spam and SMS-spam.org/wiki/Spamdexing). it increases falsely the ranking of relevancy of a search engine. Though this kind of spam is getting more and more common. it is more controlled as it is guarded by mobile phone companies and the Telephone Consumer Protection Act. 9 . When a hyperlink to a spammers site together with the post is added to a message board.wikipedia. It is based on the notion that anyone who receives a message from an unknown number would like to know who that was and will try to call back. because of the low costs of spamming (as a single signal without actual connection is not charged).1. 1. so that they would not be traced.1. 1. thus putting this site at the beginning of the search results. On phoning back one is unaware that he/she has been additionally charged for it.

They gathered addresses and having information about personal data they could threaten the affected person that these data would be sent to the wide public. The idea is the same as in the newsgroup and forum spam.1.). 10 .1.2. Surprisingly.1 Etymology Surfing the internet one can find thousands of etymologies of the word spam. that is the meaning with which we are familiar nowadays. 1. at the same time these programs changed a personal computer into a bot (zombie) computer that spammers used later to send spam to other people.2 Origins and etymology of the term spam 1. “SPiced hAM”. It is in this sketch that we can notice the basic meaning of the word. which is to flood foras with thousands of worthless and meaningless posts. one is practically unable to find any important messages. This activity grew to such an extent that it has become a threat. or "stop posting antisocial messages". like for example spam as "shit posing as mail”. usually and majorly false and made up by net users. Usually with spam emails spammers send computer viruses or worms (special programs collecting information about personal passwords. the term is connected with Hormel Foods luncheon meat which they describe as “Shoulder Pork and hAM”. Also.6 DoS spam Another type of spam that caused many newsgroups and blogs to shut down is spam used as a denial of service (DoS). This connection between the pork and the modern definition of the term spam has been made in the Monty Python SPAM sketch. bank account numbers etc. In effect.

templetons.2 History The first use of the term spam in the internet dates back to the 1980s in MUD communities. While the waitress describes how much SPAM is in the meals a group of Vikings begin to sing in chorus: "Spam.2. spam. 1. Spamming in such an environment reflected a few types of actions made by the users: “One was to flood the computer with too much data to crash it. spam. one of the old users would begin to type or paste messages from Monty Python sketch to disable normal conversation. spam. spam.The sketch presents a restaurant in which all the food in the menu is being served with SPAM. MU and many others). were the first on-line games. that is Multi-User-Dungeons. spam. One of these tactics was also used in the first days of chat rooms. with shared “world” in which users might interact with one another and chat (today’s Ragnarok Online.” ( http://www. that is the spam is something extremely annoying and drowning out normal discourse in the internet.com/brad/spamterm. rather than creating them by hand.html). the rest of the group would stay silent until the newcomer would leave and then continue to chat freely. presenting the “new” meaning. And the term was sometimes used to mean simply flooding a chat session with a bunch of text inserted by a program (commonly called a "bot" today) or just by inserting a file instead of your own real time typing output. lovely spam! Wonderful spam!" This “song” and thus essentially the word spam attempts to drown out the normal conversation. Another was to ‘spam the database’ by having a program create a huge number of objects. spam. MUDs. spam. therefore. 11 . When a group of friends was having a conversation and someone new came in.

people knowing the term from MUD called these messages spam. the newsgroup where people discussed the running of the net. the New Oxford Dictionary of English. ironically. In 1998. it began in 1994 with two lawyers. As far as commercial spamming is concerned. their action gave birth to modern spamming as some people figured out how to use mass mailing software and flood mailing boxes with junk e-mails. When this happened. during one of the attempts to limit unwanted posts. Members of one group would enter another’s chat room and start to send random messages effectively ending normal conversation.admin. advertising their services in U. the first usage of the term as excessive multiple posting was. although it was useless. They were severely attacked by the users of Usenet. added a second definition to its entry for 12 . Here spam was used as a blend of two words: “spew”(Eject or send out in large quantities) and “scam”(A fraudulent business scheme) which perfectly described the nature of that post. Thus.S Green Card lottery by means of bulk posting.policy. In 1993 Richard Depew wanted to introduce some modifications to the Usenet. While checking new software called ARMM which would introduce the changes. and after some time they vanished. However. However. It was because of the chain mail “Make Money Fast” which was being cross-posted from one user to another. That was the first time that spam was actually called spam. the term was adopted and is used commonly nowadays.This was also practised among the rival newsgroups to prevent one another from chatting. it appeared that the software had a bug and sent 200 messages to news. Canter and Siegel. which had previously only defined spam in relation to the trademarked food product. During the 1980s and 1990s Usenet had already had some problems with spam.

spam: "Irrelevant or inappropriate messages sent on the Internet to a large number of newsgroups or users. they decided not to object to the new definition and usage of the term spam.wikipedia. Although on their web site we can read: “Please Do: Always put the trademark SPAM in all capital letters. Remember. had no direct connection with the term spam it was still their trademark and they spent a lot money to bring back their good name. 13 .”. should always be followed by a noun."( http://en. Eventually.org/wiki/Spam_(electronic)). Though Hormel Foods Corporation. Follow SPAM with ‘Luncheon Meat’ or other descriptor. a trademark is a formal adjective and as such.

which is to gather information about the user and his/her passwords. Nowadays. Not only the time of people who create new anti-spam software. that is spammers and the companies that employ them to advertise their services. among the losses and costs of spam we can find cases of theft involving identity. There are also additional expenses taking into consideration cases in which spam damaged the system or even the computer. These kinds of thefts appear when by downloading attachments. track spammers and fight this problem in other ways. burdensome and annoying for some is fruitful and lucrative for others. consciously or not (sometimes while opening a spam mail the attachment might download itself automatically) we receive a hidden program. Then it transfers them to the spammer. software and professionals responsible for removing and “fighting” with the unwanted mail. but also ordinary users of the internet. The sum covers costs of new equipment.1. One last thing that is being lost because of the spam is the time. Moreover. who spend some of their time to go through the flood of junk mail to find those few messages which are important. what is common nowadays as sometimes with spam we can receive viruses. data or intellectual property. 10 billion euro is only a small percentage of what is being spent to fight spam. however.3 Spam costs According to the European Union’s Internal Market Commission report of 2001 the annual cost of spam around the world totals about 10 billion euro and is still growing. thus enabling him to enter personal accounts and use them. 14 . What is.

loss of privacy. This means that pornographic pictures can eventually end up in a child’s mailbox – and that is also a crime . spam is still profitable and the best prove of that is that it does still exist and is expanding to other medias.5 Political issues However troublesome and annoying spam is at present. and the other which uses people’s inexperience to cheat them. one should remember that usually the content of junk mail is pornography. Yet. 15 . With such a competition other types of advertising campaigns seem too expensive. 1. or other malicious software to internet users. Both of the groups are equally harmful. people are also afraid of spam filtering tools as they see a threat in them. Hence we can classify spammers into two groups: one which uses spam for advertising. censorship. though for some people it may seem that the second one is worse because of the actual crime happening. Some of today’s stop spamming techniques may in the future result in free expression restrictions. and spammers do not choose to which account it should or should not be sent.4 Legal status Spam though profitable. is still illegal in many jurisdictions. as sending e-mails is usually free. 1. and commercialization of e-mail.As stated at the beginning of this chapter the cost of spam is basically only the cost of gathering or buying mailing lists. as well as be the means of getting account passwords or credit card numbers for the later theft. As mentioned above it can be used to send computer viruses. Even though only a small percentage of people decides to use services advertised in spam messages.

which gives some vague idea that the problem can be solved. some of the great spammers could be sued and sentenced. “stealth blocking”. 16 .There are many controversial actions. like Can Spam Act of 2003. seen by some as restrictions leading to taxation of e-mail and censorship. without the mailbox users permission or knowledge. What may cause concern is the fact that these programs might also mistakenly block messages from sites that might be important for the user. Other actions include the passing of new laws about spam. However. taken even nowadays by people taking care of junk mail that resemble the beginnings of censorship and loss of privacy. like for example. thanks to these laws. Stealth blocking describes actions done by special anti-spam programs which block e-mails from spam sites.

1. 2. Though there are different types of spam in many different media. Similar to the previous one would be the substitution technique group involving substitution of messages given. It is also very difficult to combat because of the swiftness of spammers and the number of techniques they use. which involves giving additional messages.1 Additional messages Because of the fingerprint matching technique (which will be discussed further) spammers were forced to add additional messages to the one sent as a spam. random numbers and letters in dispatched messages or changes in the receiver/sender addresses. At first they were only random sets of characters and numbers.Spammers techniques (morphing messages) The fight against spam and spammers is a serious problem because it involves hundreds of people and equipment worth a huge amount of money. like letter-to-number substitution or HTML text substitution. junk emails often spread illegal contents. Another group might be graphic/image spam including any type of spam sent as images. spam is usually connected with e-mail messages and this kind of spam is thought by many to be the most common nowadays. These and other smaller groups of spam techniques will be discussed further.1 Fake Content Technique This group is applied most often and we can subdivide it into 4 major subgroups: 2. moreover. 17 . Among e-mail spam techniques we can distinguish several types similar to each other and thus we can divide them into different groups. These messages might be weather broadcast. The idea is to make the message seem more complex and sophisticated to the filtering machine.Chapter 2. One of these groups would be fake content technique.

1. spam with an address changed can resemble a message from someone one could be familiar with.2 Address Another type of the morphing messages technique is the address change method. Not only the software can be tricked by these changes. New addresses are generated every time so that they would seem spam-free. sometimes even words are mixed with random characters or signs to cheat the anti spam filters. This phenomenon will be discussed in detail further. 2. which is included at the end of the diploma project. but. Here are some examples: As can be seen in the examples of the last two messages. Sometimes not only the content but also the Title can be changed. Scientific American gives a great example of a message prepared by a spammer using the Morphing Messages Technique. According to this method spammers change the user-visible address and sender address to trick the spam filtering software. Here are some examples of such changes: 18 . It was introduced to trick another spam filter. which disposes of messages coming from a web site connected with spam sending. usually spam messages are clearly visible when flooding mailboxes.short stories or any type of text that would enrich the content of the message.

Another change might be not giving a title at all. As can be seen above.All of these messages are spam. thus making it seem as a real response to somebody’s letters. You should to read”) or a letter from an old friend (like in the example “hi”. address change connected with an appropriate message in the Title may resemble an e-mail including some important information (“Significant message. but in English speaking countries it is a serious problem. sometimes connected with “remember me ?”). To Polish people that may look like spam at first sight if they do not have any friends abroad. Some spammers even use the short form of the word response (Re:) used in e-mails. thus making people curious what might be 19 .

(). 1\1. this technique includes two major subgroups: 2. sometimes connected with additional symbols. N\. /\/\. symbols or signs. 20 . Such changes are. lVl.1 Letter-number substitution This substitution technique is based on a replacement of letters in a single word in such a way that it is still readable and comprehensible. N. There are numerous ways of morphing a single word so that it would retain its basic meaning. thus multiplying the number of ways in which a single word can be written.1. 1Y1. etc. using dashes different letter combinations and exclamation marks. M. Also visible in the Scientific American example. letters and numbers. Using dashes. I\/I. 2. I\I. Letters might be changed into numbers. They also can be replaced with the same letters in different alphabets. /Y\. phone book. I will try to present it on the example of a word “money” which can be found in most of the spam messages. the letter m can be written in 19 different ways (and many others that I may not know) : m.2. O. !\!. 1\/1.inside such a message and prompting them to open it. numbers or signs.2 Substitution Technique As stated at the beginning of the paragraph./V\. the number 0 or even the capital letter q. Here we have already 19 different ways of writing money.). !\/!. l\/l. l\l.3 Numbers Additional content might also appear in the form of numbers. IYI. The letter n can be replaced by: n. !V!. 1V1. These might be more or less random sequences typed by the spammer or just copied from any text (stock exchange information. Scientific American can be used here again as another example of this method. lYl. Q. 0. IVI. To begin with. /U\. The letter o can be written as: o. These are only a few examples of such variations.!Y!. 2. Using brackets.

+money+. adding the previous number and combining variations.2. mon+ey. mo+ney+. mo+n+ey.And the letters e and y may have two variations: e. we have already 960 (15 x 64) more ways of writing this word. +m+oney.188. +mon+ey. the ways of writing a particular word are numerous and are practically determined only by human invention. +mone+y. etc. m+on+ey. Additionally if we included some symbols and signs inside the words the number would surely grow. here instead of random characters a HTML text is inserted or the whole message is written in HTML language in such a way that the proper content is invisible for the anti spam filter. another 64 combinations arise from inserting only one new character in different places. m+o+ney. there are 6 spaces between the letters in the word ‘money’. hence it cannot be determined whether it is a spam or not. comas. 2. For example. All in all. the following is a word written in HTML language: 21 . thus we can have: +money. It works similarly to the previous technique.800 ways of different spelling (and that still is not the final number). mo+ney. m+one+y. money+. We can insert there any of the various characters. +mo+ney. m+oney. mone+y+. we have 2. for example plus (+). For example. etc. mone+y.2 HTML text Another type of substitution technique is HTML text substitution. though. mo+ne+y. m+oney+. E and y. Summing all the variations up there can be 2280 (19 x 5 x 6 x 2 x 2 ) ways to portray the word money. Y respectively. As could be seen on the example above. mon+e+y. mon+ey+. And if we assume that there are around 15 characters that are less disturbing to the eye including dashes.

thus splitting them so that they would not alert filters. &#89. &#78. 22 . &#69. &#79. Another trick is to insert a message or random sets of letters into a word. but decrypted by mailbox programs it is clearly visible: MONEY There are also many other HTML tricks used by spammers to circumvent the security software. For a normal user these words would be clearly visible and comprehensible while reading the message but for filters they would be just a set of words. messages included in JavaScript messages would be ignored by anti-spam software as they also contain HTML.ney And to the filter as: Mowhateverney 2. One of such tricks is word splitting.&#77. When encrypted it makes no sense to people and anti-spam filters.3 Other techniques Although these two groups might be considered the most common there are still many other smaller. groups of techniques used by spammers. using HTML language in such a way that these letters will be too little to read for people and simultaneously make no sense to anti spam software. Spammers just insert into the “dangerous” words some lines from HTML language. Similarly. For example: Mo<FONT SIZE =”1”>whatever</FONT>ney Would look to us like: Mo. but not less troublesome. Only some of them will be discussed here as they are too numerous and their number is still growing.

this method of writing spam is similar to the word perceiving ability of a dyslectic person.3. or adding new ones.3. nowadays graphic filters are getting more and more sophisticated and are able to filter some of these messages. or make some changes in the spelling.3. However. But even if they could be. To deal with this problem spammers invented a new way to deliver the message to their “clients”. they simply upload the image on a website and just include an HTML reference to it in the e-mail message. Though it is difficult to use them to generate large pieces of texts.2.1 Graphic/Image spam Spam message can also be included in a graphic/image attachment of an e-mail. letters can be portrayed in such a way that only the human eye could see them. though it works in the opposite direction. 2.3 BIG letters Among the software spammers use there are programs which generate big words from single letters or characters. It is extremely difficult for anti spam software to find this message. For example: CILCK HRE EARON BLLINOS 2. For example: XXXX X X X XXXX X X X X XXXX X X X X X XXXX X X X XXXX X X X X XX X X X X 23 . as filters are not capable of distinguishing letters in an image. they can be used to mask only those words which might look suspicious to spam filters. like deleting some characters. Instead of sending an attachment with an image containing a spam message. Spammers using this technique either scramble the order of letters in a word leaving only the first and the last letter at the original places.2 Dyslexia As the name suggests.

X X X X XXXXX X X X X XXXX X XX X XXXX XXX X X XXX X X X X XXXX X XX X XXXX 2. i. Why fresh wire.html) Sound is drop. and thus they became prime target of attacks. parthenogenesis. Cover part reason. 24 . book. though some spammers with more skills and experience arrange random words into simple sentences.com/techsupport/spamtricks.4 Clogging filter database Most of today’s spam filters have databases which include words connected directly with spam.e. feet. The most common ways of “breaking” these filters is clogging their database with thousands of words not connected with spam. Usually. and that is not possible without spy programs.3. Moreover. Line whether soft oxygen. accoutrements. They also have learning capability systems and thus they present a serious threat to spammers. so most of them would never be included in anyone’s e-mails. The only way for such a method to be successful would be to insert words that people of a given group can receive in their messages. Usually. as everyone receives different content mails. Most of the words spammers add in the messages do not appear in common usage. if not for every single person. pick other busy. Near hot.process. Cross burn make suggest. Notice. even if a spammer would somehow get access to the mailboxes. Move such light city. they are randomly chosen words from a dictionary. Example taken from (http://www. are fact find hold. minute. and because of that this technique is not very successful. these words are placed randomly at the end of the message. he/she would be forced to create special sets of words for each group.

Though combat with spammers is hard because of their cleverness it is successful at least to some degree. There are two kinds of fingerprint gathering methods: - Bulk detection – here fingerprint is taken from any message which content can meet particular net traffic conditions like i.1 Fingerprint matching method Fingerprint matching method is one of the earliest anti spam techniques used.e. machine learning filters. - Collaborative filtration – fingerprints are set by a community of users who identify spam in their mailboxes and send it to the database server which cumulates all of the 25 . URL analysis. methods and programs is so big only some of them will be discussed here. 3. sender lists rules. graphic filters and many others. n-gram models.Chapter 3. Every action triggers reaction that is as soon as spammers find a new way to circumvent anti spam filters or other software those who fight with spam create yet new programs and rules to stop spammers and unable them to send junk messages to normal users e-mail boxes. proof systems. Because the number of anti spam techniques. Those programs and rules include : fingerprint matching technique. Fighting with spam As said in the previous chapter spam being a serious problem nowadays requires the involvement of many people like for example. researchers which are to discover the techniques spammers use or programmers which will create new software dealing with the new forms of junk e-mails. Fingerprints are special marks and rules designed and set by anti spam specialists in order to make the filter recognize spam as accurately as possible. rapid transmission. discriminative models.

“free”.e. “yesterday”. 3. random characters or numbers into the content of the spam. when a message which has the same or similar structure as the one analyzed appears.1 Bayesian filters Bayesian filters “learning” ability is in fact a large pile of data including spam messages and “good” messages inputted into their database.2.fingerprints. this method was easily tricked by spammers as they included additional messages. one could add the number of As in a message plus 10 times the number of Bs plus 100 times the number of Cs. Unfortunately. When presented with such examples filters “learn” that some words are more likely to appear in spam messages (i. “sex”) and other words can be indicators of “good” messages (i.2 Machine learning filters These are filters which technology is based on machine learning systems. After getting to the database server the content of the junk e-mail is being analyzed and a fingerprint in a form of a number derived from the content is being set. and so forth. 3. anti-spam programs compute its fingerprint and then compare it with those of known spam“. they are capable of noticing that messages including graphic images or links might also be spam. Thus. Thanks to their advanced software they are able to identify spam messages even these which are enriched with some additional content. Additionally. American Scientific gives a simplified example of fingerprint matching method mechanism: “To give a simplified example. “nude”.e. anti spam filter deletes it or archives in a special folder. When a new message arrives. “whatever”). 26 . These users might also point out the messages that were unfairly treated as spam.

2.3 N-gram models Another model introduced to cope with ever-changing spam language is called n-gram model. and as the filter considers these words as spam related. making them seem harmless and spam free. For example if a Bayesian filter would see in a message words like “here” or “free”. it uses “subsequences of words to 27 . like “click” and “here”. 3. 3. These systems were also easily cheated by spammers who began to morph the content of junk e-mails by adding additional characters. they are also based on the machine learning systems but thanks to some innovations are able to distinguish more accurately spam messages from valid e-mails.(both of these spamming techniques were discussed in the previous paragraph). numbers or elements of HTML language into the original words. For example discriminative linear models would learn that some words might appear in both. For instance words like “click”.2. This model was created to deal with the morphing words technique.2 Discriminative linear models To solve some of the problems occurring while using Bayesian filters anti spam specialists introduced discriminative linear models. spam and valid messages.Nevertheless. which can be found in most of spam e-mails. N-gram is the probability of the next sign in a sequence. here next letter in a word. hence putting less or sometimes no weight on them. “here” or “unsubscribe” might sometimes appear in valid messages like newspaper or newsgroup subscription. thus. Moreover. thus. Bayesian filters also do not remain flawless. it would know that they do not have to be spam related at all. spammers attempted to overweight filters database by flooding them with spam messages including randomly chosen words from the dictionary. making the content analysis even more accurate. this model could also be able to learn that some words might appear together. some of the messages might be deleted needlessly. Moreover.

28 . If an e-mail message contains the phrase "n@ked l@dies". for instance." "@ked.detect the key words often associated with spam. According to the mechanism of these systems. For example such a challenge might include a picture with random sets of letters which might be in some way distorted. the n-grams extracted from this phrase would include "n@k. (Goodman J." "n@ke. if the sender completes the challenge properly the mail passes the filter and accordingly if not it is deleted.1 Human challenge They are usually called HIPs (Human Interactive Proofs) or CAPTCHAs (“completely automated public Turing test to tell computers and humans apart”)." and so on. their presence provides valuable clues“. …) This model was also used in relation to foreign languages as some do not use spaces and it would be difficult to apply other rules to them. Because these word fragments appear in confirmed spam messages.3. These challenges are made in such a way that they would be quite easy for humans and impossible to solve by computers. There are three types of such challenges: human.. computational and financial. when a message is sent and it is recognized by the filter as spam related the sender receives a challenge which is to prove that he/she is human not a computer program. 3. 3. Thanks to this model they can be analyzed as a whole and the filter will only select these words which are spam related. The aim is to make the distribution of e-mail messages more demanding from the spammer than he/she can afford.3 Proof system Another field of battle which might turn out to be more successful in the long run is the introduction of the proof systems.

but for a spammer who sends thousands of e-mails every minute it would be a disaster. sender sends a message. 3. 3. The rule might also be that he/she will cash the money only from a message that is spam. Nevertheless. the challenge is solved by the sender’s server. Moreover. these kinds of messages might also be disturbing and annoying for any normal user and a real danger here is that people would in the end resign from sending any e-mails. these challenges are in forms of jigsaw puzzles. receiver’s server sends a challenge. prolonging considerably the time of sending a message what would make the whole “business” unprofitable. to administer the whole system. all of the mailboxes might have installed rate-limiting software.3 Financial challenge This challenge system uses real money. That is. the problem here is that this kind of a system would require the cooperation of all of the mailbox servers as well as some of the government services. Yet. Usually. one could set a rule in his mail server that he/she will only receive e-mails if there will be a small amount of money in it in a form of electronic check. the main idea of that challenge is that it might take few minutes to solve it. the distribution of any spam related messages would require a great amount of time which spammers do not have. For “normal” sender e-mails would remain free. thus if the balance of the spammers account would fall to zero he or she would be unable to send more messages. thus. It is based on the same idea as the previous one.Human eye is much better at distinguishing such letters than any computer software.2 Computational challenge Another type of challenges is computational challenge. and the message lands in the mailbox.3. 29 . Its greatest advantage is that these challenges can be automated and everyday user might never be bothered by them again after the first time. And as spam messages are usually being sent in great numbers and each of them before getting to the receiver would be challenged.3. However.

3. downloaded or set by default by the mailbox server services. 3. Any of the e-mail addresses or links added to the blacklists will be deleted or archived in the spam folder. There are two types of sender’s lists: Blacklist and Whitelist.5. 3. Though this attitude is very risky as there might be a case that you receive a message from an old friend who has just received your e-mail address. To simplify URLs are the addresses of sites which they represent (although they may not only 30 . These will pass filters and rules even though they might be considered as spam. such a message would probably be deleted or achieved and discovered only after opening spam older list. They might be set by the user.4. Some people prefer to trust only to whitelists. is it fingerprinting or machine learning.5 Other techniques 3.4 Sender list The technology of sender lists can be combined with any other anti spam technique. but these are only extremes who do not receive any messages from people or companies they do not know.1 Blacklist Blacklists are lists of e-mails and links to sites which are thought or known as being spam related. whitelists include e-mail addresses and links which are determined by the user as completely safe. This method in some way helps to eliminate at least to some degree false positives (messages which were unfairly treated as spam).1 URL Analysis This technique is based on the analysis of universal resource locator (URL) information.3.2 Whitelist Analogically.4.

thus. Probably the most effective anti spam technique would be to combine all of the filters and rules. They are time-consuming and usually might result in many false positives (i. To fight with this phenomenon computer-vision researchers are improving anti spam filters to make them capable of analyzing the pictures which we might receive with e-mails. pictures presenting big areas of skin can confuse filters).e.represent sites but also other resources placed in the internet). These methods are used nowadays to scan documents so there is a chance that they might be also applied to find texts in graphic spam messages. creating an ultimate and very complex anti spam system. Because. Anti spam software can use information given by the URL variously. 3. As in the example from American Scientific “our favorite strategy to halt spam combines e-mail filtering technology 31 . To cope with such messages anti spam researchers want to introduce optical characterrecognition (OCR) methods for spam filtering.2 Analyzing images Many spammers use pictures and images to hide the content of a spam message from filtering. as spammers tend to create new domains every time one is set as spam related. A giant number of junk mails is connected with pornography distribution and promotion. thus. most of the spam messages include URLs .5. For example. blocking any messages that include these sites URLs. thereby. sites analyzed and considered as spam related can be placed on blacklists. spammers aims are usually to make someone visit their sites. There are two major disadvantages of such a filtering. information about whether the site is new or old can help to point out which sites should be analyzed. URL filtering might be a good way of identifying spam. Moreover. Such filters are searching for a big intensity of colors which are human skin-like.

however. “(…) 32 . computational puzzles and micropayments. If the sender's computer has newer software. the sender will solve a HIP or make a micropayment. which reduces the number of proofs dramatically. Otherwise. Most messages from one person to another. In this approach. the message is shunted to a machine-learning-based anti-spam filter that is designed to be especially aggressive. without the sender even being aware of the challenge.with a choice of proof tests: HIPs. if the sender of a message is not on the recipient's safe list. The original sender is then given a choice: solve a HIP or a computational puzzle or make a refundable micropayment. the recipient is challenged. will not be contested. it will work out the puzzle automatically. if the message is even a bit suspicious.

The main concerns about spam are political and legal issues.Conclusions Spam is an unsolicited and bulk message which causes many serious problems nowadays and is a hot topic for discussion. The phenomenon began in the 1980s and though throughout the years there were and there still are many attempts to stop it. The topic of spam is very controversial as on the one hand stopping spam is considered by many as the beginning of censorship and on the other junk e-mails often spread illegal contents like pornography. as this “business” remains the cheapest means of advertising. proof systems. machine learning filters. Even though spam language is still evolving in order to get pass the anti-spam software. majorly connected with e-mail messages it can appear in various media for example. These include: fingerprint matching method. 33 . Some of them which were discussed in this thesis include: fake content technique (adding content to spam messages). instant messaging spam. mobile phone spam. the number of spammers is growing. Some of the techniques already introduced to help us in the struggle with spam are discussed in the third chapter. web search engine spam or blog and forum spam. spammers were forced to invent new techniques which would enable them to circumvent anti-spam filters. and in many cases these messages can end up in a child’s mailbox. Because spam is banned in many jurisdictions and because of the fact that new systems were introduced to fight against spam. Although. safe sender lists and others. there are many people involved in the research about spam messages who try to prevent the flooding of our mailboxes with junk e-mails. substitution technique (substituting words with numbers. other letters or HTML language) or graphic techniques (including spam in images. creating glyphs).

g.Summary in Polish Niniejsza praca miała na celu przedstawienie wybranych aspektów języka spamu. 34 . Następnie przedstawiona została krótka historia spamu a także etymologia samego terminu. URL analysis. Ostatnia część rozdziału została poświęcona dodatkowym informacjom na temat politycznego i prawnego statusu spamu. spam używany do oszukania wyszukiwarek internetowych. Niektóre z tych technik to: technika dodawania dodatkowej treści (polegająca na dodaniu losowych treści do rozsyłanej wiadomości w celu wzbogacenia języka w niej przedstawionego). Są to: fingerprint matching methods. spam komunikatorów internetowych. technika zamiany (polegająca na zastępowaniu liter w słowie cyframi lub cząstkami języka HTML). a także ogólnym informacjom o jego kosztach. W ostatnim rozdziale przedstawione zostały wybrane techniki używane przez specjalistów w dziedzinie walki ze zjawiskiem spamu. machine learning systems such as Bayesian filters or models tracking specially morphed by spammers words. spam for jak również inne rodzaje. W pierwszym rozdziale podane zostały wybrane informacje na temat spamu pojawiającego się w różnych mediach. Jest to miedzy innymi spam telefoniczny. techniki graficzne (polegające na zawieraniu treści wiadomości w formie obrazka). safe sender lists and other techniques (e. W rozdziale drugim opisane zostały wybrane metody i techniki używane przez spamerów aby obejść systemy i filtry anty-spamowe. proof systems including challenging methods. image analysis).

35 .Additional resources An example from Scientific American presenting a message prepared by a spammer using morphing message techniques.

974. 06 May 2007 URL: <http://en. 2002. Wikipedia.952 ways to spell Viagra” URL: <http://www. Joanna “A brief history of SPAM. 07 May 2007 URL: <http://en.381.com/lessons/viagra/viagra.426. Wikipedia.org/wiki/Spam_(electronic)> “Messaging spam”.wikipedia.824.cockeyed.wired. “A plan for Spam” 24 May 2007 URL: <http://www.html> Templeton B.wikipedia.379. “origins of the term “spam” to mean net abuse” 04 May 2007 URL: <http://www. Wikipedia.org/wiki/Mobile_phone_spam> Glasner.html> Graham P. and spam” URL: <http://www.wikipedia.wikipedia. Wikipedia.org/wiki/Messaging_spam> “Newsgroup spam”.process. 06 May 2007 URL: <http://en.com/brad/spamterm. Wikipedia.org/wiki/Forum_spam> “Mobile phone spam”.wikipedia.com/spam. 07 May 2007 URL: <http://en. 06 May 2007 URL: <http://en.Bibliography Internet sources: “Common Spammers Tricks” White Paper Process Software 24 May 2007 URL: <http://www.com/techbiz/media/news/2001/05/44111> “There are 600.wikipedia.html> “Spamdexing”.paulgraham.com/techsupport/spamtricks. Wikipedia.org/wiki/Spamdexing> “Spam (electronic)”. 06 May 2007 URL: <http://en.html> 36 .templetons.org/wiki/Newsgroup_spam> “Forum spam”.

Joshua. ”SPAM nieuchwytny bandyta” Forum 13 Konig. Volker 2007 ”SMS do Buhaja” Forum 13 37 . “Spam: Technologies and Policies” White Paper Goodman. 2006. Adam J. ”Stopping Spam” Scientific American p. and Prakash Vipul V. Joshua and Microsoft Research. “Applying Collaborative Anti-spam Techniques to the Anti-virus Problem” San Francisco.Articles: O’Donell. 2003.42-49 Kruger. USA Goodman. Alfred 2007. 2005. Heckerman David and Rounthwaite Robert.