Naive Bayesian detects spam that a man wont

I collected bunch of latest Reddit and Hacker News posts. About 2800 of them, I placed them all into one file, and separated the spammy and interesting titles aside.

Then I fed this to a Naive Bayes classifier and saw what it detected. I was surprised about how effective it is at finding out interesting posts among the garbage.

It is not intelligent

The only thing a Naive Bayes classifier does is feature extraction and calculation of probabilities that the text belongs in a certain class.

All the knowledge in the system comes from the person who categorized the titles by their perceived value, they are used to calculate a probability that a new title is bad.

The algorithm does the classification hundreds of times faster than what any human would do it.

But I noticed, it is also far much more accurate at doing the classification than what I am.

I chuckle at the things this thing correctly classifies as spam and click the links, only to find out the cold computer was correct and there was nothing worth reading behind the title.

The effectiveness is uncanny and tells that there must be very lot of information in the files that the classifier acts upon.

Does filtering of social media sound bad to you?

To many it must sound like I am shielding myself into a bubble, narrowing my news diet. I think the effect of filtering depends entirely on how you use the filters.

You knew there are many ways to use even the Naive Bayesian classifier?

It tells you probabilities that a certain message belongs into a certain category. Here's are some examples:

hacker news [1] Current pay/pension of every State of California employee by name
0.00 1.00 0.00 0.00 [worthless]

hacker news [1] In Cryptography, Advances in Program Obfuscation
0.00 0.72 0.00 0.00 [maybe worthless?]

Non-interesting stories tend to have titles that point to a worthless story. Stuff that I do nothing with and have no reference point on.

Also if the score is near to 100%, it is likely the story, or a very similar story is already in the category list at some form.

hacker news [1] Everything you need to get up and running with Kotlin programming language
0.98 0.00 0.00 0.00 [spam]

Spam messages are advertisey. I also think that the Kotlin and Java stories are stupid and not interesting. My categorization reflects these opinions.

It is very highly likely that I would not open this link, and if I did I would do it because I am on a cranky mood and ready to write an aggressive anti-Kotlin post.

So my anti-Kotlin+Java filtering makes your life happier too, if you happen to like them. I don't have a great track record of convincing people on language choices.

hacker news [1] Show HN: Insect – a high precision scientific calculator with physical units
0.00 0.13 0.00 0.00 [unfamiliar]

hacker news [1] The Share of Free and Proprietary Drivers on Linux (nVidia/AMD)
0.00 0.00 0.00 0.00 [haven't seen this before]

You cannot classify what you haven't seen before. And when the classification lists are your own, it means that you haven't added these ones into a list.

So the bayesian classifier can show out stuff you have never seen before from huge pile of stuff you have already seen.

The time spent elsewhere

I didn't really think about how many hours I spend reading reddit or hacker news. It is my major procrastination method. The filter does all the work I used to do in searching out useful stories off those lists. It also exhausted me so I didn't get to think about anything else during the day.

Now that I've gotten rid of one waste of time, I have to figure out something else to offset it out. Perhaps I could read more of actual blog posts.

The tool

I have a tool I've been tuning out with my friends. I call it read reddit (and hacker news). It's a bunch of scripts running in a terminal. I think it works anywhere where a bash terminal works.

It's really just bunch of scripts hacked together and nothing that could withstand a mass-adoption. If someone reads this, I thought he might like to know where to find this.

The latest thing

So far I've only classified posts by reading their titles. Also I am only using the Bayes classifier. I am also only classifying the streams that I've managed to follow myself so far. Not more than few hundred entries in a day.

But there is lot more to all of this. A whole lot more. The Bayes is baby level stuff and easiest to use of anything. What I've found last week has motivated me to study further on this all.

Instead of hundreds of posts in a day, I could as well analyse ten thousands of posts in a hour. The only question is, how much better that would be compared to what am I doing now?