<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7032875</id><updated>2012-02-10T02:02:25.092-08:00</updated><title type='text'>Text Categorization</title><subtitle type='html'>Text Categorization using perl</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://perlcity.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>26</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7032875.post-8862758078768531897</id><published>2012-02-10T01:53:00.000-08:00</published><updated>2012-02-10T02:02:25.107-08:00</updated><title type='text'>Text Categorization Framework Released</title><content type='html'>Clarabridge the text analytics company recently &lt;a href="http://newsblaze.com/story/2011100406013800001.bw/topstory.html"&gt;launched Link&lt;/a&gt;, a framework of web services that gives direct access to their text analytics engine and their sentiment engine.&lt;br /&gt;&lt;br /&gt;The Link engine can detect attitudes and emotions, identify key topics and phrases for text categorization.&lt;br /&gt;&lt;br /&gt;Link has more than 15 categorization templates, plus natural language processing capabilities.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-8862758078768531897?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/8862758078768531897'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/8862758078768531897'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2012/02/text-categorization-framework-released.html' title='Text Categorization Framework Released'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-117001646731042832</id><published>2007-01-28T12:34:00.000-08:00</published><updated>2007-01-28T12:34:27.366-08:00</updated><title type='text'>Chinese Blue Jeans Sweatshop the Subject of Expose'</title><content type='html'>How would you feel if you knew that the jeans you're wearing were made by young Asian girls who got paid the equivalent of about six cents an hour? Not only six cents per hour, but under terrible conditions - they are forced to stay awake and working. And most of us are benefiting from this directly. &lt;br/&gt;&lt;br/&gt;&lt;a href="http://newsblaze.com/story/20070126232727nnnn.nb/newsblaze/TOPSTORY/Top-Story.html"&gt;read more&lt;/a&gt;&amp;nbsp;|&amp;nbsp;&lt;a href="http://digg.com/movies/Chinese_Blue_Jeans_Sweatshop_the_Subject_of_Expose"&gt;digg story&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-117001646731042832?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/117001646731042832'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/117001646731042832'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2007/01/chinese-blue-jeans-sweatshop-subject.html' title='Chinese Blue Jeans Sweatshop the Subject of Expose&apos;'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-116952983159059656</id><published>2007-01-22T21:23:00.000-08:00</published><updated>2007-01-22T21:23:52.040-08:00</updated><title type='text'>The State of the Union: An Ordinary Citizen Responds</title><content type='html'>It is imperative that our government recognize the fact that we the people are FED UP with the attempt to undermine American sovereignty, homeland security, and culture.&lt;br/&gt;&lt;br/&gt;&lt;a href="http://newsblaze.com/story/20070122111133lill.nb/newsblaze/OPINIONS/Opinions.html"&gt;read more&lt;/a&gt;&amp;nbsp;|&amp;nbsp;&lt;a href="http://digg.com/political_opinion/The_State_of_the_Union_An_Ordinary_Citizen_Responds"&gt;digg story&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-116952983159059656?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/116952983159059656'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/116952983159059656'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2007/01/state-of-union-ordinary-citizen.html' title='The State of the Union: An Ordinary Citizen Responds'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-114542960263304875</id><published>2006-04-18T23:53:00.000-07:00</published><updated>2006-04-18T23:53:23.623-07:00</updated><title type='text'>Soldiers Work to Provide Iraqis with More Water Treatment Plants</title><content type='html'>&lt;a href="http://newsblaze.com/story/20060418185514mpad.nb/newsblaze/IRAQ0001/Iraq.html"&gt;Soldiers Work to Provide Iraqis with More Water Treatment Plants&lt;/a&gt;&lt;br /&gt;The progress comes none too soon for many of the residents of this northern Iraqi province. Some have never had drinking water available to their homes said Spc. Vilhelm Heerup, a civil affairs specialist with Company C.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-114542960263304875?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://newsblaze.com/story/20060418185514mpad.nb/newsblaze/IRAQ0001/Iraq.html' title='Soldiers Work to Provide Iraqis with More Water Treatment Plants'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/114542960263304875'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/114542960263304875'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2006/04/soldiers-work-to-provide-iraqis-with.html' title='Soldiers Work to Provide Iraqis with More Water Treatment Plants'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-114460561400849732</id><published>2006-04-09T11:00:00.000-07:00</published><updated>2006-04-09T11:00:14.040-07:00</updated><title type='text'>Independent Journalist Bill Putnam Prepares to Visit Kurdistan</title><content type='html'>Independent US journalist preparing to go to Kurdish area of Iraq, needs help to visit this dangerous area for two weeks to see what's really going on there..&lt;br/&gt;&lt;br/&gt;&lt;a href="http://newsblaze.com/story/20060408222908nnnn.nb/newsblaze/IRAQ0001/Iraq.html"&gt;read more&lt;/a&gt;&amp;nbsp;|&amp;nbsp;&lt;a href="http://digg.com/links/"&gt;digg story&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-114460561400849732?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/114460561400849732'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/114460561400849732'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2006/04/independent-journalist-bill-putnam.html' title='Independent Journalist Bill Putnam Prepares to Visit Kurdistan'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-114413525548152878</id><published>2006-04-04T00:20:00.000-07:00</published><updated>2006-04-04T00:20:55.516-07:00</updated><title type='text'>Jillian Ann's Diary - an Intimate Portrayal of Independent Music</title><content type='html'>Jillian Ann, web icon, fashion model and Indie musician, has just released what may be the first 'real' show designed for Internet distribution entitled "Jillian Ann's Diary."  Brilliant album, great video diary.&lt;br/&gt;&lt;br/&gt;&lt;a href="http://newsblaze.com/story/20060403233541nnnn.nb/newsblaze/TOPSTORY/Top-Story.html"&gt;read more&lt;/a&gt;&amp;nbsp;|&amp;nbsp;&lt;a href="http://digg.com/music/"&gt;digg story&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-114413525548152878?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/114413525548152878'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/114413525548152878'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2006/04/jillian-anns-diary-intimate-portrayal.html' title='Jillian Ann&apos;s Diary - an Intimate Portrayal of Independent Music'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-113938000017764096</id><published>2006-02-07T22:26:00.000-08:00</published><updated>2006-02-07T22:26:40.220-08:00</updated><title type='text'>Google To Telcos: Who Needs You?</title><content type='html'>Telcos like AT&amp;T and Verizon have announced their plans to extort money from Google and other sites if those sites want to get adequate bandwidth. But there's some evidence that Google may be planning to bypass the Telcos altogether, and roll out its own national broadband service.&lt;br /&gt;&lt;br /&gt;My take on it is thta Google is building its own network for its own use, not to become an ISP. The ISP business is not as profitable as the other business Google does and google has no expertise in that area. Unless it buys a company, Google would be asking for big trouble if they did that, IMO.&lt;br /&gt;&lt;br /&gt;So the google network I see is for their own use only.&lt;br/&gt;&lt;br/&gt;&lt;a href="http://www.networkingpipeline.com/blog/archives/2006/02/google_to_telco.html"&gt;read more&lt;/a&gt;&amp;nbsp;|&amp;nbsp;&lt;a href="http://digg.com/technology/Google_To_Telcos:_Who_Needs_You_"&gt;digg story&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-113938000017764096?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/113938000017764096'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/113938000017764096'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2006/02/google-to-telcos-who-needs-you.html' title='Google To Telcos: Who Needs You?'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-113834513505401962</id><published>2006-01-26T22:58:00.000-08:00</published><updated>2006-01-26T22:58:55.096-08:00</updated><title type='text'>Russia: Spy gadget hailed as 'technological miracle' Worth tens of millions</title><content type='html'>The 'rock' found in Russia that was being used by British agents to transmit data is said to be space-age technology worth tens of millions of dollars.&lt;br /&gt;&lt;br /&gt;The FSB accused four British diplomats of involvement in a spy ring in which agents allegedly passed secrets through the device, located in a Moscow park. &lt;br /&gt;&lt;br /&gt;The Russians said the British 'spy rock' is a multi-million dollar miracle of technology&lt;br /&gt;http://newsblaze.com/story/20060126201650tsop.nb/newsblaze/TOPSTORY/Top-Story.html&lt;br/&gt;&lt;br/&gt;&lt;a href="http://breakingnews.iol.ie/news/story.asp?j=140234720&amp;p=y4xz353xx"&gt;read more&lt;/a&gt;&amp;nbsp;|&amp;nbsp;&lt;a href="http://digg.com/technology/Russia:_Spy_gadget_hailed_as_technological_miracle_Worth_tens_of_millions"&gt;digg story&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-113834513505401962?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/113834513505401962'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/113834513505401962'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2006/01/russia-spy-gadget-hailed-as.html' title='Russia: Spy gadget hailed as &apos;technological miracle&apos; Worth tens of millions'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-113787860995584406</id><published>2006-01-21T13:23:00.000-08:00</published><updated>2006-01-21T13:23:31.050-08:00</updated><title type='text'>Dead Body Guy to Star in 'Horrorween' - a triumph of technology</title><content type='html'>The internet, websites, blogs and ebooks made this happen.&lt;br /&gt;&lt;br /&gt;A guy dreams about being on the TV and in the movies. He finds a book that gives him the courage to do something about it. He creates a website and a blog. Finds a great publicist and the rest is history - Newspaper, Radio and TV interviews and now - OFFERS !&lt;br/&gt;&lt;br/&gt;&lt;a href="http://newsblaze.com/story/20060121130737nnnn.nb/newsblaze/TOPSTORY/Top-Story.html"&gt;read more&lt;/a&gt;&amp;nbsp;|&amp;nbsp;&lt;a href="http://digg.com/movies/Dead_Body_Guy_to_Star_in_Horrorween_-_a_triumph_of_technology"&gt;digg story&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-113787860995584406?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/113787860995584406'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/113787860995584406'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2006/01/dead-body-guy-to-star-in-horrorween.html' title='Dead Body Guy to Star in &apos;Horrorween&apos; - a triumph of technology'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-113778449923422159</id><published>2006-01-20T11:14:00.000-08:00</published><updated>2006-01-20T11:14:59.256-08:00</updated><title type='text'>Hi-Res Pluto Liftoff</title><content type='html'>Hi-Res image of Liftoff of the Atlas V carrying NASA's New Horizons spacecraft to a distant date with Pluto!&lt;br /&gt;&lt;br /&gt;Its a great photo and an interesting project because pluto is too hard to study from here.&lt;br /&gt;&lt;br/&gt;&lt;br/&gt;&lt;a href="http://www.nasa.gov/images/content/141349main_06pd0094.jpg"&gt;read more&lt;/a&gt;&amp;nbsp;|&amp;nbsp;&lt;a href="http://digg.com/science/Hi-Res_Pluto_Liftoff"&gt;digg story&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-113778449923422159?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/113778449923422159'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/113778449923422159'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2006/01/hi-res-pluto-liftoff.html' title='Hi-Res Pluto Liftoff'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-112457864333099995</id><published>2005-08-20T15:57:00.000-07:00</published><updated>2005-08-20T15:57:23.360-07:00</updated><title type='text'>Consumers Rejection of RFID Passports Consistent With Research Findings</title><content type='html'>&lt;a href="http://newsblaze.com/story/2005042712155900002.mwir/newsblaze/RFIDRFID/RFID.html"&gt;Consumers Rejection of RFID Passports Consistent With Research Findings&lt;/a&gt;: "RFID (Radio Frequency Identification) used to be an acronym known only by a handful of technology companies and retailers looking to implement slick new supply chain solutions, but it is increasingly becoming a consumer household word. Awareness of RFID among consumers has grown from just 28% in September 2004 to over 40% in March 2005. (Source: RFID Consumer Buzz March 2005)"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-112457864333099995?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://newsblaze.com/story/2005042712155900002.mwir/newsblaze/RFIDRFID/RFID.html' title='Consumers Rejection of RFID Passports Consistent With Research Findings'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/112457864333099995'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/112457864333099995'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2005/08/consumers-rejection-of-rfid-passports.html' title='Consumers Rejection of RFID Passports Consistent With Research Findings'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-111188709362133154</id><published>2005-03-26T17:31:00.000-08:00</published><updated>2005-03-26T17:31:33.620-08:00</updated><title type='text'>Perl books</title><content type='html'>&lt;a href="http://perlcity.com/"&gt;Perl books&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;My Text Categorizer written in Perl is doing a reasonable job, but I want it to do more.  I'm finding that it misses many possible categories.&lt;br /&gt;&lt;br /&gt;What I'm really looking for is natural language processing, but with some differences. I want it to know about paragraphs of text and I want it to be able to handle "must have" and "must not have".  I already built those two features into the existing text categorizer, but they only work if the categorizer selected the document for a category.&lt;br /&gt;&lt;br /&gt;The problem is that its missing some categories.&lt;br /&gt;&lt;br /&gt;I will need to text it, but maybe I just want a boolean learner.&lt;br /&gt;&lt;br /&gt;The main problem with what I'm doing is that I don't want to have a large set of documents that act as the training set. All I want to do is to have a slim specification of the matching text, like a set of search engine keywords.&lt;br /&gt;&lt;br /&gt;Maybe I'll look for someone who can help me get it done.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-111188709362133154?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://perlcity.com/' title='Perl books'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/111188709362133154'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/111188709362133154'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2005/03/perl-books.html' title='Perl books'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-109201466606495918</id><published>2004-08-08T18:24:00.000-07:00</published><updated>2004-08-08T18:53:25.403-07:00</updated><title type='text'>O'Reilly Network: What Percentage of Developer Positions Should Be Junior?</title><content type='html'>&lt;a href="http://www.oreillynet.com/pub/wlg/5388"&gt;O'Reilly Network: What Percentage of Developer Positions Should Be Junior?&lt;/a&gt;&lt;br /&gt;by William Grosso -- I've been paying attention to the local help-wanteds for the past couple of weeks. By my count, approximately 4% of the positions being advertised are junior positions. I wonder what that means?&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;You would think it would be more than 5%, but there are some special circumstances right now.&lt;br&gt;&lt;br /&gt;First, there are some experienced developers out of work, who are probably willing to accept lower wages and an employer isn't going to train someone when they can get experience at a good rate. &lt;br&gt;&lt;br /&gt;Second, employers are learning they can do more with less.&lt;br&gt;&lt;br /&gt;Third, some employers think they can save money by outsourcing overseas.&lt;br&gt;&lt;br /&gt;Fourth, the economy isn't exactly stable and employers are holding off.&lt;br&gt;&lt;br /&gt;Fifth, some companies are on the edge of crashing and they are not only not hiring, but they are going to create some unemployed developers soon.&lt;br&gt;&lt;br /&gt;&lt;br /&gt;The number of companies with vision, foresight and cash is not a big number. And for some reason or other, there are a lot of Venture Capitalists who can't even give their money way. That's not to say there aren't companies willing to burn up their money - I think it means the VC's aren't being presented with enough business plans that make sense.&lt;br /&gt;&lt;br /&gt;TrackBack URL to William's article:&lt;br /&gt;http://www.oreillynet.com/cs/user/trackback/cs_msg?x-lr=cs_disc/9427&amp;x-lr2=wlg/5388&amp;x-a=submit&amp;trackback=1 &lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-109201466606495918?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/109201466606495918'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/109201466606495918'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/08/oreilly-network-what-percentage-of.html' title='O&apos;Reilly Network: What Percentage of Developer Positions Should Be Junior?'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108708153741034707</id><published>2004-06-12T15:15:00.000-07:00</published><updated>2004-06-12T16:42:14.806-07:00</updated><title type='text'>Categorization Improvement</title><content type='html'>After another few days effort and many experiments, a reasonable improvement that I am very happy with.&lt;br /&gt;&lt;br /&gt;The latest system uses Naive Bayes Categorization with a low threshold, that gives more positive hits and more overage, coupled with 3 verifiers that enhance the results in various ways.&lt;br /&gt;It even discovered 30 real hits that I missed in my manual classification.&lt;br /&gt;&lt;br /&gt;The Boolean Natural Language NOT verifier reduces overage&lt;br /&gt;&lt;br /&gt;The Boolean Natural Language MUSTHAVE verifier &lt;br /&gt;&lt;br /&gt;The Composite Classifier increases hits by triggering a categorization when one category makes up part of another, for instance Chile is a South American country and the Composite classifier adds any story tagged for Chile into the South America topic, if the standard South America topic didn't already tag it.&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;Results&lt;hr&gt;&lt;br /&gt;This is the result without verification&lt;br /&gt;379=Hit, 10=NoCat, 172=OverCat&lt;br /&gt;&lt;br /&gt;The NOT verifier reduces overage, reduces hits&lt;br /&gt;372=Hit, 16=NoCat, 155=OverCat&lt;br /&gt;&lt;br /&gt;The MUSTHAVE verifier greatly reduces overage, reduces hits&lt;br /&gt;349=Hit, 37=NoCat, 11=OverCat&lt;br /&gt;&lt;br /&gt;Together, the verifiers cause a further improvement&lt;br /&gt;342=Hit, 44=NoCat, 3=OverCat&lt;br /&gt;&lt;br /&gt;Finally, adding the Composite Classifier increases hits&lt;br /&gt;359=Hit, 44=NoCat, 4=OverCat&lt;br /&gt;&lt;br /&gt;Overall, its a good improvement. Out of 600 manually tagged topics, the Categorizer picks up just over half of them and the overage is only 4 topics. This is very good.&lt;br /&gt;&lt;br /&gt;Now I'm ready to test documents from other sources.&lt;br /&gt;&lt;br /&gt;I suspect that further improvements will be gained by improving the category definitions.&lt;br /&gt;&lt;br /&gt;I still want to try the SVM (Support Vector Machine), as soon as I get a new server set up.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108708153741034707?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108708153741034707'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108708153741034707'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/06/categorization-improvement.html' title='Categorization Improvement'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108667419041962759</id><published>2004-06-07T21:39:00.000-07:00</published><updated>2004-06-07T22:56:30.420-07:00</updated><title type='text'>Eureka! I found it!</title><content type='html'>Now I'm happy.&lt;br /&gt;I have more hits and less overage. In fact a lot less overage.&lt;br /&gt;When I first set up the changes, there were about 6 overages and I thought that was really great. &lt;br /&gt;&lt;br /&gt;Then I realised that they weren't overages at all - the auto&lt;br /&gt;categorizer actually picked up topics that I missed.&lt;br /&gt;&lt;br /&gt;So here are the results.&lt;br /&gt;t7bbxcNLQ1:    230=Hit, 54=NoCat, 3=OverCat&lt;br /&gt;&lt;br /&gt;There should be only 10 with no categories, so there is still room for improvement.&lt;br /&gt;&lt;br /&gt;One improvement was to better specify some of the categories.&lt;br /&gt;&lt;br /&gt;What brought these improvements was using NaiveBayes with a low threshold of 0.1 instead of 0.3 and then using a boolean/Natural Language query on the results.&lt;br /&gt;&lt;br /&gt;More information soon.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108667419041962759?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108667419041962759'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108667419041962759'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/06/eureka-i-found-it.html' title='Eureka! I found it!'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108598591911872082</id><published>2004-05-30T23:41:00.000-07:00</published><updated>2004-05-30T23:45:19.120-07:00</updated><title type='text'>Experimenting with front_bias</title><content type='html'>I have manually identified 213 categories in the small corpus&lt;br /&gt;of 220 documents&lt;br /&gt;&lt;br /&gt;# front_bias =&gt; 0.9&lt;br /&gt;perl Z-Categorizer.pl stopwords7 b&lt;br /&gt;93=Hit, 106=NoCategory, 35=OverCat&lt;br /&gt;&lt;br /&gt;front_bias =&gt; 0.9&lt;br /&gt;perl Z-Categorizer.pl stopwords7 b&lt;br /&gt;175=Hit, 51=NoCategory, 40=OverCat&lt;br /&gt;&lt;br /&gt;front_bias =&gt; 0.99   *** Best result ***&lt;br /&gt;perl Z-Categorizer.pl stopwords7 b&lt;br /&gt;176=Hit, 47=NoCategory, 37=OverCat&lt;br /&gt;&lt;br /&gt;front_bias =&gt; 0.9999&lt;br /&gt;perl Z-Categorizer.pl stopwords7 b&lt;br /&gt;154=Hit, 52=NoCategory, 40=OverCat&lt;br /&gt;&lt;br /&gt;front_bias =&gt; 0.99,&lt;br /&gt;stemming =&gt; "porter",&lt;br /&gt;151=Hit, 22=NoCategory, 81=OverCat&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108598591911872082?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108598591911872082'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108598591911872082'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/experimenting-with-frontbias.html' title='Experimenting with front_bias'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108597167614771570</id><published>2004-05-30T16:53:00.000-07:00</published><updated>2004-05-30T19:47:56.146-07:00</updated><title type='text'>Handling the NOT case in Text Categorization</title><content type='html'>I'm not sure that this is the best way to handle it, but as I havent discovered a better way so far, this is my solution:&lt;br /&gt;&lt;br /&gt;I've statically created an extra topic that is teh same name as teh topic I'm looking for, except that it has a minus sign in front of it. This topic contains the items that I don't want in the topic, thus qualifying the input document.&lt;br /&gt;&lt;br /&gt;For instance the text for the JORDAN category is "Jordan" and the text for the -JORDAN category id Michael. Then on output, I take 2 runs through the corpus list. The first one gathers the NOT cases, remembering only the category name, with the '-' character removed.&lt;br /&gt;In the second loop, if the topic is in the NOT has, then it is skipped.&lt;br /&gt;&lt;br /&gt;This is not the ultimate in elegance, but it works and that's all I care about right now - and its only a few lines of code.&lt;br /&gt;&lt;br /&gt;I could handle this by using my original topic data structure in which there was a NOT entry in the body of the topic definition and then splitting it out at the time the base knowledge set is created.&lt;br /&gt;&lt;br /&gt;So, by using just one NOT case, the hit/miss score looks like this:&lt;br /&gt;&lt;br /&gt;t4bbxcT8: 179=Hit, 43=NoCat, 38=OverCat&lt;br /&gt;&lt;br /&gt;An improvement over the previous best, which was:&lt;br /&gt;&lt;br /&gt;t7bbxcT7: 182=Hit, 39=NoCat, 42=OverCat&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108597167614771570?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108597167614771570'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108597167614771570'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/handling-not-case-in-text.html' title='Handling the NOT case in Text Categorization'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108590324477388533</id><published>2004-05-30T00:32:00.000-07:00</published><updated>2004-05-30T00:47:24.773-07:00</updated><title type='text'>What are the Text Categorization overages</title><content type='html'>The overages are interesting. They are all related to miscategorization due to acting on a single word rather than a phrase and my inability to program NOT into the system.&lt;br /&gt;&lt;br /&gt;for example, a Horoscope story gets picked up by the Cancer &lt;br /&gt;topic. I could stop that with NOT horoscope, aquarius etc&lt;br /&gt;&lt;br /&gt;A Michael Jordan story is tagged by the Jordan (country) topic&lt;br /&gt;2 Appeals court stories are tagged by the Supreme Court topic&lt;br /&gt;&lt;br /&gt;A child custody story is tagged by the Child Abuse topic&lt;br /&gt;&lt;br /&gt;I was hoping to find other people working on this problem but I can't find anything recent.&lt;br /&gt;&lt;br /&gt;At least I'm getting closer to where I want to be, but I'm now over a week behind.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108590324477388533?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108590324477388533'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108590324477388533'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/what-are-text-categorization-overages.html' title='What are the Text Categorization overages'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108590212276280569</id><published>2004-05-29T23:57:00.000-07:00</published><updated>2004-05-30T01:41:01.786-07:00</updated><title type='text'>Hacking my topic definitions and rechecking manual classifications</title><content type='html'>After almost a whole day of trying to get individual term weighting running, I decided to have a break and try something else.&lt;br /&gt;&lt;br /&gt;First, I saved the topic definitions and started hacking them.&lt;br /&gt;This classification system has a few problems. First, it only knows about words, not phrases, so Michael Jordan and Jordan the country get classified the same. &lt;br /&gt;&lt;br /&gt;I obviously need some Word Sense Disambiguation here, but I don't know how to do that with these modules. So I settled on hacking the topic definitions. &lt;br /&gt;&lt;br /&gt;At about the same time, I reviewed the hits and misses and realised that I may have missed classifying some topics in a few documents.&lt;br /&gt;&lt;br /&gt;I'm pleased with the results of this effort because now, the number of positive hits is up and the overage is down. Its moving in the direction I wanted. Also, the number of files with no category is down.&lt;br /&gt;&lt;br /&gt;t7bbxcT1: 155=Hit, 35=NoCat, 82=OverCat&lt;br /&gt;t7bbxcT2: 159=Hit, 36=NoCat, 76=OverCat&lt;br /&gt;t7bbxcT3: 160=Hit, 36=NoCat, 75=OverCat&lt;br /&gt;t7bbxcT4: 160=Hit, 36=NoCat, 76=OverCat&lt;br /&gt;t7bbxcT5: 162=Hit, 36=NoCat, 73=OverCat&lt;br /&gt;t7bbxcT6: 176=Hit, 36=NoCat, 51=OverCat&lt;br /&gt;t7bbxcT7: 182=Hit, 39=NoCat, 42=OverCat&lt;br /&gt;&lt;br /&gt;So that was a very useful exercise.&lt;br /&gt;I tested KNN and Rocchio again.&lt;br /&gt;&lt;br /&gt;t4kbxcT7: 44=Hit, 174=NoCat,  3=OverCat&lt;br /&gt;t4rbxcT7: 179=Hit, 84=NoCat, 65=OverCat&lt;br /&gt;&lt;br /&gt;Also tested term weighting&lt;br /&gt;t4btxcT7: 180=Hit, 40=NoCat, 52=OverCat&lt;br /&gt;&lt;br /&gt;So binary weighting with cosing smoothing and lots of&lt;br /&gt;stopwords with a Naiive Bayes Learner works best so far.&lt;br /&gt;&lt;br /&gt;Handling the NOTs should definitely do it for me.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108590212276280569?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108590212276280569'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108590212276280569'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/hacking-my-topic-definitions-and.html' title='Hacking my topic definitions and rechecking manual classifications'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108587133395404582</id><published>2004-05-29T15:25:00.000-07:00</published><updated>2004-05-29T16:01:20.386-07:00</updated><title type='text'>Attempts to improve the Over-classification</title><content type='html'>The first thing I tried was to alter the term weighting system.&lt;br /&gt;The best results came from using bxc. I was already using bxx in the previous tests. The difference is cosine smoothing of results. This reduced the overage and increased the number of good hits.&lt;br /&gt;&lt;br /&gt;tfidf_weighting specifies how document word counts should be converted to vector values.&lt;br /&gt;This scheme uses the three-character specification strings from Salton &amp; Buckley's paper "Term-weighting approaches in automatic text retrieval".  The three characters indicate the three factors that will be multiplied for each feature to find the final vector value for that feature.  The default weighting is "xxx".&lt;br /&gt;&lt;br /&gt;The first character is "term frequency".&lt;br /&gt;The second character is "collection frequency".&lt;br /&gt;The third character is "normalization"&lt;br /&gt;&lt;br /&gt;b   Binary weighting - 1 for terms present, 0 for terms absent&lt;br /&gt;x   No change - multiply by 1&lt;br /&gt;c   cosine normalization - multiply by 1/length(doc_vector)&lt;br /&gt;&lt;br /&gt;t7bbxc: 124=Partial, 157=Hit, 35=NoCategory, 62=OverCategory, 84=TotalOver&lt;br /&gt;t7rbxc: 81=Partial, 161=Hit, 83=NoCategory, 57=OverCategory, 92=TotalOver&lt;br /&gt;t7kbxc: 39=Partial, 39=Hit, 178=NoCategory, 4=OverCategory, 4=TotalOver&lt;br /&gt;&lt;br /&gt;Naiive Bayes, KNN and Rocchio all improved with smoothing and surprisingly, Rocchio came right up with Naiive Bayes and got the best number of positives and only slightly more negatives. &lt;br /&gt;&lt;br /&gt;I'm trying to get a server set up where I can run a C compiler and build a Support Vector Machine (SVM) because I think this would be a very interesting comparison.&lt;br /&gt;&lt;br /&gt;The next thing I tried was to reduce the amount of document data passed to the classifier. I tried sending it only the document TITLE and the 200 character summary.&lt;br /&gt;&lt;br /&gt;I call this the shortdoc test. As you see, it reduced the overage, but the number of good hits is down too&lt;br /&gt;shortdoc:  39=Partial, 39=Hit, 168=NoCategory, 14=OverCategory, 15=TotalOver&lt;br /&gt;&lt;br /&gt;The next thing to do (apart from building the SVM) is to try individual term weighting in a category.&lt;br /&gt;If I could do that, I'd be able to to get the NOT terms working, which I think are going to give the biggest improvement. There won't be more hits, but there should be much fewer overages, which is what I need right now.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108587133395404582?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108587133395404582'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108587133395404582'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/attempts-to-improve-over.html' title='Attempts to improve the Over-classification'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108550230820674728</id><published>2004-05-25T09:07:00.000-07:00</published><updated>2004-05-29T15:24:45.673-07:00</updated><title type='text'>First Categorization Results (Numeric)</title><content type='html'>These are the first results that will hopefully lead me to making changes in various areas to improve the automated categorization.&lt;br /&gt;&lt;br /&gt;k = KNN&lt;br /&gt;r = Rocchio&lt;br /&gt;b = Naiive Bayes&lt;br /&gt;&lt;br /&gt;t1k:    Exact=4, Partial=22, NoCategory=205, OverCategory=1&lt;br /&gt;t2k:    Exact=11, Partial=38, NoCategory=187, OverCategory=4&lt;br /&gt;t3k:    Exact=14, Partial=54, NoCategory=177, OverCategory=3&lt;br /&gt;t4k:    Exact=16, Partial=58, NoCategory=173, OverCategory=3&lt;br /&gt;&lt;br /&gt;t1r:    Exact=5, Partial=28, NoCategory=22, OverCategory=147&lt;br /&gt;t1rn:   Exact=7, Partial=30, NoCategory=28, OverCategory=171&lt;br /&gt;t2r:    Exact=8, Partial=32, NoCategory=43, OverCategory=78&lt;br /&gt;t2rn:   Exact=15, Partial=54, NoCategory=64, OverCategory=115&lt;br /&gt;t3r:    Exact=21, Partial=86, NoCategory=90, OverCategory=67&lt;br /&gt;t3rn:   Exact=21, Partial=86, NoCategory=90, OverCategory=67&lt;br /&gt;t4r:    Exact=10, Partial=28, NoCategory=30, OverCategory=22&lt;br /&gt;t4rn:   Exact=22, Partial=90, NoCategory=92, OverCategory=62&lt;br /&gt;&lt;br /&gt;t1b:    Exact=22, Partial=74, NoCategory=3, OverCategory=159&lt;br /&gt;t1bn:   Exact=22, Partial=74, NoCategory=3, OverCategory=159&lt;br /&gt;t2b:    Exact=34, Partial=92, NoCategory=17, OverCategory=124&lt;br /&gt;t2bn:   Exact=34, Partial=92, NoCategory=17, OverCategory=124&lt;br /&gt;t3b:    Exact=45, Partial=108, NoCategory=22, OverCategory=100&lt;br /&gt;t3bn:   Exact=45, Partial=108, NoCategory=22, OverCategory=100&lt;br /&gt;t4b:    Exact=49, Partial=104, NoCategory=24, OverCategory=96&lt;br /&gt;t4bn:   Exact=49, Partial=104, NoCategory=24, OverCategory=96&lt;br /&gt;&lt;br /&gt;Discussion of these results:&lt;br /&gt;&lt;h5&gt;Overall&lt;/h5&gt;&lt;br /&gt;It is obvious that increasing the number (and probably quality)&lt;br /&gt;of stopwords leads to better categorization, so that is an area to be investigated further.&lt;br /&gt;&lt;br /&gt;&lt;h5&gt;KNN&lt;/h5&gt;&lt;br /&gt;KNN gives the least number of overcategorizations, but it obviously misses a lot of things too&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108550230820674728?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108550230820674728'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108550230820674728'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/first-categorization-results-numeric.html' title='First Categorization Results (Numeric)'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108546550026358580</id><published>2004-05-24T22:52:00.000-07:00</published><updated>2004-05-25T09:37:51.690-07:00</updated><title type='text'>Text Categorization Algorithms and Methodology</title><content type='html'>In this first test run, I have 4 Algorithms for selecting Article Categories.&lt;br /&gt;&lt;br /&gt;Guesser, Naiive Bayes, KNN and Rocchio&lt;br /&gt;I will use these abbreviations: g=guesser b=bayes k=knn r=rocchio&lt;br /&gt;&lt;br /&gt;There are other interesting algorithms I want to try, but due to &lt;br /&gt;the limitation of no access to a C compiler on this server, I&lt;br /&gt;couldn't build a Support Vector Machine (SVM) or a Decision Tree&lt;br /&gt;(DT). I will try these later.&lt;br /&gt;&lt;br /&gt;k-Nearest Neighbours (KNN) is an example-based classifier and&lt;br /&gt;Rocchio is a linear classifier, which depend on a test set of&lt;br /&gt;documents, so these may no be useful for what I want.&lt;br /&gt;&lt;br /&gt;Guesser was by far the fastest to execute, but now that I have a baseline, its easy to see why - its completely useless.&lt;br /&gt;&lt;br /&gt;With a base set of 224 documents, using 4 different sets of stopwords and stemming or no stemming, there were no exact matches and no partial matches.&lt;br /&gt;&lt;br /&gt;The test names tell something about the test - they are also&lt;br /&gt;the filenames I used to store the results.&lt;br /&gt;&lt;br /&gt;t = test&lt;br /&gt;1-4 = the stopword files = 5, 23, 571 and 1602 words&lt;br /&gt;       (I'll explain these in detail later)&lt;br /&gt;g,b,k,r = the algorithm used, as shown above&lt;br /&gt;s or n  = stemming or no stemming&lt;br /&gt;&lt;br /&gt;Im not going to discuss results in this post, except to say&lt;br /&gt;that I'm not going to use the Guesser algorithm again&lt;br /&gt;because its a waste of time - OK for academic interest, but&lt;br /&gt;no good to me.&lt;br /&gt;&lt;br /&gt;This is how the results will show. These are the guesser&lt;br /&gt;results - you can see why its a waste of time.&lt;br /&gt;&lt;br /&gt;The table shows the number of Exact matches, Partial matches&lt;br /&gt;where the Algorithm gave a subset of the manual categorization&lt;br /&gt;No category at all and finally, categorization, but additional&lt;br /&gt;topics. These will need further investigation.&lt;br /&gt;&lt;br /&gt;I would rather have it under-categorizing than give extra&lt;br /&gt;categories that are false. For instance, I'd rather it not&lt;br /&gt;put a Washington State story in the Washington DC category.&lt;br /&gt;&lt;br /&gt;However, there could be perfectly valid over-categorization that&lt;br /&gt;I missed seeing, so this first set of results will need more&lt;br /&gt;investigation.&lt;br /&gt;&lt;br /&gt;These are the results from Guesser&lt;br /&gt;&lt;br /&gt;t1gs:   Exact=0, Partial=0, NoCategory=82, OverCategory=139&lt;br /&gt;t1gn:   Exact=0, Partial=1, NoCategory=86, OverCategory=134&lt;br /&gt;t2gs:   Exact=0, Partial=0, NoCategory=76, OverCategory=145&lt;br /&gt;t2gn:   Exact=0, Partial=0, NoCategory=75, OverCategory=146&lt;br /&gt;t3gs:   Exact=0, Partial=0, NoCategory=74, OverCategory=147&lt;br /&gt;t3gn:   Exact=0, Partial=0, NoCategory=76, OverCategory=145&lt;br /&gt;t4gs:   Exact=0, Partial=0, NoCategory=82, OverCategory=139&lt;br /&gt;t4gn:   Exact=0, Partial=0, NoCategory=85, OverCategory=136&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108546550026358580?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108546550026358580'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108546550026358580'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/text-categorization-algorithms-and.html' title='Text Categorization Algorithms and Methodology'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108545880451871201</id><published>2004-05-24T21:03:00.000-07:00</published><updated>2004-05-25T08:48:32.066-07:00</updated><title type='text'>Text Categorization baseline</title><content type='html'>I spent 4 hours manually categorizing almost 400 documents, to use as the baseline for the automated tests. &lt;br /&gt;&lt;br /&gt;After that I ran a quick check to see how close any of the automated tests were to  the baseline. some of the automated tests were completely useless, while others were remarkably close.&lt;br /&gt;&lt;br /&gt;One big problem I noticed easily was the confusion between Washington DC and Washington State and there was one instance of George Washington.&lt;br /&gt;&lt;br /&gt;The Categorizer Algorithms all seem to only handle the documents word by word, less the stopwords and any stemming. &lt;br /&gt;&lt;br /&gt;Here are a list of things I think it needs to handle:&lt;br /&gt;- 2 and 3 word phrases&lt;br /&gt;- proximity (x within 2/3/4 words of y)&lt;br /&gt;- NOT a word / phrase / proximity phrase&lt;br /&gt;- a OR b OR c OR d etc&lt;br /&gt;- MUST contain a OR b OR c etc AND d OR e OR f etc&lt;br /&gt;&lt;br /&gt;If I can get all this into the system, I'll be very happy.&lt;br /&gt;&lt;br /&gt;The next report will show initial results comparisons between the Human Categorizer (me) and the automated Categorizers.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108545880451871201?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108545880451871201'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108545880451871201'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/text-categorization-baseline.html' title='Text Categorization baseline'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108533412071655162</id><published>2004-05-23T10:42:00.000-07:00</published><updated>2004-05-23T10:47:26.866-07:00</updated><title type='text'>Text Categorization improvements</title><content type='html'>Well, back to the drawing board. none of those searches helped.&lt;br /&gt;It was basically a waste of 4 hours.&lt;br /&gt;&lt;br /&gt;So I went back to AI::Categorizer and found a really good paper by Fabrizio Sebastiani called &lt;a href="http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf" target="_new"&gt;Machine Learning in Automated Text Categorization&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I also thought through what I was trying to achieve and realised I need to build an extra module of my own and also need to call the Categorizer differently.&lt;br /&gt;- Its amazing the things you can accomplish when you think and investigate before you act!&lt;br /&gt;&lt;br /&gt;So I revisited all the documentation (its not set out to be easily accessible), reworked the code and came up with results much closer to what I was looking for.&lt;br /&gt;&lt;br /&gt;I ran some tests and got interesting results, which I'll publish soon. The poor initial results were what made me look elsewhere, but now I realise that subtle changes can drastically improve the results.&lt;br /&gt;Now its back to the code and tests.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108533412071655162?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108533412071655162'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108533412071655162'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/text-categorization-improvements.html' title='Text Categorization improvements'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108524843550035205</id><published>2004-05-22T10:39:00.000-07:00</published><updated>2004-05-22T10:53:55.500-07:00</updated><title type='text'>Text categorization</title><content type='html'>I've been experimenting with text categorization for a new project. I started with many searches on Google. There are quite a few commercial products from large companies, all with large price tags. Probably all worth it, because its not a simple thing to do.&lt;br /&gt;&lt;br /&gt;I was only experimenting so I was looking for code to build on. Ken Williams' AI::Categorizer is a perl module that keeps coming up in the searches. &lt;br /&gt;&lt;br /&gt;After a day of searching, I decided I'd try it. Neat interface, easy to follow the examples. Unfortunately, I got really weird results. It must be because of the way the Learner is configured. The winning pattern seems to dominate all the others. I was tryiing to run it on my site for &lt;a href="http://perlcity.com/"&gt;Perl Books&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;Unfortunately, although I have SSH access, I can't run the C compiler and that meant I could only try 3 of the basic learners. The Decision Tree learner has a lot of C code and it needs to be compiled and linked to the perl.&lt;br /&gt;&lt;br /&gt;After another day of testing, I was looking in dmoz and I saw a link to Text Categorizer, so I'm going to investifate that.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108524843550035205?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108524843550035205'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108524843550035205'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/text-categorization.html' title='Text categorization'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-7032875.post-108493050595868522</id><published>2004-05-18T18:23:00.000-07:00</published><updated>2004-05-24T21:03:02.396-07:00</updated><title type='text'>Business Strategy</title><content type='html'>If your small business isn't performing, you probably don't have a strategy to make it perform and keep it on track.&lt;br /&gt;&lt;br /&gt;At least not a written strategy.&lt;br /&gt;Its a mazing that many business just fly by the seat of their pants.&lt;br /&gt;Some of them probably do OK but I can't help thinking that they could do much better if they had a roadmap. &lt;br /&gt;&lt;br /&gt;Its like driving in the country without a map. I did this last week, north of Lincoln. Just kept moving south and west and eventually found a main road. Then I looked at the map when I got home to see where I should have gone and realised I could have been home an hour earlier. I'll remember to take the map next time!&lt;br /&gt;&lt;br /&gt;Last month, I ran a series of PPC tests on Overture advertising &lt;a href="http://loopbiz.com/?-small-business-"&gt;Small Business&lt;/a&gt; I was surprised at the number of clickthroughs. Most of them were for "free business plan".&lt;br /&gt;&lt;br /&gt;I'm not sure whether people are looking for a plan to use or whether they need some guidance. In any case, without a strategy, what's the point? &lt;br /&gt;&lt;br /&gt;I tried to tune out the freebie seekers by setting the ad up to dissuade them. This month I'm going to try 2 ads in the same section. One will dissuade them and the other won't.&lt;br /&gt;&lt;br /&gt;Look here for the results in a few weeks.&lt;br /&gt;&lt;br /&gt;Alan.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7032875-108493050595868522?l=perlcity.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108493050595868522'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7032875/posts/default/108493050595868522'/><link rel='alternate' type='text/html' href='http://perlcity.blogspot.com/2004/05/business-strategy.html' title='Business Strategy'/><author><name>NewsBlaze</name><uri>http://www.blogger.com/profile/02425972917288659866</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry></feed>
