fast-breaking news analysis about drug policy and illegal drugs

The Story of Your Enslavement
five star (video)

Prison for Profit
The Rise of the Prison Industrial Complex
Prison Profiteers (2013)


home   about

MAP scholar

propaganda news

canada uk

australia pot news

psychedelics news

tag cloud topics

concept dictionary

feeds   stats

user analysis

contact us   faq   chat

login register


news hawking!




No Victim/No Crime -

Need facts? See:

this bot site is Hosted By DrugPolicyCentral! ... Please help us keep going!

This newsbot site, while automated (true!) has always depended on the generosity of volunteers and visitors to keep it up and running. If this site has helped you, and you can afford to help us, then we ask you to give generously. And to those who have been helping to keep us running, a heartfelt THANK YOU!       Donate online: click here and help us out!

bot: roadmap
Doug Snead
July, 2004

DrugSense Newsbot: Overview

The DrugSense newsbot ("newsbot", "bot", etc.) is a software system that efficiently gathers news from online sources, on a given topical area. The prototype implementation we will discuss here has the example topic area of "drug related" news articles.
DrugSense News Bot

The system can gather the very latest news on a topic area, and it does this from any online html news source. Articles of interest are discovered by very limited spidering of news sites, following links of topical interest and ignoring other links. Articles are automatically categorized and summarized according to selected concept terms. Once properly configured, the system runs continuously and robustly to gather and (sub)categorize the latest news on the desired topic. No human intervention is then needed.

Unlike news "aggregators", the newsbot does not need or use other RSS XML as input. Instead, for input, the newsbot scrapes online papers' html. The newsbot produces RSS (etc) as output. Other RSS aggregators can use the newsbot's output; the newsbot doesn't need or use other RSS sources.

So, while the articles that the newsbot scans and classifies are arbitrary html, the newsbot's output is RSS XML (and html, javascript, etc.). The topic area, as well as the configuration (sites spidered, and the way they are spidered) are all defined in simple XML formats.

Concept-based News Bot

Concept Dictionary


Central to the operation of the newsbot is the "concept dictionary." The concept dictionary is a type of thesaurus, where keywords are grouped together to form "concepts." Concepts, basically, are named lists of keywords. The concept dictionary the newsbot uses is a type of lightweight keyword ontology, a conceptual schema or a taxonomy, for a given topic area.
Newsbot's Concept Dictionary

Concepts Drive Site Content, Organization

The concepts drive the content of the system. The robot uses the concepts to guide the article finding process. The concepts are used to categorize and rank articles by relevance to the desired topic area. The concepts also are reflected as the "topics" hierarchy. Concepts are used to summarize articles. The concepts are central to the operation of the entire newsbot system and are crafted to let the system quickly sort items of topical interest. The concepts are made to let the newsbot efficiently categorize news articles; they are not intended to be a rigorous ontology, nor are they intended to be completely exhaustive.

Concept-based Spidering, Text Mining

The robot/spider portion of the system uses the concept dictionary (the concepts' keywords) to guide the new article discovery process. Online newspaper front pages and index pages are scanned for concepts. Links are followed when they look interesting, that is, when they contain certain (user specified) concepts.

The technique that the newsbot uses in 'extracting interesting and non-trivial information and knowledge from unstructured text' is often called text mining.

Concepts Relationships Reflected As Topics

Concepts may be linked to other concepts, as sub-concepts. (This is similar to broader term/narrower term thesauri). Concepts can be linked together to form hierarchies. The (possibly tangled) hierarchies formed by the concepts and their related sub-concepts are reflected in the topic/sub-topic organization in the newsbot site. Automatically generated "topic" pages are maintained by the system. These pages list recent articles that contain the concept/topic. So topics reflect the organization of the concepts; for each concept a corresponding topic page is maintained by the bot.
Newsbot's list of topics, automatically made from concept dictionary

Concept Keywords

Concept keywords (called "terms") are implemented as regular expressions for some flexibility.

The Content: Automatic Discovery of Breaking News Articles

As mentioned before, the system uses the concepts to spider news sites for interesting articles. articles discovered by the system are cached on the server, and their categorization (a list of concepts applied to the given article) is also stored.

Richly categorizing news articles by a set of concepts tags allows for concept-based searches, later. The system lets searches be done via a search form that appears on every page of the newsbot site.

It is possible to write simple (cgi, shell, etc.) scripts to create/cache a series of complicated searches. For example, one script caches a "breaking UK drug news" page every 15 minutes. This page is created by a trivial sh(1) script that uses the bot's command-line shell script modes to make a page that embodies five searches.
Newsbot page built with simple script

Newsbot: Open Service

The newsbot, similar to Meerkat-like servers, lets users also specify searches as parameters or options, in urls. The newsbot is an open service. Its API is open and documented (and trivially simple).
Newsbot search, specified in the url.

Point a browser at that url, and a search for items containing "Mexico" are returned, as html.

One may search for specific concepts, also.
search for the oxycodone concept

Boolean expressions are permitted, too.
search for "Texas" articles that don't mention cannabis etc.
search articles that mention "Thailand" or "Philippines"

XML, RSS and Javascript Output

RSS (Really Simple Syndication) allows anyone to syndicate unique content. Others can subscribe to one's own syndicated output. Users can make the newsbot output RSS, as an option.
search for "Mexico" articles, output it in RSS -> News
examples of sites using this newsbot's RSS feeds

Other users may prefer javascript. A similar option outputs data as javascript. This makes the system's categorized stream of articles even more accessible - anyone can add a few lines of javascript to their web page and get headlines.
search for "Mexico" articles, output in javascript
examples of sites using this newsbot's javascript feeds

User Accounts, Profiles, News Analyst Features

The newsbot supports user registration and login. Once registered and logged in, users may create concept and keyword based interest profiles. Once a profile is created, users then have the option of having "interesting" articles mailed to them.

Other (login-based) features are designed to mesh with MAPInc operations. For example, the system attempts to determine if articles cached by the newsbot have been submitted to MAPInc's drug policy article clipping service already. If so, users are alerted to that article's submission status.

Recap: Newsbot Output

The newsbot can output in html (the default), RSS (XML RDF), and XFML, and javascript. A simple, open, url interface lets users take advantage of the newsbot's richly categorized database of breaking news articles.
the simple API for news feeds, searches

Newsbot Design

The concept dictionary is stored as XML. For the current prototype system that deals with the topic area of "drug related" articles, the concept dictionary was hand-written in XML. The newsbot itself is written in perl.

The DrugSense Newsbot: Some Statistics
number of online papers spidered daily about 650 sites
html pages examined per site 7 to 15 (Configurable. Now configured to look at at least 7 pages, and, if interesting pages are found, will spider up to 15.)
minimum number of web pages spidered per day over 4550 html pages
maximum number of web pages spidered per day9750 html pages
interesting articles found per day about 400 articles per day
articles cached for 5 days back
running since January 2003 on Baremetal servers
concept dictionary, number of "concepts" about 50 concepts
number of keywords per concept from 2 to around 70 keywords

Newsbot Operation: Bot Makes RSS News feeds from HTML-Only News Sites

Unlike Meerkat, the newsbot examines the html article text, so sources don't need rss/xml themselves. However, the newsbot's output can be RSS XML. So the newsbot makes RSS news feeds from sites that don't, themselves, have feeds. (Meerkat uses existing RSS feeds as input; the newsbot can make use of any site or text source, regardless of whether the site had it to begin with.) The newsbot automatically generates summaries for interesting articles. These summaries are used for descriptions in the output.

Server Usage

The newsbot's spidering and article discovery are somewhat CPU intensive. To balance the system load with the speed of article discovery, the newsbot will automatically slow down using less CPU, when CPU usage becomes too high. This feature is configurable.

Competitive Concept-based products

Apelon's Content Tagging Tool apelon.com1
DCARS (Document Content Analysis and Retrieval System) DCARS description
AmikaNow! Content Analysis Toolkit
TEMIS - Text Mining Solutions
Automomy Corp. autonomy.com1
NetOwl (SRA, Inc.) netowl.com1
Mesa Dynamics "theConcept" Text Mining Software
IBM Masala text mining web crawler
USAF / Cymfony Inc. Dashboard product, text mining

Where we are: Summary

That's a brief overview of the newsbot, and how it uses the concept dictionary. The concept dictionary is central to the newsbot's operation. The concepts are used to automatically categorize articles. The concepts determine the topic area of interest, and the granularity at which articles are grouped. It organizes lists of keywords (regular expression patterns) so that they can efficiently guide the system as it spiders online news sites. The concept dictionary is implemented as an XML configuration file.

The newsbot runs 24/7 without human intervention, continuously gathering breaking news relating to the desired topic area. A prototype of the newsbot exists, and has been running for over a year. No other site delivers more recent breaking drug war and drug policy news.

The output of the system is a stream of breaking news, articles of interest. The output of the system is both html (cached pages on the site, searches on demand), and XML RSS news feeds. Other output types, like javascript or XFML, are also available. By making the search requests specifiable as url parameters, and by making the search output available as RSS XML news feeds, people can easily use the newsbot as an open resource. When someone needs news feeds for their site (related to the general topic area, in our example, of "drug related" news items), they are able to easily get them from the newsbot site. For example, PHPnuke-based sites can use the RSS output of the newsbot directly.

media charts

Bot's analysis of: "The Dangers and Consequences of Marijuana Abuse" the U.S. Department of Justice Drug Enforcement Administration (DEA) Demand Reduction Section, May 2014
more >>

Newsbot crossword puzzles!

Drug War

A review and analysis of modern prohibition rhetoric

  • Amazon Kindle
  • html (free)
  • pdf (free)
  •   Wonder Drug Cover-Up: Yes, it's true: pot fights cancer. more

    As Bad For Your Lungs As Smoking 20 Normal Cigarettes? 20 times more likely to cause cancer than tobacco? Why does the US Government make cannabis researchers use only Government-issued marijuana?


    Observer's Propaganda Picks
    dripping with drug-war propaganda!

    Prohibition-era cartoons
    Anti-prohibition political cartoons from Prohibition I.

      Support Mapinc & Drugsense

    Donate to drugsense please give generously!