Data, Data Everywhere

by Maitri on March 2, 2010

in computing & internet, gizmos & hacks, science & technology

You read yesterday’s VatulBlog post on natural disasters and the benefits of educating yourself in this time of information plenty. Where do you start, however, when search engine output has a rather low signal-to-noise ratio?

1) A list of Google’s or Ask’s search terms shows many internet users cannot pose a question to save their lives. Use these 20 Tips For More Effective Google Searches to avoid being overwhelmed by search results. Become one with Boolean.

2) There are many free science search tools out there. Here is a review of the prominent ones. Search early, search often.

3) Ask a librarian. There exist these humans, many Library & Information Science graduates and other smart people, who work at libraries who can help you hone your search skills as well as find answers to your questions. They also do funny dances with library carts, I hear, which is always worth the time spent at your local library. Jokes aside, considering the number of people who don’t possess home computers and computer & online literacy, well-funded libraries and intelligent, helpful librarians are crucial to the future of information facilitation. Also consider publishing your search results and analyses as a blog post to help others who may have the same question.

4) For humanity’s sake, lower the noise or help verify data.

Blair pointed me to The Economist’s The Data Deluge which simultaneously celebrates and bemoans the exponential increase in data, be it photographic, tabular, scientific or vital-statistical in nature, and how corporations are only beginning to find value in separating the wheat from the chaff. Violation of privacy aside, the greatest threat to signal quality is data hoarding (especially by individuals, taxpayer-funded organizations and publishing houses) and a large amount of noise in the system. Those who do put information out there don’t (want to) curate it, which includes ensuring accuracy and constant updates. In a comment on the aforementioned Economist article, a D. Sherman says it best:

… the vast majority of of “noise” in databases is simply bad data, duly entered and propagated … We put a great deal of effort into collecting more and more data, but comparatively little into weeding out bad data. This implies that the sign[al]-to-noise ratio is only going to deteriorate. Part of the reason for that is that the incentives all favor collecting more data, but do not adequately penalize bad data.

… Members of the database resistance movement who are willing to risk more radical means of spoofing more important databases can readily imagine even more creative (though less legal) methods. When RFID tags and image-recognition tracking of people and vehicles becomes more common, the opportunities and means for injecting noise into the system will multiply exponentially.

The solution is a selective and sophisticated willingness to part with data depending on what the data involves. Not divulging personal information and data hoarding are two different things. While I value my personal privacy and often purposefully give wrong phone numbers and zip codes to websites and salespeople, I will readily part with any scientific, technological or social information that is not classified explicitly as proprietary. Google and Bing are not entitled to my street address but they are to my paper on transtensional folding because Americans paid for it and not Elsevier or Springer.

Another sure-fire way to lower noise is to cut back on content replication. A mature search engine will a) enforce canonical URLs and b) take a scythe to unabashed content scrapers who enjoy high search engine rankings. I’ve lost track of the number of MY blog posts that show up on others’ sites as higher-ranked search hits. (Just because my content is published under a CC-BY-NC-SA license does not mean you are entitled to pilfer it and ignore the non-commercial and share-alike components of the license.) Also, take-down notices only work if you can find a responsive human on the other end to respond, and who has time for that? There has to be an easier way to punish websites and aggregators for outright plagiarism and internet abuse.

Speaking of content replication, here’s an interesting “what-if” article on information copying and machine sentience. I guess there is one benefit to a low signal-to-noise ratio in the network: really dumb Artificial Intelligence, should it emerge.

4 comments… add one

D March 2, 2010, 11:39 AM

“Become one with Boolean.”??????

physician, heal thyself!

Reply Link
Maitri March 2, 2010, 11:43 AM

:-PPP Not my fault I have an expert search engine user for a husband and others don’t.

Reply Link
Patrix March 2, 2010, 5:15 PM

Feed scrapers are the sole reason I sometimes contemplate going the partial feed way but then better sense prevails.

Reply Link
Clay March 4, 2010, 8:34 PM

http://www.wired.com/gadgetlab/2010/03/popular-science-puts-entire-scanned-archive-online-free/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+wired%2Findex+%28Wired%3A+Index+3+%28Top+Stories+2%29%29

All of Popular Science’s archives online for free. Going ALL the way back. Jet packs and flying subs galore!

Reply Link

Data, Data Everywhere

Related

Leave A ReplyCancel reply

Maitri's bookshelf: currently-reading