Nerd Words: Fallacies of Data Science

Good piece  by Shane Brennan on Medium about the realities of data science in day-to-day working life (in contrast with how it’s taught).

His ten fallacies:

1. The data exists.
2. The data is accessible.
3. The data is consistent.
4. The data is relevant.
5. The data is intuitively understandable.
6. The data can be processed.
7. Analyses can be easily re-executed.
8. Where we’re going we don’t need encryption.
9. Analytics outputs are easily shared and understood.
10. The answer you’re looking for is there in the first place.

I have always considered Excel primarily a medium for creative expression!

He is writing about a business context–for instance where Google Analytics, and its attendant woes, are likely to play a big role in answering a client’s marketing strategy question. But what struck me about his fallacies is their aptness in worlds I hang in–journalism and education. Data journalism is, of course, the flavor of the week, month, and year, and no doubt it is of value–but it is sometimes seen like a magic toolbox that can be used without an hypothesis, without a real data set, and, most importantly, no clear idea of what would actually constitute a newsworthy answer to the query.

I know there are data journalism efforts that don’t fall pray to Brennan’s list,  but I wonder how many. In particular, overcoming that last point in the affirmative is a high bar. Is the information really there for the finding? Reminds me of a quote from Confucius.

“The hardest thing of all is to find a black cat in a dark room, especially if there is no cat.”–Confucius

(As for education, I’ll save my gripes about use and misuse of data for another day.)

Advertisements

Panama Papers in 30 Seconds

Via VOX, a funny and easy to digest explanation of the Panama Papers, originally from Reddit

When you get a quarter you put it in tpiggy_bankhe piggy bank. The piggy bank is on a shelf in your closet. Your mom knows this and she checks on it every once in a while, so she knows when you put more money in or spend it.

Now one day, you might decide “I don’t want mom to look at my money.” So you go over to Johnny’s house with an extra piggy bank that you’re going to keep in his room read full post

That secret tax-free piggy banks are popular among elected officials, who are paid, of course by taxes, is not surprising, but does seem particularly odious.

Data, Data Everywhere…

smith_corona
The news biz how it was…words.

…but any room to think?

 

Big data has come to the newspaper biz in a big way. London’s Guardian has a data leaderboard in their newsroom with real time metrics for how stories are “performing” but the Financial Times, being the overachievers they are, have a whole integrated data enterprise that is embedded in their news operation.

Digiday has the story. In the excerpt below, the Betts in the quote is Tom Betts, the FT’s chief data officer.

“Tech companies don’t have chief data officers.”
Betts’ appointment also marks the publisher’s evolution to decentralize its analysts. Before last year, engineers and analysts were separate from the rest of the organization. Now, data analysts are embedded in marketing and editorial.

The audience engagement team sits in the newsroom so it can work directly with journalists. It includes data analysts, SEO experts, engagement strategists, social media managers and journalists. Its objectives are to get the FT journalism out to more people and evolve the newsroom with digital readers in mind.

“An analytically mature business is where the vast majority of analysts sit within the other teams,” Betts said. Tech organizations, he added, “don’t have chief data officers.”

Adding_Machine
The news biz as it is now: numbers.

It goes down so reasonably that you are almost lulled into forgetting to ask what SEO, engagement strategy, social and media actually have to do with journalism. One of these things is not like the other. Still, the FT manages to remain pretty newspapery, certainly compared to many other papers, which seem to be lame print versions of their lame websites.

 

Now I’m off to check my metrics!

Platforms, platforms all around us…

Journalism (and much else for that matter) is mostly a question of platform now, you just may not have noticed it. By platform–a word, that like ‘risk’ means so many things it almost has withered to a semantic husk of itself–I’m thinking about the technological variety, generally a software system that facilitates, automates and otherwise organizes some human activity. Once upon a time it was the human bit that constituted the ends, with the platform as the means, now ends and means are mixed up, perhaps nowhere more than in journalism.

At least that’s the conclusion I draw from three bits of today’s reading, of passing interest to anybody who is watching the intersection between media and technology with fascination or dread.

First (and most interesting), New York Magazine’s Max Read asks “Can Medium Be Both a Tech Company and a Media Company?”
This is pegged to a story I didn’t know about (and am going to catch up on) in which a tech publication covering its own domain trips up on just what enterprise they are engaged in.

http://nymag.com/following/2015/11/medium-a-tech-company-or-a-media-company.html

chautauqua
The platform of yesteryear, a speaker holding forth at Chautauqua.

“Medium wants to straddle the divide between media and tech — to be both a platform (tech) and a publisher (media). This can place it in an awkward position: Institutionally, is it on the closed-ranks side of the “new class of industrialists” of the tech industry, to whom the question of Airbnb’s liability in the deaths of its guests is already settled? Or is it an editorially independent media company with a mandate to ask uncomfortable questions? So far, its defense against the differing interests of its two halves is transparency. This morning, Matter’s editor Mark Lotto weighed in on the entire set of comments: “I can’t think of another publication or platform where an editor and his boss would have this exchange in front of everybody.”

 

–As Read points out, it’s possible to find such transparency pretty easily (including going back to the pre-web days). Further, is ‘transparency’–another word of the moment, along with its sibling ‘disclosure’–enough? Is having a lively debate about the meta-ethics of a story the same as facing up to a potential conflict of interest? Platforms it seems to me are awfully conducive to this recursive hall of mirrors feel–they yield data about data about data (think the cascade of comments that never ends, even after it’s become very, very meta.) Not sure what it this is–and even if it’s bad per se–just doesn’t feel like journalism.

Exhibit 2 is a piece from BBC Social Media Editor Mark on the BBC Academy Blog, “#Paris: UGC expertise can no longer be a niche newsroom skill” that raises lots of interesting points. Underscores the reality that social media/user generated content sources are now not just part of the reporter’s job but often most of it, with challenges about verification, and just stomaching what you see in your feeds. (We’ve come a long way from checking the AP and UPI wires instead of writing your story.)

http://www.bbc.co.uk/blogs/collegeofjournalism/entries/fbb87059-ab13-4b79-b5ad-5a47a417fb8a

“But how much time and resource can we afford to spend on uncovering the truth? Are we as invested in the search and verification tools as we are in training our staff across newsrooms to appreciate and understand the risks of UGC fakery? And finally, perhaps most importantly, do we have enough safeguards in place to help those who work with UGC on a regular basis cope with the distressing and disturbing material they see?”

Finally, The NYTimes profiles its most prolific commenters.

“These frequent commenters have also become a community, one that has its own luminaries.

“But who are they? We decided to take a look at some of the most popular commenters on The Times site, which receives around 9,000 online comments a day.”

http://www.nytimes.com/interactive/2015/11/23/nytnow/23commenters.html?_r=0

Short profiles, with photos, that predictably (see recursion above) garnered more than 700 comments.

Fact Checking, Online Communities and Journalism

check_and_x graphicA few tidbits about fact checking that caught my eye recently:

First, here’s a shocker! algorithms can do it. Tipped via the National Science Foundation’s Science 360 site

Indiana University scientists create computational algorithm for fact-checking

  • June 17, 2015

FOR IMMEDIATE RELEASE

BLOOMINGTON, Ind. — Network scientists at Indiana University have developed a new computational method that can leverage any body of knowledge to aid in the complex human task of fact-checking.

As a former professional fact checker, I smiled a bit at their “complex human task” description. Sometimes, but many facts you check for publication in a daily newspaper, for instance, people’s names, titles, addresses, spelling, dates, quotations, are pretty straightforward and sources are fairly structured etc. If you think about the sources as in one part of the network, and the query in another, it’s basically a (geometric) math problem. So makes complete sense that an algorithmic approach could, in principle, do this work, and following paths to explore where the fact “lives,” and if it can be located in multiple sources, with different axes to grind, is what a resourceful fact checker will do–computer or human.

I wonder what is next for these researchers, and I hope it involves not just checking facts fed in but also finding a way to determine what facts (and biases) in a document need to be checked. Although there is often misinformation at the root of factual errors, more pernicious and harder to automate is smoking out persistent bias, a problem of sense-making, in which true facts are nonetheless marshaled to dubious or faulty ends, or less balefully, just not applicable to the question at hand. (Insert old pirates and global warming joke here.) If such tests were computerizable it might end, or at least put a dent in  blog commenting as we know it. Not a bad outcome. Fact checking is also a notoriously unoptimized activity, at least when done by humans. The more obscure the fact, and honestly, the less relevant, the more heroic and inefficient the quest (microfilm anybody?).  That works for sleuthing in the stacks for that telling citation, but on the web, bad facts spread like wildfire, and catching them fast and correcting them decisively would be a real service.

Second Poynter, a resource for journalists, has an interesting piece about using “gossip communities” (their term) as sources for journalists. Writer Ben Lyons pegs his story to a now-debunked social science study about whether people became more open to gay marriage based on in person canvassing). He sheds light on the issues of what happens when a journalist needs to enter a subculture,  abrasive and unreliable though they sometimes are, to get or check a story.

Complications abound with such “online community beats:” real names are rare, verifiable sources likewise, and the details can often only be checked against the comments of other people in the same world. But, in the case of the  Nonetheless, in the case of the gay marriage canvassing story, the PoliSciRumors.com community did raise doubts about the data long before it unraveled  more publicly. I suppose a modern day Woodward & Bernstein team wouldn’t be meeting in a parking garage, but in a chat room in TOR!

The next bite, “fact checking, are you doing it at all?” comes from the  science journalist (and inventor of dance your PhD thesis!) John Bohannon, who explains that results from his fake study linking chocolate to weight loss was an all too easy sell to the media, who didn’t bother to sniff out that the results (and the publication they appeared in) were rubbish.  From his lively explanation at io9:

Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.

It was a perfect storm of problems: p-values, a very small n, and then to top it off a “pay to play” journal that published it two weeks after submission, without changing a word, and for the low low cost of 600 Euros.

The experiment was craptastic, but the news coverage was a dream. And are now his “results” are probably part of the corpus of facts that the IU researchers’ computers have to untangle. Maybe they will factor in the questions from commentators, who, unlike professional journalists raised questions.

And as a bonus, Priceonomics has a timely entry about scientific retractions, with the point that the increase in number is possibly due to better policing than to an epidemic of cheating (although that remains a possibility).

Fact Checking Words: Doing it Diligently

In the 24-hour 360-degree news cycle that is the Web, fact checking seems to be a lost art, but I encountered this interview with an editor at a small Virginia paper that suggests otherwise:

From the American Press Institute site:

Fact checking a sensitive story: 6 good questions with News Leader editor William Ramsey.

I was particularly struck by these bits:

Q: Can you describe how the fact checking was conducted for this series? Did you use a checklist? A spreadsheet? A particular process?

A: We had a multi-pronged approach. We generated a list of every factual statement (not actual copy) from the main stories and sent it to state officials, who used investigators and PIOs to verify the information. This was critical since a portion of our reporting featured narratives rebuilt from disjointed case records. We also sampled a percentage of our hand-built database and determined an error rate, which was really low. We made those error fixes, and re-sampled another portion, which held up. For accuracy in [Borns’]  writing, we extracted facts from her project’s main story and made a Google spreadsheet for the team, using it to log verification of each fact, the source, the person checking and a note when a change was made to the draft.

Q: In the fact checking of this series, were there any lessons learned that will be used at the News Leader in the future, or could be replicated at other news organizations?

A: I hope so. We tried two new ideas I liked: war room Fridays and a black hat review.

For the Friday sessions, we took over a conference room and brought in reporters not connected to the project. On one Friday, for example, our government reporter spent the day checking story drafts against state records.

For the “black hat” review, borrowed from the software development industry, we took turns playing a critic’s role, peppering ourselves in a hostile interview about process, sources and conclusions. It gave us actionable information to improve the content before it published.

There is so much yammer about computational journalism (much of it hype to my old-school ears), but this example of using both old-fashioned and computer approaches to fact check the work of journalism seems to me a lot more valid that trawling the data for “news” and then reporting it even if specious or trivial. I particularly like the image of “black hat” fact checkers. In cybersecurity, it seems you call (at least some of) these Pen Testers.

 

Washington_times_scaled
The Washington Times (a different publication from the current one) as it was 100 years ago. From the Library of Congress’s Historic American Newspapers Collection, which has a display of front pages from 100 years ago each day.

 

30 Days of Musical Tidbits, Day 17: Quick tips for preparing your written materials: A guide for performing classical musicians.

Over years of steady (if part-time) work as a music journalist and as program annotator, I’ve read a lot of résumés, bio-sketches, programs, web sites etc. for classical musicians. They are sometimes, even often, a bit of a mess, felled by typographical errors, out of date copy, fuzzy writing, and unusable visual or other resources.

They don’t do the job of presenting the artist in a clear, engaging way, and certainly don’t help the harried program note preparer or music critic find a needed fact and get on with it.

In the spirit of helping (with what is admittedly one of the world’s less pressing problems) here are some tips on improving editorial materials for classical artists (with singers in mind specifically).

1. Establish a set of consistent, and easily updated editorial materials and keep them fresh. Go for quality over quantity, both for your sake and those of your readers. I would recommend a résumé, and at most a bio in two flavors, short and long. Be strategic about the way the bio is written: build it out in modular sections  that can be swapped out and supplemented when there are new things to add, rather than requiring a redo from scratch.

2. Keep track of versions of documents by clearly naming and dating them (in both the file name and inside the doc). A file naming convention is good, for instance Violetta_Valery_Bio_Short_11_14_14.docx. Note the filename, author of the doc and important details right in the document as well.

3. Make it as easy as possible for all involved to tell at a glance whether materials are up to date and what to do if they aren’t.  Nothing wastes time (and annoys editors) like trying to determine which of 3 or 4 different versions of bios flying around as email attachments should be used for a program. One approach is to write something like “Violetta Valery’s s bio was last updated 11/17/14. Please check, http://www.allaboutvioletta.com for the latest version).  Alternately, you can just say something like this, “This biography valid for the 2014-2015 season only, see the web site for more info” and make a point to do an annual update. Importantly, if you have doubts about your ability to keep the web site up, be realistic with yourself, and don’t set up expectations you can’t meet.

4. Avoid excessive revision. Good: an opera singer slightly reworking a bio to emphasize her achievements in song as she prepares to make her debut in a distinguished lieder series. Bad, completely rewriting a bio because you got cast as the cover Marullo for a big opera company. Also, while it’s reasonable to present the best possible take on your background, don’t lie and don’t inflate. Singing “Ines” in a volunteer performance of Trovatore at Una Volta Opera Company of West Pitchfork, Montana, is all well and good, but does not a major credit make. Arts editors and presenters are savvy readers and can generally read through pad and discern what credits actually mean for the career in question. Don’t oversell or undersell yourself.

CMOS_16thed
“Chicago,” not the musical, the authoritative style guide.

5. Choose and abide by a consistent copy-editing style. This is simple to describe although not so simple to do. It means applying consistent rules for things like the capitalization of titles, names of composers, working with foreign terms, abbreviations, etc.  Style guides also specify which choice should be made when several options are acceptable (form of titles, writing out numbers, certain spellings, use of the serial comma, the way sources should be cited etc.) Newspapers frequently use the Associated Press Style Book. Academe typically uses a style guide specific to the discipline or the big kahuna of style guides, The Chicago Manual of Style. None of these is targeted to the needs of practicing performing artists and presenters (as far as I know, there is not a resource tailored to this task). If that’s all too much for you, just make sensible rules for yourself and follow them. (Keep track somewhere of your decisions.)

6. Set reasonable expectations of yourself and enlist others to help. If writing and updating these materials is not a strength, don’t sweat it. There is probably an English major in your life who would be happy to help. I’ve flirted with the idea of starting a service to help with this–although in truth, I’m not much of a copyeditor. (Extra points to readers if they can spot all of the inconsistencies and other errors in this post!) Professional writers at all levels have editors, so it’s certainly no shame to ask for help on your materials.

6. Provide a range of photographs in usable formats. Print requires higher resolution photos (300 dpi or greater is preferable) and for a large photo, this may make it inconvenient for emailing. Provide a print and digital-friendly format of key photos (again, don’t go overboard) on your site for download (or in a cloud resource). Provide a caption and a photo credit, and explain any restrictions on use. Make other media (audio clips, video) as easy to use as possible (for instance, making sure it can be embedded).

7. Abide by copyright and other IP requirements. Don’t use materials without permission and don’t put your presenters in the position of inadvertently using copyrighted materials unlawfully. Just because it’s on the web and can be downloaded, doesn’t mean it’s okay to use in your press kit or materials you submit for a program. Also, “fair use,” is a complicated issue, in that it is a decision that is dependent on a number of relative factors, one of which being whether there is any commercial interest involved. Given that marketing and promotion are implicit in a singer’s biographical and other materials, there is a risk in assuming that material could be used on a fair use basis automatically.

7. Be on time and responsive. There are a lot of things to juggle for artists, god knows, but stay on top of this, and don’t let the line go dead on this topic. Many presenters and program editors pull their hair out waiting for long-requested materials, or holding a spot to accommodate a program change or bio update. Playbill–and most other publications–fine presenters for late material, and, of course, late changes breed opportunities for errors. Managing things in a timely fashion will be enormously appreciated, as will be being forthright when problems come up. A practical example: if you can’t get a program note you had hoped to write done in time, call and explain yourself. That will give enough time to consider a “plan B” (for instance, doing a Q&A on the program that can be pulled together in two days). Just hiding in a bunker and not answering email for two weeks risks making a minor glitch into a major hassle.

Your editorial materials are not the most important part of your tasks, certainly, but handle them professionally from the get go and they’ll add polish to your presentation, save everybody some time and headaches, and might even open some unexpected doors.

Reasonable Words: The Linotype

Just finished Keith Houston’s informative and droll Shady Characters: The Secret Life of Punctuation, Symbols & Other Typographical Marks, a book teeming with a lot of news about such creatures as the pilcrow, interrobang, octothorpe, and the surprisingly complicated history of the hyphen.

This last is of course related to the rules for word division, which once upon a time, long before computer word processing programs relieved us from this task, writers (even mere typers, like myself) were supposed to master. I took typing in high school and I doubt ever correctly applied the 10 rules for word hyphenation–not sure I even learned them.

While illuminating the hyphen, Houston takes us on a side trip to the Linotype and Monotype machines, nearly mythic to me–as both my parents started in journalism in the era of hot type. These wildly complicated contraptions automated the setting of type, but they still left hyphenation up to the operators. This was least of their worries, as Houston relates:

A Linotype machine at the Charles River Museum of Industry in Waltham, MA.
A Linotype machine at the Charles River Museum of Industry in Waltham, MA.

“For all the speed gained over hand composition, there were dangers inherent in the machines that required their users to work beside bubbling crucibles of molten lead. The joy of mechanically setting line after line o’ type came with the added frisson that a “squirt” might occur at any time: any detritus caught between two adjacent Linotype matrices would allow molten type metal to jet through the gap. And aside from the immediate dangers of seared flesh, operators of both Linotype and Monotypes ran the more insidious risk of poisoning from the (highly flammable) benzene used to clean matrices, the natural gas that some machines burned to melt the type metal, and the fumes emitted by the molten type metal itself.”

 

 

 

 

 

 

 

Makes my regular carping about the annoyances of WordPress seem a little silly! Updating versions has not, as yet, required me to dodge squirts of molten lead, but God knows what they are thinking up for the next release.

BTW, Houston has a blog on the same topic–also charming, but somehow this topic seems to really twinkle in book form.

Surprising Words: Fact Checking in Books

When I was a news researcher, it was surprising to me that you were allowed to cite a fact previously reported in our own pages to resolve a query. But at least the effort to get things right was serious; if this Atlantic piece is correct, book publishers don’t bother now, and never really did.

One of the most notorious and colorful publishing frauds. One quibble with the Atlantic piece...fact-checking and fraud detection are distinct tasks. As is rooting out bias. Most  editorial "gatekeepers," the few that are left, don't attempt all three.
One of the most notorious and colorful publishing frauds. One quibble with the Atlantic piece…fact-checking and fraud detection are distinct tasks. As is rooting out bias. Most editorial “gatekeepers,” the few that are left, don’t attempt all three.

http://www.theatlantic.com/entertainment/archive/2014/09/why-books-still-arent-fact-checked/378789/

“When I was working on my book, I did an anecdotal survey asking people: Between books, magazines, and newspapers, which do you think has the most fact-checking?” explained Craig Silverman, author of Regret the Error, a book on media accuracy, and founder of a blog by the same name. Almost inevitably, the people Silverman spoke with guessed books.

“A lot of readers have the perception that when something arrives as a book, it’s gone through a more rigorous fact-checking process than a magazine or a newspaper or a website, and that’s simply not that case,” Silverman said. He attributes this in part to the physical nature of a book: Its ink and weight imbue it with a sense of significance unlike that of other mediums.Fact-checking dates back to the founding of Time in 1923, and has a strong

tradition at places like Mother Jones and The New Yorker. (The Atlantic checks every article in print.) But it’s becoming less and less common even in the magazine world. Silverman suggests this is in part due to the Internet and the drive for quick content production. “Fact-checkers don’t increase content production,” he said. “Arguably, they slow it.”

What many readers don’t realize is that fact-checking has never been standard practice in the book-publishing world at all.

Commonplace Book: Meg Greenfield

Finally getting to Washington, the hybrid memoir and take down of Washington, DC, which Meg Greenfield, the late editor of the op-ed page of the Post, wrote in secret as she was battling cancer. Having read This Town a few months back, it’s clear that Greenfield was diagnosing many of the same ills as Leibovich is, with fewer wisecracks but a sharper scalpel.

Like him, she doesn’t excuse herself from playing a role in perpetuating aspects of a system she finds baleful in many ways. But she’s has an interesting view of the “two-track” language of Washington (which she acknowledges she learned to decode, even if not to speak, early in her career.)

Here’s a representative excerpt; Greenfield on politics and lying:

Reflecting on what had taken me by surprise at the outset, and on so much like it that I observed in Washington and politics generally over the years, I was eventually to conclude that there is a two-part truth that just about every one of us knows and has always known but that practically none of us will admit for fear of being seen as an accomplice. It is, first, that the basic linguistic unit of speech in politics–all politics, not just the Washington kind–is a statement that is already somewhere between one-eighth and one-fourth of the way to being a lie. (I will leave it to others to decide whether this is any different from the basic speech of either commerce or love and, if so, in what degree and with what moral difference.)

The other part of the proposition is that such deception appears to be built into the process, a function of what we demand and expect and what they feel is required to stay alive and get anything whatever to happen. Politics, in other words–and not just politics practiced by people you don’t like, but politics across the board–pretty much rests on a foundation of fractional lies, justified by some commonly shared if rarely acknowledged presumptions of necessity. The phenomena, after all, has long been copiously in evidence from the White House briefing room to the debate among candidates for city council and every other office to the solemn pronouncements of State Department spokesmen to the ocean of near parody gibbledy-gabble that engulfs the pages of the Congressional Record.

Our preferred way of dealing with this discomfiting truth seems to be to add one more smallish-to medium lie of our own. We grouse or look the other way or pretend to be shocked, even though at some level we have known all along that what is being asserted as truth really didn’t happen that way and never was going to.

We know that are our elected representatives were not acting out of the unalloyed high purpose they solemnly claimed. And we knew that likely as not when they promised tireless, strong, conclusive action on something we cared about deeply, the odds were that more than a few of them were already shopping around for a respectable looking cop-out in which to take shelter and pretend they had done their utmost but that the bad guys had stopped them or, even more contemptuous of our intelligence, that the cop-out was the great deal they had promised.

Yet whenever one of these implausible fictions that we never took seriously to begin with hits the news for some reason and is exposed for what it is, we manage to project heart-wrenching disillusion all over again. More reckless, in my opinion, we leap to endorse and thus encourage what must be the most tiresome strain of commentary running through the nation’s op-ed pages–and that is saying something–namely, that if the politicians don’t cut it out, we the people will become cynical. Yes, we say, you are making us cynical, awful you–ignoring the fact that America and Americans were born cynical, or at least profoundly skeptical, about politicians, which is perhaps why we have survived as long as we have.

 

Ground zero of 'gibbledy-gabble,'  aka the Government Printing Office.
Ground zero of ‘gibbledy-gabble,’ aka the Government Printing Office.

It’s still worth a read, for, among other things, a reminder that the political culture we so avidly deplore in Washington now is hardly new. The book also casts a sometimes hard to read shadow narrative about Greenfield’s personal life, including the price paid for breaking through barriers–although she’d probably cringe at that term–in the all boys world of newspapering in mid-century.