Acculation
Talk with Ivy League PhD data scientists...
Internet map: network data visualization...
A Visualization of Wikipedia Data...
Streamgraph: multidimensional data visua...
Data viz: scoreboards as the original an...
Evolution of historical notions of Earth...

Big Data Analytics: Articles, Movies, Songs Robo-written by Computer?

Big data analytics permits robo-journalism, robo-song writing and robo-movie scripting? Shown here is the TOPIO TOSY humanoid ping-pong playing robot. Photo: TOPIO/WikiMedia/Creative Commons 3 Attribution Share Alike

First in a series of articles on big data analytics

This is the first in a series of planned blog articles on applications of big data analytics. We are Internet of Things analytics company, focusing initially on indoor air quality analytics with our AQcalc app.

Home environment and health/fitness applications are the first killer applications for the Internet of Things. This is driven by the dramatic price drop in technology that permits everyday appliances, devices, sensors — even traditionally ‘dumb’ and static objects like washing machines, windows, kitchen cupboards, and refrigerator shelves — to be connected to the Internet (see our Cisco CES talk article). Consumers are seeing the first killer applications for these devices in the form of fitness trackers. (The much rumored Apple iWatch may well turn out to be a cross between an iPhone and something like a FitBit or a Misfit Shine.) You have Internet-connected devices such as the Nest learning thermostat or their smoke alarm. And of course we have our app which helps you diagnose your indoor air quality, and eventually will help you control your home automation devices to assist with air quality.

Suggestions from readers on (non-controversial?!) uses of big data

(We’re looking for suggestions from readers for future articles on big data analytics and the Internet of Things. One timely application involves government consumers using data analytics and Internet-connected devices to engage in territorial land grabs against neighboring countries. We are considering future blog articles showing how to use analytics to figure out which countries to target for territorial aggression. Topics like using big data analytics and online media to  efficiently use propaganda to soften the opposition prior to the invasion and sugar coat the actual territorial expansion. 🙂 We’ll have another article on coordinating tanks and heavy armor using the Internet of Things during the actual invasion. Finally, we will have an article about using algorithmics to more efficiently consolidate your hold on power in the aftermath following enlargement. We’ve previously suggested using Internet  of Things applications to improve the air quality and health of citizens (including a light-hearted government backdoor), but we know this is a secondary consideration. If your country is looking to expand at the expense of your neighbors in this way — and we know of many such prospective customer countries — our blog articles will help you with an easy-to-read, step-by-step how-to guide.)

In view of various recent events,  some people might joke that replacing certain politicians with robots should be of higher priority that putting writers out of work. In at least one strange and non-obvious way, feedbacks loops already exist whereby insights from big data are able to influence policies almost instantaneously. In this way, robo-politicians may already exist (although not in a way that threatens anyone’s job!). We’ll hold that thought until the end of the article.

Earthquakes herald the dawn of robo-journalism?

The Los Angeles Times initial article on yesterday’s earthquake’s was written by computer, which Slate heralded as the dawn of robo-journalism.

(Obviously, if Slate is writing articles like this, the earthquake itself was no biggie. It did appear to have generated a lasting Internet memes of a terrified-looking TV anchor, however. It was 4.4 on Richter scale, a snoozer by California standards, except that it was right under the city’s Westside.)



The robo-journalistic feat itself is not that interesting. The LA times reporter simply wrote a script, dubbed ‘Quakebot’, that extracts earthquake reports from the USGS feed larger than Richter 3.0 and fills out an article template with the information. The initial article is not especially good, including extraneous information from the USGS such as the distance between Los Angeles and Sacramento. It includes a disclaimer at the bottom that the article was written by a computer program written by the journalist. It hit the LA Times website within 3 minutes. (The delay was due to a human approval step required, needed because of USGS feed glitches.) After being the first with the journalistic scoop, the article was then vastly improved over time by the LA Times’ human staff of editors and journalists into a very professional piece. The LA Times has other bots that auto-generate the scoop from similar government databases, such as city homicides.

The intent, supposedly, is to eliminate the journalistic drudge work associated with these types of articles. The computer takes care of getting that first article very quickly out to the public ‘under deadline.’ The deadline met and the first article out, human journalists are then free to do real, deep-digging work.

However, we will point out that ‘automated journalism’ is nothing new. Years ago, Google replaced the human editor, or at least human news aggregator/content selector, with Google News. Some of our readers may remember the joke Google used to put at the bottom of its Google News page, ‘No humans were harmed in the preparation of this page.’ Google later clarified this by replacing with a sentence that the news articles and their placement on the page were generated automatically by a computer algorithm, which is what they mean by ‘no humans were harmed.’ Some editors apparently felt that some humans had been harmed by Google News, which they saw as a major and very successful competitor to their front pages. (A few organizations sued, AFP was one of them if memory serves. Google reportedly argued that it was sending AFP and other organizations a lot of traffic in exchange for snippets. If memory serves, there was eventually a settlement, which may have included Google agreeing not to use some content.)

You can almost detect a certain pride in the Slate piece. Major newspapers like the LA Times, not traditionally known for their technological prowess, had finally stepped up to the plate and copied a page from Google, using computers to generate content.

Robo-journalism not anything new

In some sense, of course, blog and newspaper articles generated at least partially by computer are also nothing new. For at least ten years there have been articles recommending optimizing content around advertiser or search keywords to drive traffic. The recommendations go as far as selecting content pieces entirely around the most-searched for keywords, or the most expensive paid search keywords. (Your blog will be very boring if you do this. The author remembers one such article from ten years back recommending writing a blog, or blog articles, entirely around private jets, because private jet manufacturers at the time were paying several dollars for search keywords. Good luck if you attempt this.)

There are computer algorithms to analyze Google and Bing databases and try and identify which keywords or topics to write your blog article around based on the most popular or the most expensive topics. If, on some blogs, the content has been selected on the recommendation, are these blog and webzine articles not essentially computer generated? Some publishers have taken this a step further were the content really is auto-generated from keywords databases, or scrapped from Twitter (minus even the human quality-control check that the LA Times imposes, and without the disclaimer that it is an auto-generated article). It’s usually immediately obvious that this is poor quality, computer-generated content, and Google has implemented algorithms to try to weed out these publishers. (These kinds of auto-generated pages are sometimes referred to as ‘advertiser arbitrage’ where the page links one Advertising system with another, using an auto-generated page as a go-between. ‘Arbitrage’ is usually seen as a good word in financial markets, a necessary market player that links different financial exchanges to improve price discovery, greatly increase liquidity, and decrease the spreads that are essentially a tax on the user. But in the world of online advertising it is considered something of a dirty word.

Robo-journalism and the Singularity

As computers become increasingly more powerful due to Moore’s law, we expect computer-generated content to become increasingly harder to distinguish from human-generated content as we approach the much-hyped Singularity as described in Ray Kurzweils’ famous book and the related movies (watch now). (We previously touched on the Singularity in our coverage of Cisco’s IoT talk at CES.) Already, the Google News-style function is being used in many on-line newsrooms to select which human-generated content to give prominence, so even if the articles are not written by computer, computer algorithms are deciding which topics should be written about and which articles should be displayed. Is the next, which the Los Angeles Times has now taken, of having the entire article also be written by computer, really that much of a novelty?

Speaking of Kurzweil and the Singularity, we should mention that Kurzweil, whose father was a musician, fist achieved fame as a young man on a 1960s TV quiz show where he introduced ‘computer-generated’ music. (This was before he achieved commercial success with an early OCR machine for the blind, or his eponymous brand of commercial music synthesizers.) The idea, which has been copied and elaborated on many times sense, is to randomly select different human-written musical phrases under rules that govern which pre-written phrase may transition into which other phrase. The author remembers a 1980s ‘Mozart Machine’ for home computers that did something similar; the music was vaguely Mozart-esque except long-winded and, due to the random selection of phrases, lacking the thematic discipline used by real composers. (The same way that ‘Quakebot’, the Los Angeles Times’ robot journalist lacks the use of ‘elegant variation’ in its writing, as described in the Slate article. A human musician would normally have the different sections of a piece be somehow related in some way, so that each section is an elegant variation of another, rather than musical sections being either verbatim repetitions or entirely unrelated as these early computer robo-composers would attempt.) Most of these attempts were limited to classical music because it was easy to synthesize on the 4-voice monaural synthesizer chips of the time, but the concept can be applied nowadays to any genre of music (although it may still sound bad). We will come back to the concept of computer-generated, or at least computer-analyzed, modern commercial pop music in a bit.

Computer-generated feature films?

Let’s talk about computer-generated movies. Epagogix reportedly under contract to at least one major US studio, that uses computer algorithms to analyze movie scripts to predict their success. Apparently, it analyzes the themes used in the movie script and how closely they match certain well-known formulas. (It’s not immediately clear to us if it requires humans to read the script and encode the themes used in a machine-readable format first),  According to an article by BBC Focus Magazine on Epagogix’s website, they predicted the $50 million 2007 Drew Barrymore movie Lucky You (watch now ) would flop. (It did flop, making only $5.7 million.) This analysis is independent of the movie’s marketing budget (normally a very important predictor of box office success) in that the intention is to help executives decide which movies to invest in based purely on the script.

Not mentioned in the BBC article, but supposedly Epagogix’s scored the movie poorly because it was not shot in exotic locations, although the casting of Drew Barrymore was a factor in the movie’s favor. Obviously, at least on this particular movie, Epagogix did much better than the human studio executives that green-lighted the movie; this isn’t a surprising result, as even expert human decision making is found to be biased and can be supplemented by computer models with domain expertise. In the sense that movie scripts that are poorly scored by Epagogix are then rejected by movie studios, the criticism has been made that scripts now a days are being ‘written’ by computer programs (in the sense that human writers will write and re-write their scripts until the computer is pleased).

Computer-generated pop-music?

Let’s come back to the topic of computers writing hit songs. Is this possible? We already mentioned earlier experiments dating back to the 1960s of computer-generated music, although this music mostly lacked elegant variation or overall thematic sense. Of course there are now computer versions of Epagogix, such as TheHitEquation that can predict with 60% accuracy whether a given song will be a hit, looking purely at the contents of the music.

Some celebrities have criticized such software, saying it can never be as good as a human expert at finding hits because it cannot analyze important song elements such as lyrics. (And, of course, the all-important marketing budget is ignored because these algorithms since the point is to help industry executives decide which songs to invest marketing budget on.) But the software doesn’t have to be as good as a human expert on every aspect of analyzing a song. It just has to be ‘pretty good’ at predicting hits using some subset of criteria that it is expert on evaluating. The reason is simple: deciding what’s going to be the next hit isn’t typically made by a single executive, it’s made by a human team of executives combining their individual expertise to make a decision. If the algorithm is as good as any one of these team members — and being right 60% of the time is pretty good —- then you can replace one highly-paid executive on the team with the computer algorithm. The lyrics and such other important song aspects that the computer doesn’t know anything about will be analyzed by the remaining team members. And according to a Harvard Business Review article, that’s exactly how these algorithms are implemented in practice — by combining three human experts plus the computer expert together and voting on songs.

What if you combine the two technologies? Have the 1960s random music composer (updated to include the latest popular music phrases) with something like TheHitEquation, and keep generating random music until the algorithm says you have a hit. Would it work? To be efficient, it would require some feedback from the model being used to evaluate each song on how to best improve (otherwise you have the proverbial monkey writing Shakespeare by punching random typewriter keys, which will succeed but take a very long time). The algorithms would likely need considerable updating to understand thematic consistency as well, but in time something like is possible. If you throw in the 3 human executives doing a further evaluation of the song score by computer (and providing machine-readable feedback to the algorithm writing/editing the song), that it becomes very tractable. (Although who is the songwriter in this scenario? The computer algorithms, or the human music executives that provided the feedback to the algorithm? (We’re ignoring the issue of lyrics here; expert lyric writers might still be needed for a while, or this approach could be limited to instrumental music.)

Of course,  some might argue say computer-generated pop-music is nothing new. Song and voice-manipulation technologies like Autotune (not to mention ‘technologies’ such as lip-synching) have become widespread if controversial. Less-controversial everyday digital song software like GarageBand and its professional analogues can be installed for free on your iPhone.

Big data analytics governance and democracy?

We started this article off by light-heartedly alluding to some darker governmental uses for big data. What about replacing politicians by computer algorithms? Assuming this were desirable? Is it possible? To some extent, given the use of market-research like techniques in political science (government by opinion poll or campaign contributor profiling) it already has. (This is lamentable, as public opinion can be short-sighted and donors might be closely aligned with special interests.) Is it possible to go further than this using big data? Of course it is. There were articles a few year back about detecting mood on Twitter and using it to predict the stock market a few days in advance (which at least worked for a short period of time back when the article was written). Stock markets influence politicians; this system could be taken a step further by having politicians actually attempt to respond to Twitter moods, thereby creating a fast feedback loop (which may, by virtue of the stock market traders influencing the market via algorithms that analyze Twitter, and in turn having the stock market influence governments, already exist to some extent, albeit indirectly). This is ‘governance by Twitter’ or big data analytics, an interesting topic, but one we’ll leave for a possible future blog post. [Update: we did multiple articles on governance by computer (click here), as well as using AI for crisis mitigation (click here).]

Photo credit: TOPIO/Humanrobo/WikiMedia/Creative Commons 3 Attribution Share Alike

Next steps: Check out our YouTube channel for more great info, including our popular "Data Science Careers, or how to make 6-figures on Wall Street" video (click here)!