Re: Moving Towards Algorithmic Corroboration

This is my response to Nick Diakopoulos’ blog post titled, “Moving Towards Algorithmic Corroboration“. I really like Nick’s set–the problems he identifies– concerning the difficulties with processing information on the Internet. Being able to computationally process semantic information in order to more holistically organize and intelligently filter information is an unidentified process for which there are many problems that need solving. At least, on the level that Nick is talking about.


If we find two (or more) independent sources that reinforce each other, and that are credible, we gain confidence in the truth-value of a claim. Independence is key, since political, monetary, legal, or other connections can taint or at least place contingencies on the value of corroborated information.

The juxtaposition of two (or more) pieces of semantic information are a key part in this endeavor. Using some borrowed ideas from Dr. Luciano Floridi, I think that in this process of comparing two sources of information, we would be looking at the operational and derivative classifications of information. Why? because the operational information that we, as information consumers, would either have or be able to get, for example, would include things like the understanding that Fox News presents semantic information that is composed of conservative political positions. The slippery-slope consequences of such are tied to many emotionally based presumptions which can skew the balance between factual, truthful, nonfactual, or untruthful. This process of obtaining and owning operational information about the semantic information that we consume is reinforced by derivative information, including other people’s opinion about said operational information. Trust and distrust affect our derivative information.

In Nick’s example of “taint”, he is talking about the connections that we make with the operational and derivative information that we are predisposed to and/or expose ourselves to as we take in new semantic or environmental information.


How can we scale this idea to the web by teaching computers to effectively corroborate information claims online? An automated system could allow any page online to be quickly checked for misinformation. Violations could be flagged and highlighted, either for lack of corroboration or for a multi-faceted corroboration (i.e. a controversy).

I’m a huge skeptic of the term “misinformation” being used in the context of extrapolating relatively useful information from a presumed source of information. Why that is, is because the end result of a misinformed user is a disinformed user. The goal, a highly presumption-based goal, with our predispositions to operational and derivative information, is to become informed. The “flags” that we independently throw are dependent on our preexisting networked information, aka knowledge. With the highly anticipated goal of learning, we expect the information that we consume to be truthful. Dr. Floridi describes the nature–and a requirement–of “truthfulness” as being informative, a fundamental reason of Floridi’s as to why “disinformation” is not information. Aside, even if the information producer is presenting accidental misinformation or purposeful disinformation, the information consumer is, more likely than not, processing said mis/disinformation as information, and thus disinformed.


First of all, we need to define and extract the units that are to be corroborated. Computers need to be able to differentiate a factually stated claim from a speculative or hypothetical one, since only factual claims can really be meaningfully corroborated.

It’s more complicated than that. I’m only guessing, but I’m sure there’s at least four levels of presumed-information. There’s factual, truthful, untruthful, and nonfactual. And I’m going to go out on a limb and presume that people are, in general, fairly competent at distinguishing those four categories if pressed, even if they are incorrect because of a predisposition to primary, meta, operational, or derivative information.

From that position, systematically corroborating two or more pieces of semantic information is completely dependent on the user’s existing position about any given primary information. It’s my opinion that, in order to either filter information and reconstruct it into something more efficient or effective, or to apply metrics to further inform an information consumer about the content they process, you have to have some prior knowledge about the nature of where an information consumer is coming from before they are introduced to a new piece of semantic information. You can’t present a “truthful” argument to someone that is predispositioned (operational, derivative info) to process that information as untruthful or nonfactual. I mean, you can, but will that be meaningful?


Then, the simplest aggregation strategy might consider the frequency of a statement as a proxy for its truth-value (the more sources that agree with statement X, the more we should believe it) but this doesn’t take into the account the credibility of the source or their other relationships, which also need to be enumerated and factored in. We might want algorithms to consider other dimensions such as the relevance and expertise of the source to the claim, the source’s originality (or lack thereof), the prominence of the claim in the source, and the source’s spatial or temporal proximity to the information.

Whew. Lots of challenges deserving of research, indeed!

Frequency can’t automatically produce information if it’s ever disinformative. Source predictability is also, at most, reducible to probability. And that’s the key, because that’s how we process the meaning of information: minimizing and maximizing probability.

Looking at the source (the information producer) and applying a dynamic metric that’s based on operational and derivative information has to be one small part of the system. The metric system has to be dynamic because the producer is not a static, closed-loop system. And the metrics shouldn’t be based on source (as in a single node of information), they should be based using network theory–the complex that make up the predefined scope of the system (information producer). Why? Because a news network isn’t a singular entity. It’s a composite of many information aggregators and producers. And if you extend the scope just a little bit to include commentators, which is quite common in social media on the Internet, you have to create your calculations to include a metric-based subsystem for classifying the information that they produce, too. For example, as an information consumer, reading an article on Slashdot is just a warm-up compared to reading the comments.

Additionally, even a singular author is not a closed-loop system. The system used for calculating a rating (or what have you) has to be dependent on a systemic, adjustable (ideally in real-time) system for generating metrics, or for changing the construction of computationally re-processed semantic information.


Any automated corroboration method would rely on a corpus of information that acts as the basis for corroboration.

Yes. And data, whenever possible. And yes, linking data to information is a whole other problem set.


It’s important not to forget that there are limits to corroboration too, both practical and philosophical. Hypothetical statements, opinions and matters of taste, or statements resting on complex assumptions may not benefit at all from a corroborative search for truth.

I do presume that all information is resting on complex assumptions, especially philosophical, and worse, emotional. There may be a fine line between a Wikipedia article and a footnote to an artist’s painting. But whatever system is developed is going to process it differently. Why couldn’t AI respond the same yet be tuned for generating useful feedback? Not everything corroborated has to turn out to be fact or truth. It could simply highlight the probability of being nonfactual or untruthful and it would still be beneficial to an information consumer, especially if it’s a transparent process so that the information consumer can see how the metric was generated. Which is a perfect segue.

A transparent and open-source corroboration system will be more useful to an information consumer than a closed-source system. Especially if s/he can add or remove their own sources of information. Why? The system could easily be wrong, especially when weighted more heavily on a fallacy that is identifiable to a person. Feedback loops will only strengthen an information system when performed carefully. Design is obviously critical even if it’s a closed system.

Even if hypotheticals, opinions, or subjectivity is to be processed by Nick’s notion of a computationally-based corroboration system, information can still be categorized. Even if it’s so much as describing something in a statistically negative or positive manner, it’s going to help piece together a probability of informativeness.


We’ll still need smart people around, but, I would argue, finding effective ways to automate corroboration would be a huge advance and a boon in the fight against a misinformed public.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s