Making our way to cognitive inspired architectures

“What is interesting about cognitive architectures?”

Truthfully until recently I wasn’t convinced the brain was a useful object to model algorithms after. The brain’s neural network is a black box of highly complex and constantly modulating connections. Not to mention the ways in which we can study it are highly constrained to the ethical. As far as problems go it seems like replacing an already difficult problem (creating AGI) with another difficult problem.

“What has changed?”

Interesting models based on the brain, deeper analysis of brain functions, and emerging trends.

Also there two other points of interest, one is short and one is long.

“What are they?”

Well, the first one is that the brain -works-. It does basically what we want AGI to do, just on a smaller scale. Regardless of if there is a spirit residing within us, starting from a working structure will highly increase the likelihood of success. The closer we can copy it, the more we increase the chances.

“And the second point of interest?”

‘Neural networks’ are ‘not good’ at ‘symbolic manipulation.’ Judicious air quotes. Human brains seem pretty good at it. This is a lot to unpackage, where should we start?

“Well first off, what does symbolic manipulation mean?”

People’s ability to combine ideas and produce novelty that is directed and useful. For example let’s take the idea of a “tree.” We have many different directions we can take from this ‘anchor.’ Given our needs, we can consider Tree can be a house or absorb CO2properties of trees, like color, height, smell, feeling, geographic location. We can relate trees to paper, books, wooden houses. Or there is CO2 absorption and tree’s role in regulating the atmosphere. Leaves changing color, symbols of states and countries, books, metaphors, the list goes on and on.

For reasoning in computers we want the ability to actively group information together in a chunk, or assign it to an object, observe relations to other objects, and arrive at conclusions. Central to this ability is to trivially change the relations or object information. It must be interpretable and explainable how conclusions were reached.

In the next figure, from “Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution” we can see examples of causal reasoning.

Explains the different levels of causal reasoning

The author states today’s machine learning solutions only reach the first level in the hierarchy. That doesn’t seem quite right, but.. if we take it at face value then there is an obvious question.

“ML seems to be getting quite good at #1, why haven’t we built #3 yet?”

Obvious questions are good questions! Let’s explore and see what might be required to achieve #3, and importantly, what problems does it solve, and what data can be used to train it?

“Okay, how are others approaching the ‘reasoning adventure’?”

As usual, here is a non-exhaustive list.

  • Visual question answering (CLEVR, FVQA, TallyQA)
  • Mathematical proof proving (E, VAMPIRE, Prover9 )
  • Named entity recognition (CoNLL-2003)
  • Noun reasoning (Winograd Schema Challenge, COPA)
  • Question answering (SQUAD, MCTeSt, CNN/Daily mail, CBT)
  • Fuzzy Cognitive Maps (FCM)
  • Robotics & Reasoning

Visual Question Answering:

Example image and question answer data
Stanford’s CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Computer vision is not ‘solved’ and CLEVR addresses an important link between object classification, and general interpretability.

Since CLEVR was released teams have built architectures that virtually solve the dataset.

“Can we use those architectures to achieve imagining or introspection?”

It’s a great question. The CLEVR dataset itself didn’t have any examples from either the 2nd or 3rd tier of reasoning listed above. By chance, Stanford created a follow-up dataset which happens to have questions of that type, called CLEVR-Humans.

In CLEVR-Humans they included some inference questions about hypothetical futures: Image showing tower of blocks with hypothetical question

“Will the tower fall if the top block is removed?”

What do you think? I think the answer would be yes, because after squinting for a little bit it seems like only the top block’s weight is balanced on the block below it. The other blocks, sans the bottom block, are unbalanced and would fall without the weight from above keeping them level.

Regardless of whether the answer we come up with or if reason we give is right or wrong, we’ve used knowledge derived from sources outside of the situation. The team that achieved an impressive 99.8% accuracy on the base CLEVR set only scored 67% on the CLEVR-Humans set. Although they didn’t release the statistics for the Humans set it is likely that these sorts of hypothetical questions were poorly answered.

“What kind of architecture did they use to achieve their impressive results on the base CLEVR dataset?”

The answer to this is quite long, let’s table this question for now in the interest of getting a broad look at reasoning approaches.

“What is the difference between VQA and FVQA?”

FVQA separates the information the network needs from the image and stores it in a database of facts. The network then must look at an image and associate it with the correct fact. In the FVQA database there are 190,000 facts. A large comparison can be found here by the group that released the FVQA dataset.

“What is TallyQA?”

From their website:

“Most counting questions in Visual Question Answering (VQA) datasets (e.g., VQA 1.0 , VQA 2.0, and TDIUC) are simple and can be easily answered using an object detector. Complex counting questions involve understanding relationships between objects along with their attributes and require more reasoning. Thus, performance of counting models cannot be estimated on complex counting questions using these datasets. To address this, we created the TallyQA dataset that has both simple and complex questions. Simple counting questions are those which require only object detection whereas complex counting questions demand more, as shown by the example image below.”

“Is there a big architecture difference between this and other VQA designs?”

To be explored.


Named Entity Recognition:

Here is an example of NER:

image of the dandelion output
Dandelion entity extraction demo

NER is in a particular sweet spot of being universally useful, commercially viable, and accessible to a host of computationally tractable methods. It’s also a problem that could be approached from a neural-symbolic ‘reasoning’ based viewpoint. Grammar rules, statistical models, and neural networks have all been applied to this problem set. While there has been proposed neural-symbolic architecture, it is from before the golden era of of deep-learning, from a time when resource scarcity forced people to seek efficiency in the form of knowledge representation.

“What is knowledge representation?”


“Is a neural-symbolic architecture necessary?”

Well let’s have a look at the current state of the art for this dataset.

Shows NER accuracy across different architectures
From: “A Survey on Recent Advances in Named Entity Recognition from Deep Learning models”

Given that on the challenging CoNLL-2003 dataset, which is extracted from the Reuters News Corpus, the current top performing methods only get 91% there are still a couple orders of magnitude of accuracy left to achieve. While not necessary, named entity recognition is definitely an attractive test bed for neuro-symbolic based architectures. It’s hard to say anything is ‘necessary’, since so many ways can work. Perhaps ‘interesting’ is better.

Noun Reasoning (Winograd Schema Challenge)

Another useful benchmark in reasoning that native speakers have no problem solving is in resolving pronoun ambiguity. Take this example from Wikipedia:

The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.

The task is to identify what the pronoun ‘they’ is referring to. Alternating between ‘feared’ and ‘advocated’ changes the answer. For people this is a trivial task but the current state of the art is 61.5% from a Google Brain team.

“Is this really reasoning?”

Right, the Winograd Schema Challenge was designed to be a test of common-sense reasoning ability. However they must be written carefully so as to avoid being solved by ‘selectional restrictions’.

“What is a selectional restriction?”

Some verbs only go with some categories of nouns. From Wikipedia: “Sam drank a car.” Car isn’t in the category of things that can be drunk, and a system wouldn’t need to ‘reason’ about it any further.

“Isn’t comparing things this way still reasoning?”

What is reasoning? Where is the division between reasoning and statistics? Haha. To be xplored.

“What kind of architecture did the top performer use?”

Let’s save architectures for the next post and do a big comparison.


Mathematical proof proving

“What in the world is automatic proof proving?”

Automated theorem proving (ATP) is a field of study focusing on the generation of proofs or in proof verification. Wikipedia says it’s a very difficult field, but has some commercial uses. Notably ATPs are used for floating point circuitry design by Intel and AMD.

“What is the connection to reasoning?”

ATPs have achieved decent results with first order logic, which is an essential component of reasoning.

“Is there a relationship to machine learning?”

According to “Reinforcement Learning of Theorem Proving” current ATP research is focused on developing more hand engineered designs. The authors of that paper showed that RL + Monte Carlo Tree Search had somewhat decent results.

“Is there a significant difference between the ability to reason logically in English and the ability to make mathematical conjecture?”


“Is there anything that immediately jumps out as being interesting to pursue?”

I haven’t found anything that grabs my interest immediately. The kinds of algorithms used would be good to look at.


Textual Question Answering

This is a massive topic, it’s difficult to even summarize it. The wheels just keep spinning without targeted questions.

“What are the currently used datasets?”

“What kinds of questions do they have?”

SQuAD 2.0:

Southern California, often abbreviated SoCal, is a geographic and cultural region that generally comprises California’s southernmost 10 counties. The region is traditionally described as “eight counties”, based on demographics and economic ties: Imperial, Los Angeles, Orange, Riverside, San Bernardino, San Diego, Santa Barbara, and Ventura. The more extensive 10-county definition, including Kern and San Luis Obispo counties, is also used based on historical political divisions. Southern California is a major economic center for the state of California and the United States.

What is Southern California often abbreviated as?

Ground Truth Answers: SoCal, SoCal, SoCal

What is a major importance of Southern California in relation to California and the United States?

Ground Truth Answers: economic center, major economic center, economic center


“QuAC shares many principles with SQuAD 2.0 such as span based evaluation and unanswerable questions (including website design principles! Big thanks for sharing the code!) but incorporates a new dialog component. “

There were criticisms about the nature of the data in SQuAD 1.0, which led both to version 2.0, and QuAC, as well as for people to look for other datasets such as TriviaQA.


“CoQA contains 127,000+ questions with answers collected from 8000+ conversations. Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers.”


“TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions.”

MS Marco:

“In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer.”


“Leveraging CNN articles from the DeepMind Q&A Dataset, we prepared a crowd-sourced machine reading comprehension dataset of 120K Q&A pairs.”


“The WikiQA corpus is a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, we used Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer. Because the summary section of a Wikipedia page provides the basic and usually most important information about the topic, we used sentences in this section as the candidate answers. With the help of crowdsourcing, we included 3,047 questions and 29,258 sentences in the dataset, where 1,473 sentences were labeled as answer sentences to their corresponding questions.”

This one seems to be the smallest of the available datasets.


“The TREC conference series has produced dozens of test collections. Each of these collections consists of a set of documents, a set of topics (questions), and a corresponding set of relevance judgments (right answers). “


“HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. It is collected by a team of NLP researchers at Carnegie Mellon UniversityStanford University, and Université de Montréal.”


“We have constructed a new “Who-did-What” dataset of over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus.”

“Are we currently crushing these datasets or is there still a ways to go?”

Top performers per dataset, F1 score:

  • SQuAD – 77%
  • QuAC – 64%
  • CoQA – 75%
  • TriviaQA – 56%
  • HotpotQA – 59%
  • NewsQA – 66%
  • MarcoQA – 85%
  • Trec has various different datasets
  • WDW – 60%

“Where is the threshold where QA becomes usable in our daily lives?”

“If a Question Answering system was perfect, would we even need any other ‘design’?”


Fuzzy Cognitive Mapping (FCM)

Rod Tabers FCM depicting eleven factors of the American drug market

Fuzzy Cognitive mapping is a way to learn and represent causal factors. The ‘cognitive map’ looks like a natural way to store relationship information, and graphs are hot these days in ML. The fuzzy comes from fuzzy logic.

“What is fuzzy logic?”

Instead of boolean logic which is limited to True and False, 1 and 0, fuzzy logic can take partial values between 0 and 1.

“Have there been any successes with this?”

FCM’s are widely lauded as a simulation technique and used in product planning, medical applications, robotics, etc.

A search on Arxiv reveals very few papers but one is fresh out of the oven from last month. It helpfully details issues with FCM, invents a novel network structure to incorporate expert knowledge, and achieves good results vs other standard techniques.

“What issues did they outline for FCMs?”

  1. Causation fallacy of learning algorithms
  2. Restrictions imposed by causal weights
  3. Implications of unique attractors in simulation scenarios

“Why aren’t FCMs more interesting to me?”

Perhaps its the hand designed factor? Perhaps its hard to see how they can be combined with recent ML trends?

“How can FCMs be extended with recent ML trends?”

Robotics & Reasoning

This is a big topic and hard to find a good place to start, so as usual let’s have a list of exciting upcoming ‘robot’ tech!

  • Self driving vehicles
  • Automated cleaning
  • Elderly assisting
  • Warehouse logistics
  • Drones!
  • Marine operations
  • Education

“What are fundamental problems robots have to deal with?”

“Artificial Intelligence for Long-Term Robot Autonomy: A Survey” helpfully breaks down the categories into:

  • Navigation and Mapping
  • Environment classification
  • Knowledge representation
  • Interaction
  • Ongoing learning

Classification and image recognition is well covered. Interaction would be good to look at as there are comments floating around to the effect that manipulation of the environment is critical to creating a representation.

“What are the theories of object manipulation as they relate to robots/humans developing environmental understanding?”


“Is there anything unique to robot design and reasoning?” 

One thing is that robots need to have an emphasis on safety. Previously I thought that was a little boring. Given this context though, safety systems might be the biggest incentive to give ML designs the ability to incorporate hypothetical information. Or, pay more attention to low probability events and uncertainty in predictions.

“How are safety systems designed to combine with ML for self driving cars?”

To be explored !



“What have we seen?”

Research progress and interest on question/answering systems is picking up. New datasets are coming out regularly. We haven’t even scratched the surface of the different approaches at reasoning systems.

“What would be good to look at next?”

It would be helpful to look at the architectures for the VQA and QA models. In particular I would like to see if anyone has had success with graphs or neuro-symbolic architectures. Graphs seem like a good way to store information so that it can be accessed from a variety of different paths that might enable hypothetical perusal.

After that it would make sense to look at cognitive models and the recent cognitive inspired architectures. From there, let’s re-evaluate!


Unfinished questions:

  • Where does information reasoning fit in an AGI design?
  • Is it even true that all ML designs only reach level 1 of the causal hierarchy chart? What about Monte Carlo Tree Search in AlphaGo? Or Counterfactual regret minimization? What about the implicit comparison of actions stored in the Q value in RL?
  • What are the models of the mind?
  • What are the theories of object manipulation as they relate to robots/humans developing environmental understanding?
  • How are safety systems designed to combine ML with  otherfor self driving cars?
  • What architectures do all these approaches use?
  • Where is the threshold where QA becomes usable in our daily lives?”
  • Is there a significant difference between the ability to reason logically in English and the ability to make mathematical conjecture?
  • What are the theories of object manipulation as they relate to robots/humans developing environmental understanding?


Leave a Reply

Your email address will not be published. Required fields are marked *