AIs Are Getting Too Smart – Time For A New “IQ Test” 🎓


Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér. In a world where learning-based algorithms
are rapidly becoming more capable, I increasingly find myself asking the question: “so, how
smart are these algorithms, really?”. I am clearly not alone with this. To be able to answer this question, a set
of tests were proposed, and many of these tests shared one important design decision:
they are very difficult to solve for someone without generalized knowledge. In an earlier episode, we talked about DeepMind’s
paper where they created a bunch of randomized mind-bending, or in the case of an AI, maybe
silicon-bending questions that looked quite a bit like a nasty, nasty IQ test. And even in the presence of additional distractions,
their AI did extremely well. I noted that on this test, finding the correct
solution around 60% of the time would be quite respectable for a human, where their algorithm
succeeded over 62% of the time, and upon removing the annoying distractions, this success rate
skyrocketed to 78%. Wow. More specialized tests have also been developed. For instance, scientists at DeepMind also
released a modular math test with over 2 million questions, in which their AI did extremely
well at tasks like interpolation, rounding decimals, integers, whereas they were not
too accurate at detecting primality and factorization. Furthermore, a little more than a year ago,
the Glue benchmark appeared that was designed to test the natural language understanding
capabilities of these AIs. When benchmarking the state of the art learning
algorithms, they found that they were approximately 80% as good as the fellow non-expert human
beings. That is remarkable. Given the difficulty of the test, they were
likely not expecting human-level performance, which you see marked with the black horizontal
line, which was surpassed within less than a year. So, what do we do in this case? Well, as always, of course, design an even
harder test. In comes SuperGLUE, the paper we’re looking
at today, which is meant to provide an even harder challenge for these learning algorithms. Have a look at these example questions here. For instance, this time around, reusing general
background knowledge gets more emphasis in the questions. As a result, the AI has to be able to learn
and reason with more finesse to successfully address these questions. Here you see a bunch of examples, and you
can see that these are anything but trivial little tests for a baby AI – not all, but
some of these are calibrated for humans at around college-level education. So, let’s have a look at how the current
state of the art AIs fared in this one! Well, not as good as humans, which is good
news, because that was the main objective. However, they still did remarkably well. For instance, the BoolQ package contains a
set of yes and no questions, in these, the AIs are reasonably close to human performance,
but on MultiRC, the multi-sentence reading comprehension package, they still do OK, but
humans outperform them by quite a bit. Note that you see two numbers for this test,
the reason for this is that there are multiple test sets for this package. Note that in the second one, even humans seem
to fail almost half the time, so I can only imagine the revelation we’ll have a couple
more papers down the line. I am very excited to see that, and if you
are too, make sure to subscribe and hit the bell icon to not miss future episodes. Thanks for watching and for your generous
support, and I’ll see you next time!

74 Comments

Add a Comment

Your email address will not be published. Required fields are marked *