NLP is more than just LLMs

Share this post

This is a guest post from our good friend and CTO of our sister company, (Filament AI) – James Ravenscroft – a Machine Learning and NLP PhD.
Originally posted here – where the full index of resources used to complete the article are given 🙂
There is sooo much hype around LLMs at the moment. As an NLP practitioner of 10 years, it’s exhausting and quite annoying and amongst the junior ranks, there’s a lot of despondency and dejection and a feeling of “what’s the point? ClosedOpenAI have solved NLP”.

Well, I’m here to tell you that NLP is more than just LLMs and that there are plenty of opportunities to get into the field. What’s more, there are plenty of interesting, ethical use cases that can benefit society. In this post I will describe a number of opportunities for research and development in NLP that are unrelated or tangential to training bigger and bigger transformer-based LLMs.

Combatting Hallucination

If you take the hype at face value, you could be forgiven for believing that NLP is pretty much a solved problem. However, that simply isn’t the case. LLMs hallucinate (make stuff up) and whilst there is a marked improvement in hallucinations between versions of GPT, hallucination is a problem with transformer-based LLMs in general as the technical co-founder of OpenAI, Ilya Sutskever admits. Instead of relying on pure LLMs, there are lots of opportunities for building NLP pipelines that can reliably retrieve answers from specific documents via semantic search. This sort of approach allows the end user to make their own mind up about the trustworthiness of the source rather than relying on the LLM itself which might be right or might spit out alphabet soup. This week OpenAI announced a plugin interface for ChatGPT that, in theory, facilitates a hybrid LLM and retrieval approach through their system. However, it seems like GPT can still hallucinate incorrect answers even when the correct one is in the retrieved response. There’s definitely some room for improvement here!

As use of LLMs becomes more widespread and people ask it questions and use it to write blog posts, we’re going to start seeing more hallucinations presented as facts online. What’s more, we’re already seeing LLMs citing misinformation generated by other LLMs to their users.

Bot Detection

There are certainly opportunities in bot vs human detection. Solutions like GPTZero and GLTR rely on the statistical likelihood that a model would use a given sequence of words based on historical output (for example if the words “bananas in pajamas” never appear in known GPT output but they appear in the input document, the probability that it was written by a human is increased). Approaches like DetectGPT use a model to perturb (subtly change) the output and compare the probabilities of the strings being generated to see if the original “sticks out” as being unusual and thus more human-like. edit: I was also contacted by Tracey Deacker – a computer science student in Reykjavik, who recommended CrossPlag – another such detection tool.

It seems like bot detection and evading detection are likely to be a new arms race: as new detection methods emerge, people will build more and more complex methods for evading detection or rely on adversarial training approaches to train existing models to evade new detection approaches automatically.

Fact Checking and Veracity

Regardless of who wrote the content, bots or humans, fact-checking remains a key topic for NLP, again something that generative LLMs are not really set up to do. Fact checking is a relatively mature area of NLP with challenges and workshops like FEVER. However, it remains a tricky area which may require models to make multiple logical “hops” to arrive at a conclusion.

When direct evidence of something is not available, rumour verification is another tool in the NLP arsenal that may help us to derive the trustworthiness of a source. It works by identifying support or denial from parties who may be involved in a particular rumour (for example, Donald Trump tweets that he’s going to be arrested and some AI generated photos of his arrest appear online, posted by unknown actors, but we can determine that this is unlikely to be true because social media accounts at trustworthy newspapers tweet that trump created a false expectation of arrest). Kochkina et al currently hold the state of the art on the RumourEval dataset.

Temporal Reasoning

Things change over time. The answer to “who is the UK Prime Minister” today is different to this time last year. GPT 3.5 got around this by often prefixing information with big disclaimers about being trained in 2021 before telling you that the UK Prime Minister is Boris Johnson and not knowing who Rishi Sunak is. Early Bing/Sydney (which we now know was GPT-4) simply tried to convince you into believing that it was actually 2022 not 2023 and that you must be wrong: “You have been a bad user. I have been a good Bing”).

Again this is something that a pure transformer-based LLM sucks at and around which there are many opportunities. Recent work in this area includes modelling moments of change in peoples’ mood based on social media posts 16 and some earlier work has been done to do things like how topics of discussion in scientific research change over time .

Specialised Models and Low Compute Modelling

LLMs are huge and power hungry language generalists but often get outperformed by smaller specialised models at specific tasks. Furthermore, recent developments have shown that we can get pretty good performance out of LLMs by shrinking them so that they run on laptops, Raspberry Pis and even mobile phones. It also looks like it’s possible to get ChatGPT-like performance from relatively small LLMs with the right datasets, DataBricks yesterday announced their Dolly model which was trained on a single machine in under an hour.

There is plenty more work to be done in continuing to shrink models so that they can be used on-site, on mobile or in embedded use cases in order to support use cases where flexibility and trustworthiness are key. Many of my customers would be very unlikely to let me send their data to OpenAI to be processed and potentially learned from in a way that would benefit their competitors or that could accidentally leak confidential information and cause GDPR headaches.

Self-hosted models are also a known quantity but the big organisations that can afford to train and host these gigantic LLMs stand to make a lot of money off people just using their APIs as black boxes. Building small, specialised models that can run on cheap commodity hardware will allow small companies to benefit from NLP without relying on OpenAI’s generosity. It might make sense for small companies to start building with a hosted LLM but when you get serious, you need to own your model.

Trust and Reproducibility

Explainability and trustworthiness of models are now a crucial part of the machine learning landscape. It is often very important to understand why an algorithm made a particular decision in order to eliminate latent biases and discrimination and to ensure that the reasoning behind a decision is sound in general. There are plenty of opportunities to improve the current state-of-the-art in this space by training models that can explain their rationale as part of their decision and by developing benchmarks and tests that can draw out problematic biases.

The big players have started to signal their intent not to make their models and datasets open any more. By hiding this detail, they are effectively withdrawing from the scientific community and, we can no longer meaningfully reproduce their findings or trust their results. For example, there are some pretty feasible hypotheses around about how GPT-4 may have previously been exposed to and overfit on the bar exam papers that it supposedly aced . Without access to the model dataset or weights nobody, can check this.

In fact, we’ve got something of a reproducibility crisis when it comes to AI in general. There are lots of opportunities for budding practitioners to enter the arena and tidy up processes and tools and reproduce results.


In conclusion, while the world’s gone mad with GPT fever, it’s important to remember that there are still a huge number of opportunities within the NLP space for small research groups and businesses.

I sort of see ChatGPT a bit like how many software engineers see MongoDB: a prototyping tool you might use at a hackathon to get a proof-of-concept working but which you subsequently revisit and replace with a more appropriate, tailored tool.

So for early career researchers and engineers considering NLP: it’s definitely learning about LLMs and considering their strengths and weaknesses but also consider that, regardless of what the Silicon Valley Giants would have you believe, NLP is more than just LLMs.

Other Resources for AI Beyond LLMs

Here are some more resources on nlp and ml stuff that is going on outside of the current LLM bubble from others in the nlp space: – a thread where some nlp experts weigh in on unsolved problems – a recent chat between AI and ML practitioners on stuff they are working on outside of LLMs – a blog post from an NLP professor about finding problems to work on outside of the LLM bubbl

More to explore

Ready to kickstart your chatbot journey?