Blog Post Categorisation with Embeddings & LLMs

Learn how to 'automagically' create and de-duplicate categories for your blog posts.


Categorising Blog Posts

Categories provide structure to your site by organising individual posts and sub-topics under several main topics...More


James Anthony Phoenix

Data Engineer | Full Stack Developer
💪 Useful 0
😓 Difficult 0
🎉 Fun 0
😴 Boring 0
🚨 Errors 0
😕 Confusing 0
🤓 Interesting 0
Premium subscription required.
Python experience recommended.
1. Scenario
As we gather in the employee cafeteria during lunchtime, I can already sense the excitement in the air. William Winters, our brilliant Brand Strategist at Whipple, has just arrived to share his expertise on blog post categorization using LLMs and embeddings.
William Winters
at Whipple

I'm thrilled to be here today to talk about blog post categorization using LLMs and embeddings.

This is a game-changer for us in terms of efficiency and accuracy.

By automating the categorization process, we can save valuable time and ensure that our blog posts are reaching the right audience.

So let's dive in and learn how to level up our tagging game.

Are you ready to take on this challenge?

This course is a work of fiction. Unless otherwise indicated, all the names, characters, businesses, data, places, events and incidents in this course are either the product of the author's imagination or used in a fictitious manner. Any resemblance to actual persons, living or dead, or actual events is purely coincidental.

2. Brief

In the realm of marketing and SEO, categorizing blog posts is a crucial yet arduous task that streamlines content organization and enhances discoverability for readers. Automation of this process can be achieved through large language models (LLMs), embeddings, and tools like LangChain, with initial steps involving the installation of open-source embeddings such as Sentence Transformers and importing essential packages including LangChain sets, transformers, pydantic, and numpy.

The categorization process necessitates setting up various tags, including "tagged articles" with text and categories, "untagged articles" with only text, and "tags" encompassing a list of category names, followed by preparing the data accordingly.

Utility functions play a pivotal role in this automation, where "get_category_embeddings" generates category embeddings using Sentence Transformers to eliminate duplicates, and "find_similar_category" employs cosine similarity to match new categories with existing ones, returning matches above a certain threshold.

Additionally, an LLM model predicts categories for untagged articles based on content, with functions like "encode_articles" and "find_similar_articles" leveraging Sentence Transformers to encode text into embeddings and find category matches, respectively. The culmination of this process through "process_untagged_articles" involves encoding tagged articles, initializing categories, and assigning relevant categories to untagged articles based on similarity, significantly streamlining content categorization.

3. Tutorial

Hey, one of your jobs working in marketing or SEO can be to specifically tag or categorize blog posts. Maybe you've already had a client that's been working on some specific blog posts and they've already got categories. You want to stick to the system, but you want to also make sure that you can automate all of the categorization for the blog posts going forward.

4. Exercises
5. Certificate

Share This Course