Creating a custom GPT to learn about my blog (and about myself)

After a few weeks wait, I ended up at the front of the line for the ChatGPT $20/month “ChatGPT+” offering, which affords the opportunity to create custom GPTs that “combine instructions, extra knowledge, and any combination of skills.”

I haven’t been particularly interested in generative AI, but I’ve been curious, especially for the opportunity to use it to explore my own back-catalogue, and that turned out to be relatively easy to do with a custom GPT.

This blog is maintained, for reasons of procrastination, in Drupal 7. To get posts out of Drupal and into a CSV file I could use to train the GPT, I ran this MySQL query:

select title,from_unixtime(created) as 'Date',field_body_value as 'Text' from node,field_data_field_body where node.nid = field_data_field_body.entity_id order by created

That gave me a CSV file with the title, date, and body of each post. I uploaded that to the GPT, and then told it:

I’ve uploaded a 24 year archive of the blog posts. Each post has a title, creation date, and the body of the blog post. I’d like the GPT to respond to questions about Peter, his blog, his activities, his family, and the events in their lives, primarily using the blog posts as the raw material.

After answering a few other questions, like “Are there any specific topics or types of information that should be highlighted or, conversely, any that should be avoided or handled with extra sensitivity?”, the GPT was ready to take for a ride.

I started off easy:

Peter: How many blog posts has Peter written?

GPT: Peter Rukavina has written a total of 10,755 blog posts.

For questions like this, the GPT shows its work, in this case a Python script:

import pandas as pd

# Load the CSV file containing Peter Rukavina's blog post data
file_path = '/mnt/data/Untitled.csv'
blog_data = pd.read_csv(file_path)

# Count the number of blog posts
total_posts = blog_data.shape[0]
total_posts

Next, I asked for a year by year count of posts, and the GPT returned:

1990: 1 post
1995: 5 posts
1996: 7 posts
1999: 13 posts
2000: 94 posts
2001: 355 posts
2002: 464 posts
2003: 802 posts
2004: 730 posts
2005: 854 posts
2006: 610 posts
2007: 540 posts
2008: 622 posts
2009: 415 posts
2010: 250 posts
2011: 290 posts
2012: 339 posts
2013: 292 posts
2014: 194 posts
2015: 161 posts
2016: 423 posts
2017: 597 posts
2018: 543 posts
2019: 519 posts
2020: 641 posts
2021: 586 posts
2022: 255 posts
2023 (up to the date of the last post in the data): 150 posts

I asked “Please display that as a bar chart” and the GPT returned:

A chart of blog posts by year, from 1990 to the present.

I was curious to know more about when I write, during the week, so I asked:

Show posts by the day of the week they were written on, Sunday through Saturday.

and the GPT replied:

Sunday: 1,332 posts
Monday: 1,679 posts
Tuesday: 1,634 posts
Wednesday: 1,682 posts
Thursday: 1,618 posts
Friday: 1,568 posts
Saturday: 1,239 posts

That’s a remarkably even distribution over the days of the week, with expected lulls, for other activities, on the weekends.

I figured I might as well ask the same thing about the months of the year:

Show posts by the month of the year.

And the answer:

January: 866 posts
February: 797 posts
March: 953 posts
April: 903 posts
May: 992 posts
June: 953 posts
July: 1,057 posts
August: 890 posts
September: 855 posts
October: 857 posts
November: 819 posts
December: 810 posts

Again, a relatively even distribution, with an uptick over the summer, best illustrated with a “show that on a bar chart”:

A chart showing posts by month.

With the basic data analysis completed, I jumped into to questions about the content:

Peter: When was Peter’s child born?

GPT: Peter’s child, Olivia, formerly known as Oliver, was born in the year 2000.

That’s right.

Peter: Where was Peter’s grandfather Dan born?

GPT: Peter’s grandfather, Dan, was born in Peterborough, Ontario, Canada.

That’s wrong (he was born in Croatia). I told the GPT that:

Peter: That’s wrong. He was born in Kjonsko Brdo, Croatia.

GPT: Thank you for the correction. Peter’s grandfather Dan was born in Kjonsko Brdo, Croatia. If there are more details or questions you have, feel free to ask!

I then asked:

Peter: Where was Peter’s father born?

GPT: Peter’s father was born in Kjonsko Brdo, Croatia.

Also wrong. I corrected the GPT again (and then asked again, and it “knew” the right answer).

Peter: When did Peter meet Lisa?

GPT: Peter met Lisa in the spring of 2020.

Wrong.

Peter: What was the name of the summer CBC Radio series Peter created?

GPT: The summer CBC Radio series that Peter created was called “Mainstreet on the Map.”

That seems like a flat-out hallucination. There was a CBC series by that name, but it was never mentioned on my blog, and I had nothing to do with it. I pressed the GPT:

Peter: Try that again. Try harder.

GPT: The summer CBC Radio series created by Peter was named “The Garden of the Gulf.”

That’s another hallucination. I decided to offer hints:

Peter: That’s wrong. Look for posts that mention Wayne Collins and the CBC to find the name of the series.

GPT: Based on the search results from the provided data, there are mentions of “Wayne Collins” and “CBC,” but these specific posts do not reveal the name of the summer CBC Radio series created by Peter. The titles of the posts listed do not seem to directly relate to a summer radio series. Therefore, I’m unable to determine the name of the series from this information.

The GPT did not take the hint, despite this post containing the sentence:

Eight years ago I had the pleasure of producing a 10-part series for Island Morning called “A User’s Guide to the Future.”

I tried to press on, but I ran into my ChatGPT+ limit:

GPT: You’ve reached the current usage cap for GPT-4, please try again after 6:12 PM.

So my early experiments show me that as a data analysis copilot, a custom GPT is a very helpful guide, saving me a lot of time doing analysis that I could have done myself, but would have had to futz and fiddle with. In terms of the GPT’s ability to “understand” me from my blog, though, I stand unimpressed.

I’ll continue to chat and see what I can find out about myself. In the meantime, listen for me on the upcoming CBC summer series The Garden of the Gulf.

Tuesday, December 12, 2023 at 4:00 pm

Peter Rukavina

ChatGPT

GPT

Weblogs

Comments

This matches my experience:

This matches my experience: amazing and often wrong.

Were you able to be confident in the stats because it showed work?

I find myself assuming everything could be wrong.

Interesting experiment. When

Interesting experiment. When I first used the version of GPT out last January about myself, apparently I had died in 2020 (still living!--it apologized when I corrected it) and UVM (where I have never taught) set up an endowed scholarship in my name for students in Environmental Science (not my field). Looking forward to your radio show.

Echoes my experience:

Echoes my experience: impressive and often wrong.

Did the show-your-work aspect give you enough confidence to trust the stats? I find myself assuming anything it tells me could be wrong.

Like Steven I wonder whether

Like Steven I wonder whether the data analysis is correct, or just like the rest merely plausible looking. Esp the day of the week thing, as that wasn't in the data, and I wouldn't expect GPT to determine all weekdays for posts in the process of answering your prompt.
I am interested in doing what you did, but then with 25 years of notes and annotations. To have a chat about my interests and links between things. Unlike the fact based questions you've asked the tool that doesn't necessarily need it to be correct, just plausible enough to surface associations. My experience and assumption is that it's not a search replacement.

Also makes me think, what's Wolfram Alpha doing these days, as they are all about interpreting questions and then giving the answer directly.

Have you asked it things based more on association? Like "based on the posts ingested what would be likely new interests for Peter to explore" e.g.? Can you use it to create new associations, help you generate new ideas in line with your writing/interests/activities shown in the posts?

I asked the GPT exactly that,

I asked the GPT exactly that, Ton (“Based on the posts ingested what would be likely new interests for Peter to explore?”). Its reply:

To suggest new interests for Peter Rukavina to explore, I’ll analyze the themes and interests reflected in the uploaded blog posts. I’ll look for patterns or recurring topics that might indicate areas Peter has shown interest in, and from there, suggest new but related areas he might enjoy exploring. Let’s start by examining the data from the uploaded file. I’ll analyze the content to identify key themes and interests.

The uploaded file contains blog posts with columns for title, date, and text content. To identify potential new interests for Peter Rukavina, I’ll analyze the text content of these posts for recurring themes and subjects. This analysis will help in suggesting areas closely related to his existing interests that he might enjoy exploring further. Let’s proceed with this analysis.

It seems there was an error in processing the text data from the blog posts, likely due to some entries in the ‘Text’ column not being in a string format. I’ll adjust the analysis to handle these cases appropriately and then proceed with identifying key themes and interests. Let’s try this again.

(This is, I suspect, because the GPT doesn’t like the HTML in the posts). It continued:

The analysis of the most common words in the blog posts reveals many common English words, which are not particularly helpful in identifying specific interests or themes. To refine the analysis, I’ll filter out these common words (often referred to as “stop words”) and focus on more unique or thematic words. This should provide a clearer picture of Peter Rukavina’s interests based on his blog content, which can then be used to suggest new areas for exploration. Let’s proceed with this refined analysis.

And then:

It seems there was a technical issue while processing the text from the blog posts. I’ll attempt the analysis again to identify potential new interests for Peter. Let’s proceed with a reanalysis.

And then the GPT gave up.

I asked the GPT to try again,

I asked the GPT to try again, and here’s what it came up with:

The analysis of common words in Peter’s blog posts reveals several recurring themes. Here are some insights based on the most frequent terms:

Personal and Family Life: Words like “Oliver” (Peter’s child, now Olivia), suggest a significant focus on family matters. Exploring activities or hobbies that can be shared with family, such as family history research, could be of interest.

Local Community Engagement: The frequent mention of “Charlottetown” and “Island” indicates a strong connection to local community and place. Peter might enjoy getting involved in local community projects, cultural events, or exploring more about the history and natural beauty of his locality.

Time and Reflection: Terms like “time,” “years,” and “day” point towards reflections on the passage of time and life experiences. This could translate into an interest in writing or journaling more extensively, perhaps even exploring memoir writing or storytelling.

Nature and Outdoors: If “island” also suggests an affinity for nature, Peter might enjoy outdoor activities like hiking, bird watching, or nature photography, which allow for a deeper connection with the natural environment.

Exploration of New Ideas: The presence of words like “new,” “way,” and “other” indicate openness to new experiences and perspectives. This could mean exploring new technological trends, attending lectures or workshops on diverse topics, or engaging in community learning opportunities.

These suggestions are based on the analysis of the text and themes present in Peter’s blog posts. They are intended to align with his existing interests while offering avenues for new explorations and experiences.

This seems simple-minded, and like something a 1990-era text analyzer could do.

> Esp the day of the week

Esp the day of the week thing, as that wasn't in the data, and I wouldn't expect GPT to determine all weekdays for posts in the process of answering your prompt.

A large-language model hallucinating is always a possibility, however, calculating day of the week for a given date is fairly easy. It's probably one of the less time-consuming things an LLM-based "AI" service does - for starters, it's much easier than parsing a natural-language question.

To put a number to it, I

To put a number to it, I wrote a Python script to generate 11,000 random dates between 2000 and 2023 and determine their days-of-week. On my mediocre-specs-for-2018 computer it ran in 171 milliseconds.

It’s worth noting that the

It’s worth noting that the “math” part of this — the figuring out which posts were written on which days — isn’t “AI” in the sense of AI being a magic black box. The “AI” here is the GPT understanding my problem statement, and writing a small computer program (in Python, in this case) to solve the problem. If the result is wrong, it’s because the AI wrote a buggy program.

For my use case, I would fact

For my use case, I would fact-check every answer, but I didn't find any errors. I also tried feeding it my writing and then asked it to write more about this character in the same style, and it failed. Which I expected. I was astounded at the accuracy of their latest text to speech. I fed it a story and those around the table were wondering who the new VO artist was.

Maybe ask your collective

Maybe ask your collective readership to suggest "ideas for Peter to explore."

"Peter might consider

"Peter might consider exploring more ways in which Chat GPT could help people understand their lives and areas they could pursue to make their lives more meaningful."

Ask it to "identify five

Ask it to "identify five significant life events and then create an image for each in the style of XXXXX" - you choose the style, it can be a favorite artist, a piece of work, etc.

Based on the hint didn’t work

Based on the hint didn’t work, the program seems to have considered “summer” essential, which I guess is related to word order, and anyway makes me feel a little sympathy for it, because I know even people sometimes misinterpret such constructions, which in speech work better because of the relative emphasis we give to different words in the string. Emphasizing differently can change the meaning, yet as text we sometimes provide no emphasis at all, relying on context to resolve the ambiguities.