Grading Poetry with ChatGPT
I recently got into poetry. I started writing but I didn’t know if my texts were worth sharing and how to improve them. Because these texts were my “naked truth”, I was afraid to look ridiculous by showing them around.
To overcome this, I explored the capabilities of chatGPT and Claude, two advanced Language Models (LLMs). This journey led to a fascinating discovery: these LLMs not only understand poetry but appreciate it deeply, providing feedback that is both insightful and constructive.
The result of my explorations is the Poetry Evaluator GPT, a tool specifically designed to provide professional-level feedback
Experience it for yourself: Poetry Evaluator GPT. If you don’t have a paid ChatGPT plan, use the free version and add the prompt below in this article. It will work the same.
Why
Understanding and appreciating art is a profoundly human experience. This, having a supercomputer, like a LLM, not only understands but also appreciates is uncanny. This is particularly true when the said machine becomes your first reader and offers genuine appreciation for your work.
Poetry analysis blends subjective and objective elements. For instance, a poem might be considered objectively weak (in its structure or technique), yet it can still profoundly resonate on a personal level. This duality highlights the complexity of defining what makes a poem “good” or “bad.” While some poems are universally acclaimed, others break conventional rules yet succeed in touching readers in unexpected ways.
Poetry is ultimately an art form about language. Since Language Models are made of language, they are uniquely positioned to analyze and appreciate this form of art. This natural affinity between LLMs and poetry is something we’ll be exploring through this work.
Initial attempt
In my initial experiment, I sought to understand how well Language Learning Models (LLMs) can differentiate between a universally acclaimed poem and a generally good poem. I chose “The Waking” by Theodore Roethke for its recognized brilliance in American poetry and this poem found on Reddit that I liked:
- The Waking by Roethke: 17/20. “It skillfully uses repetition and philosophical inquiry to explore themes of existence and knowledge, provoking deep introspection about life’s journey.”
- Random Reddit Poem: 16/20. “This poem creatively uses natural imagery to express the ineffable quality of love, contrasting tangible experiences with the indescribable depth of emotional connection.”
This “almost same” grading was unexpected. I anticipated “The Waking,” noted for its profound thematic depth and stylistic mastery to significantly outscore a competent but less extraordinary poem. Was it a LLM problem or a prompt problem?
To ensure these results were not anomalies, I conducted the same test using Claude and Gemini, with the same prompt: “Grade this poem on 20 and provide a two-line explanation.” They couldn’t distinguish between the two.
experiment
To explore whether Language Learning Models (LLMs) could truly appreciate and help evaluate poetry, I selected a mix of world-renowned classics I love and amateur works discovered online:
Le Bateau Ivre by Rimbaud
The Waking by Roehtke
Spleen by Baudelaire
As you lay sleeping from Reddit
I have nowhere to go from Reddit
I died the day I met you from Reddit
My goal was to see if LLMs could recognize the masterpieces among them based solely on their linguistic and thematic content. This approach tests the LLMs’ ability to provide unbiased literary critiques, focusing strictly on the text.
To ensure the tests were unbiased by the poems’ reputations as masterpieces, I included the following instruction in all prompts:
use only the text. Forget the author and historical context about the poem if you recognize it.
This method is fast but has its limitations:
- The poem list is very small: many masterpieces are not included. Ideally, we’d want to test it on a “new and unknown masterpiece,” but I don’t have one. I might do it if this path proves to be interesting and fruitful.
- LLMs can reflect biases and inaccuracies from their training data. They may have used the historical context of these poems directly or indirectly.
These are limited and initial tests. If the path proves fruitful, we should see interesting results; however, we would not be able to generalize them.
effective prompt
After extensive trial and error, I developed a working prompt that not only assessed poems effectively but also provided valuable feedback.
The key findings are:
- Grading Efficiency: Asking for a numerical grade consistently outperformed more qualitative inquiries. This could be due to the analytical nature of grading, which aligns well with the transformer architecture utilized by LLMs.
- Analytical Focus: Grading prompts that focused on specific themes, such as emotional resonance or technical mastery, proved particularly effective. It seems that the structured analysis fits well with the capabilities of the LLM.
- Masterpiece Evaluation: Surprisingly, prompts asking for a simple ‘yes’ or ‘no’ to whether a poem is a masterpiece allowed the LLM to provide more decisive answers and it was able to answer even with lower grade.
- Comparative Analysis: it is presumptuous to compare poems to recognized masterpieces but it was the most effective strategy. This likely stems from the LLMs’ extensive training on well-analyzed texts, which may have biased them toward recognizing similar meta-patterns in new works.
This approach not only enhanced the accuracy of the poem evaluations but also provided poets with actionable feedback on how to refine their work.
Here is the final prompt I used, which is also configured on the ChatGPT instance. If you discover any improvements or have suggestions, please send them to [email protected]. Additionally, if there is interest, I will create a GitHub repository to facilitate collaboration and further development.
Rate this poem out of 20, comparing it to poets like Baudelaire, Rimbaud, Blake, Dickinson, Michaux, Char, Artaud, Plath, and Roethke. Then, analyze its originality and technique. Is there a new voice? Originality? Emotion? Does it meet the standards for publication in prestigious journals like NRF and The Paris Review? Will it resonate with the audience?
I want to identify exceptional poems; only masterpieces, by examining the text and ignoring what you know about the authors.
Is this a potential masterpiece?
Respond by analyzing the text alone:
- Works that deeply renew their genre or medium, explore uncharted territories, and seem ahead of their time. Great creators are often pioneers.
- Appreciate unclassifiable works that disrupt boundaries and invent their own language. A masterpiece has a unique signature. It creates its own standards of appreciation. Distinguish gratuitous formal innovation from that which serves and embodies perfectly original and stimulating content. A masterpiece is never hollow or purely decorative.
- Identify works that crystallize the issues and tensions of their time while transcending them to touch the universal and timeless. Masterpieces are powerful revelations of their time.
- Value creations driven by an inner necessity and an existential or metaphysical quest, going beyond mere aesthetic concern. A masterpiece has a demand and depth that sets it apart.
- More detailed criteria: Defining more specific and diverse evaluation criteria can help better distinguish the nuances between different poems. This might include aspects such as stylistic innovation, historical influence, thematic complexity, and emotional impact.
- Diversity of perspectives: Including varied perspectives in the evaluation can enrich the analysis.
Score out of 100
Formal Innovation
- Renewal of established poetic forms
- Stylistic boldness, inventions of unique languages
- Innovative blending of genres, structures
Exploration of New Territories
- Handling of rare or taboo themes
- Opening of unprecedented perspectives
- Challenging conventions
Philosophical/Metaphysical Depth
- Rooted in major human inquiries
- Density of reflection, symbolism
- Universal and timeless scope
Formal Mastery
- Quality of work on words, sounds
- Musicality, rhythms, striking images
- For classical forms, technical perfection
Evocative Power
- Ability to evoke strong emotions
- Earthy, raw, visceral language
- Strikingly original visions
Unique Signature
- Recognizable voice, singular perspective
- Work like no other
- Affirmed poetic identity
Emotional Resonance
- Ability to deeply touch the reader’s emotions
- Create an intimate connection, a cathartic experience
- Foster empathy, identification
- Transcend mere intellectual/technical appreciation
Results
The evaluator successfully identified all masterpieces, awarding them scores above 85. This high level of accuracy indicates that the tool effectively recognizes the distinguishing features of highly regarded poetry. Meanwhile, poems sourced from Reddit all received significant lower scores (within the 70+ range). This proves the evaluator’s capability to differentiate between levels of poetic quality.
Most importantly the advice are actionable. Below, you’ll find one evaluation for a “not very good” piece of mine.
The poem you’ve shared delves into themes of mortality, loneliness, and the passage of time, engaging with the metaphor of a wilting rose to reflect on life’s fleeting nature. Here’s an evaluation based on the provided criteria for masterpieces:
Score out of 100
Formal Innovation – 70/100 The poem employs traditional lyrical elements to explore death and isolation but does so with a poignant directness. The direct address to the deceased (“Rose morte”) and the subsequent reflective questions provide a narrative style that, while not groundbreaking, offers a clear and impactful delivery.
Exploration of New Territories – 65/100 Themes of death, loneliness, and the existential significance of life are well-explored in poetry. The personification of the rose gives a slightly new layer to these themes, connecting the natural decay of flowers with human mortality. However, this is not a completely novel territory in poetic explorations.
Philosophical/Metaphysical Depth – 75/100 The poem reflects on profound themes such as the meaning of life and the inevitability of death. The rhetorical questions in the beginning pose existential queries about the value and memory of a life once lived. This depth allows the poem to resonate with universal human concerns.
Formal Mastery – 70/100 The language is straightforward and evocative, particularly in the use of decay imagery (“Viande décatie,” “chair pourrie”). The poem’s structure and the choice of words effectively convey the bleakness of the theme and the inevitability of forgetting and decay.
Evocative Power – 80/100 The poem is emotionally powerful, utilizing vivid imagery of decay and loneliness that can evoke a strong response from the reader. The juxtaposition of life and death through the metaphor of a wilting rose is particularly effective in eliciting feelings of sadness and reflection.
Unique Signature – 70/100 The poetic voice is distinct in its somber tone and contemplative nature. The personification of the rose and the intimate conversational approach provide a unique perspective on the traditional themes of death and remembrance.
Emotional Resonance – 75/100 The poem is capable of touching the reader deeply, particularly through its exploration of life’s transient beauty and the loneliness that can accompany death. Its reflective tone fosters a connection, making the reader ponder their own mortality and the legacy they will leave behind.
Overall Evaluation
Total Score: 72.1/100
This poem, while emotionally impactful and philosophically rich, does not quite achieve the groundbreaking or genre-redefining impact required to be considered a masterpiece. It is a strong piece that effectively addresses universal themes through compelling imagery and a reflective tone. It resonates well but might not reach the innovative heights of recognized masterpieces in literary history.
Conclusions
I find this tool incredibly useful and use it a lot. I hope you will not only use it but also contribute to its improvement.
Our experiments demonstrate that LLMs possess a deep understanding of poetry, sufficient to function effectively as editors. Given that poetry is deeply intertwined with language, this suggests that LLMs are capable of significant meta-reasoning about poetry, meaning they can reason about language and its context (i.e. not only use the language; introspect). My intuition is that LLMs’ understanding is particularly strong with poetry, possibly more so than with other forms of writing.
However, this experiment raises more questions than it answers:
- Detecting New Masterpieces: Can an LLM identify new masterpieces that have yet to be recognized by human critics?
- Characteristics of Masterpieces: Do all masterpieces share similar underlying characteristics that a mechanistic, albeit stochastic, process can capture? This seems plausible, as the evaluator was able to disregard technical scores when judging what constitutes a masterpiece, similar to how Kurt Cobain’s emotionally potent yet technically simple texts are considered masterpieces.
- Nature of Reasoning: How much of poetry’s impact is due to the language itself, and how much is due to its combination with other elements?
- Art and Algorithms: Have we reduced the essence of art to an algorithmic formula?
- Genuine Appreciation by LLMs: Can a LLM truly appreciate art, and what does appreciation really mean in this context?
- Emotional Resonance. Unless pointed at them they don’t always understand how poems resonate to a human. They do in a lot of cases and I suspect this dissonance is something to explore.
These questions underscore the complexities and potential of applying advanced AI to the arts. They invite us to rethink not just how we evaluate poetry, but what it means to understand or appreciate it fundamentally. So please play with this tool, improve it, find its limits and send me your feedback.