Building Almanac Games: Scraping PSN with Duct Tape and Playwright

I track movies and TV shows on my blog’s Almanac page. It’s a timeline of everything I’ve watched, sorted by date. I liked it. Then I thought: what about games?

I have 367 games on my PSN account. 108 Platinums. Over 3,500 trophies. That data is sitting on PSNProfiles, publicly visible. How hard could it be to pull it into my blog?

Pretty hard, actually.

The scraper

PSNProfiles doesn’t have an API. So I wrote a Python scraper using Playwright and BeautifulSoup. Playwright handles the browser automation, BeautifulSoup parses the HTML.

The scraper navigates to my profile, waits for the page to load, then extracts each game row: title, platform, trophy breakdown, completion percentage, rank, and cover image URL.

Simple enough in theory. Then Cloudflare showed up.

Cloudflare ruins everything

PSNProfiles sits behind Cloudflare. You can’t just requests.get() the page. You get a challenge, and if you don’t solve it, you get nothing.

My workaround: Playwright with a persistent browser context. First run uses --headed mode so I can manually solve the Cloudflare challenge. The browser session is saved to a browser_data/ directory. Future runs reuse that session headlessly.

But I also needed the Cloudflare cookie for downloading images later. So the scraper dumps the cf_clearance cookie to a text file after each run. Feels hacky. Works fine.

def dump_cf_clearance(context):
    cookies = context.cookies()
    for cookie in cookies:
        if cookie["name"] == "cf_clearance":
            with open("data/cf_clearance.txt", "w") as f:
                f.write(cookie["value"])

The image problem

PSNProfiles hosts game covers on img.psnprofiles.com. I tried hotlinking them. 403. Tried routing through wsrv.nl as a proxy. Also 403. Cloudflare blocks everything that doesn’t come from a real browser session.

So I gave up on hotlinking and mirrored every image locally. A TypeScript script reads games.json, extracts each game’s ID from the cover URL, and downloads the large version using the stolen cf_clearance cookie.

function extractGameId(coverUrl: string): string | null {
  const match = coverUrl.match(/game\/[sl]\/(\d+)\//);
  return match ? match[1] : null;
}

364 out of 367 games downloaded. The missing three had broken URLs on PSNProfiles itself. I can live with that.

Trophy parsing is weird

PSNProfiles formats trophy counts differently depending on completion. If you’ve earned 12 out of 91 trophies, it says “12 of 91 Trophies”. Normal.

But if you’ve earned all of them, it says “All 91 Trophies”. My regex only handled the first format. Every Platinum game showed NaN trophies. Took me longer than I’d like to admit to notice.

trophy_match = re.search(r"(\d+)\s*of\s*(\d+)\s*Trophies?", trophy_text)
all_match = re.search(r"All\s*(\d+)\s*Trophies?", trophy_text)

There was also a whitespace issue. BeautifulSoup’s get_text(strip=True) sometimes removes spaces between words, turning “1 of 91 Trophies” into “1of91Trophies”. The regex needed \s* instead of literal spaces. Classic scraping pain.

The sync pipeline

Updating the blog with fresh game data involves three steps, glued together with a bash script:

Copy games.json from the scraper directory into the Astro project
Read the cf_clearance cookie
Run the Bun image downloader to grab any missing covers

#!/bin/bash
cp ../psngames/data/games.json src/data/psn/games.json
CF_COOKIE=$(cat ../psngames/data/cf_clearance.txt)
bun run scripts/download-psn-covers.ts "$CF_COOKIE"

Not elegant. But it runs in under a minute and I don’t have to think about it.

Putting it on the page

The Almanac already had utilities for movies and TV shows. I added a psn.ts utility that normalizes the scraped data and a GameCard.astro component for rendering.

One design issue: PSNProfiles serves covers in two sizes. Some are 320x176 landscape banners, others are 320x320 squares. My initial card used a portrait aspect ratio, which cropped everything badly. Switching to aspect-square with object-contain fixed it. The landscape banners sit inside the square with some padding. Not perfect, but nothing gets cut off.

The unified almanac view merges movies, shows, and games into one chronological list. A function in almanac.ts sorts everything by date. Now my Almanac page shows what I watched, what I played, and when.

What I’d do differently

The Cloudflare dance is fragile. If my browser session expires, I have to manually re-solve the challenge. There’s no way around this without paying for a proxy service, and I’m not doing that for a hobby project.

I also should have started with local image mirroring instead of trying three different proxy services first. Would have saved me a couple hours of yak-shaving.

Anyway

The whole thing is held together with a Python scraper, a bash script, a TypeScript image downloader, and a Cloudflare cookie I’m smuggling between processes. It’s not pretty. But my blog now shows 367 games with trophy progress, and the Almanac page has a unified timeline of everything I consume.

Sometimes the best architecture is the one you can explain in one sentence: scrape it, download the images, render it.

Building Almanac Games: Scraping PSN with Duct Tape and Playwright

Related Posts