Podcast Transcript Scraper

A Node.js tool that automatically downloads podcast transcripts from Podscribe.ai for entire series, handling even large collections with 700+ episodes efficiently.

Overview

Podscribe.ai hosts podcast transcripts but lacks a bulk download option. This script:

Fetches all episodes for a podcast series via the Podscribe API
Automates browser interactions to download each transcript
Organizes and saves all transcripts locally
Tracks progress to allow resuming interrupted downloads

Prerequisites

Node.js (v16+)
npm or yarn

Quick Start

# Install dependencies
npm install

# Run with required Series ID parameter
node podcast-scraper.js --seriesId=123

# Run with additional custom settings
node podcast-scraper.js --seriesId=123 --outputDir=./my-transcripts

Configuration Options

Option	CLI Argument	Default	Description
Series ID	`--seriesId`	Required	Podcast series ID
Output Directory	`--outputDir`	./transcripts	Where transcripts will be saved
Download Wait Time	`--downloadWaitTime`	5000	Wait time for downloads (ms)
Request Delay	`--requestDelay`	2000	Delay between requests (ms)
Max Retries	`--maxRetries`	3	Retry attempts for failed downloads
Log File	`--logFile`	./scraper_log.json	Progress log location
Headless Mode	`--headless`	false	Run browser invisibly when true

For help with all options:

node podcast-scraper.js --help

Example Commands

# Basic usage with required Series ID
node podcast-scraper.js --seriesId=123

# Custom series and output location
node podcast-scraper.js --seriesId=359 --outputDir=./my-transcripts

# Avoid rate limiting
node podcast-scraper.js --seriesId=123 --requestDelay=5000

# Run without visible browser
node podcast-scraper.js --seriesId=123 --headless=true

Key Features

Configurable: Command-line options for all settings
Resilient: Progress tracking and automatic retries
Rate-Limited: Prevents overwhelming the server
Organized: Consistent file naming with metadata

Troubleshooting

Failed Downloads: Increase --requestDelay (default: 2000ms)
UI Interaction Issues: Run with --headless=false to observe browser
Timeouts: Increase --downloadWaitTime (default: 5000ms)
Check Failures: Review scraper_log.json for error details

Technical Details

The script uses:

Puppeteer: Browser automation
Axios: API requests
p-throttle: Rate limiting
fs-extra: File operations

Disclaimer

This tool is for educational purposes. Ensure you have permission to download content and respect website terms of service and rate limits.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
package-lock.json		package-lock.json
package.json		package.json
podcast-scraper.js		podcast-scraper.js
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Podcast Transcript Scraper

Overview

Prerequisites

Quick Start

Configuration Options

Example Commands

Key Features

Troubleshooting

Technical Details

Disclaimer

About

Uh oh!

Releases

Packages

Languages

mex7xx/podcast-transcript-scraper

Folders and files

Latest commit

History

Repository files navigation

Podcast Transcript Scraper

Overview

Prerequisites

Quick Start

Configuration Options

Example Commands

Key Features

Troubleshooting

Technical Details

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages