Skip to content

mex7xx/podcast-transcript-scraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Podcast Transcript Scraper

A Node.js tool that automatically downloads podcast transcripts from Podscribe.ai for entire series, handling even large collections with 700+ episodes efficiently.

Overview

Podscribe.ai hosts podcast transcripts but lacks a bulk download option. This script:

  • Fetches all episodes for a podcast series via the Podscribe API
  • Automates browser interactions to download each transcript
  • Organizes and saves all transcripts locally
  • Tracks progress to allow resuming interrupted downloads

Prerequisites

  • Node.js (v16+)
  • npm or yarn

Quick Start

# Install dependencies
npm install

# Run with required Series ID parameter
node podcast-scraper.js --seriesId=123

# Run with additional custom settings
node podcast-scraper.js --seriesId=123 --outputDir=./my-transcripts

Configuration Options

Option CLI Argument Default Description
Series ID --seriesId Required Podcast series ID
Output Directory --outputDir ./transcripts Where transcripts will be saved
Download Wait Time --downloadWaitTime 5000 Wait time for downloads (ms)
Request Delay --requestDelay 2000 Delay between requests (ms)
Max Retries --maxRetries 3 Retry attempts for failed downloads
Log File --logFile ./scraper_log.json Progress log location
Headless Mode --headless false Run browser invisibly when true

For help with all options:

node podcast-scraper.js --help

Example Commands

# Basic usage with required Series ID
node podcast-scraper.js --seriesId=123

# Custom series and output location
node podcast-scraper.js --seriesId=359 --outputDir=./my-transcripts

# Avoid rate limiting
node podcast-scraper.js --seriesId=123 --requestDelay=5000

# Run without visible browser
node podcast-scraper.js --seriesId=123 --headless=true

Key Features

  • Configurable: Command-line options for all settings
  • Resilient: Progress tracking and automatic retries
  • Rate-Limited: Prevents overwhelming the server
  • Organized: Consistent file naming with metadata

Troubleshooting

  • Failed Downloads: Increase --requestDelay (default: 2000ms)
  • UI Interaction Issues: Run with --headless=false to observe browser
  • Timeouts: Increase --downloadWaitTime (default: 5000ms)
  • Check Failures: Review scraper_log.json for error details

Technical Details

The script uses:

  • Puppeteer: Browser automation
  • Axios: API requests
  • p-throttle: Rate limiting
  • fs-extra: File operations

Disclaimer

This tool is for educational purposes. Ensure you have permission to download content and respect website terms of service and rate limits.

About

bulk download podcast transcriptions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 100.0%