How I Built an Audio Analysis Pipeline From Scratch

Matt/ April 23, 2026/ Agentforce, AI, Development, Integration, Music, Personal, Projects, Salesforce

How the Music Intelligence Engine came to be

In the last post, I mentioned that Spotify deprecated their Audio Features API – the endpoint that used to return energy, danceability, valence, and a dozen other musical characteristics for any track. It just disappeared. No replacement. No warning. No timeline.

That was the moment this project stopped being a submission tracker and started becoming something I hadn’t planned to build.

If you want to understand why a track fits a playlist – or doesn’t – you need to understand what the track actually sounds like. Not the genre tag in a database. Not the label’s marketing copy. The actual sonic character of the music: how intense it is, how emotionally positive it feels, whether it has a strong beat, how acoustic or electronic the production is. Playlist curators respond to these qualities intuitively. My Scoring Engine needs to respond to them systematically.

Without Spotify’s API, I had to compute all of it myself.

The Thirty-Second Advantage

Here’s the thing about audio analysis that I didn’t fully appreciate before I dug into it: you don’t need the whole song.

Spotify (and Deezer, and Apple Music) surfaces 30-second preview clips for most tracks in their catalog. These previews are free to access, reasonably representative of the track’s overall character, and – crucially – small
enough to analyze in memory without writing anything to disk.

Thirty seconds of audio at 22,050 Hz is a float32 array of about 661,500 samples. That’s enough signal to measure tempo, key, spectral energy, rhythmic structure, and a half-dozen other meaningful dimensions. The academic music information retrieval field has been working with clips this size for decades.

So the pipeline starts there: get the preview clip, decode it into raw samples, hand it to the analysis engine.

Why Three Sources, Not One

You’d think Spotify would be the obvious — and only — source for Spotify track previews. In practice it’s unreliable. Around 30% of catalog simply has no preview URL in the Spotify API response, usually due to label restrictions. The field is null. No preview, no analysis.

So the pipeline uses a three-source fallback chain. Spotify is still the first call — a single batch request resolves preview URLs for up to 50 tracks at once, which is fast and covers the majority of the catalog. But for every track Spotify can’t serve, the pipeline turns to Deezer. Their API is fast, doesn’t require OAuth for basic lookups, and returns previews for a meaningfully higher percentage of tracks. Apple Music is the third option, reached only when both Spotify and Deezer come up empty.

In practice: Spotify resolves most of the catalog in one round trip, Deezer handles the bulk of the remainder, and Apple Music catches the long tail. The combination pushes coverage high enough that the analysis is meaningful across the full catalog rather than patchy.

What the Engine Actually Measures

The analysis library at the core of this is librosa – an open-source Python library for music and audio analysis. It’s the standard tool in academic music information retrieval research, and it runs entirely on the raw waveform.

No metadata. No genre tags. Just the signal.

From a 30-second clip, librosa extracts:

Spectral centroid – the “brightness” of the sound, measured as the frequency centre of mass
Spectral flatness – how noise-like vs tonal the signal is
Zero-crossing rate – how often the waveform crosses silence, a rough proxy for high-frequency content and vocal presence
Spectral rolloff – the frequency below which 85% of the total energy falls
RMS energy – the root-mean-square amplitude, correlating with perceived loudness
MFCCs – Mel-frequency cepstral coefficients, the standard fingerprint for timbre and vocal texture
Chroma features – energy distributed across the 12 pitch classes (C, C#, D, and so on)
Onset strength envelope – the timing and intensity of musical attack events (drum hits, chord strikes, note starts)
Beat positions – where the beat tracker identifies metrically strong moments in the onset envelope

None of these are musical features by themselves. They’re signal measurements. The interesting engineering work is turning them into something a musician – or a curator – would recognize.

The Feature Layer

On top of the raw signal measurements sits a layer of purpose-built formulas that translate signal into musical meaning. This is where the proprietary work lives, so I’ll describe what each feature measures without going into how I compute it.

Energy – How intense, dense, and produced a track sounds. Heavy electronic production scores high. A solo acoustic guitar scores low. Energy is not the same as volume; two tracks at the same loudness can have very different energy profiles depending on their spectral character.

Danceability – How strong and consistent the beat is. This is specifically about beat salience – whether the beat stands out clearly above the surrounding musical activity – not just whether the track has a fast tempo. A slow, heavily syncopated groove can outscore a fast track with a weak kick.

Acousticness – How natural and unplugged the track sounds. Acoustic instruments produce clean, harmonic tones; electronic production adds noise and brightness. This dimension separates the two.

Valence – The emotional positivity of the track. High valence is euphoric and happy; low valence is dark or melancholic. Key (major vs minor) is the strongest single predictor here, but it’s not the whole story – plenty of minor-key tracks sound upbeat, and vice versa.

Instrumentalness – The probability of no lead vocals. Close to 1.0 means instrumental; close to 0.0 means the track almost certainly has a singer.

Liveness – The likelihood the track was recorded in front of a live audience. Studio recordings have tightly controlled acoustics; live recordings have the ambient chaos of a room full of people.

Speechiness – The presence of spoken words. Below a certain threshold: music. Above it: rap, spoken word, or a podcast.

Loudness – Overall perceived loudness in decibels, measured the way an audio engineer would measure it: as the average power of the signal, converted to the logarithmic dB scale. Ranges from approximately −60 dB (near silence) to 0 dB (maximum amplitude).

Tempo – Beats per minute, detected from the periodic structure of the onset strength envelope.

Key – The musical key of the track (C through B), determined using the Krumhansl-Schmuckler key-finding algorithm – a standard technique from music psychology research that compares the track’s harmonic content against known perceptual profiles for each key.

Mode – Whether the track is in a major or minor scale. Major keys are conventionally associated with positive emotional character; minor keys with tension or melancholy. Mode is the strongest single predictor of perceived valence in the Scoring Engine.

Time Signature – Whether the track is in 4/4 (standard pop and rock) or 3/4 (waltz time), determined by comparing the autocorrelation of the beat structure at three-beat versus four-beat intervals.

Duration – The full track length in milliseconds, retrieved directly from the streaming service API rather than inferred from the preview clip.

That’s thirteen features computed per track, all derived from a 30-second audio clip and a set of signal processing formulas I designed and tuned against my own catalog. Thirteen features is too many to reason about directly. The Scoring Engine takes those raw measurements and condenses them into three composite scores – and that’s where the next post picks up

The Microservice Architecture

The pipeline runs as a microservice – a scheduled, stateless process that knows nothing about my Salesforce application except the API credentials it needs to write results back. It queries Salesforce for tracks that need analysis, fetches preview audio, runs the librosa pipeline, applies the scoring formulas, and bulk-updates the results.

Keeping the analysis logic outside of Salesforce was a deliberate architectural choice. Signal processing at this scale doesn’t belong in Apex. The microservice can be updated, redeployed, and rerun independently of anything in the Salesforce org. When the scoring formulas improve – and they will – I rerun the pipeline and the scores update everywhere automatically.

For the playlist side of the engine, a second microservice runs the same analysis on every track in every active Spotify playlist in the database. That’s a much larger job: roughly 3,500 playlists, each with dozens to hundreds of tracks, many of which will change week to week as curators update their lists. That microservice runs incrementally, processing only new or changed tracks and building up a statistical profile of each playlist over time.

Once both pipelines have run, the Scoring Engine has what it needs: a set of composite scores for a submitted track and a corresponding set of averages for every playlist in the database. Playlist fit is a function of how close those scores are across all three dimensions.

What’s Next

The audio features and composite scores are live in Salesforce. Every track in my catalog has been analyzed. Every active Spotify playlist in my database is accumulating a profile.

The next post goes deeper into the Scoring Engine itself: how the three composite scores are designed to be directly comparable across tracks and playlists, why I chose these three dimensions over others, and what the theoretical frameworks from music psychology and music information retrieval research have to do with whether a punk track fits a curation playlist.

It’s the part of this project I’m most proud of. See you there.

Matt McGuire is an independent punk artist and Salesforce architect. He’s presenting “The Music Intelligence Engine: AI-Powered Promotion on Salesforce” at True North Dreamin‘ in May 2026.

Matt McGuire

Canada's #1 certified Salesforce professional. AI Architect. Builder.

How I Built an Audio Analysis Pipeline From Scratch

Related Articles