Sales & Demand Forecasting for Book Titles

Technical report on hybrid time-series forecasting for book sales.

📊Project Overview

This project forecasts weekly demand for selected best-selling books over a 32-week horizon. It combines classical time-series modelling (SARIMA) with gradient-boosted trees (XGBoost) to capture both linear seasonal patterns and nonlinear effects.

📂Data Sources

Historical Sales Data (2008–2014): Weekly transaction records (~200,000 rows)
Books Metadata: ISBN, title, author, imprint, publisher group, product class, and category (~500 unique titles)
Variables: Sales volume, value, average selling price (ASP), recommended retail price (RRP), binding type, and time interval
Frequency: Weekly time series per title (aggregated from raw transactions)
Titles: Including The Alchemist and The Very Hungry Caterpillar.

💼Business Problem

Accurate demand forecasting is critical for publishers and retailers to balance inventory, pricing, and marketing decisions.
Traditional forecasting models often underperform due to seasonal volatility, promotions, and title-specific behavior.

This project aimed to build a robust forecasting framework that predicts weekly sales 32 weeks ahead for each book title, helping the publisher:

Reduce stockouts and overstock
Optimize reprint timing
Improve pricing and promotion planning

⚙️Approach & Methodology

We developed a hybrid time-series forecasting pipeline that integrates statistical and machine learning models to improve prediction accuracy, interpretability, and strategic usability.
The workflow followed five structured phases:

1️⃣ Data Preparation

Cleaned and aligned weekly time series for each ISBN.
Handled missing weeks using interpolation and smoothing techniques.
Removed outliers caused by bulk promotions or anomalous spikes.
Resampled data to a consistent weekly frequency, filling gaps with zeros.
Book Selection: Focused modeling on two representative titles:
- The Alchemist — steady, long-term sales pattern.
- The Very Hungry Caterpillar — strong seasonal variation.

2️⃣ Exploratory Analysis

Conducted time series decomposition to separate trend, seasonality, and residual components.
Performed ACF/PACF analysis and stationarity testing to identify autocorrelation structures.
Applied Auto ARIMA to establish an initial statistical baseline and optimal parameters for trend modeling.

3️⃣ Feature Engineering

Generated lag features (4, 8, 12 weeks) and rolling averages to capture recent sales dynamics.
Created time-based features (month, quarter, week number, year) to model cyclical patterns.
Incorporated categorical metadata (category, binding, imprint) to represent product characteristics.
Engineered trend and residual features for downstream hybrid modeling.

4️⃣ Modeling Framework

SARIMA Model: Captured trend and seasonal components.
XGBoost Regressor: Modeled nonlinear residuals left by SARIMA, improving fine-grained accuracy.
LSTM Model: Captured temporal dependencies for sequential forecasting.
Hybrid Models: Combined statistical and machine learning forecasts via two approaches:
- Sequential Hybrid (SARIMA → XGBoost): XGBoost learns residual errors from SARIMA.
- Parallel Hybrid (SARIMA + LSTM): Weighted ensemble integrating both models’ strengths.
Tuned hyperparameters using KerasTuner and time-series cross-validation to ensure robust performance.

5️⃣ Evaluation & Aggregation

Evaluated model performance using:
- Mean Absolute Error (MAE)
- Mean Absolute Percentage Error (MAPE)
- Root Mean Squared Error (RMSE)
Aggregated weekly forecasts to monthly summaries for executive-level decision-making and inventory management.

👩‍💻 My Role

Led data engineering and feature pipeline design
Implemented SARIMA and XGBoost hybrid model for 32-week forecast horizon
Designed forecast comparison visualizations (actual vs predicted)
Evaluated model stability and generalization across multiple titles
Authored final forecast interpretation and business recommendations

📈 Key Findings

Best Model: SARIMA + XGBoost hybrid

Average Forecast Accuracy: 91% (MAPE < 9%)

Improvement: +18% accuracy over standalone SARIMA baseline

Forecast Horizon: 32 weeks ahead (8 months)

Performance Insights:

Sales followed strong seasonal peaks (Q4 holiday season)
Children’s and Fiction categories had highest volatility
Non-fiction titles showed stable, predictable demand patterns

🔍 Analytical Insights

Hybrid modeling effectively combined the interpretability of SARIMA with the flexibility of XGBoost
Lag features and rolling averages improved short-term responsiveness to recent trends
Performance metrics confirmed residual learning reduced forecast errors significantly
Visual trend analysis revealed predictable post-holiday sales dips ideal for restocking decisions

✅Recommendations

Automate weekly forecast updates to align with sales and marketing cycles
Use forecasts to optimize print and distribution schedules
Expand feature set to include promotion and weather data for richer context
Integrate the forecasting pipeline into Power BI dashboards for continuous tracking

💡Business & Regulatory Impact

Enabled the publisher to anticipate demand 8 months ahead
Reduced stock imbalances and lost sales opportunities
Provided an explainable, reproducible forecasting framework
Improved planning collaboration between sales, printing, and supply chain teams

🔧Future Work

Improve speaker identification for more granular sentiment.
Strengthen topic coherence through improved embeddings.
Validate early-warning indicators using historical backtesting.
Develop interpretability metrics for LLM-driven summaries.

📘Conclusion

This project showcased how hybrid statistical–machine learning forecasting can transform sales planning for the publishing industry.

By leveraging historical patterns and feature-based learning, the model provided more accurate, explainable, and scalable forecasts, empowering data-driven decision-making across the organization.

← Back to Projects