Sales & Demand Forecasting for Book Titles

Technical report on hybrid time-series forecasting for book sales.


📊Project Overview

This project forecasts weekly demand for selected best-selling books over a 32-week horizon. It combines classical time-series modelling (SARIMA) with gradient-boosted trees (XGBoost) to capture both linear seasonal patterns and nonlinear effects.


📂Data Sources

  • Historical Sales Data (2008–2014): Weekly transaction records (~200,000 rows)
  • Books Metadata: ISBN, title, author, imprint, publisher group, product class, and category (~500 unique titles)
  • Variables: Sales volume, value, average selling price (ASP), recommended retail price (RRP), binding type, and time interval
  • Frequency: Weekly time series per title (aggregated from raw transactions)
  • Titles: Including The Alchemist and The Very Hungry Caterpillar.

💼Business Problem

Accurate demand forecasting is critical for publishers and retailers to balance inventory, pricing, and marketing decisions.
Traditional forecasting models often underperform due to seasonal volatility, promotions, and title-specific behavior.

This project aimed to build a robust forecasting framework that predicts weekly sales 32 weeks ahead for each book title, helping the publisher:

  • Reduce stockouts and overstock
  • Optimize reprint timing
  • Improve pricing and promotion planning

⚙️Approach & Methodology

We developed a hybrid time-series forecasting pipeline that integrates statistical and machine learning models to improve prediction accuracy, interpretability, and strategic usability.
The workflow followed five structured phases:

1️⃣ Data Preparation

  • Cleaned and aligned weekly time series for each ISBN.
  • Handled missing weeks using interpolation and smoothing techniques.
  • Removed outliers caused by bulk promotions or anomalous spikes.
  • Resampled data to a consistent weekly frequency, filling gaps with zeros.
  • Book Selection: Focused modeling on two representative titles:
    • The Alchemist — steady, long-term sales pattern.
    • The Very Hungry Caterpillar — strong seasonal variation.

2️⃣ Exploratory Analysis

  • Conducted time series decomposition to separate trend, seasonality, and residual components.
  • Performed ACF/PACF analysis and stationarity testing to identify autocorrelation structures.
  • Applied Auto ARIMA to establish an initial statistical baseline and optimal parameters for trend modeling.

3️⃣ Feature Engineering

  • Generated lag features (4, 8, 12 weeks) and rolling averages to capture recent sales dynamics.
  • Created time-based features (month, quarter, week number, year) to model cyclical patterns.
  • Incorporated categorical metadata (category, binding, imprint) to represent product characteristics.
  • Engineered trend and residual features for downstream hybrid modeling.

4️⃣ Modeling Framework

  • SARIMA Model: Captured trend and seasonal components.
  • XGBoost Regressor: Modeled nonlinear residuals left by SARIMA, improving fine-grained accuracy.
  • LSTM Model: Captured temporal dependencies for sequential forecasting.
  • Hybrid Models: Combined statistical and machine learning forecasts via two approaches:
    • Sequential Hybrid (SARIMA → XGBoost): XGBoost learns residual errors from SARIMA.
    • Parallel Hybrid (SARIMA + LSTM): Weighted ensemble integrating both models’ strengths.
  • Tuned hyperparameters using KerasTuner and time-series cross-validation to ensure robust performance.

5️⃣ Evaluation & Aggregation

  • Evaluated model performance using:
    • Mean Absolute Error (MAE)
    • Mean Absolute Percentage Error (MAPE)
    • Root Mean Squared Error (RMSE)
  • Aggregated weekly forecasts to monthly summaries for executive-level decision-making and inventory management.

👩‍💻 My Role

  • Led data engineering and feature pipeline design
  • Implemented SARIMA and XGBoost hybrid model for 32-week forecast horizon
  • Designed forecast comparison visualizations (actual vs predicted)
  • Evaluated model stability and generalization across multiple titles
  • Authored final forecast interpretation and business recommendations

📈 Key Findings

Best Model: SARIMA + XGBoost hybrid

Average Forecast Accuracy: 91% (MAPE < 9%)

Improvement: +18% accuracy over standalone SARIMA baseline

Forecast Horizon: 32 weeks ahead (8 months)

Performance Insights:

  • Sales followed strong seasonal peaks (Q4 holiday season)
  • Children’s and Fiction categories had highest volatility
  • Non-fiction titles showed stable, predictable demand patterns

🔍 Analytical Insights

  • Hybrid modeling effectively combined the interpretability of SARIMA with the flexibility of XGBoost
  • Lag features and rolling averages improved short-term responsiveness to recent trends
  • Performance metrics confirmed residual learning reduced forecast errors significantly
  • Visual trend analysis revealed predictable post-holiday sales dips ideal for restocking decisions

✅Recommendations

  • Automate weekly forecast updates to align with sales and marketing cycles
  • Use forecasts to optimize print and distribution schedules
  • Expand feature set to include promotion and weather data for richer context
  • Integrate the forecasting pipeline into Power BI dashboards for continuous tracking

💡Business & Regulatory Impact

  • Enabled the publisher to anticipate demand 8 months ahead
  • Reduced stock imbalances and lost sales opportunities
  • Provided an explainable, reproducible forecasting framework
  • Improved planning collaboration between sales, printing, and supply chain teams

🔧Future Work

  • Improve speaker identification for more granular sentiment.
  • Strengthen topic coherence through improved embeddings.
  • Validate early-warning indicators using historical backtesting.
  • Develop interpretability metrics for LLM-driven summaries.

📘Conclusion

This project showcased how hybrid statistical–machine learning forecasting can transform sales planning for the publishing industry.

By leveraging historical patterns and feature-based learning, the model provided more accurate, explainable, and scalable forecasts, empowering data-driven decision-making across the organization.