Palak Gupta -Data Analyst

Portfolio Project 7:

Covid Data Prediction

Services:

Data Analysis

Overview

The COVID-19 Data Prediction project focused on analyzing and forecasting the spread of COVID-19 using historical case data. The aim was to build predictive models that could estimate future infection rates and help health agencies, policymakers, and the public prepare for potential surges. The analysis involved trend visualization, statistical modeling, and machine learning techniques to forecast daily cases, recoveries, and deaths.

Research:The research phase involved studying publicly available COVID-19 datasets from sources like Johns Hopkins University and Kaggle. Key variables included daily confirmed cases, deaths, recoveries, testing rates, and vaccination status. Additional features like population density and lockdown dates were also reviewed to assess their impact on infection spread.

Information Architecture: The data was organized by country, region, and date. After extensive cleaning—handling missing values, normalizing population data, and dealing with reporting inconsistencies—the dataset was prepared for time series modeling and regression analysis.

Time series plots (using Matplotlib and Plotly) and dashboards (in Power BI) were created to display trends in confirmed cases, death rates, and recovery patterns. Prototypes of forecasting models were built using ARIMA, Facebook Prophet, and LSTM neural networks to simulate future case trajectories.

Challenges

Data Volatility and Noise:

Challenge:Inconsistent case reporting and missing values, especially from underreported regions, made forecasting difficult.

Solution:Used rolling averages and data smoothing techniques to reduce volatility. Imputation techniques were applied for missing values.

Model Selection:

Challenge: Choosing the right model for short- and long-term predictions given the non-linear, highly dynamic nature of the pandemic .

Solution:Compared statistical models like ARIMA with deep learning models like LSTM. LSTM was found to capture temporal dependencies better, while ARIMA performed well for countries with consistent data.

External Factors Impact:

Challenge:Lockdowns, vaccination drives, and variants drastically affected case trends.

Solution: Introduced external features such as stringency index, mobility data, and vaccination rates as covariates to improve model accuracy.

Overfitting Risks:

Challenge:High variability in data caused models to overfit on training sets.

Solution:Used regularization, cross-validation, and early stopping in neural networks to generalize predictions.

Results/Conclusion:

The project successfully demonstrated the use of data science in real-world crisis prediction. LSTM-based models outperformed traditional time series models, especially for longer forecast windows. Forecasts helped predict peaks and declines with reasonable accuracy, and region-specific insights were derived regarding infection trends and recovery speed. The findings were visualized on interactive dashboards, aiding public understanding and institutional decision-making. This project highlighted the importance of data quality, external factor inclusion, and transparent modeling in health analytics. Future work could involve real-time data pipelines and integration with hospital resource prediction models.

Let's 👋 Work Together Let's 👋 Work Together

Let's 👋 Work Together Let's 👋 Work Together

Palak Gupta👋

Covid Data Prediction

Data Analysis

Overview

Challenges

Data Volatility and Noise:

Model Selection:

External Factors Impact:

Overfitting Risks:

Results/Conclusion: