You should write about it and try to prove what you say - maybe you're right, I'd be interested to see. Of course, if I am wrong here, that'd be good to know, since it will make my research better.
My thinking primarily was as follows: if SHAP is the aggregation of marginal contributions of a feature in a predictive model, then in order to get a fair representation of the importance of a feature in prediction, it should be derived from genuine predictions (i.e., unseen data). Otherwise, you are deriving marginal contributions of a feature to predictions that are not representative of the actual model performance.
Maybe the marginal contributions would be the same in both training and test sets, and only the predictive performance differs. If so, then you have a point. But if the predictions are biased, I can see how the feature importances derived from them would be too .