A backtest that produces a clean, rising equity curve is not evidence that a strategy works. It is evidence that the strategy worked on one specific historical price series — which you already had in front of you when you designed the setup. There are four ways grid bot backtests systematically flatter the strategy, and each one has a specific antidote.
1. Period selection bias
The most common problem is also the simplest. You pull 30 or 60 days of OHLCV data, run the backtest, see a strong result, and treat it as validation. What you have actually done is found a period where the market happened to cooperate — price stayed within a range, oscillated regularly, and didn't trend hard enough to exit your boundaries.
This is not bad luck or deliberate cheating. It is a natural consequence of how humans select data. We tend to pull recent data, and recent data is whatever the market was doing when we decided to start testing. If the market has been ranging for the past two months, a backtest on those two months will look excellent. When you deploy live and the market enters a trend — as it eventually will — the result will be nothing like the backtest.
The antidote is to test across multiple periods that you did not select because they looked good. At minimum, run the same configuration against a 30-day ranging period, a 30-day trending period, and a 30-day high-volatility period. If the strategy only survives one of those three, it is not robust — it is period-specific.
Minimum backtest portfolio for a meaningful validation: Period A: Recent 30–60 days (current regime) Period B: A month where the asset trended >20% in one direction Period C: A month where realised vol exceeded 80% annualised Period D: A quiet month where price moved <10% total If the strategy is profitable across A, B, C, and D — or at least survives B and C without liquidating — that is a more honest signal than a single clean-looking period.
2. Range look-ahead bias
Look-ahead bias in grid bots is subtler than in other strategies. You are not peeking at future prices to decide when to enter — but you may be using the historical data to set the range, which is just as distorting.
If you pull 60 days of BTC data, notice that price oscillated between $92,000 and $108,000, and then set your grid range to $91,000–$109,000, you have built a range that is perfectly sized for the period you are testing. The backtest will show few or no breakouts because the range was calibrated to fit the data. Live, you will not have that luxury — you will set the range before knowing what the next 60 days look like.
The test for look-ahead bias is simple: could you have arrived at this range configuration without looking at the period you are testing? If the answer is no, the backtest is contaminated. Use the volatility-based range sizing method — deriving the range from a prior period's realised vol — so the range is set from data that predates the test window.
Clean range-setting process for backtesting: Test period: 1 Feb – 28 Feb Range source: 30d realised vol from 1 Jan – 31 Jan (prior period) Calculation: ±1.5σ from entry price as at 1 Feb This ensures the range was derivable without any knowledge of what happened during the test period.
3. Candle size overstates fill frequency
The simulator's backtest engine uses a high-low sweep model: for each candle, it checks whether the candle's high crossed any ask orders (filling them top-down) and whether the low crossed any bid orders (filling them bottom-up). This is conservative and correct for small candles. For large candles — 4-hour or daily — it becomes a significant source of overstated fills.
Consider a 4-hour candle with a low of $97,000 and a high of $103,000, on a grid with levels at $98,000, $99,000, $100,000, $101,000, and $102,000. The high-low sweep model fills all five levels — two buys and two sells — and counts four completed round trips. In reality, price may have moved from $100,000 to $103,000 to $97,000 in a single sweep, completing only one round trip before the candle closed. The sweep model cannot distinguish these cases.
The result is that backtests on 4-hour or daily candles will show more round trips, and therefore more income, than actually occurred. The effect compounds over a long backtest period. On 1-hour candles the distortion is smaller; on 15-minute candles it is small enough to be negligible for most grid spacings.
| Candle interval | Fill overcount risk | Use for |
|---|---|---|
| 1 – 5 minutes | Negligible | Short-duration accuracy checks (7–30 days) |
| 15 – 30 minutes | Low | Standard backtests up to 90 days |
| 1 hour | Moderate on tight grids | Multi-month backtests — adequate for most setups |
| 4 hours | High on spacing < 1% | Long-period overviews only — not for income estimates |
| Daily | Very high | Structural breakout analysis only — not fill counting |
4. The single-path problem
A backtest produces one result: what happened on one specific sequence of prices. That sequence is one path out of the enormous number of paths that could have plausibly occurred given the same starting conditions. A strategy that performed well on that specific path may have performed poorly on most other plausible paths with similar statistical properties.
This is not a flaw in backtesting — it is its inherent nature. A backtest tells you what happened. It cannot tell you what was likely to happen. A Monte Carlo simulation, calibrated to the same period's realised volatility, shows the distribution of outcomes across many paths with similar statistical properties. The two tools together give a more complete picture than either alone.
The practical implication: if a backtest shows a strong result but the Monte Carlo on the same configuration shows a median outcome that is flat or negative, the backtest result was probably a lucky path — not a signal of genuine edge. If both show strong results, the configuration is more robustly profitable. If the backtest shows a loss but Monte Carlo shows a positive median, the historical period was unusually hostile and the strategy may still have merit.
A checklist before trusting a backtest
Before treating a backtest result as evidence:
☐ Tested across at least 3 distinct market periods
(ranging, trending, high-vol)
☐ Range was set from data prior to the test period
(not fitted to the test data)
☐ Used candles of 1 hour or smaller
☐ Compared against Monte Carlo on the same configuration
— backtest result is consistent with the P50 outcome
☐ Checked whether the strategy would have survived
the worst period in the dataset without liquidating
If any box is unticked, the backtest result carries less
weight than it appears to.
Upload the same configuration against two different 30-day OHLCV files — one from a ranging period and one from a trending period. Then run Monte Carlo on the same setup with vol calibrated to each period. The contrast between the four results is the most honest picture of what the strategy actually does.
Launch the simulator →