To Do

The never-ending list.

Android

Skylake YUV sampling

Vulkan

guardband clipping
double check that all slm / scratch space fixes are done
Write a multisample-scaled-blit extension

Bugs

FEZ on Ironlake - SIMD16 and discard doesn’t work
TIS-100 on Ironlake - some SIMD16 bug
Rendercheck failures on Skylake? (Works on Broadwell…)

Performance

Core
- Trim unused components of giant vec4 arrays (see deus-ex 256).
NIR
- Phi distributor
i965 Compiler
- Consider disabling sample-EOT outside of BLORP, due to overlapping polygon scoreboarding woes
- Drop PSIZ/POS slots from VUE map for non-SSO shaders except for last geometry stage (krh log, february 28th)
- Ditch varying packing
- Push UBOs
- SENDS
- NoSrcDepSet for SEND
- IVB: “VS URB Entry Allocation Size equal to 4 (5 512-bit URB rows) may cause performance to decrease due to banking in the URB. Element sizes of 16 to 20 should be programmed with six 512-bit URB rows.”
- VF component packing
- Use repdata message for more than clears
- Atomics
- Combine URB writes
- Better pull vs. push thresholds
- Scheduling
- Make sure Curro’s spilling heuristic work lands
- SIMD32 FS (for Unigine Valley)
- 3-source register banking
- Store uniforms in 1 component rather than 8/16/32
- Compare performance of shadow comparison messages: c, l_c, b_c, lz. Make sure we hit fast rate in more cases (possibly use LOD message).
  - Sampler messages of length 9 or higher (implicit header adds 1) may be slower.
- TCS Improvements (low priority)
  - Output Shadowing
  - Varying packing for TCS inputs (I think tarceri is doing this)
  - Write per-patch outputs from only one invocation
  - New Skylake modes (dual patch seems slower…try 8 patch?)
BLORP
- Use send instead of sendc - there are no overlapping polygons
i965 Surfaces
- Stencil PMA Fix (Skylake)
- When Stencil OP is keep, then disable stencil writes?
- Clear and CopyImage using a larger BPP
  - jekstrand measured a huge difference - see Nov. 11 “ish” logs
  - 8bpb: 58s, 16bpb: 38s, 32bpb: 18s, 64bpb: 18s, 128bpb: 15s.
  - a 2x2 subspan at 128bpp processes a whole cacheline
- Ignore alpha blending when checking if formats are renderable in BLORP
- Half float texture results
- Implement InvalidateBufferSubData and friends
- Check BO_ALLOC_FOR_RENDER usage to avoid stalls
- Drop BRW_NEW_BLORP from more atoms
- MOCS
- Force unwritten RT components to 0 for better compression
Defer some DrawTransformFeedback math calculations?
Queries
- Use a vertex shader for query math rather than CS ALU
- Fix timer queries on Skylake
State upload code
- Port to genxml (brwxml branch)
- Use a separate state buffer (allowing larger batches), retire in lockstep
- Keep the state buffer around for two batches (so we don’t re-emit state just because we need it in a new buffer)
- Generate shader packets at compile time, save them in the cache
- Rewrite program cache
- See how command buffer sizes impact Xonotic performance (see Jira 7424, Eero June 2016 MSR)