The never-ending list.

Android

  • Skylake YUV sampling

Vulkan

  • guardband clipping
  • double check that all slm / scratch space fixes are done
  • Write a multisample-scaled-blit extension

Bugs

  • FEZ on Ironlake - SIMD16 and discard doesn’t work
  • TIS-100 on Ironlake - some SIMD16 bug
  • Rendercheck failures on Skylake? (Works on Broadwell…)

Performance

  • Core
    • Trim unused components of giant vec4 arrays (see deus-ex 256).
  • NIR
    • Phi distributor
  • i965 Compiler
    • Consider disabling sample-EOT outside of BLORP, due to overlapping polygon scoreboarding woes
    • Drop PSIZ/POS slots from VUE map for non-SSO shaders except for last geometry stage (krh log, february 28th)
    • Ditch varying packing
    • Push UBOs
    • SENDS
    • NoSrcDepSet for SEND
    • IVB: “VS URB Entry Allocation Size equal to 4 (5 512-bit URB rows) may cause performance to decrease due to banking in the URB. Element sizes of 16 to 20 should be programmed with six 512-bit URB rows.”
    • VF component packing
    • Use repdata message for more than clears
    • Atomics
    • Combine URB writes
    • Better pull vs. push thresholds
    • Scheduling
    • Make sure Curro’s spilling heuristic work lands
    • SIMD32 FS (for Unigine Valley)
    • 3-source register banking
    • Store uniforms in 1 component rather than 8/16/32
    • Compare performance of shadow comparison messages: c, l_c, b_c, lz. Make sure we hit fast rate in more cases (possibly use LOD message).
      • Sampler messages of length 9 or higher (implicit header adds 1) may be slower.
    • TCS Improvements (low priority)
      • Output Shadowing
      • Varying packing for TCS inputs (I think tarceri is doing this)
      • Write per-patch outputs from only one invocation
      • New Skylake modes (dual patch seems slower…try 8 patch?)
  • BLORP
    • Use send instead of sendc - there are no overlapping polygons
  • i965 Surfaces
    • Stencil PMA Fix (Skylake)
    • When Stencil OP is keep, then disable stencil writes?
    • Clear and CopyImage using a larger BPP
      • jekstrand measured a huge difference - see Nov. 11 “ish” logs
      • 8bpb: 58s, 16bpb: 38s, 32bpb: 18s, 64bpb: 18s, 128bpb: 15s.
      • a 2x2 subspan at 128bpp processes a whole cacheline
    • Ignore alpha blending when checking if formats are renderable in BLORP
    • Half float texture results
    • Implement InvalidateBufferSubData and friends
    • Check BO_ALLOC_FOR_RENDER usage to avoid stalls
    • Drop BRW_NEW_BLORP from more atoms
    • MOCS
    • Force unwritten RT components to 0 for better compression
  • Defer some DrawTransformFeedback math calculations?

  • Queries
    • Use a vertex shader for query math rather than CS ALU
    • Fix timer queries on Skylake
  • State upload code
    • Port to genxml (brwxml branch)
    • Use a separate state buffer (allowing larger batches), retire in lockstep
    • Keep the state buffer around for two batches (so we don’t re-emit state just because we need it in a new buffer)
    • Generate shader packets at compile time, save them in the cache
    • Rewrite program cache
    • See how command buffer sizes impact Xonotic performance (see Jira 7424, Eero June 2016 MSR)