Images need PHOTO for vision, audio needs VOICE for STT, and other files get DOCUMENT for text inlining.