paraphernalia.torch.clip module¶
Evaluate images with CLIP.
- class CLIP(prompt, anti_prompt=None, detail=None, use_tiling=True, macro=0.5, chops=64, model='ViT-B/32', device=None)[source]¶
A CLIP-based perceptor that evaluates how well an image fits with on or more target text prompts.
The underlying model is limited to (224, 224) resolution, so this class presents it with multiple perspectives on an image:
Macro: random crops of 90-100% of the image, used to counteract aliasing
Micro: small near-pixel-perfect random crops, and an optional tiling to enable the fine details of high-resolution images to be processed.
A lot of internals are exposed via methods to facilitate debugging and experimentation.
- Parameters
prompt (Union[str, List[str]]) – the text prompt to use in general
anti_prompt (Optional[Union[str, List[str]]]) – a description to avoid
detail (Optional[Union[str, List[str]]]) – a text prompt to use for micro-perception, defaults to “A detail from a picture of {prompt}”
use_tiling (bool) –
macro (float) –
chops (int) –
model (str) –
device (Optional[str]) –
- use_tiling¶
if true, add a covering of near-pixel-perfect perceptors into the mix
- Type
bool
- chops¶
augmentation operations, these get split 50-50 between macro and micro
- Type
int
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- encode_text(text_or_texts)[source]¶
Encode text.
Returns a detached tensor.
- Parameters
text_or_texts (str) –
- Return type
torch.Tensor
- encode_image(batch)[source]¶
Encode an image.
Does not detach.
- Parameters
batch (torch.Tensor) –
- Return type
torch.Tensor
- get_macro(img)[source]¶
Get a set of high-level views on an image batch.
- Parameters
img (Tensor) – A (b, c, h, w) image batch
- Returns
an expanded (b, c, h, w) image batch
- Return type
Tensor
- get_micro(img)[source]¶
Get a set of detailed (near pixel-perfect) views on an image batch.
- Parameters
img (Tensor) – A (b, c, h, w) image batch
- Returns
an expanded (b, c, h, w) image batch
- Return type
Tensor
- get_similarity(img, prompts, batch_size, match='all')[source]¶
Compute the average similarity between a combined but contiguous batch of images and set of prompts.
- Parameters
imgs (Tensor) – A combined-but-contiguous image batch with shape (batch_size * t, c, h, w)
prompts (Tensor) – A tensor of prompt embeddings with shape (n, 512)
batch_size (int) – The size of the original image batch
match (str) – Policy for multiple prompts. “any”, “all” or (in future) “one”
img (torch.Tensor) –
- Returns
A tensor of average similarities with shape (batch_size,)
- Return type
Tensor